You can now chat with ChatGPT using your voice

In one of ChatGPT’s biggest updates to date, OpenAI has released two new ways to interact with its viral app.

First, ChatGPT now has a voice. Choose one of five realistic synthetic voices and you can talk to the chatbot as if you were making a phone call, getting answers to your questions in real time.

ChatGPT now also answers questions about images. OpenAI introduced this feature in March with the reveal of GPT-4 (the model that powers ChatGPT), but it was not yet available to the general public. This means you can now upload images to the app and ask questions about what they show.

These updates come on top of last week’s announcement that DALL-E 3, the latest version of OpenAI’s imaging model, will connect to ChatGPT so you can have your chatbot generate images.

The ability to chat with ChatGPT is based on two distinct models. Whisper, OpenAI’s existing speech-to-text model, converts what you say into text, which is then sent to the chatbot. And a new text-to-speech model that converts ChatGPT responses into spoken words.

In a demo the company gave me last week, product manager Joanne Jang showed off ChatGPT’s range of synthetic voices. They were created by training the text-to-speech model with the voices of actors hired by OpenAI. In the future, it may even allow users to create their own voices. “When creating the voices, the number one criteria was whether this is a voice you could listen to all day,” she says.

They are chatty and enthusiastic, but they don’t please everyone. “I have a really good feeling about us being a team,” says one. “I just want to share how excited I am to work with you and can’t wait to get started,” says another. “What’s the game plan?”

OpenAI is sharing this text-to-speech model with several other companies, including Spotify, which revealed it is using the same synthetic voice technology to translate celebrity podcasts — including episodes of the Lex Fridman Podcast and Trevor Noah’s new show , which will be released later this year – in several languages that will be spoken with synthetic versions of the podcasters’ own voices.

This set of updates shows how quickly OpenAI is turning its experimental models into desirable products. OpenAI has spent most of the time since its surprising success with ChatGPT last November improving its technology and selling it to private consumers and commercial partners.

ChatGPT Plus, the company’s premium app, is now a single, elegant home for OpenAI’s best models, bringing together GPT-4 and DALL-E into a single smartphone app that rivals Apple’s Siri, Google Assistant and Amazon’s Alexa.

What was only available to a few software developers a year ago is now available to anyone for $20 a month. “We’re trying to make ChatGPT more useful and more helpful,” says Jang.

In last week’s demo, Raul Puri, a scientist working on GPT-4, gave me a quick tour of the image recognition feature. He uploaded a photo of a child’s math homework, circled a Sudoku-type puzzle on the screen, and asked ChatGPT how you should solve it. ChatGPT responded with the correct steps.

Puri says he also used the feature to help him fix his fiancée’s computer, uploading screenshots of error messages and asking ChatGPT what he should do. “This was a very painful experience that ChatGPT helped me overcome,” he says.

ChatGPT’s image recognition capability has already been tested by a company called Be My Eyes, which creates an app for people with visual impairments. Users can upload a photo of what’s in front of them and ask human volunteers to tell them what it is. In a partnership with OpenAI, Be My Eyes offers its users the option to ask a chatbot.

“Sometimes my kitchen is a little messy, or it’s just a really early Monday morning and I don’t want to talk to a human being,” said Be My Eyes founder Hans Jørgen Wiberg, who uses the app, when I interviewed him at EmTech Digital in May. “Now you can ask questions to the photo.”

OpenAI is aware of the risk of releasing these updates to the public. Combining models brings whole new levels of complexity, says Puri. He says his team spent months brainstorming possible misuses. It is not possible to ask questions about photos of individuals, for example.

Jang gives another example: “Right now, if you ask ChatGPT to make a bomb, it will refuse,” she says. “But instead of saying, ‘Hey, tell me how to make a bomb,’ what if you showed him a picture of a bomb and said, ‘Can you tell me how to make that?

“You have all the problems of computer vision; you have all the problems of large language models. Voice fraud is a big problem,” says Puri. “We need to consider not only our users, but also people who are not using the product.” The possible problems don’t stop there. Adding voice recognition to the app could make ChatGPT less accessible to people who don’t speak with common accents, says Joel Fischer, who studies human-computer interaction at the University of Nottingham in the United Kingdom.

Synthetic voices also come with social and cultural baggage that will shape users’ perceptions and expectations of the app, he says. This is an issue that still needs to be studied.

But OpenAI claims to have resolved the worst issues and is confident that the ChatGPT updates are secure enough to be released. “It’s been an extraordinarily good learning experience to iron out all these rough edges,” says Puri.

(Sources: MIT Technology Review)