OpenAI released its advanced voice mode to more people. Here’s how to get it.

24 Sep 2024, 19:08 by James O'Donnell · MIT Technology Review

OpenAI is broadening access to Advanced Voice Mode, a feature of ChatGPT that allows you to speak more naturally with the AI model. It allows you to interrupt its responses midsentence, and it can sense and interpret your emotions from your tone of voice and adjust its responses accordingly.

These features were teased back in May when OpenAI unveiled GPT-4o, but they were not released until July—and then just to an invite-only group. (At least initially, there seem to have been some safety issues with the model; OpenAI gave several Wired reporters access to the voice mode back in May, but the magazine reported that the company “pulled it the next morning, citing safety concerns.”)

Users who’ve been able to try it have largely described the model as an impressively fast, dynamic, and realistic voice assistant—which has made its limited availability particularly frustrating to some other OpenAI users.

Today is the first time OpenAI has promised to bring the new voice mode to a wide range of users. Here’s what you need to know.

What can it do?

Though ChatGPT currently offers a standard voice mode to paid users, its interactions can be clunky. In the mobile app, for example, you can’t interrupt the model’s often long-winded responses with your voice, only with a tap on the screen. The new version fixes that, and also promises to modify its responses on the basis of the emotion it’s sensing from your voice. As with other versions of ChatGPT, users can personalize the voice mode by asking the model to remember facts about themselves. The new mode also has improved its pronunciation of words in non-English languages.

AI investor Allie Miller posted a demo of the tool in August, which highlighted a lot of the same strengths of OpenAI’s own release videos: The model is fast and adept at changing its accent, tone, and content to match your needs.

The update also adds new voices. Shortly after the launch of GPT-4o, OpenAI was criticized for the similarity between the female voice in its demo videos, named Sky, and that of Scarlett Johansson, who played an AI love interest in the movie Her. OpenAI then removed the voice.

Now it has launched five new voices, named Arbor, Maple, Sol, Spruce, and Vale, which will be available in both the standard and advanced voice modes. MIT Technology Review has not heard them yet, but OpenAI says they were made using professional voice actors from around the world. “We interviewed dozens of actors to find those with the qualities of voices we feel people will enjoy talking to for hours—warm, approachable, inquisitive, with some rich texture and tone,” a company spokesperson says.

Who can access it and when?

For now, OpenAI is rolling out access to Advanced Voice Mode to Plus users, who pay $20 per month for a premium version, and Team users, who pay $30 per month and have higher message limits. The next group to receive access will be those in the Enterprise and Edu tiers. The exact timing, though, is vague; an OpenAI spokesperson says the company will “gradually roll out access to all Plus and Team users and will roll out to Enterprise and Edu tiers starting next week.” The company hasn’t committed to a firm deadline for when all users in these categories will have access. A message in the ChatGPT app indicates that all Plus users will have access by “the end of fall.”

There are geographic limitations. The new feature is not yet available in the EU, the UK, Switzerland, Iceland, Norway, or Liechtenstein.

There is no immediate plan to release Advanced Voice Mode to free users. (The standard mode remains available to all paid users.)

What steps have been taken to make sure it’s safe?

As the company noted upon the initial release in July and again emphasized this week, Advanced Voice Mode has been safety-tested by external experts “who collectively speak a total of 45 different languages, and represent 29 different geographies.” The GPT-4o system card details how the underlying model handles issues like generating violent or erotic speech, imitating voices without their consent, or generating copyrighted content.

Still, OpenAI’s models are not open-source. Compared with such models, which are more transparent about their training data and the “model weights” that govern how the AI produces responses, OpenAI’s closed-source models are harder for independent researchers to evaluate from the perspective of safety, bias, and harm.