In late March of last year, OpenAI unveiled a preliminary version of its Voice Engine service, an AI-powered tool capable of replicating a person’s voice using just 15 seconds of speech. However, nearly a year has passed, and the service remains in its preview phase, with no indication from OpenAI as to when it might be officially launched, or if it will be launched at all.
The company’s hesitation to widely release the service may be attributed to concerns about potential misuse, as well as a desire to avoid drawing regulatory attention. OpenAI has faced criticism in the past for prioritizing the development of new products over safety considerations, and for rushing to release new technologies in order to stay ahead of its competitors.
According to a statement from an OpenAI spokesperson, the company is continuing to test Voice Engine with a limited group of “trusted partners,” with the goal of gathering feedback and improving the service’s performance and safety features. The spokesperson noted that the company has been pleased to see the various ways in which its partners are utilizing the technology, including applications in speech therapy, language learning, customer support, and video game development.
“[We’re] learning from how [our partners are] using the technology so we can improve the model’s usefulness and safety,” the spokesperson said. “We’ve been excited to see the different ways it’s being used, from speech therapy, to language learning, to customer support, to video game characters, to AI avatars.”
Pushed back
Voice Engine, which powers the voices available in OpenAI’s text-to-speech API and ChatGPT’s Voice Mode, is capable of generating highly realistic speech that closely resembles the original speaker’s voice. The service converts written text into spoken words, subject to certain limitations on content. However, the release of Voice Engine has been subject to delays and shifting timelines from the outset.
As explained in a blog post published by OpenAI in June 2024, the Voice Engine model learns to predict the most likely sounds a speaker will make when reading a given text, taking into account factors such as the speaker’s voice, accent, and speaking style. The model can then generate spoken versions of text, as well as “spoken utterances” that reflect how different types of speakers would read text aloud.
Initially, OpenAI had planned to release Voice Engine, initially known as Custom Voices, to its API on March 7, 2024, according to a draft blog post obtained by TechCrunch. The plan was to provide access to a group of up to 100 “trusted developers” ahead of a wider release, with priority given to those building applications that provided a social benefit or demonstrated innovative and responsible uses of the technology. OpenAI had even trademarked and priced the service: $15 per million characters for “standard” voices and $30 per million characters for “HD quality” voices.
However, at the last minute, the company postponed the announcement. OpenAI ultimately unveiled Voice Engine a few weeks later, but without a sign-up option. Access to the service would remain limited to a small group of around 10 developers with whom the company had begun working in late 2023.
In the announcement blog post for Voice Engine, OpenAI stated its hope to initiate a discussion on the responsible deployment of synthetic voices and how society can adapt to these new capabilities. The company noted that it would use the results of its small-scale tests and conversations with stakeholders to inform its decision on whether and how to deploy the technology on a larger scale.
Long in the works
According to OpenAI, Voice Engine has been in development since 2022. The company claims to have demoed the technology to high-level global policymakers in the summer of 2023, highlighting both its potential and risks.
Several partners currently have access to Voice Engine, including the startup Livox, which is developing devices to enable people with disabilities to communicate more naturally. Livox CEO Carlos Pereira told TechCrunch that while his company was unable to integrate Voice Engine into its products due to the service’s requirement for an internet connection, he found the technology to be “really impressive.”
“The quality of the voice and the possibility of having voices speak in different languages is unique, especially for people with disabilities, who are our customers,” Pereira said in an email to TechCrunch. “It is really the most impressive and easy-to-use tool I’ve seen for creating voices. We hope that OpenAI develops an offline version soon.”
Pereira noted that he has not received any guidance from OpenAI regarding a potential launch of Voice Engine, nor has he seen any indication that the company plans to start charging for the service. To date, Livox has not been required to pay for its usage of Voice Engine.
In a blog post published in June 2024, OpenAI hinted that one of the factors contributing to the delay in releasing Voice Engine was the potential for abuse during the previous year’s U.S. election cycle. The company has implemented various safety measures, including watermarking to trace the origin of generated audio, in an effort to mitigate these risks.
Developers using Voice Engine are required to obtain explicit consent from the original speaker before using the service, and they must clearly disclose to their audience that the voices are AI-generated. However, OpenAI has not disclosed how it plans to enforce these policies, which could prove to be a significant challenge even for a company with its resources.
In its blog posts, OpenAI has also suggested that it aims to develop a “voice authentication experience” to verify speakers and a “no-go” list to prevent the creation of voices that sound too similar to prominent figures. These are ambitious projects from a technological standpoint, and failure to implement them effectively could reflect poorly on the company, which has already faced criticism for prioritizing product development over safety considerations.
Effective filtering and ID verification are becoming essential requirements for the responsible release of voice cloning technologies. AI voice cloning has become a rapidly growing scam, with instances of fraud and bypassed bank security checks on the rise as privacy and copyright laws struggle to keep pace. Malicious actors have used voice cloning to create deepfakes of celebrities and politicians, which have spread quickly across social media platforms.
OpenAI could potentially release Voice Engine as early as next week, or it may choose never to release it at all. The company has repeatedly stated that it is considering keeping the service limited in scope. However, one thing is clear: due to concerns over optics, safety, or both, the limited preview of Voice Engine has become one of the longest in OpenAI’s history.
Source Link