Skip to main content

ElevenLabs, an innovative AI startup that has recently secured a significant $180 million funding round, has traditionally been recognized for its exceptional audio generation capabilities. The company has now taken a strategic step into a new technological direction with the launch of its first standalone speech-to-text model, dubbed Scribe.

Valued at $3.3 billion, ElevenLabs has provided support to numerous companies in delivering speech-to-text services through its extensive library of voices. However, the company is now expanding its focus to include speech detection, positioning itself to compete with notable players such as Gladia, Speechmatics, AssemblyAI, Deepgram, and OpenAI’s Whisper models.

Upon its launch, ElevenLabs’ Scribe model boasts support for over 99 languages, with more than 25 languages categorized as having excellent accuracy, where the word error rate is less than 5%. This list includes languages such as English, with a claimed accuracy rate of 97%, as well as French, German, Hindi, Indonesian, Japanese, Kannada, Malayalam, Polish, Portuguese, Spanish, and Vietnamese. The remaining languages are grouped into categories based on their word error rates, including high (5-10%), good (10-20%), and moderate (25-50%) accuracy.

According to the company, the Scribe model has outperformed both Google Gemini 2.0 Flash and Whisper Large V3 across multiple languages in FLEURS & Common Voice benchmark tests, demonstrating its superior capabilities in speech-to-text functionality.

While ElevenLabs had previously developed a speech-to-text component for its AI conversational agent platform, released last year, this marks the company’s first foray into releasing a standalone speech detection model. In a recent conversation with TechCrunch, CEO Mati Staniszewski emphasized the importance of enhancing speech detection models, highlighting the company’s unique advantage in having in-house teams for data annotation and rapid feedback.

Staniszewski expressed the company’s goal of improving its understanding of conversational speech, stating, “We want to understand what’s being said by you in a conversation better. We are working on ways to move away from only generating content and understanding and transcribing speech.” He also noted that while many consider speech-to-text to be a solved problem, the reality is that for numerous languages, the accuracy is still relatively poor, presenting an opportunity for ElevenLabs to develop more effective speech detection models.

The Scribe model incorporates advanced features such as smart speaker diarization, which enables the identification of individual speakers, timestamping at the word level for precise subtitles, and auto-tagging of sound events like audience laughter. Furthermore, the startup is providing customers with the capability to directly transcribe video content and add subtitles or captions within its studio environment.

Currently, Scribe is designed to work with pre-recorded audio formats, with plans to release a low-latency, real-time version of the model in the near future. This upcoming enhancement will enable the model to be effective for applications such as meeting transcriptions and voice note-taking, significantly expanding its utility and versatility.

ElevenLabs has set the pricing for Scribe at $0.40 per hour of transcribed audio, which, although competitive, is slightly higher than what some of its rivals offer, with variations in features and functionalities. As the speech-to-text market continues to evolve, the pricing strategy and feature set of Scribe will be crucial factors in determining its adoption and success.


Source Link