Skip to main content

AI Safety Working Group Releases World’s Largest Collection of Public Domain Voice Recordings for AI Research

A New Resource for AI Researchers

MLCommons, a nonprofit AI safety working group, has partnered with AI development platform Hugging Face to release one of the world’s largest collections of public domain voice recordings for AI research. The dataset, called Unsupervised People’s Speech, contains over a million hours of audio spanning at least 89 different languages.

Motivation Behind the Dataset

The data set was created by MLCommons to support research and development in various areas of speech technology. The organization aims to bring communication technologies to more people globally by supporting broader natural language processing research for languages other than English.

Potential Risks and Challenges

While the dataset is an admirable goal, it also carries risks for researchers who choose to use it. Biased data is one of the main concerns, as the recordings in Unsupervised People’s Speech came from Archive.org, a nonprofit known for the Wayback Machine web archival tool. Many of Archive.org’s contributors are English-speaking and American, resulting in almost all the recordings being in American-accented English.

Risks of Biased Data

Without careful filtering, AI systems like speech recognition and voice synthesizer models trained on Unsupervised People’s Speech could exhibit prejudices. For example, they might struggle to transcribe English spoken by a non-native speaker or have trouble generating synthetic voices in languages other than English.

Potential Issues with Data Ownership

Unsupervised People’s Speech might also contain recordings from people unaware that their voices are being used for AI research purposes, including commercial applications. Although MLCommons claims that all recordings are public domain or available under Creative Commons licenses, there is a possibility of mistakes being made.

Lack of Transparency in AI Training Data Sets

According to an MIT analysis, hundreds of publicly available AI training data sets lack licensing information and contain errors. Creator advocates, including Ed Newton-Rex, the CEO of AI ethics-focused nonprofit Fairly Trained, argue that creators shouldn’t be required to "opt out" of AI data sets due to the onerous burden it imposes on them.

Conclusion

While Unsupervised People’s Speech is a valuable resource for AI researchers, it’s essential to exercise caution when using it. MLCommons is committed to updating, maintaining, and improving the quality of the dataset, but developers should be aware of the potential flaws and take necessary precautions to avoid biased data and ensure data ownership issues are addressed.


Source Link