Training AI models without copyrighted material

Introduction to Ethical AI
AI companies have long claimed that their tools couldn’t exist without training on copyrighted material. However, a recent study has proven that this is not the case. A team of AI researchers has successfully trained a new model that is less powerful but more ethical, using only public domain and openly licensed material.

The Study
The study, which was a collaboration between 14 different institutions, including universities like MIT, Carnegie Mellon, and the University of Toronto, as well as nonprofits like Vector Institute and the Allen Institute for AI, built an 8 TB ethically-sourced dataset. This dataset included a set of 130,000 books from the Library of Congress. The team then trained a seven-billion-parameter large language model (LLM) on this data, resulting in a model that performed comparably to Meta’s similarly sized Llama 2-7B from 2023.

Challenges in Creating an Ethical AI Model
Creating this ethical AI model was not without its challenges. Much of the data could not be read by machines, requiring humans to sift through it manually. Additionally, the team had to determine which license applied to each website they scanned, making the process even more difficult. As co-author Stella Biderman noted, "We use automated tools, but all of our stuff was manually annotated at the end of the day and checked by people. And that’s just really hard."

Implications of the Study
While the resulting model may not be as powerful as those trained on copyrighted material, it serves as a counterpoint to the claims made by AI companies. In 2024, OpenAI told a British parliamentary committee that such a model essentially couldn’t exist, and an Anthropic expert witness added that LLMs would likely not exist if AI firms were required to license the works in their training datasets. This study proves that this is not the case and may be cited in future legal cases and regulation arguments.

Conclusion
The creation of this ethical AI model may not change the trajectory of AI companies, as they are more interested in creating more powerful tools. However, it does puncture one of the industry’s common arguments and provides a precedent for future studies. The study’s findings may be used to inform legal cases and regulation arguments, highlighting the importance of considering ethical implications in AI development.

Source Link

Training AI models without copyrighted material

Killer’s Simple Rules

Last of Us Spoiler Zone: Big Moment

Galaxy Z Fold/Flip 7 Leaked

Samsung Galaxy S25 Edge: Slimmest Phone Launches 2nd Quarter

Home

Services

Domains & Hosting

FUSION MAG

Training AI models without copyrighted material

Killer’s Simple Rules

You May Also Like

Last of Us Spoiler Zone: Big Moment

Galaxy Z Fold/Flip 7 Leaked

Samsung Galaxy S25 Edge: Slimmest Phone Launches 2nd Quarter

Home

Services

Domains & Hosting

FUSION MAG