Microsoft has initiated a research project aimed at assessing the impact of specific training examples on the output of generative AI models, including text, images, and other forms of media.
This information was revealed through a job listing, initially posted in December and recently recirculated on LinkedIn, which seeks a research intern for the project.
According to the job listing, the primary objective of the project is to develop a method that can efficiently estimate the influence of particular data, such as photographs and books, on the outputs generated by AI models.
The listing highlights the current limitations of neural network architectures in providing transparency regarding the sources of their generated content, stating, “Current neural network architectures are opaque in terms of providing sources for their generations, and there are good reasons to change this.” It further emphasizes the importance of recognizing and potentially compensating individuals who contribute valuable data to the development of future models.
The issue of AI-powered generators has been at the center of numerous intellectual property lawsuits against AI companies, with many of these companies relying on massive amounts of data from public websites, including copyrighted material, to train their models. While these companies often argue that fair use doctrine justifies their practices, creatives and copyright holders frequently disagree.
Microsoft itself is facing legal challenges from copyright holders, including a lawsuit filed by The New York Times in December, which alleges that Microsoft and its collaborator, OpenAI, infringed on The Times’ copyright by using models trained on millions of its articles. Additionally, several software developers have sued Microsoft, claiming that the company’s GitHub Copilot AI coding assistant was unlawfully trained using their protected works.
The research effort, referred to as “training-time provenance,” reportedly involves the participation of Jaron Lanier, a renowned technologist and interdisciplinary scientist at Microsoft Research. Lanier has written about the concept of “data dignity,” which involves connecting digital content with the humans who created it, and has proposed a system where the most unique and influential contributors to a model’s output could be identified and potentially compensated.
Several companies, including AI model developer Bria, Adobe, and Shutterstock, are already exploring ways to compensate data owners based on their contributions to AI models. However, few large labs have established programs to provide individual contributor payouts outside of licensing agreements with publishers and data brokers.
The success and impact of Microsoft’s project remain to be seen, and it is possible that it may not yield significant results. Similarly, OpenAI’s previously announced plans to develop a tool allowing content creators to opt-out of AI training have yet to materialize.
Microsoft’s initiative may also be seen as an attempt to “ethics wash” or preempt regulatory and court decisions that could disrupt its AI business. The company’s actions come amid a broader debate on fair use and copyright protections in the context of AI development, with several top labs advocating for weakened copyright rules and the codification of fair use for AI training.
Microsoft has not immediately responded to requests for comment on the project.
Source Link