Wikipedia’s Struggle with AI Crawlers
Wikipedia has been facing challenges due to the impact of AI crawlers — bots that scrape text and multimedia from the encyclopedia to train generative artificial intelligence models — on its servers, resulting in increased costs and slower load times for human users in some cases. To mitigate this issue, the Wikimedia Foundation, which manages Wikipedia’s data, is offering AI developers a dataset they can freely use, potentially reducing the strain on the public Wikipedia website and bandwidth consumption.
Collaboration with Kaggle
The organization has partnered with Kaggle, a data science platform, to release a beta version of a structured dataset in both English and French. — which owns Kaggle — the dataset is formatted for machine learning, making it more useful for training, development, and data science applications.
Dataset Details
Wikimedia Enterprise notes that the dataset includes abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections. However, it excludes references and other non-prose elements, such as video clips. While the lack of references may raise concerns about attribution, Wikimedia Enterprise assures that the content in the dataset is freely licensed under Creative Commons, public domain, and other licenses, as it is sourced from Wikipedia.
Source Link