Last week, the European digital sovereignty agenda received a significant boost with the announcement of a new initiative to develop a series of open source large language models (LLMs) that cover all European Union languages. This ambitious project, known as OpenEuroLLM, aims to create models that are not only highly accurate but also transparent and mindful of the linguistic and cultural diversity of the European region.
The OpenEuroLLM project is a collaborative effort between approximately 20 organizations, led by Jan Hajič, a computational linguist from Charles University in Prague, and Peter Sarlin, CEO and co-founder of Finnish AI lab Silo AI, which was acquired by AMD last year for $665 million. The project’s scope includes the development of LLMs for the current 24 official EU languages, as well as languages from countries that are in the process of joining the EU, such as Albania.
OpenEuroLLM is part of a broader trend in Europe, where digital sovereignty has become a priority. This includes initiatives such as the development of local cloud infrastructure by major cloud providers, ensuring that EU data remains within the region. Additionally, companies like OpenAI have introduced data residency options for European customers, allowing them to process and store data locally.
The EU has also made significant investments in digital sovereignty, including an $11 billion deal to establish a sovereign satellite constellation, rivaling Elon Musk’s Starlink. In this context, OpenEuroLLM aligns with the EU’s goals, aiming to create open source LLMs that are tailored to European languages and cultures.
Although the project’s budget of €37.4 million, with €20 million coming from the EU’s Digital Europe Programme, may seem modest compared to the investments made by corporate AI giants, the partnership with EuroHPC supercomputer centers provides access to significant computational resources. The actual budget for the project is larger, considering the funding allocated for related work and the cost of compute resources.
However, some have raised concerns about the project’s feasibility, given the large number of participating organizations and the complexity of the task. Anastasia Stasenko, co-founder of LLM company Pleias, questioned whether a “sprawling consortia of 20+ organizations” could achieve the same level of focus as a private AI firm. She pointed to the success of smaller, focused teams like Mistral AI and LightOn, which have made significant advancements in AI.
Up to scratch
The OpenEuroLLM project is not starting from scratch, as it has a predecessor in the High Performance Language Technologies (HPLT) project, which has been developing free and reusable datasets, models, and workflows using high-performance computing (HPC) since 2022. Many of the partners from HPLT are also participating in OpenEuroLLM, providing a foundation of expertise and resources.
Jan Hajič, the project’s co-lead, expects the first versions of the LLMs to be released by mid-2026, with the final iterations arriving by 2028. The project’s partners include EuroHPC supercomputer centers in Spain, Italy, Finland, and the Netherlands, as well as organizations from academia and research in Czechia, the Netherlands, Germany, Sweden, Finland, and Norway.
Notable absentees from the project include French AI unicorn Mistral, which has positioned itself as an open source alternative to incumbents like OpenAI. Despite attempts to initiate discussions, Mistral did not participate in the project.
Build up
The OpenEuroLLM project aims to create a series of foundation models for transparent AI in Europe, preserving the linguistic and cultural diversity of all EU languages. The project’s deliverables are still being defined, but they will likely include a core multilingual LLM for general-purpose tasks and smaller, quantized versions for edge applications.
The project’s goals are ambitious, and the team is aware of the challenges involved in achieving equality across all languages, particularly those with scarce digital resources. To address this, the project will utilize benchmarks that are representative of the languages and cultures, rather than relying on generic benchmarks.
The project will leverage the HPLT dataset, which was trained on 4.5 petabytes of web crawls and over 20 billion documents. Additional data from Common Crawl, an open repository of web-crawled data, will also be used to train the models.
The open source definition
The perennial debate between open source and proprietary software has been ongoing, with the Open Source Initiative (OSI) providing a formal definition of open source licenses. Recently, the OSI has also defined open source AI, although not everyone agrees with the outcome. Open source AI proponents argue that models, datasets, pretrained models, and weights should all be freely available.
The OpenEuroLLM project faces similar challenges, and despite its intentions to be “truly open,” it may need to make compromises to fulfill its quality obligations. The project’s co-lead, Jan Hajič, acknowledged that while the goal is to have everything open, there may be limitations due to the European copyright directive and the need to use proprietary data.
Two for one
Another criticism of the OpenEuroLLM project is that it bears similarities to an existing project, EuroLLM, which launched its first model in September and a follow-up in December. EuroLLM is also co-funded by the EU and shares similar goals, including building an open source European Large Language Model that supports 24 official European languages.
Andre Martins, head of research at Unbabel, pointed out the similarities between the two projects, suggesting that OpenEuroLLM is appropriating a name that already exists. Hajič called the situation “unfortunate” and expressed hope for potential cooperation between the two projects, although he noted that OpenEuroLLM’s funding restrictions limit its ability to collaborate with non-EU entities.
Funding gap
The arrival of China’s DeepSeek has sparked discussions about the cost-to-performance ratio of AI initiatives. While some have questioned the true costs involved in building DeepSeek, Peter Sarlin, technical co-lead on the OpenEuroLLM project, believes that the project has sufficient funding, primarily to cover personnel costs.
Sarlin noted that the project’s partnership with EuroHPC centers provides access to significant computational resources, which will help cover the costs of building the models. He also emphasized that the project’s goal is not to build a consumer- or enterprise-grade product but rather to create open source foundation models that can serve as AI infrastructure for European companies.
Sovereign state
As critics have noted, OpenEuroLLM has a lot of moving parts, which Hajič acknowledges as a challenge. However, he believes that the combination of academic expertise and corporate focus can bring something new and valuable to the table.
The ultimate goal of OpenEuroLLM is digital sovereignty, with the creation of open source LLMs that are built by and for Europe. While it may not be about outmaneuvering Big Tech or billion-dollar AI startups, the project’s success would provide a positive outcome, even if it’s not the number one model. Having a “good” model with all components based in Europe would be a significant achievement, promoting digital sovereignty and reducing dependence on external AI solutions.
Source Link