📊 Full opportunity report: Minerva. The opposite path. on ThorstenMeyerAI.com — validation score, market gap, and execution plan.
TL;DR
Italy’s Minerva-3B, a European sovereign large language model trained from scratch, achieved high technical performance but scored near chance on Italian school benchmarks. This challenges assumptions about scale and investment in country-specific AI models.
Italy’s Minerva-3B, a large language model trained from scratch on 2.5 trillion tokens with approximately 50% Italian content, scored only 4.9% on the INVALSI Italian school-exam benchmark, despite its impressive technical architecture. This performance highlights a significant challenge in the European sovereign-LLM strategy—scale alone may not produce country-knowledge depth.
The Minerva project, led by Sapienza University of Rome and supported by Italy’s national research infrastructure, built a 7-billion-parameter model from scratch, trained on a dataset of 2.5 trillion tokens, roughly half of which were Italian. The project aimed to demonstrate that a domestic, open-weights LLM could outperform multilingual models on Italian benchmarks and serve as a model for European AI sovereignty.
However, despite its technical achievements, Minerva-3B scored just 4.9% on the INVALSI Italian academic-content tests, a near-chance result that indicates a disconnect between training data scale and actual language understanding in complex tasks. Researchers noted that while dataset composition matters, overall size and parameters are more critical for handling complex language tasks.
This empirical finding suggests that the substantial investment in native-language data and infrastructure may still be insufficient at the current parameter scales, raising questions about the optimal investment levels needed for meaningful country-specific AI capabilities. The results challenge the assumption that larger, native-language models automatically translate into deeper language and knowledge understanding.
Minerva.
The opposite
path.
Italy spent years building a European sovereign LLM from scratch. Then Minerva-3B scored 4.9% on the INVALSI Italian school exam.
Where AMÁLIA layered Portuguese specialization onto a multilingual foundation, Minerva trained from scratch on 2.5 trillion tokens with approximately 50% Italian content. Where AMÁLIA’s weights are not yet public, Minerva published weights, training data, and code as truly-open from day one. By every institutional measure, the Italian approach worked. But the empirical results contain a finding the press coverage has been quiet about — and it has implications that extend well beyond Italy.
Same problem. Opposite path.
European sovereign-LLM development has two primary architectural approaches. Italy chose from scratch with substantial native-language foundation. Portugal chose continuation pre-training of a multilingual model. The structural comparison surfaces what each commitment actually requires operationally.
The comparison is not “Italy did it better than Portugal.” Both projects respond to the same structural problem with different architectural strategies under different institutional and economic constraints. Italy’s national-AI investment is structurally larger by an order of magnitude — and Minerva is the visible artifact of that scale.
large language model training datasets
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
4.9% on INVALSI. The bitter lesson surfaces.
In June 2024, researchers evaluated Minerva-3B on the Italian school-exam benchmark. The result was unambiguous. This is not a critique of Minerva — it is a critique of the public discourse around what Minerva’s empirical results actually demonstrate.
AI language model training infrastructure
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
350M to 7B. Four parameter scales, one architecture.
The Minerva model family covers four parameter tiers, each with specific training corpora. Each scale level reveals what the from-scratch path actually requires at different operating points.
Italian + English
100B English
~50% English
+ 200B code
AI model evaluation tools
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three answers. Same question.
Minerva, AMÁLIA, and OpenEuroLLM represent the three operational answers to the European sovereign-LLM question. Each makes different architectural and institutional bets. The strategic discourse benefits from treating all three as data points in the same empirical experiment.
AI research datasets for language understanding
As an affiliate, we earn on qualifying purchases.
As an affiliate, we earn on qualifying purchases.
Three standards the movement should adopt.
The structural critique generalizes beyond Minerva. The European sovereign-LLM movement benefits from internalizing these lessons across every subsequent national project. Italy modeled the openness standard; the movement should adopt it as norm.
Minerva is one valid answer to the European sovereign-LLM question. AMÁLIA is another. OpenEuroLLM is potentially a third. The strategic discourse benefits from treating all three as data points in the same empirical experiment rather than as competing national-prestige projects. More analysis like this is needed. Not less.
Implications for European Sovereign-LLM Strategies
The performance gap revealed by Minerva-3B indicates that European efforts to develop domestic LLMs cannot rely solely on scale and native-language data. It underscores the need for a reevaluation of investment levels and architectural strategies to achieve genuine country-specific knowledge and capabilities. This finding has broad implications for policymakers and AI developers across Europe, emphasizing that scale alone may not suffice for complex language understanding and academic competence.
European Sovereign-LLM Development Approaches and Challenges
Italy’s Minerva project represents a significant investment in building a large-scale, open-weights LLM from scratch, contrasting with approaches like Portugal’s AMÁLIA, which layered specialization onto multilingual foundations. Minerva trained on a dataset of 2.5 trillion tokens, with approximately 50% Italian content, and was supported by Italy’s national research infrastructure and funding programs.
Despite these efforts, Minerva’s performance on academic benchmarks has been disappointing, prompting a broader discussion on the effectiveness of scaling and data composition strategies in European sovereign AI projects. The project’s empirical results suggest that current scaling may not be sufficient to produce truly country-specific language understanding, challenging assumptions underlying many national AI initiatives.
This situation underscores a broader debate within the European AI community about the optimal balance between data, model size, and architectural design to meet national and linguistic needs effectively.
“The results suggest that dataset composition and size are critical, but current parameter scales still fall short for complex language tasks in specific languages.”
— Research team, Minerva project
Unresolved Questions About Scale and Effectiveness
It remains unclear what the optimal scale and investment thresholds are for achieving meaningful country-specific language understanding. The performance of Minerva-3B raises questions about whether larger models or different training strategies could overcome current limitations, but definitive answers are still forthcoming.
Next Steps for European Sovereign AI Development
The Minerva team is continuing to iterate on their methodology, including ongoing experiments in continual training and larger models. Policymakers and researchers are expected to reassess investment strategies and architectural approaches based on these empirical insights, aiming to bridge the gap between technical performance and real-world language understanding.
Key Questions
Why did Minerva-3B perform poorly on Italian academic tests?
Despite extensive training data and a large model size, the empirical results suggest that scale alone is insufficient for complex language understanding. The model may lack the nuanced knowledge and reasoning capabilities needed for academic content, indicating a need for different training approaches or larger models.
Does this mean European sovereign models are ineffective?
Not necessarily. The results highlight the challenges of scale and data composition, but ongoing research aims to identify the most effective strategies. It underscores that achieving deep language and domain knowledge may require more than just larger datasets and models.
What are the implications for future AI investments in Europe?
The findings suggest that European AI initiatives should consider increasing investment in model scale, data quality, and architectural innovation. Simply scaling up current models may not suffice to meet national language and knowledge needs.
Source: ThorstenMeyerAI.com