The AI Alliance Russia launches MERA Code — the first open benchmark for evaluating code generation across Tasks
The AI Alliance has unveiled a new evaluation tool —MERA Code — the first comprehensive open benchmark for evaluating large language models (LLMs) in applied programming tasks in Russian. The development of the benchmark involved teams from Sber AI, T-Bank, MWS AI (part of MTS Web Services), Rostelecom, Innopolis University, ITMO University, Skoltech, Central University, and Siberian Neuronets.
With the advancement of large language models, developers are increasingly using AI tools for code generation, task automation, and documentation processing. However, until now, there has been no unified way to assess how well these models handle practical tasks in a Russian-language environment. MERA Code is a significant step toward standardizing and objectively evaluating LLMs in Russian programming tasks. It helps determine how useful and effective modern LLMs truly are for real-world tasks on the local market.
Key Features of MERA Code:
- Transparent LLM Evaluation Methodology for Russian: The first standard that accounts for the specifics of task formulation and documentation in Russian.
- Tasks and evaluation methods reflect typical cases encountered by programmers in a Russian-language environment.
- 11 diverse tasks in text2code, code2text, and code2code formats — covering 8 programming languages: Python, Java, C#, JavaScript, Go, C, C++, and Scala.
- Fair testing — code is executed in isolated environments rather than just being assessed based on text.
- An open platform with an end-to-end scoring system, leaderboard, and a user-friendly testing framework.
- Analysis and results cover both open general-purpose models and proprietary code-generation APIs.
The MERA Code tool will be valuable for:
- Developers and engineers, enabling them to choose the most effective models for their projects.
- Researchers, who can objectively compare models under uniform conditions.
- Companies, which can make informed decisions based on open and transparent data about LLM quality.