24 Sep 2025

AI Alliance Launches Dynamic SWE-MERA Benchmark for Evaluating Code Models

The AI Alliance's benchmark lineup has been expanded with a new tool — the dynamic benchmark SWE-MERA, designed for comprehensive evaluation of coding models on tasks close to real development conditions. SWE-MERA was created as a result of collaboration among leading Russian AI teams: MWS AI (part of MTS Web Services), Sber, and ITMO University.

SWE-MERA, like the MERA CODE benchmark, assesses code models but takes a fundamentally different approach, with a key advantage — its dynamic nature. Unlike classical static benchmarks, SWE-MERA is automatically and regularly replenished with new relevant tasks and proposed changes selected from public GitHub repositories. This allows testing and retraining models with the freshest data, as close as possible to real development conditions.

Key Features of SWE-MERA:

· Dynamism and Relevance: An automated data collection pipeline ensures constant updating of the task set, preventing benchmark obsolescence and minimizing the risk of model overfitting.

· Data Contamination Protection: A unique leaderboard function allows selecting tasks from specific time periods. This simplifies identifying models whose results may have been affected by inclusion of test data in the training set.

· Automated Methodology: The evaluation process includes careful task selection, filtering using an LLM-as-a-judge approach, and solution verification through a reliable testing framework, ensuring high result reliability.

· Scalability: The number of tasks will be increased several times to cover a broader range of scenarios.

Plans include further expanding the task base and increasing it to five programming languages (C++, Java, JavaScript, TypeScript, and Go), as well as developing the leaderboard for deeper and more objective model evaluation.

SWE-MERA is created as an open tool for the community, complementing existing practices and potentially becoming a standard for evaluating code models. The benchmark will help researchers and developers avoid model stagnation caused by memorizing fixed tasks and make algorithm evaluation more objective, dynamic, and close to real development conditions.

Developers can test their own models using the instruction.

The SWE-MERA benchmark will be presented at the leading conference in natural language processing and artificial intelligence, EMNLP, this year.

More about the SWE-MERA project can be read in the article.

Valentin Malykh, Head of Fundamental Research at MWS AI (part of MTS Web Services):

"The agent approach to coding is actively discussed today. Unlike ordinary generation, when the model immediately produces a ready fragment, the agent acts like a developer: setting goals, breaking the task into steps, writing and checking code, fixing errors, and gradually arriving at a working solution. Currently, static benchmarks are used to evaluate models, but they quickly become outdated and create a risk of overfitting on the open code from which the benchmark was compiled. Therefore, we propose a benchmark format that can be regularly updated. This approach better reflects real-world agent system scenarios and allows for more accurate assessment of how well models cope with coding in changing conditions."

Sergey Markov, Director of AI Development and Head of Sberbank’s AI Department:

"The task of objectively evaluating modern generative models working with code is of great practical importance. Although a number of specialized benchmarks have been developed in recent years, in the fast-paced AI race they become quickly outdated, suffer from data leaks, and do not always reflect the realities of practical development well. Creating dynamic benchmarks aims to address these challenges. We hope that in the near future generative models will significantly contribute to improving their own codebase, which will gradually expand the models’ capabilities. This makes the task of dynamic benchmarking of code models even more relevant."

***

SWE-MERA — a dynamic benchmark developed by the AI Alliance for comprehensive evaluation of coding models on real programming tasks. The Alliance also offers the MERA CODE benchmark — a static evaluation for code models.

The MERA benchmark was first presented at the international AI Journey conference in 2023. Subsequently, the test methodology was also presented at ACL, the leading scientific conference on computational linguistics held since 1963 and supported by major IT companies worldwide, including Apple, Google DeepMind, Baidu, IBM, and others. In summer 2025, the MERA benchmark also introduced an industry branch — MERA INDUSTRIAL.