MERA is a new open independent benchmark for the evaluation of fundamental models for the Russian language.

Task description

RealCode is a benchmark for evaluating the ability of language models to generate function bodies in real-world Python repositories. The benchmark focuses on realistic completions using project-level context and validates correctness through test execution. Dataset contains 802 tasks.

Evaluated skills: Instruction Following, Code Perception, Completion

Contributors: Pavel Zadorozhny, Rodion Levichev, Pavel Adamenko, Aidar Valeev, Dmitrii Babaev, Denis Kokosinskiy

Motivation

This dataset tests how well models can:

Generate function bodies based on surrounding code context;

Integrate into existing Python projects;

Pass real unit tests after insertion.

The main evaluation metric is pass@k, computed via execution of repository-specific tests inside Docker containers.

Data description

Data fields

Each dataset question includes data in the following fields:

instruction [str] — string containing the task formulation for function body generation;
inputs — Input data that forms the task for the model. Can include one or multiple modalities - video, audio, image, text.
- left_context [str] — code appearing before the target function;
outputs [str] — one-dimensional array of strings of size n_samples, where n_samples is the number required to compute pass@k;
meta — Metadata related to the test example, not used in the question (hidden from the tested model).
- id [int] — unique identifier of the example;
- repo [str] — GitHub repository name the task is taken from;
- base_commit [str] — commit hash fixing the repo state;
- gt [str] — ground truth function body (no signature);
- right_context [str] — code appearing after the target function;
- left_context [str] — code appearing before the target function;
- image_name [str] — Docker image for running the project;
- build_command [str] — command to build the project before tests;
- test_command [str] — command to run the tests;
- fn [str] — path to the file containing the function;
- PASS_TO_PASS [list] — tests that pass with the generated function;
- FAIL_TO_PASS [list] — tests that used to fail and now pass;
- intent [str] — function or method name;
- intent_type [str] — element type (function, class, etc.)

Prompts

For the task, 10 prompts were prepared and evenly distributed among the questions on the principle of "one prompt per question". The templates in curly braces in each prompt are filled in from the fields inside the inputs field in each question.

Prompt example:

Ответ оформите так: ```python

<code>```Контекст:

{left_context}

Требуется: продолжить только тело одной функции. Строго соблюдайте отступы Python. Не добавляйте лишнего текста и не пишите другие функции. Ваша генерация будет вставлена сразу после контекста и запущена тестами.

Dataset creation

The benchmark is built from 95 public Python GitHub repositories created in 2024. There are 802 tasks in total: for each sample, a function is extracted along with its surrounding code (`left_context`) and evaluated based on whether the generated body passes original unit tests. All examples come from real repositories and are reproducibly executable.

Metrics

Metrics for aggregated evaluation of responses:

Pass@1: fraction of tasks where at least one generation passes all tests

execution_success: fraction of tasks where the project built and tests executed without failure