Task description
RealCode is a benchmark for evaluating the ability of language models to generate function bodies in real-world Python repositories. The benchmark focuses on realistic completions using project-level context and validates correctness through test execution. Dataset contains 802
tasks.
Evaluated skills: Instruction Following, Code Perception, Completion
Contributors: Pavel Zadorozhny, Rodion Levichev, Pavel Adamenko, Aidar Valeev, Dmitrii Babaev, Denis Kokosinskiy
Motivation
This dataset tests how well models can:
- Generate function bodies based on surrounding code context;
- Integrate into existing Python projects;
- Pass real unit tests after insertion.
The main evaluation metric is pass@k
, computed via execution of repository-specific tests inside Docker containers.
Data description
Data fields
Each dataset question includes data in the following fields:
instruction
[str] — string containing the task formulation for function body generation;inputs
— Input data that forms the task for the model. Can include one or multiple modalities - video, audio, image, text.left_context
[str] — code appearing before the target function;
outputs
[str] — one-dimensional array of strings of size n_samples, where n_samples is the number required to compute pass@k;meta
— Metadata related to the test example, not used in the question (hidden from the tested model).id
[int] — unique identifier of the example;repo
[str] — GitHub repository name the task is taken from;base_commit
[str] — commit hash fixing the repo state;gt
[str] — ground truth function body (no signature);right_context
[str] — code appearing after the target function;left_context
[str] — code appearing before the target function;image_name
[str] — Docker image for running the project;build_command
[str] — command to build the project before tests;test_command
[str] — command to run the tests;fn
[str] — path to the file containing the function;PASS_TO_PASS
[list] — tests that pass with the generated function;FAIL_TO_PASS
[list] — tests that used to fail and now pass;intent
[str] — function or method name;intent_type
[str] — element type (function, class, etc.)
Prompts
For the task, 10 prompts were prepared and evenly distributed among the questions on the principle of "one prompt per question". The templates in curly braces in each prompt are filled in from the fields inside the inputs
field in each question.
Prompt example:
Ответ оформите так: ```python
<code>```Контекст:
{left_context}
Требуется: продолжить только тело одной функции. Строго соблюдайте отступы Python. Не добавляйте лишнего текста и не пишите другие функции. Ваша генерация будет вставлена сразу после контекста и запущена тестами.
Dataset creation
The benchmark is built from 95 public Python GitHub repositories created in 2024. There are 802 tasks in total: for each sample, a function is extracted along with its surrounding code (`left_context`) and evaluated based on whether the generated body passes original unit tests. All examples come from real repositories and are reproducibly executable.
Metrics
Metrics for aggregated evaluation of responses:
Pass@1
: fraction of tasks where at least one generation passes all tests
execution_success
: fraction of tasks where the project built and tests executed without failure