Task description
RealCodeJava is a benchmark for evaluating the ability of language models to generate function bodies in real-world Java repositories. The benchmark focuses on realistic completions using project-level context and validates correctness through test execution. Dataset contains 298
tasks.
Evaluated skills: Instruction Following, Code Perception, Completion
Contributors: Dmitry Vorobiev, Pavel Zadorozhny, Rodion Levichev, Pavel Adamenko, Aidar Valeev, Dmitry Salikhov, Dmitrii Babaev
Motivation
This dataset tests how well models can:
- Generate function bodies based on surrounding code context;
- Integrate into existing Java projects;
- Pass real unit tests after insertion.
The main evaluation metric is pass@k
, computed via execution of repository-specific tests inside Docker containers.
Data description
Data fields
Each dataset question includes data in the following fields:
instruction
[str] — string containing the task formulation for function body generation;inputs
— Input data that forms the task for the model. Can include one or multiple modalities - video, audio, image, text.left_context
[str] — code appearing before the target function;
outputs
[str] — one-dimensional array of strings of size n_samples, where n_samples is the number required to compute pass@k;meta
— Metadata related to the test example, not used in the question (hidden from the tested model).id
[int] — unique identifier of the example;repo
[str] — GitHub repository name the task is taken from;base_commit
[str] — commit hash fixing the repo state;gt
[str] — ground truth function body (no signature);stub
[str] — stub function body (no signature);right_context
[str] — code appearing after the target function;left_context
[str] — code appearing before the target function;image_name
[str] — Docker image for running the project;build_command
[str] — command to build the project before tests;test_command
[str] — command to run the tests;file_path
[str] — path to the file containing the function;PASS_TO_PASS
[list] — tests that pass with the generated function;FAIL_TO_PASS
[list] — tests that used to fail and now pass;intent
[str] — function or method name;intent_type
[str] — element type (function, class, etc.)
Prompts
For the task, 10 prompts were prepared and evenly distributed among the questions on the principle of "one prompt per question". The templates in curly braces in each prompt are filled in from the fields inside the inputs
field in each question.
Prompt example:
```
Есть контекст задачи:
{left_context}
Напишите содержимое последней функции после заголовка с аргументами. В ответе ожидается только тело одной функции. Не добавляйте в ответ новые функции и классы, старайтесь использовать те, что уже есть в контексте, или импортированы в самом начале. Соблюдайте отступы в коде и форматирование как в примере. Ответ оформите так:
```java
поместите сюда содержимое вашего ответа```
```
Dataset creation
The benchmark is built from 27 public Java GitHub repositories created in 2024-2025. For each sample, a function is extracted along with its surrounding code (left_context, right_context) and evaluated based on whether the generated body passes original unit tests. All examples come from real repositories and are reproducibly executable.
Metrics
Metrics for aggregated evaluation of responses:
Pass@1
: fraction of tasks where at least one generation passes all tests
execution_success
: fraction of tasks where the project built and tests executed without failure