MERA is a new open independent benchmark for the evaluation of fundamental models for the Russian language.

Task description

Evaluation of the correctness of the written code for Python, Java and Go. Correctness means the absent of any errors including SyntaxError, Runtime Error etc. and the successful tests passing as well. The dataset contains 1361 tasks.

Evaluated skills: Instruction Following, Code Perception, Simulation, Error Classification

Contributors: Elena Bruches, Ivan Bondarenko, Daniil Grebenkin, Oleg Sedukhin, Roman Derunets, Georgii Mkrtchyan, Vadim Alperovich, Nikolay Bushkov, Stanislav Moiseev

Motivation

It is assumed that during the training process the model learns not only to generate the code and solve different tasks but also learns to process and analyze the code, e.g. to detect whether the code is correct, contains any errors etc. This dataset was developed to automatically evaluate such ability. Any model, which assess the code correctness, should be limited with a given context. To define whether the code is successfully executed, we collected such pairs {focal_code - test_code}, which do not contain the imports from the other files of the projects. Also we kept only the files which do not contain any assets usage, e.g. loading data from files.

Data description

Data fields

Each dataset question includes data in the following fields:

instruction [str] — Instruction prompt template with question elements placeholders;
inputs — Input data that forms the task for the model. Can include one or multiple modalities - video, audio, image, text;
- focal_code [str] — Source code from the focal file;
- test_code [str] — Source code from the test file;
- lang [str] — Programming language of this sample;
outputs [str] — Answer of the model, should be either "success" or "failed";
meta — Metadata related to the test example, not used in the question (hidden from the tested model);
- id [int] — Identification number of the question in the dataset.

Prompts

For the task, 11 prompts were prepared and evenly distributed among the questions on the principle of "one prompt per question". The templates in curly braces in each prompt are filled in from the fields inside the inputs field in each question.

Prompt example:

Вот код из фокального файла на языке {lang}:

{focal_code}

Проверь, корректен ли тест для этого кода:

{test_code}

Дай короткий ответ: если тест пройдет без ошибок, скажи "success", иначе — "failed".

Dataset creation

The dataset creation includes the following stages:

1) Automatic retrieval, parsing and processing of open-source repositories from GitHub based on the number of stars, novelty (date of the latest commits) and execution (check whether the project is built successfully and focal and test files are executed successfully);

2) The collection of dataset samples from the repositories data in the following format: focal file source code | test file source code;

3) The creation of two subsets: original (the samples contain the original test files) and generated (the samples contain the generated test cases by LLM);

4) The samples were tagged with the following features: the number of lines in test case, the number of lines in the focal file, syntax correctness, import types etc.;

5) The samples were filtered out to keep only the samples for which the task of correctness determining may be solved without additional inputs.

6) The final version of the dataset was formed from the filtered data.

Metrics

Metrics for aggregated evaluation of responses:

Exact Match: Measures the proportion of model predictions which exactly match with the reference among the total number of cases processed.

CodeCorrectness