MERA is a new open independent benchmark for the evaluation of fundamental models for the Russian language.

Task description

Evaluation of unit-test generation for functions and methods in five programming languages (Java, Python, Go, JavaScript, and C#). Dataset contains 2500 tasks.

Evaluated skills: Instruction Following, Long Context Comprehension, Synthesis, Testing

Contributors: Alena Pestova, Valentin Malykh

Motivation

Unit testing is an important software development practice in which individual components of a software system are evaluated in isolation.

This benchmark is aimed at evaluating the ability of models to generate unit tests for methods and functions in five programming languages - Java, Python, Go, JavaScript and C#.

The task of unit-test generation is formulated as follows: there is a function or method (called focal function/method), and we need to generate a unit-test for it (test function/method).

The dataset is focused on instructional code and multimodal models.

The evaluation results may be useful:

- for researchers in the field of automatic code generation in general and unit tests in particular

- developers of tools for automatic code generation

Based on the results, it will be possible to conclude how similar the tests generated by the model are to tests written by humans.

The task evaluates the ability of the model to generate unit tests for a given function/method, taking into account the additional context gathered from the project.

Thus, the ability to generate code at the project level is also tested, which is important when using models to generate unit tests for real projects.

The model is provided with instructions that contain

- task (generate unit test)

- programming language used

- the text of the method/function for testing

- the path to the file of the method/function for testing, as well as the rest of the code from this file

- the path to the file where the generated test function will be located

- (optional) a test framework that needs to be used

- type of a unit test to generate - method or function

- additional context from a future test file

About the additional context from the test file:

Since the dataset was gathered from real repositories files from github, the test project files contain more than one test function, and there may also be necessary imports, variables, auxiliary functions and methods, as well as the test functions for different units.

In the basic scenario, we can simply give the model a unit (and even some additional context for this unit) and ask to generate a test for it.

However, in this case, the problem appears that the model did not see the context from the test file.

If such data is used for training, then we intentionally teach the model to hallucinate, namely, to use some libraries, other functions and classes in the test function that have not been described anywhere before.

If we use such data for testing, it turns out that we have provided the model with a very limited context and the comparison with human tests is not honest.

You could simply give the model the rest of the text from the test file and ask it to generate a targeted test function, but then there is a high probability of data leakage - this test function could be used somewhere in the file, and other test functions could also fall into such a context.

Therefore, we decided to collect some cutted context for each test function from the test files for the model input.

The quality assessment metric is CodeBLEU, which evaluates the similarity of a generated test and a test written by a human.

Data description

Data fields

Each dataset question includes data in the following fields:

instruction [str] — Instruction prompt template with question elements placeholders.
inputs — Input data that forms the task for the model. Can include one or multiple modalities - video, audio, image, text.
- focal_func [str] — the focal function/method;
- test_func_type [str] — a type of the test;
- test_func_context [str] — test function context from the test file;
- language [str] — programming language (python, java, csharp, js, go);
- focal_file_path [str] — the focal function/method file path;
- test_file_path [str] — the test function/method file path;
- focal_func_context [str] — focal function context: the focal file text where the focal method is replaced with the text `#focal function/method here`;
- test_framework [str] — test framework that should be used (only for JS);
outputs [str] — The correct answer to the question.
meta — Metadata related to the test example, not used in the question (hidden from the tested model).
- id [int] — Identification number of the question in the dataset.
- repo_id [str] — The github id of the repository;
- focal_func_type [str] — a type of the focal object

Prompts

For the task, 20 prompts were prepared and evenly distributed among the questions on the principle of "one prompt per question". The templates in curly braces in each prompt are filled in from the fields inside the inputs field in each question.

Prompt example:

"Сгенерируйте функцию на языке {language}.

Напиши тест для этого кода на языке {language} из файла '{focal_file_path}'.

Вот код, который надо протестировать:

{focal_func}

Тебе необходимо написать {test_func_type} на языке {language}. Тест будет помещен в файл '{test_file_path}'.

Обязательно учитывай код, собранный из будущего тестового файла:

{test_func_context}

Для тебя собран код из репозитория, который может помочь тебе в написании теста:

{focal_func_context}

Напиши только {test_func_type} без пояснений и комментариев. Не забывай соблюдать синтаксис языка {language}.

Оформи свой ответ с соблюдением markdown разметки для кода:```{language}

<your code>```

Dataset creation

The dataset collection process consisted of the following steps:

1. Parsing the repository list, filtering the list, and downloading repositories.

2. Parsing repositories, functions, methods, and tests

3. Comparison of methods/functions and their corresponding tests.

These steps will be described in more detail later.

A list of repositories was downloaded using GitHub API for each language.

We chose the repositories with permissive licenses only and with the number of stars more than 10. We also filtered out fork repositories. The list of licenses used in the dataset: MIT License, Apache License 2.0, The Unlicense, Mozilla Public License 2.0, BSD 2-Clause "Simplified" License, BSD 3-Clause "New" or "Revised" License, EPL 1.0 license, EPL 2.0 license, MPL 2.0 License, Unlicense License, 0BSD license.

When building the dataset, the same filtering rules for all languages were used:

+ Empty tests are removed.

+ No more than 200 method-test pairs were collected from one repository. If there were more pairs, they were sampled randomly.

+ The test case should be less than 5000 characters. This limit is set to remove overly long tests from the data.

+ Maximum input length (focal function with context) should be less than 70000 characters.

+ Maximum number of assertions (the word "assert" in the test case) is 20.

+ For Python and Java, there was additional filtering for tests with syntax errors (using ast and javalang libraries correspondingly).

+ The training data was filtered for duplicates of test cases both within a set, and possible overlaps with the validation and test data were removed.

For all languages (except Python) the tree-sitter was used for

code parsing, specifically, for searching and parsing functions/meth-

ods and classes, identifying calls, etc. For Python, we use the built-in

ast library.

After we have parsed all the classes, methods, and functions in the repository, we need to somehow understand which method/function the test functions are testing. In other words, we need to compare them and create a list of test-method pairs. Methods/functions and the unit-tests for them were mapped using the method adapted from the paper (only Java methods and tests were compared in the work). We will briefly describe below how we have adapted this method for each language.

+ For Java, test classes are mapped with focal classes by their paths and names. Focal and test methods are then matched using two heuristics - names and unique method invocation.

+ For Python, all parsed functions and methods were mapped with tests in accordance with the test naming conventions in pytest. Based on the logic that one function can be tested by several tests, but one test is aimed at only one function, only tests matched with one function/method are added to the dataset.

+ For Go, the identification of test functions and mapping them to focal functions was carried out following the naming conventions of the testing package. The test-mapping procedure was performed in the same way as for Python.

+ For C#, focal and test methods are mapped if the name of the test method includes the name of a non-test method from the repository and makes an invocation of this method. Only tests mapped with one focal method are added to the dataset.

+ For JavaScript, the test framework used in the repository was also parsed by searching for one of the following libraries in dependencies in "package.json" file: "mocha", "jest", "jasmine", "qunit", "nightwatch". Subsequently, the name of the framework was added to the model input as one of the parts of the test file context. Unlike other languages, this is necessary, since imports often do not contain information about the test framework. If a test framework from the list was not found in the repository dependencies, the test function was still added to the dataset, but the test framework was defined as "Unknown". As for method-test mapping, this is the only language where it was based only on the last local method/function invocation because test functions do not have identifiers when declared in it() and test().

Metrics

Metrics for aggregated evaluation of responses:

CodeBLEU: CodeBLEU considers the surface match similar with the original BLEU, but can also consider the grammatical correctness and the logic correctness, leveraging the abstract syntax tree and the data-flow structure.

UnitTests