MERA is a new open independent benchmark for the evaluation of fundamental models for the Russian language.

Task description

The dataset contains structured Russian-language docstrings for functions in 5 programming languages (Python, Java, C#, Go, JavaScript). Dataset contains 500 tasks.

Key features:

First specialized corpus for Russian-language documentation

Combination of real GitHub data (for testing) and synthetic data from Qwen2.5-Coder-32B-Instruct (for training)

Strict filtering for completeness and compliance with documentation standards

All comments conform to specified formats (Python - GoogleDoc, JavaScript - JSDoc, Java - JavaDoc, C# - XML, Go - GoDoc)

Evaluated skills: Instruction Following, Code Perception, Simulation, Documentation

Contributors: Maria Dziuba, Valentin Malykh

Motivation

Target Models and Limitations

Designed for evaluating models supporting structured documentation generation (DeepSeek-Coder, Qwen2.5-Coder)

Not suitable for:

Unstructured comment generation

Code summarization

Code explanation

Users and Result Interpretation

Primary users:

NLP developers and researchers working on automated documentation tools

Results allow to:

Assess models' ability to generate technically accurate comments compliant with documentation standards

Metrics:

chrF evaluates similarity between generated and reference texts using character n-grams, considering morphology, spelling and grammatical endings - particularly crucial for Russian due to its morphological complexity

Data description

Data fields

Each dataset question includes data in the following fields:

instruction [str] — Instruction prompt template with question elements placeholders.
inputs — Input data that forms the task for the model. Can include one or multiple modalities - video, audio, image, text.
- function [str] — The function to generate a structured comment for.
outputs [str] — The correct answer to the question.
meta — Metadata related to the test example, not used in the question (hidden from the tested model).
- id [int] — Identification number of the question in the dataset.
- language [str] — The programming language in which the function is written.

Prompts

For the task, 10 prompts were prepared and evenly distributed among the questions on the principle of "one prompt per question". The templates in curly braces in each prompt are filled in from the fields inside the inputs field in each question.

Prompt example:

"Напиши русскоязычную документацию к функции.

Функция:

{function}"

Dataset creation

Stage 1: Data Collection

Crawling Russian-language GitHub repositories with permissive/no licenses, language identification via Lingua

Function extraction using function_parser and Code-Text

Stage 2: Synthetic Data

Qwen2.5-Coder-32B-Instruct model used for synthetic data generation

Stage 3: Cleaning and Standardization

Strict structural filtering (requiring complete coverage of all documented code elements)

Style standardization of all comments
Length filtering (250-1000 characters)

Metrics

Metrics for aggregated evaluation of responses:

chrF: Metric evaluating character n-gram matches with reference text, suitable for Russian morphology and spelling accuracy

stRuCom