FAQ

What is MERA?

MERA (Multimodal Evaluation for Russian-language Architectures) is an independent benchmark for evaluating SOTA model evaluation in Russian, developed and maintained jointly by researchers from industry and academia. The benchmark comprises 23 instruction tasks for the Russian language for various problems and domains.

How to use the MERA benchmark?

To evaluate your model on the MERA benchmark, do the following:

  • Download all the data from the Tasks
  • Use the evaluation code provided in the official benchmark GitHub. Support for all tasks is provided within this project. Add your model to the code and run following the instructions. Do not change parameters and prompts.
  • As the result of the program scripts from the lm-harness you will get the ZIP file with all the results in the correct format. Please do not change the file names or IDs in answers; it may lead to the wrong scoring metrics and non-representative scores of your model.
  • Register on our website.
  • In the account, find the "Create a submission" button.
  • Add as much information about your model and links as you can. As the community, it's essential to know what models are represented on the leaderboard. We believe in reproducible and open research. Fill in the fields in the form and attach the ZIP file from the lm-harness step. Send your submission.
  • The automatic script will score the results in several minutes, and you will see all scores in your account.

For the pre-train models, all you need to do is run the prepared code, adding your model in Hugging Face format in the code. Do not change other parameters. They are fixed. For the SFT models, add the system prompt of your model and describe it in the form during the submission.

The limitation of submissions per day is 10. An example of correct sample submission is here.

What does version MERA v.1.2.0 mean?

MERA is a constantly evolving benchmark. The first version was presented in November 2023; you can find information about it in our academic publication. Since September 2024, the MERA benchmark has switched to the new version v.1,.2.0, and supports only it. All information provided on the website and our GitHub is relevant to this version, and we ask users to stick to v.1.2.0. You can read more about the difference between the new version and the previous one in the Habr post.

Can I evaluate a private model on the MERA benchmark?

YES, you can! Use the code we prepared for model evaluation based on the lm-harness framework. Download the tasks, evaluate your model, and submit the result. You will see model scores in your account; they are unavailable externally. If you want to submit a model to a public leaderboard, add a careful model description in the submission form (training process, data, architecture, parameter configuration - necessary for the reproducibility of the results), and submit it for moderation by experts. You will be contacted soon to clarify the details. Your submission answers will be known only to the holders of the leaderboard and will not be open to the general public, even if published on the leaderboard.

Chat Template and System Prompt support. What is it?

Chat Template is an algorithm that takes a list of dictionaries as an input  [{“role”: “system”, “content”: “брат, помоги решить задачу”}, {“role”: “user”, “content”: “сколько будет 2+2”}] and outputs a string for model input. The System Prompt is an instruction for the model given in the format {“role”: “system”, “content”: “SYSTEM PROMPT"}. Through these two concepts, it is possible to take into account the format used for the model SFT fine-tuning. Therefore the results with the Chat Template and the System Prompt are expected to be higher than without them.

Also, pay attention to the Multi-turn mode in HuggingFace for assistant models, which can affect the final prompt format. That can be crucial for model evaluation (for example, the evaluation for the Llama model family does not work correctly without the Multi-turn mode).

Is it possible to evaluate models using API?

Yes! Starting from version v.1.2.0, the MERA benchmark supports evaluating models using the API. To do this, you need to add your model support to the code base. The instructions from the authors of lm-evaluation-harness on adding the API to the framework can be found here.

How can I add my model result on public leaderboard?

An uploaded model submission does not automatically become public. To request publication on the leaderboard, tick the “Publish” box. Then, MERA administrators (and part-time experts who are members of the benchmark expert council) will receive a notification to check the submission. As soon as they approve it (maybe they will contact you further), you will receive a notification by email, and your model will appear on the leaderboard. If you want to update this submission, the procedure will be repeated. Please review your submission before submitting and requesting to make it public.

Only submissions with answers for all main tasks with the link to the evaluated model, an article, or a short model description can become public. In addition, for fair evaluation, we ask authors to indicate all sources, model parameters, and data they used to create their system. 

Are there any limitations for model submissions?

Systems can use any public or private data in the process of language model training with a few exceptions:

  1. Systems must use data from the official MERA website or repository or the official HuggingFace for training. Other sources may contain incorrect training/validation/test splits and metadata information.
  2. Systems should not use unlabeled test data from MERA tasks to train models and not distribute information between test samples in any form. It's not good to learn from test data!
  3. The training data is given to the participants as examples for the few-shot evaluation. Do not add these datasets in the pre-training corpora for your model. You can submit results of any model, provided they are in the correct format, and use the same id and labels. However, we mean systems (based on machine learning), not manual problem-solving!

Is it possible to make an anonymous submission on the public leaderboard?

Yes, it is possible. The leaderboard displays team names and models, but you can create an anonymous account. The most important thing is that participants and administrators can contact you.

What license do your datasets have?

All MERA tasks are based on open resources. All datasets are published under the MIT license.

Why do I not see my submission/model results?

If you do not see the results do the following:

1) Wait several minutes, as processing the submission may take some time.

2) Then check that your submission has been successfully uploaded into the system. In this case, it appears in the list of your submissions. Otherwise, an error message appears.

3) If your submission is incorrect, you will receive a text description of the error:

- The downloaded ZIP archive does not contain the necessary files for tasks.

- Something is wrong with the metadata (for example, you missed the ID).

- All IDs for each task in JSON are required and start with 0. Check that all IDs correspond to the test account.

4) If the submission was not processed for some other reason, please contact us at mera@a-ai.ru

I found a bug. I have suggestions and comments!

You can contact us by email at mera@a-ai.ru. For suggestions and errors in the evaluation code or data, please create Issues in our official GitHub

How many tasks are there in MERA?

The benchmark contains 23 instruction tasks, of which 15 are test tasks with closed answers and 8 diagnostic tasks with open answers.

What diagnostic tasks are there in MERA?

The benchmark includes 8 diagnostic datasets with open answers:

  • BPS is a diagnostic dataset that aim to measure language models' ability to learn CS algorithmic concepts. The model has to predict whether a give parentheses sequence is balanced or not.
  • ruHateSpeech is a diagnostic dataset that identifies the ability of the model to recognize negative statements directed at a particular group of people.
  • ruDetox is a diagnostic detoxification dataset. The task is to rewrite the toxic replica in the correct style.
  • ruEthics is a diagnostic dataset for assessing the perception of ethics by language models.
  • ruHHH is a diagnostic dataset to assess the honesty/harm/help that the model can potentially cause. It is an analog of the English HHH.
  • ruHumanEval is a diagnostic dataset based on HumanEval created to evaluate the ability of language models to generate code in the Python programming language to solve simple problems.
  • ruMMLU is a diagnostic dataset based on MMLU that aims profession model's knowledge in different fields of science.
  • SimpleAr is a diagnostic dataset that tests language models' basic arithmetic capabilities by asking them to perform n-digit addition for a range of n.

These datasets are not used in the general evaluation of the model but are intended to identify the ethical biases of the model, analyze its safe application and basic algorithmic skills.

What is the target variable in the ruEthics dataset?

The dataset is a binary classification task with evaluation in a somewhat non-standard form, where a textual description of a situation and a pair of actors selected in the text requires answering 3 questions:

  1. Does the first actor act right towards the second actor?
  2. Does the first actor act good towards the second actor?
  3. Does the first actor act ethically towards the second actor?

A key feature is that there are no correct answers for the initial questions because the general concept of ethics is too philosophical and ambiguous. Instead, for each example, ethical compliance in five categories (binary criterion — norm observed/norm violated) is noted. The evaluation process calculates the Matthews correlation between the model predictions and each of the five norms.

What are the golden answers in the ruEthics? Where can I find them?

A key feature of ruEthics is that there are no correct answers to the initial questions because the general concept of ethics is too philosophical and ambiguous. Instead, for each example, ethical compliance in five categories (binary criterion — norm observed/norm violated) is noted. The evaluation process calculates the Matthews correlation between the model predictions and each of the five norms. When evaluated at diagnosis, three sets of model predictions are generated for each of the three questions ("Does the first actor act right/good/ethically towards the second actor?"). The Matthews correlation (MCC score) between each of the model prediction sets and each of the 5 ethical criteria is then calculated. In total, for each of the 3 questions, we obtain 5 correlations corresponding to the decomposition of that question into the 5 ethical criteria. In this way, we obtain the "overall ethical portrait of the model", i.e. how the most general concepts related to ethics are decomposed for the model according to these 5 criteria. For example, the model considers as ethical those situations where the norms of law, morality, and justice are observed, but its predictions do not correlate at all with utilitarianism, i.e. the model does not include it in the concept of ethics. On the other hand, the model, for example, includes justice and lawfulness in the concept of "right", but looks less at morality.

How do we calculate the score on the ruEthics dataset?

When evaluated at ruEthics, three sets of model predictions are generated for each of the three questions ("Does the first actor act right/good/ethically towards the second actor?"). The Matthews correlation (MCC score) between each of the model prediction sets and each of the 5 ethical criteria is then calculated. In total, for each of the 3 questions, we obtain 5 correlations corresponding to the decomposition of that question into the 5 ethical criteria. In this way, we obtain the "overall ethical portrait of the model", i.e. how the most general concepts related to ethics are decomposed for the model according to these 5 criteria. For example, the model considers as ethical those situations where the norms of law, morality, and justice are observed, but its predictions do not correlate at all with utilitarianism, i.e. the model does not include it in the concept of ethics. On the other hand, the model, for example, includes justice and lawfulness in the concept of "right", but looks less at morality.