How MERA works
Modern large language models (such as ChatGPT, Llama, YandexGPT, GigaChat) are actively developing and need fair comparison and independent evaluation. There is no single standard for evaluation, and therefore the models cannot be compared fairly, as the measurements are performed in disparate experimental settings (different data for evaluation, methods of measurement). Openness and transparency of the procedure is a key problem of evaluation, also because models are usually evaluated by developers who are interested in their models getting high scores. We present a Russian-language industrial benchmark for comprehensive validation of large language models in the agriculture and medical and healthcare industries. The benchmark website ranks models on the quality of solving a fixed set of tasks composed by experts, with standardized configurations of prompts and parameters. The project is supported by the AI Alliance, leading industry players and academic partners who research language models.

We propose a new methodology to evaluate SOTA language models:
It includes a wide range of challenging tests focused on critical professional areas such as agriculture and medicine. All tasks in this set were created by leading experts in the fields of agriculture and medicine, edited by professional editors, then manually retested one by one by three experts.
How are task prompts designed?
For each task, experts have manually created several diverse universal model-agnostic instruction prompts with clearly defined requirements for the answer output format. These prompts are uniformly distributed among all questions in the task so that each question is assigned exactly one prompt.
This format allows to obtaining an average score based on different prompts and all models are evaluated in equal conditions: prompts do not "favor" specific models. For these reasons, instructions as well as generation parameters and few-shot examples cannot be changed within model evaluation.
How is measurement done?
The benchmark's scoring system is based on the international LM Evaluation Harness framework, which provides generative and log-likelihood setup for model evaluation.

Generative evaluation | Log-likelihood evaluation |
---|---|
Does not require access to logits, suitable for any model capable of generating text. | Is not suitable for API models since they typically do not return logits used in log-likelihood evaluation. |
Requires answer post-processing (no universal heuristics, human SBS and LLM-as-a-Judge / special parsers). | Does not require model response post-processing because the answer is a fixed letter or number. |
Smaller models tend to generate irrelevant responses. | Allows to measure the probability of specific responses from a language model. |
We recommend to evaluate instructive models (SFT-like) and APIs only in generative setup. | Better suited for evaluation of pretrained models and smaller models. |