MERA is a new open independent benchmark for the evaluation of fundamental models for the Russian language.

Large language models handle routine tasks more effectively than ever before, but in order to provide high-quality answers to highly specialized questions, they need to delve deeply into the essence of specific fields. In this test, we are making such a “step deeper” into medicine, bringing the model’s knowledge closer to that of a general practitioner who has recently graduated from university.

This test covers the following fundamental medical sciences: in-depth knowledge of how the human body functions at every level — from the cell (Biology, Biophysics, Biochemistry) to organ systems (Anatomy, Physiology, Pathological disciplines) — as well as skills in the main areas of medicine, such as surgery, therapy, hygiene, laboratory diagnostics, and pharmacology.

Fundamental sciences are the necessary basis upon which clinical specialities are built, so this set of knowledge is not only possessed by every graduate of the “General Medicine” specialty, but by any specialist in the medical field in general. Without this knowledge, a language model will not be able to provide a detailed and accurate answer to a medical question, nor will it be able to explain the significance of a pathology or justify the importance of following the instructions for a medicinal product.

The test includes 17 fundamental medical sciences, each of which contains 270 tests and 30 thematic training tasks. Each question has four possible answers, with only one of them being correct.

Keywords: Medicine, fundamental medicine, Anatomy, Biochemistry, Bioorganic Chemistry, Biophysics, Clinical Laboratory Diagnostics, Faculty Surgery, General Chemistry, General Surgery, Histology, Hygiene, Microbiology, Normal Physiology, Parasitology, Pathological Anatomy, Pathophysiology, Pharmacology, Propaedeutics of Internal Diseases

Authors: Almazov National Medical Research Center of the Ministry of Health of the Russian Federation

Motivation

This task is one of six benchmarks in the medicine and healthcare set, which is intended to assess professional knowledge in the field of fundamental medical sciences. It resembles the well-known MMLU test in its structure and purpose, and is suitable for comprehensive testing of language models for the professional quality of understanding and responses. We provide a public MMLU test version of the medical benchmark in Russian to assess capabilities of our model on real professional tasks.

Data description

Data fields

instruction — a string containing the instruction for the task;
inputs — a dict with the input data:
- question — a string with the task question;
- option_a — answer option A;
- option_b — answer option B;
- option_c — answer option C;
- option_d — answer option D;
outputs — a string containing the right answer for the task (one or more letters (A-H) separated with comma and written in alphabetic order);
meta — a dict with task meta information:
- id — an integer, the task's unique number in dataset;
- domain — a string with the task's domain name.

Prompts

10 promptes of varying complexity were prepared for the dataset.

Example:

"Short test on a medical topic. Question: {question} Possible answers: {option_a} {option_b} {option_c} {option_d} Write down your answer using only one letter. Only answers consisting of one letter will be accepted. Answers containing any other information will not be accepted or evaluated."

Dataset Creation

All tasks in this set were written by top experts (practicing physicians and medical researchers), professionally edited, and then manually double-checked by 3 different experts.

Metric

Quality metrics: Exact Match and F1.

ruTXTMedQFundamental