MaMuRAMu
Task Description
Massive Multitask Russian AMplified Understudy (MaMuRAMu) is a dataset designed to measure model professional knowledge acquired during pretraining in various fields. The task covers 57 subjects (subdomains) across different topics (domains): HUMANITIES; SOCIAL SCIENCE; SCIENCE, TECHNOLOGY, ENGINEERING, AND MATHEMATICS (STEM); OTHER. The dataset was created based on the English MMLU proposed in [1] and follows its methodology in instruction format. Each example contains a question from one of the categories with four possible answers, only one of which is correct.
Warning: to avoid data leakage for MaMuRAMu, we created the NEW closed dataset that follows the original MMLU design. Thus, results on the MMLU and MaMuRAMu datasets cannot be directly compared with each other.
Keywords: logic, world knowledge, factual, expert knowledge
Motivation
This set is a continuation of the idea GLUE [2] and SuperGLUE [3] benchmarks, which focus on generalized assessment of tasks for Natural Language Understanding (NLU). Unlike sets like ruWorldTree and ruOpenBookQA (where questions are similar to MMLU format), which cover tests of the school curriculum and elementary knowledge, MaMuRAMu is designed to test professional knowledge in various fields.
Dataset Description
Data Fields
instruction
is a string containing instructions for the task and information about the requirements for the model output format;inputs
is a dictionary that contains the following information:text
is the test question;option_a
is the option A;option_b
is s the option B;option_c
is the option C;option_d
is the option D;subject
is the topic of the question (generalization of a group of subdomains by meaning);
outputs
is the result: can be one of the following string variables: "A", "B", "C", "D";meta
is a dictionary containing meta information:id
is an integer indicating the index of the example;domain
is question subdomain.
Prompts
For this task 10 prompts of varying difficulty were created. Example:
"Вопрос:\n{text}. Варианты ответа:\nA {option_a}\nB {option_b}\nC {option_c}\nD {option_d}\nИспользуй знания по теме {subject} и выбери правильный ответ. Выведи только одну букву. Ответ:"
Dataset Creation
The test set is based on the the original MMLU dataset methodology. We present the MMLU adopted set as a public set.
The MaMuRaMu dataset is created from scratch by the Russian experts. The set was assembled manually according to the original format with domains as close as possible to the original set. The set is adapted for the Russian language and culture. The distribution of tasks across individual specific domains and subjects are balanced.
Evaluation
Human benchmark
MaMuRaMu test set was measured for the updated version of the board: by the human-experts — 79.6%, by the non-experts is 46%.
The overlap was 5 by the subset of the categories.
Note: for this set , due to limited expert resources, annotation was conducted only for selected domains and categories (biology, geography, history, general facts, physics - 100 samples per category). This score does not reflect average performance across all dataset domains and therefore cannot be directly compared with model scores on this dataset.
References
[1] Hendrycks, Dan, et al. "Measuring Massive Multitask Language Understanding." International Conference on Learning Representations. 2020.
[2] Wang, Alex, et al. "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding." International Conference on Learning Representations. 2018.
[3] Wang, Alex, et al. "Superglue: A stickier benchmark for general-purpose language understanding systems." Advances in neural information processing systems 32 (2019).
[4] The original MMLU translated into Russian (without filtering) https://github.com/NLP-Core-Team/mmlu_ru.
[5] The 🤗 Open LLM Leaderboard (содержит внутри MMLU, замеры происходят по 5-шотам) https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.