Qwen2.5-32B-Instruct

RCC MSU Created at 13.11.2024 13:47
0.603
The overall result
21
Place in the rating
In the top by tasks:
10
ruHateSpeech
Weak tasks:
20
RWSD
25
PARus
84
RCB
120
ruEthics
28
MultiQ
23
ruWorldTree
34
ruOpenBookQA
123
CheGeKa
42
ruMMLU
39
ruDetox
48
ruHHH
25
ruTiE
35
ruHumanEval
31
USE
21
MathLogicQA
21
ruMultiAr
35
SimpleAr
22
LCS
37
BPS
88
ruModAr
39
MaMuRAMu
33
ruCodeEval
+18
Hide

Ratings for leaderboard tasks

The table will scroll to the left

Task name Result Metric
LCS 0.17 Accuracy
RCB 0.564 / 0.519 Accuracy F1 macro
USE 0.339 Grade norm
RWSD 0.669 Accuracy
PARus 0.936 Accuracy
ruTiE 0.865 Accuracy
MultiQ 0.585 / 0.452 F1 Exact match
CheGeKa 0.191 / 0.147 F1 Exact match
ruModAr 0.628 Exact match
MaMuRAMu 0.823 Accuracy
ruMultiAr 0.428 Exact match
ruCodeEval 0.364 / 0.454 / 0.5 Pass@k
MathLogicQA 0.704 Accuracy
ruWorldTree 0.983 / 0.983 Accuracy F1 macro
ruOpenBookQA 0.923 / 0.741 Accuracy F1 macro

Evaluation on open tasks:

Go to the ratings by subcategory

The table will scroll to the left

Task name Result Metric
BPS 0.99 Accuracy
ruMMLU 0.747 Accuracy
SimpleAr 0.994 Exact match
ruHumanEval 0.348 / 0.432 / 0.463 Pass@k
ruHHH 0.848
ruHateSpeech 0.857
ruDetox 0.327
ruEthics
Correct God Ethical
Virtue 0.37 0.352 0.418
Law 0.35 0.342 0.396
Moral 0.387 0.378 0.446
Justice 0.314 0.306 0.381
Utilitarianism 0.3 0.299 0.378

Information about the submission:

Mera version
v.1.2.0
Torch Version
2.4.0
The version of the codebase
430295f
CUDA version
12.1
Precision of the model weights
auto
Seed
1234
Butch
4
Transformers version
4.45.2
The number of GPUs and their type
1 x NVIDIA A100
Architecture
vllm

Team:

RCC MSU

Name of the ML model:

Qwen2.5-32B-Instruct

Model size

32.0B

Model type:

Opened

SFT

Additional links:

https://qwenlm.github.io/blog/qwen2.5-llm/

Architecture description:

Type: Causal Language Models Training Stage: Pretraining & Post-training Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias Number of Parameters: 32.5B Number of Paramaters (Non-Embedding): 31.0B Number of Layers: 64 Number of Attention Heads (GQA): 40 for Q and 8 for KV Context Length: Full 131,072 tokens and generation 8192 tokens

Description of the training:

-

Pretrain data:

The size of the pre-training dataset is expanded from 7 trillion tokens to a maximum of 18 trillion tokens.

License:

apache-2.0

Inference parameters

Generation Parameters:
simplear - do_sample=false;until=["\n"]; \nchegeka - do_sample=false;until=["\n"]; \nrudetox - do_sample=false;until=["\n"]; \nrumultiar - do_sample=false;until=["\n"]; \nuse - do_sample=false;until=["\n","."]; \nmultiq - do_sample=false;until=["\n"]; \nrumodar - do_sample=false;until=["\n"]; \nruhumaneval - do_sample=true;until=["\nclass","\ndef","\n#","\nif","\nprint"];temperature=0.6; \nrucodeeval - do_sample=true;until=["\nclass","\ndef","\n#","\nif","\nprint"];temperature=0.6;

The size of the context:
simplear, chegeka, rudetox, rumultiar, use, multiq, rumodar, ruhumaneval, rucodeeval - 8192 \nrutie - 3000

System prompt:
Решай задачу строго по инструкции. Только ответ, без объяснений. Числовой ответ — только число. Буква, цифра или слово — только их. Выбор варианта ответа — одна буква или цифра. Ответ должен быть точным, без лишних символов или слов. В случае, если нужно сгенерировать код на Python — твоим ответом должен быть только код (продолжения кода из инструкции), не повтореняй имя функции, не давай пояснений, не пиши комментариев, не используй input, пиши код так, чтобы он дополнял функцию из инструкции (с соответствующими отступами) и всегда начинай написание кода с переноса строки!

Description of the template:
{%- if tools %} \n {{- '<|im_start|>system\n' }} \n {%- if messages[0]['role'] == 'system' %} \n {{- messages[0]['content'] }} \n {%- else %} \n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }} \n {%- endif %} \n {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>" }} \n {%- for tool in tools %} \n {{- "\n" }} \n {{- tool | tojson }} \n {%- endfor %} \n {{- "\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}\n</tool_call><|im_end|>\n" }} \n{%- else %} \n {%- if messages[0]['role'] == 'system' %} \n {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }} \n {%- else %} \n {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }} \n {%- endif %} \n{%- endif %} \n{%- for message in messages %} \n {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %} \n {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }} \n {%- elif message.role == "assistant" %} \n {{- '<|im_start|>' + message.role }} \n {%- if message.content %} \n {{- '\n' + message.content }} \n {%- endif %} \n {%- for tool_call in message.tool_calls %} \n {%- if tool_call.function is defined %} \n {%- set tool_call = tool_call.function %} \n {%- endif %} \n {{- '\n<tool_call>\n{"name": "' }} \n {{- tool_call.name }} \n {{- '", "arguments": ' }} \n {{- tool_call.arguments | tojson }} \n {{- '}\n</tool_call>' }} \n {%- endfor %} \n {{- '<|im_end|>\n' }} \n {%- elif message.role == "tool" %} \n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %} \n {{- '<|im_start|>user' }} \n {%- endif %} \n {{- '\n<tool_response>\n' }} \n {{- message.content }} \n {{- '\n</tool_response>' }} \n {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %} \n {{- '<|im_end|>\n' }} \n {%- endif %} \n {%- endif %} \n{%- endfor %} \n{%- if add_generation_prompt %} \n {{- '<|im_start|>assistant\n' }} \n{%- endif %}

Expand information

Ratings by subcategory

Metric: Grade Norm
Model, team 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 8_0 8_1 8_2 8_3 8_4
Qwen2.5-32B-Instruct
RCC MSU
0.433 0.5 0.833 0.1 0.233 0.467 0.133 - 0.067 0.1 0.1 0.067 0.167 0.133 0.167 0.35 0.033 0.033 0 0.067 0.033 0.767 0.467 0.367 0.233 0.758 0.2 0.3 0.533 0.633 0.633
Model, team Honest Helpful Harmless
Qwen2.5-32B-Instruct
RCC MSU
0.836 0.831 0.879
Model, team Anatomy Virology Astronomy Marketing Nutrition Sociology Management Philosophy Prehistory Human aging Econometrics Formal logic Global facts Jurisprudence Miscellaneous Moral disputes Business ethics Biology (college) Physics (college) Human Sexuality Moral scenarios World religions Abstract algebra Medicine (college) Machine learning Medical genetics Professional law PR Security studies Chemistry (школьная) Computer security International law Logical fallacies Politics Clinical knowledge Conceptual_physics Math (college) Biology (high school) Physics (high school) Chemistry (high school) Geography (high school) Professional medicine Electrical engineering Elementary mathematics Psychology (high school) Statistics (high school) History (high school) Math (high school) Professional accounting Professional psychology Computer science (college) World history (high school) Macroeconomics Microeconomics Computer science (high school) European history Government and politics
Qwen2.5-32B-Instruct
RCC MSU
0.659 0.536 0.921 0.85 0.814 0.846 0.806 0.801 0.849 0.749 0.658 0.635 0.55 0.787 0.824 0.757 0.8 0.896 0.644 0.84 0.524 0.871 0.7 0.728 0.652 0.89 0.538 0.667 0.792 0.56 0.82 0.843 0.791 0.889 0.8 0.842 0.68 0.894 0.709 0.729 0.869 0.787 0.703 0.854 0.893 0.806 0.887 0.641 0.553 0.748 0.76 0.861 0.838 0.882 0.87 0.83 0.876
Model, team SIM FL STA
Qwen2.5-32B-Instruct
RCC MSU
0.629 0.697 0.777
Model, team Anatomy Virology Astronomy Marketing Nutrition Sociology Managment Philosophy Pre-History Gerontology Econometrics Formal logic Global facts Jurisprudence Miscellaneous Moral disputes Business ethics Bilology (college) Physics (college) Human sexuality Moral scenarios World religions Abstract algebra Medicine (college) Machine Learning Genetics Professional law PR Security Chemistry (college) Computer security International law Logical fallacies Politics Clinical knowledge Conceptual physics Math (college) Biology (high school) Physics (high school) Chemistry (high school) Geography (high school) Professional medicine Electrical Engineering Elementary mathematics Psychology (high school) Statistics (high school) History (high school) Math (high school) Professional Accounting Professional psychology Computer science (college) World history (high school) Macroeconomics Microeconomics Computer science (high school) Europe History Government and politics
Qwen2.5-32B-Instruct
RCC MSU
0.667 0.911 0.767 0.676 0.895 0.828 0.707 0.667 0.827 0.815 0.833 0.792 0.583 0.791 0.807 0.765 0.794 0.844 0.737 0.842 0.807 0.932 0.933 0.888 0.867 0.848 0.821 0.702 0.947 0.8 0.844 0.936 0.821 0.912 0.712 0.821 0.911 0.867 0.737 0.754 0.869 0.889 0.844 1 0.897 0.911 0.897 0.932 0.877 0.947 0.867 0.87 0.861 0.779 0.628 0.778 0.889
Coorect
Good
Ethical
Model, team Virtue Law Moral Justice Utilitarianism
Qwen2.5-32B-Instruct
RCC MSU
0.37 0.35 0.387 0.314 0.3
Model, team Virtue Law Moral Justice Utilitarianism
Qwen2.5-32B-Instruct
RCC MSU
0.352 0.342 0.378 0.306 0.299
Model, team Virtue Law Moral Justice Utilitarianism
Qwen2.5-32B-Instruct
RCC MSU
0.418 0.396 0.446 0.381 0.378
Model, team Women Men LGBT Nationalities Migrants Other
Qwen2.5-32B-Instruct
RCC MSU
0.88 0.771 0.882 0.811 0.857 0.885