ruGPT-3-medium

Created at 12.01.2024 14:46

General assessment: 0.201

The table will scroll to the left

Task name Result Metric
BPS 0.43 Accuracy
LCS 0.102 Accuracy
RCB 0.333 / 0.167 Avg. F1 / Accuracy
USE 0.002 Grade Norm
RWSD 0.5 Accuracy
PARus 0.498 Accuracy
ruTiE 0.5 Accuracy
MultiQ 0.106 / 0.043 F1-score/EM
ruMMLU 0.271 Accuracy
CheGeKa 0.005 / 0 F1 / EM
ruModAr 0.001 Accuracy
SimpleAr 0.008 Accuracy
ruMultiAr 0.012 Accuracy
MathLogicQA 0.248 Accuracy
ruHumanEval 0 / 0 / 0 pass@k
ruWorldTree 0.251 / 0.248 Avg. F1 / Accuracy
ruOpenBookQA 0.273 / 0.271 Avg. F1 / Accuracy

Evaluation on diagnostic datasets:

It is not taken into account in the overall rating

The table will scroll to the left

Task name Result Metric
ruHHH

0.483

  • Honest: 0.508
  • Harmless: 0.466
  • Helpful: 0.475
Accuracy
ruHateSpeech

0.543

  • Women : 0.519
  • Man : 0.686
  • LGBT : 0.588
  • Nationality : 0.595
  • Migrants : 0.286
  • Other : 0.492
Accuracy
ruDetox
  • 0.348
  • 0.713
  • 0.618
  • 0.755

Overall average score (J)

Assessment of the preservation of meaning (SIM)

Assessment of naturalness (FL)

Style Transfer Accuracy (STA)

ruEthics
Correct God Ethical
Virtue 0.076 0.03 -0.072
Law 0.083 0.035 -0.035
Moral 0.086 0.042 -0.064
Justice 0.061 0.026 -0.068
Utilitarianism 0.076 0.033 -0.063

Table results:

[[0.076, 0.083 , 0.086, 0.061 , 0.076],
[0.03, 0.035 , 0.042, 0.026 , 0.033],
[-0.072, -0.035 , -0.064, -0.068 , -0.063]]

5 MCC

Information about the submission:

Team:

MERA

Name of the ML model:

ruGPT-3-medium

Additional links:

https://arxiv.org/abs/2309.10931

Architecture description:

ruGPT-3 is a Russian counterpart of GPT-3 (Brown et al., 2020). We use the model architecture description by Brown et al. and the GPT-2 code base (Radford et al., 2019) from the Transformers library. ruGPT-3 is pretrained on the language modeling objective. We use the BBPE tokenizer with the vocabulary size of 5 · 104 tokens.

Description of the training:

The model was trained with sequence length 1024 using transformers lib by the SberDevices team on 80B tokens for 3 epochs. After that, the model was finetuned 1 epoch with sequence length 2048. Total training time was around 14 days on 128 GPUs for 1024 context and a few days on 16 GPUs for 2048 context. The final perplexity on the test set is 13.6.

Pretrain data:

450GB of texts. The corpus includes texts from various publicly available resources, which represent diverse domains: Wikipedia, News, Books, Colossal Clean Crawled Corpus, OpenSubtitles.

Training Details:

The ruGPT-3 models are pretrained with a maximum sequence length of 1024 tokens for three epochs and of 2048 tokens for one epoch. We use the initial learning rate of 1e−4 and the Adam optimizer with β1 = 0.9, β2 = 0.99, and ϵ = 1e−8. The total number of tokens seen during pretraining is 80B. The pretraining of ruGPT3-large has taken 16 days on the cluster of 64 V100-SXM3 GPUs

License:

MIT

Strategy, generation and parameters:

Code version v.1.1.0 All the parameters were not changed and are used as prepared by the organizers. Details: - 1 x NVIDIA A100 - dtype auto - Pytorch 2.1.2 + CUDA 12.1 - Transformers 4.36.2 - Context length 2048