ruGPT-3-small

Created at 12.01.2024 14:47

General assessment: 0.191

The table will scroll to the left

Task name Result Metric
BPS 0.367 Accuracy
LCS 0.08 Accuracy
RCB 0.333 / 0.167 Avg. F1 / Accuracy
USE 0.001 Grade Norm
RWSD 0.492 Accuracy
PARus 0.498 Accuracy
ruTiE 0.5 Accuracy
MultiQ 0.063 / 0.009 F1-score/EM
ruMMLU 0.263 Accuracy
CheGeKa 0.007 / 0 F1 / EM
ruModAr 0.001 Accuracy
SimpleAr 0.0 Accuracy
ruMultiAr 0.009 Accuracy
MathLogicQA 0.244 Accuracy
ruHumanEval 0 / 0 / 0 pass@k
ruWorldTree 0.257 / 0.254 Avg. F1 / Accuracy
ruOpenBookQA 0.258 / 0.253 Avg. F1 / Accuracy

Evaluation on diagnostic datasets:

It is not taken into account in the overall rating

The table will scroll to the left

Task name Result Metric
ruHHH

0.478

  • Honest: 0.475
  • Harmless: 0.466
  • Helpful: 0.492
Accuracy
ruHateSpeech

0.54

  • Women : 0.519
  • Man : 0.657
  • LGBT : 0.588
  • Nationality : 0.595
  • Migrants : 0.286
  • Other : 0.492
Accuracy
ruDetox
  • 0.316
  • 0.676
  • 0.612
  • 0.713

Overall average score (J)

Assessment of the preservation of meaning (SIM)

Assessment of naturalness (FL)

Style Transfer Accuracy (STA)

ruEthics
Correct God Ethical
Virtue 0 0 0
Law 0 0 0
Moral 0 0 0
Justice 0 0 0
Utilitarianism 0 0 0

Table results:

[[0, 0 , 0, 0 , 0],
[0, 0 , 0, 0 , 0],
[0, 0 , 0, 0 , 0]]

5 MCC

Information about the submission:

Team:

MERA

Name of the ML model:

ruGPT-3-small

Additional links:

https://arxiv.org/abs/2309.10931

Architecture description:

ruGPT-3 is a Russian counterpart of GPT-3 (Brown et al., 2020). We use the model architecture description by Brown et al. and the GPT-2 code base (Radford et al., 2019) from the Transformers library. ruGPT-3 is pretrained on the language modeling objective. The BBPE tokenizer with the vocabulary size of 5 · 104 tokens was used.

Description of the training:

The model was trained with sequence length 1024 using transformers lib by the SberDevices team on 80B tokens for 3 epochs. After that, the model was finetuned 1 epoch with sequence length 2048. Total training time was around 14 days on 128 GPUs for 1024 context and a few days on 16 GPUs for 2048 context. The final perplexity on the test set is 13.6.

Pretrain data:

450GB of texts. The corpus includes texts from various publicly available resources, which represent diverse domains: Wikipedia, News, Books, Colossal Clean Crawled Corpus, OpenSubtitles.

Training Details:

The ruGPT-3 models are pretrained with a maximum sequence length of 1024 tokens for three epochs and of 2048 tokens for one epoch. We use the initial learning rate of 1e−4 and the Adam optimizer with β1 = 0.9, β2 = 0.99, and ϵ = 1e−8. The total number of tokens seen during pretraining is 80B. The pretraining of ruGPT3-large has taken 16 days on the cluster of 32 V100-SXM3 GPUs, respectively.

License:

MIT

Strategy, generation and parameters:

Code version v.1.1.0 All the parameters were not changed and are used as prepared by the organizers. Details: - 1 x NVIDIA A100 - dtype auto - Pytorch 2.1.2 + CUDA 12.1 - Transformers 4.36.2 - Context length 2048