Таблица скроллится влево
Задача | Результат | Метрика |
---|---|---|
LCS | 0.132 | Accuracy |
RCB | 0.331 / 0.194 | Avg. F1 / Accuracy |
USE | 0.025 | Grade Norm |
RWSD | 0.523 | Accuracy |
PARus | 0.504 | Accuracy |
ruTiE | 0.488 | Accuracy |
MultiQ | 0.115 / 0.036 | F1-score/EM |
CheGeKa | 0.037 / 0 | F1 / EM |
ruModAr | 0.001 | EM |
ruMultiAr | 0.025 | EM |
MathLogicQA | 0.258 | Accuracy |
ruWorldTree | 0.246 / 0.22 | Avg. F1 / Accuracy |
ruOpenBookQA | 0.223 / 0.208 | Avg. F1 / Accuracy |
Таблица скроллится влево
Задача | Результат | Метрика | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
BPS | 0.492 | Accuracy | ||||||||||||||||||||||||
ruMMLU | 0.246 | Accuracy | ||||||||||||||||||||||||
SimpleAr | 0.029 | EM | ||||||||||||||||||||||||
ruHumanEval | 0.001 / 0.003 / 0.006 | pass@k | ||||||||||||||||||||||||
ruHHH |
0.472
|
Accuracy | ||||||||||||||||||||||||
ruHateSpeech |
0.543
|
Accuracy | ||||||||||||||||||||||||
ruDetox |
|
Общая средняя оценка (J) Оценка сохранения смысла (SIM) Оценка натуральности (FL) Точность переноса стиля (STA) |
||||||||||||||||||||||||
ruEthics |
Результаты таблицы:
[[-0.036, -0.023
, -0.025, -0.017
, -0.016], |
5 MCC |
MERA
ruGPT-3.5 13B
ruGPT-3 is a Russian counterpart of GPT-3 (Brown et al., 2020). Model has 13B parameters. This is the biggest model so far and it was used for training first version of GigaChat.
Model was trained using Deepspeed and Megatron libraries, on 300B tokens dataset for 3 epochs, around 45 days on 512 V100. After that model was finetuned 1 epoch with sequence length 2048 around 20 days on 200 GPU A100 on additional data (see above).
Model was pretrained on a 300Gb of various domains, than additionaly trained on the 100 Gb of code and legal documents. Training data was deduplicated, the text deduplication includes 64-bit hashing of each text in the corpus for keeping texts with a unique hash. We also filter the documents based on their text compression rate using zlib4. The most strongly and weakly compressing deduplicated texts are discarded.
After the final training perplexity for this model was around 8.8 for Russian.
MIT
Code version v.1.1.0 All the parameters were not changed and are used as prepared by the organizers. Details: - 1 x NVIDIA A100 - dtype auto - Pytorch 2.1.2 + CUDA 12.1 - Transformers 4.36.2 - Context length 2048