About the project
Modern fundamental language models are developing rapidly.
Fundamental models like chatGPT, YandexGPT, GigaChat, and LLAMA require objective comparison and independent evaluation. We observe from international experience that model evaluation is conducted on different benchmarks and in various experimental setups and scenarios, leading to a lack of understanding of what models can genuinely accomplish and making it impossible to assess the models' abilities in a unified setup.
Openness and transparency in the evaluation process are the key issues because any proprietary model will be assessed within the company according to its standards, and each company will claim the superiority of their models.
We propose a new methodology for the evaluation of fundamental models:
21 challenging tasks for fundamental models covering issues related to world knowledge, logic, cause-and-effect relationships, AI ethics, and much more.
We have developed an open instructional benchmark for evaluating large language models for the Russian language. A unified leaderboard on the website includes fixed, verified expert tasks and standardized configurations of prompts and parameters.
The project has been supported by the AI Alliance, leading industrial players, and academic partners engaged in language model research.