Current LLMs evaluation problems:
There is no unified methodology and standards for independent, expert comparisons of SOTA models.
Previous benchmarks (such as RussianSuperGLUE and TAPE) are becoming outdated; new models operate on instruction data and work with different modalities.
Each model creator evaluates their solution under their own local conditions, metrics, scenarios, and benchmarks. Thus, it’s leading to a lack of result reproducibility.
What does this project offer?
A unified platform for model evaluation, comparison, and reflection of their capabilities across domains, tasks, and modalities.
Tasks that are challenging even for humans, not only automatic systems, and comparison to human capabilities.
Formation of a realistic view of the abilities of AI technologies.
An informational portal and platform for research in the field of large language models.