Leaderboard

The final score is obtained by averaging the task scores, excluding public tasks from the computation of the final score. In cases where tasks have multiple metrics, these metrics are also averaged.

The current version of the benchmark is a static model rating. In the near future, functionality will be added for testing user models and sending submissions. Stay tuned!