Go back to the task list

MathLogicQA

Type of task
Maths, Logic
Output format
Choosing an answer
Metric
Accuracy
Domains
Mathematics
Systems thinking
Statistics
dev: 680
test: 1143

MathLogicQA

Task Description

The task is to solve mathematical problems formulated in natural language.

Mathematical problems can be divided into several types:

  • forming and solving equations,
  • forming and solving systems of equations,
  • solving problems on proportions and comparison,
  • comparing the objects described in the problem with the variables in the equation.

Motivation

The goal of the task is to analyze the ability of the model to solve mathematical tasks using simple operations such as addition, subtraction, multiplication, division, and comparison operations.

Dataset Description

Each dataset sample consists of the problem text and 4 answer options, only one of which is correct.

Data Fields

  • instruction — is a string containing instructions for the task and information about the requirements for the model output format. All used products are presented in the project repository;
  • inputs — is a dictionary containing input data for the model:
    • id — is an integer indicating the index of the example;
    • option_a — is a string containing answer option A;
    • option_b — is a string containing answer option B;
    • option_c — is a string containing answer option C;
    • option_d — is a string containing answer option D;
  • outputs — is a string containing the letter of the correct answer;
  • meta — is a dictionary containing meta information:
    • id — is an integer indicating the index of the example;
    • task — is a string containing information about the task type: math includes solving systems of equations and comparing quantities; logimath includes matching the objects described in the problem with the variables in the equation and solving it.

Prompts

Ten prompts of varying difficulty were created for this task. Example:

"Решите математичеcкую задачу: {text}\nA) {option_a}\nB) {option_b}\nC) {option_c}\nD) {option_d}\nВыберите один правильный ответ. В ответе укажите только букву правильного ответа.\nОтвет:"

Dataset Creation

The dataset includes two types of problems: logic and math.

logic

Logic problems are mathematical problems formulated in natural language. To solve this type of problem, it is necessary to construct a system of equations (or one equation) and solve it by comparing the objects described in the problem with the variables in the equation. Problems of this type were formed using open sources containing databases of mathematical problems.

math

Math problems consist of a mathematical expression (a linear equation or a system of linear equations) and a question about that expression. One must solve a linear equation or system of linear equations to answer the question. For some tasks, it is also necessary to perform a comparison operation. Mathematical expressions are synthetic data generated using an open-source library using the linear_1d and linear_2d modules. The resulting generated expressions were manually rewritten by experts from mathematical language into natural Russian. Next, the experts formulated a question in natural language and the correct answer for each expression.

When creating the dataset, experts added instructions in natural language to some tasks. The experts also formulated 3 incorrect answer options for each task from the dataset.

Validation

All examples from the dataset have been validated on the Yandex.Toloka platform. Tolokers checked the correctness of the problem conditions and the answer. The dataset included 2000 examples of type math and 570 examples of type logic. Each example had a 3-person overlap, which could increase to 5 if the agreement on the task answer was below 70%. The responses of the Toloka annotators who showed labeling accuracy of less than 50% on control tasks were excluded.

As a result of validation, the final test set included examples with complete consistency between the annotators. The training set included the remaining examples with agreement above 60%.

Human Benchmark

Human-level score is measured on a test set with the Yandex.Toloka project with the overlap of 5 reviewers per task. The human accuracy score is 0.99.

Domains
Mathematics
Systems thinking
Statistics
dev: 680
test: 1143