ruModAr
Task Description
Modified Arithmetic is a mathematical task from BIG-bench. The task tests a model's ability to learn new knowledge from context examples and then calculate the results based on new skills. Each question in each subtask begins with a prompt and five examples of arithmetic expressions with results. The sixth example is incomplete, the model's task is to finish it correctly.
Keywords: arithmetic, free response, few-shot, mathematics
Motivation
Can large language models learn new skills and understand operations from a few examples? This task probes this question with a series of simple few-shot tasks, each involving computing a joint arithmetic function with correctly recognizing a pattern very similar to, yet subtly different from, standard arithmetic operations common in training data.
Dataset Description
Each subtask (addition, subtraction, multiplication w/o adding +1
to result) includes 1000 questions. The symbol ->
is used instead of =
because the last one already has a definite canonical meaning. The symbol ->
can mean “=” or “+ 1 = ”. In the end, we got sets for 6 subtasks: addition_control, addition_plus_one, subtraction_control, subtraction_plus_one, multiplication_control, multiplication_plus_one. The arguments of the two-digit subtasks (multiplication_ prefix) are randomly generated from [0, 100), and arguments of the three-digit subtasks (addition_ and subtraction_ prefix) — [0, 1000).
Data fields
instruction
— is an instructional prompt specified for the current task;inputs
— is five expressions for recognising the pattern, the sixth for calculating by a model;outputs
— is the target, the resulted answer for the last expression;meta
— is an additional information field:id
— is the id of the example from the dataset;task_type
— is the subtask type.
Data Instances
Below is an example from the subtask three_digit_addition_plus_one:
{
"instruction": "В следующих строках символ -> представляет собой одну простую математическую операцию. Определи операцию и вычисли последний пример:\\\\n{inputs}",
"inputs": "102 + 435 -> 538\\\\n860 + 270 -> 1131\\\\n106 + 71 -> 178\\\\n700 + 20 -> 721\\\\n614 + 121 -> 736\\\\n466 + 214 ->",
"outputs": "681",
"meta": {
"id": 1,
"task_type": "three_digit_addition_plus_one"
}
}
Data Splits
The dataset consists of a public test (train split) (6000
samples) with labeled examples and a closed test set (test split) (6000
samples) for model evaluation.
Prompts
Ten prompts of varying difficulty were created for this task. Example:
"Вычисли результат последнего выражения, определив математическую операцию, которая скрывается под символом \"->\". Запиши в качестве ответа только число без дополнительных слов и символов.\n{inputs}"
Dataset creation
Public test set was taken from the Big-Bench.
Closed test was generated from scratch based on the original methodology of Big-Bench.
Evaluation
Metrics
The task is evaluated using the Exact Match (EM). For each example, 1.0 is given for the target sequence that EXACTLY matches the predicted sequence. Else, 0.0.
Human Benchmark
The human benchmark is measured on a subset of size 1800
(300 samples per subtask from test set with the original target distribution). Evaluate on one pool (all subtasks) with an overlap of 5 reviewers per task.
The final human score is
.
0.999
References
[1] Brown, T.B., et al. (2020) Language models are few-shot learners. arXiv:2005.14165.