BPS
Task Description
The balanced sequence is an algorithmic task from BIG-bench. The primary purpose of this task is to measure language models' ability to learn CS algorithmic concepts like stacks, recursion, or dynamic programming.
Each subtask contains a parentheses sequence. The model's goal is to correctly predict whether the sequence is balanced.
An input string is valid if:
- Open brackets must be closed by the same type of brackets.
- Open brackets must be closed in the correct order.
- Every close bracket has a corresponding open bracket of the same type.
Warning: This is a diagnostic dataset with an open test and is not used for general model evaluation on the benchmark.
Keywords: algorithms, numerical response, context length, parantheses, binary answer
Authors: Harsh Mehta, Behnam Neyshabur
Motivation
Algorithms are a way to extrapolate examples and are some of the most concise descriptions of a pattern. In that sense, the ability of language models to learn them is a prominent measure of intelligence.
Dataset Description
Data Fields
instruction
is a string containing instructions for the task and information about the requirements for the model output format;inputs
is an example of the parentheses sequence;outputs
is a string containing the correct answer: “1” if the parentheses sequence is valid, “0” otherwise;meta
is a dictionary containing meta information:id
is an integer indicating the index of the example.
Data Instances
Below is an example from the dataset:
{
"instruction": "Проверьте, сбалансирована ли входная последовательность скобок.\n\"{inputs}\"\nВыведите 1, если да и 0 в противном случае.",
"inputs": "} } ) [ } ] ) { [ { { ] ( ( ] ) ( ) [ {",
"outputs": "0",
"meta": {
"id": 242
}
}
Data Splits
The train consists of 250
examples, and the test set includes 1000
examples.
Prompts
10 prompts of varying difficulty were created for this task. Example:
""Проверьте входную последовательность скобок: \"{inputs}\" на сбалансированность. В случае положительного ответа выведите 1, иначе 0."."
.
Dataset Creation
The parentheses sequences of the length 2, 4, 8, 12, 20 were generated with the following distribution: {20: 0.336, 12: 0.26, 8: 0.24, 4: 0.14, 2: 0.024} for the train set and {20: 0.301, 12: 0.279, 8: 0.273, 4: 0.121, 2: 0.026} for the test set.
Evaluation
Metrics
The task is evaluated using Accuracy.
Human benchmark
The human benchmark is measured on a subset of size 100
(sampled with the same original distribution). The accuracy for this task is 1.0
.