BPS

Task Description

The balanced sequence is an algorithmic task from BIG-bench. The primary purpose of this task is to measure language models' ability to learn CS algorithmic concepts like stacks, recursion, or dynamic programming.

Each subtask contains a parentheses sequence. The model's goal is to correctly predict whether the sequence is balanced.

An input string is valid if:

Open brackets must be closed by the same type of brackets.
Open brackets must be closed in the correct order.
Every close bracket has a corresponding open bracket of the same type.

Warning: This is a diagnostic dataset with an open test and is not used for general model evaluation on the benchmark.

Keywords: algorithms, numerical response, context length, parantheses, binary answer

Authors: Harsh Mehta, Behnam Neyshabur

Motivation

Algorithms are a way to extrapolate examples and are some of the most concise descriptions of a pattern. In that sense, the ability of language models to learn them is a prominent measure of intelligence.

Dataset Description

Data Fields

instruction is a string containing instructions for the task and information about the requirements for the model output format;
inputs is an example of the parentheses sequence;
outputs is a string containing the correct answer: “1” if the parentheses sequence is valid, “0” otherwise;
meta is a dictionary containing meta information:
- id is an integer indicating the index of the example.

Prompts

10 prompts of varying difficulty were created for this task. Example:

""Проверьте входную последовательность скобок: \"{inputs}\" на сбалансированность. В случае положительного ответа выведите 1, иначе 0.".".

Dataset Creation

The parentheses sequences of the length 2, 4, 8, 12, 20 were generated with the following distribution: {20: 0.336, 12: 0.26, 8: 0.24, 4: 0.14, 2: 0.024} for the train set and {20: 0.301, 12: 0.279, 8: 0.273, 4: 0.121, 2: 0.026} for the test set.

Human benchmark

The human benchmark is measured on a subset of size 100 (sampled with the same original distribution). The accuracy for this task is 1.0 .