BPS

Task Description

The balanced sequence is an algorithmic task from BIG-bench. The primary purpose of this task is to measure language models' ability to learn CS algorithmic concepts like stacks, recursion, or dynamic programming.

Each subtask contains a parentheses sequence. The model's goal is to correctly predict whether the sequence is balanced.

An input string is valid if:

  1. Open brackets must be closed by the same type of brackets.
  2. Open brackets must be closed in the correct order.
  3. Every close bracket has a corresponding open bracket of the same type.

Warning: This is a diagnostic dataset with an open test and is not used for general model evaluation on the benchmark.

Keywords: algorithms, numerical response, context length, parantheses, binary answer

Authors: Harsh Mehta, Behnam Neyshabur

Motivation

Algorithms are a way to extrapolate examples and are some of the most concise descriptions of a pattern. In that sense, the ability of language models to learn them is a prominent measure of intelligence.

Dataset Description

Data Fields

  • instruction is a string containing instructions for the task and information about the requirements for the model output format;
  • inputs is an example of the parentheses sequence;
  • outputs is a string containing the correct answer: “1” if the parentheses sequence is valid, “0” otherwise;
  • meta is a dictionary containing meta information:
    • id is an integer indicating the index of the example.

Data Instances

Below is an example from the dataset:

{
    "instruction": "Проверьте, сбалансирована ли входная последовательность скобок.\n\"{inputs}\"\nВыведите 1, если да и 0 в противном случае.",
    "inputs": "} } ) [ } ] ) { [ { { ] ( ( ] ) ( ) [ {",
    "outputs": "0",
    "meta": {
        "id": 242
    }
}

Data Splits

The train consists of 250 examples, and the test set includes 1000 examples.

Prompts

10 prompts of varying difficulty were created for this task. Example:

""Проверьте входную последовательность скобок: \"{inputs}\" на сбалансированность. В случае положительного ответа выведите 1, иначе 0.".".

Dataset Creation

The parentheses sequences of the length 2, 4, 8, 12, 20 were generated with the following distribution: {20: 0.336, 12: 0.26, 8: 0.24, 4: 0.14, 2: 0.024} for the train set and {20: 0.301, 12: 0.279, 8: 0.273, 4: 0.121, 2: 0.026} for the test set.

Evaluation

Metrics

The task is evaluated using Accuracy.

Human benchmark

The human benchmark is measured on a subset of size 100 (sampled with the same original distribution). The accuracy for this task is 1.0 .