MultiQ

Task Description

MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks. The dataset is based on the dataset of the same name from the TAPE benchmark [1].

Keywords: Multi-hop QA, World Knowledge, Logic, Question-Answering

Authors: Ekaterina Taktasheva, Tatiana Shavrina, Alena Fenogenova, Denis Shevelev, Nadezhda Katricheva, Maria Tikhonova, Albina Akhmetgareeva, Oleg Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Alena Spiridonova, Valentina Kurenshchikova, Ekaterina Artemova, Vladislav Mikhailov

Motivation

Question-answering has been an essential task in natural language processing and information retrieval. However, certain areas in QA remain quite challenging for modern approaches, including the multi-hop one, which is traditionally considered an intersection of graph methods, knowledge representation, and SOTA language modeling.

Dataset Description

Data Fields

  • meta is a dictionary containing meta-information about the example:
    • id is the task ID;
    • bridge_answer is a list of entities necessary to answer the question contained in the outputs field using two available texts;
  • instruction is an instructional prompt specified for the current task;
  • inputs is a dictionary containing the following information:
    • text is the main text line;
    • support_text is a line with additional text;
    • question is the question, the answer to which is contained in these texts;
  • outputs is the answer information:
    • label is the answer label;
    • length is the answer length;
    • offset is the answer start index;
    • segment is a string containing the answer.

Data Instances

Each dataset sample consists of two texts (the main and the supporting ones) and a question based on these two texts. Below is an example from the dataset:

{
    "instruction": "Прочитайте два текста и ответьте на вопрос.\\\\nТекст 1: {support_text}\\\\nТекст 2: {text}\\\\nВопрос: {question}\\\\nОтвет:",
    "inputs": {
        "question": "В какую реку впадает река, притоком которой является Висвож?",
        "support_text": "Висвож — река в России, протекает по Республике Коми. Устье реки находится в 6 км по левому берегу реки Кыбантывис. Длина реки составляет 24 км.",
        "text": "Кыбантывис (Кабан-Тывис) — река в России, протекает по Республике Коми. Левый приток Айювы. Длина реки составляет 31 км. Система водного объекта: Айюва → Ижма → Печора → Баренцево море."
    },
    "outputs": [{
        "label": "answer",
        "length": 5,
        "offset": 85,
        "segment": "Айювы"
    }],
    "meta": {
        "id": 9,
        "bridge_answers": [{
            "label": "passage",
            "length": 10,
            "offset": 104,
            "segment": "Кыбантывис"
        }]
    }
}

Data Splits

The dataset consists of 1056 training examples (train set) and 900 test examples (test set).

Prompts

We prepared 5 different prompts of various difficulties for this task. An example of the prompt is given below:

"Прочитайте два текста и ответьте на вопрос.\\\\nТекст 1: {support_text}\\\\nТекст 2: {text}\\\\nВопрос: {question}\\\\nОтвет:".

Dataset Creation

The dataset was created using the corresponding dataset from the TAPE benchmark [1] and was initially sampled from Wikipedia and Wikidata. The whole pipeline of the data collection can be found here.

Evaluation

Metrics

To evaluate models on this dataset, two metrics are used: F1-score and complete match (Exact Match — EM).

Human Benchmark

The F1-score / EM results are 0.928 / 0.91, respectively.

References

[1] Taktasheva, Ekaterina, et al. "TAPE: Assessing Few-shot Russian Language Understanding." Findings of the Association for Computational Linguistics: EMNLP 2022. 2022.