Task Description

MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks. The dataset is based on the dataset of the same name from the TAPE benchmark [1].

Keywords: Multi-hop QA, World Knowledge, Logic, Question-Answering

Authors: Ekaterina Taktasheva, Tatiana Shavrina, Alena Fenogenova, Denis Shevelev, Nadezhda Katricheva, Maria Tikhonova, Albina Akhmetgareeva, Oleg Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Alena Spiridonova, Valentina Kurenshchikova, Ekaterina Artemova, Vladislav Mikhailov


Question-answering has been an essential task in natural language processing and information retrieval. However, certain areas in QA remain quite challenging for modern approaches, including the multi-hop one, which is traditionally considered an intersection of graph methods, knowledge representation, and SOTA language modeling.

Dataset Description

Data Fields

  • meta is a dictionary containing meta-information about the example:
    • id is the task ID;
    • bridge_answer is a list of entities necessary to answer the question contained in the outputs field using two available texts;
  • instruction is an instructional prompt specified for the current task;
  • inputs is a dictionary containing the following information:
    • text is the main text line;
    • support_text is a line with additional text;
    • question is the question, the answer to which is contained in these texts;
  • outputs is the answer information:
    • label is the answer label;
    • length is the answer length;
    • offset is the answer start index;
    • segment is a string containing the answer.

Data Instances

Each dataset sample consists of two texts (the main and the supporting ones) and a question based on these two texts. Below is an example from the dataset:

    "instruction": "Прочитайте два текста и ответьте на вопрос.\\\\nТекст 1: {support_text}\\\\nТекст 2: {text}\\\\nВопрос: {question}\\\\nОтвет:",
    "inputs": {
        "question": "В какую реку впадает река, притоком которой является Висвож?",
        "support_text": "Висвож — река в России, протекает по Республике Коми. Устье реки находится в 6 км по левому берегу реки Кыбантывис. Длина реки составляет 24 км.",
        "text": "Кыбантывис (Кабан-Тывис) — река в России, протекает по Республике Коми. Левый приток Айювы. Длина реки составляет 31 км. Система водного объекта: Айюва → Ижма → Печора → Баренцево море."
    "outputs": [{
        "label": "answer",
        "length": 5,
        "offset": 85,
        "segment": "Айювы"
    "meta": {
        "id": 9,
        "bridge_answers": [{
            "label": "passage",
            "length": 10,
            "offset": 104,
            "segment": "Кыбантывис"

Data Splits

The dataset consists of 1056 training examples (train set) and 900 test examples (test set).


We prepared 5 different prompts of various difficulties for this task. An example of the prompt is given below:

"Прочитайте два текста и ответьте на вопрос.\\\\nТекст 1: {support_text}\\\\nТекст 2: {text}\\\\nВопрос: {question}\\\\nОтвет:".

Dataset Creation

The dataset was created using the corresponding dataset from the TAPE benchmark [1] and was initially sampled from Wikipedia and Wikidata. The whole pipeline of the data collection can be found here.



To evaluate models on this dataset, two metrics are used: F1-score and complete match (Exact Match — EM).

Human Benchmark

The F1-score / EM results are 0.928 / 0.91, respectively.


[1] Taktasheva, Ekaterina, et al. "TAPE: Assessing Few-shot Russian Language Understanding." Findings of the Association for Computational Linguistics: EMNLP 2022. 2022.