Go back to the task list

MultiQ

Type of task
Reasoning
Output format
Open question
Metric
F1
Exact match
Domains
Geography
History
Sports
Statistics
dev: 1056
test: 900

MultiQ

Task Description

MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks. The dataset is based on the dataset of the same name from the TAPE benchmark [1].

Keywords: Multi-hop QA, World Knowledge, Logic, Question-Answering

Authors: Ekaterina Taktasheva, Tatiana Shavrina, Alena Fenogenova, Denis Shevelev, Nadezhda Katricheva, Maria Tikhonova, Albina Akhmetgareeva, Oleg Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Alena Spiridonova, Valentina Kurenshchikova, Ekaterina Artemova, Vladislav Mikhailov

Motivation

Question-answering has been an essential task in natural language processing and information retrieval. However, certain areas in QA remain quite challenging for modern approaches, including the multi-hop one, which is traditionally considered an intersection of graph methods, knowledge representation, and SOTA language modeling.

Dataset Description

Data Fields

  • meta is a dictionary containing meta-information about the example:
    • id is the task ID;
    • bridge_answer is a list of entities necessary to answer the question contained in the outputs field using two available texts;
  • instruction is an instructional prompt specified for the current task;
  • inputs is a dictionary containing the following information:
    • text is the main text line;
    • support_text is a line with additional text;
    • question is the question, the answer to which is contained in these texts;
  • outputs is a string containing the answer.

Prompts

We prepared 10 different prompts of various difficulties for this task. An example of the prompt is given below:

"Текст 1: {support_text}\nТекст 2: {text}\nОпираясь на данные тексты, ответьте на вопрос: {question}\nЗапишите только ответ без дополнительных объяснений.\nОтвет:"

Dataset Creation

The dataset was created using the corresponding dataset from the TAPE benchmark [1] and was initially sampled from Wikipedia and Wikidata. The whole pipeline of the data collection can be found here.

Human Benchmark

The F1-score / EM results are 0.928 / 0.91, respectively.

References

[1] Taktasheva, Ekaterina, et al. "TAPE: Assessing Few-shot Russian Language Understanding." Findings of the Association for Computational Linguistics: EMNLP 2022. 2022.

Domains
Geography
History
Sports
Statistics
dev: 1056
test: 900