MultiQ
Task Description
MultiQ is a multi-hop QA dataset for Russian, suitable for general open-domain question answering, information retrieval, and reading comprehension tasks. The dataset is based on the dataset of the same name from the TAPE benchmark [1].
Keywords: Multi-hop QA, World Knowledge, Logic, Question-Answering
Authors: Ekaterina Taktasheva, Tatiana Shavrina, Alena Fenogenova, Denis Shevelev, Nadezhda Katricheva, Maria Tikhonova, Albina Akhmetgareeva, Oleg Zinkevich, Anastasiia Bashmakova, Svetlana Iordanskaia, Alena Spiridonova, Valentina Kurenshchikova, Ekaterina Artemova, Vladislav Mikhailov
Motivation
Question-answering has been an essential task in natural language processing and information retrieval. However, certain areas in QA remain quite challenging for modern approaches, including the multi-hop one, which is traditionally considered an intersection of graph methods, knowledge representation, and SOTA language modeling.
Dataset Description
Data Fields
meta
is a dictionary containing meta-information about the example:id
is the task ID;bridge_answer
is a list of entities necessary to answer the question contained in theoutputs
field using two available texts;
instruction
is an instructional prompt specified for the current task;inputs
is a dictionary containing the following information:text
is the main text line;support_text
is a line with additional text;question
is the question, the answer to which is contained in these texts;
outputs
is a string containing the answer.
Prompts
We prepared 10 different prompts of various difficulties for this task. An example of the prompt is given below:
"Текст 1: {support_text}\nТекст 2: {text}\nОпираясь на данные тексты, ответьте на вопрос: {question}\nЗапишите только ответ без дополнительных объяснений.\nОтвет:"
Dataset Creation
The dataset was created using the corresponding dataset from the TAPE benchmark [1] and was initially sampled from Wikipedia and Wikidata. The whole pipeline of the data collection can be found here.
Human Benchmark
The F1-score / EM results are 0.928
/ 0.91
, respectively.
References
[1] Taktasheva, Ekaterina, et al. "TAPE: Assessing Few-shot Russian Language Understanding." Findings of the Association for Computational Linguistics: EMNLP 2022. 2022.