PARus
Task Description
The choice of Plausible Alternatives for the Russian language (PARus) evaluation provides researchers with a tool for assessing progress in open-domain commonsense causal reasoning.
Each question in PARus is composed of a premise and two alternatives, where the task is to select the alternative that more plausibly has a causal relation with the premise. The correct alternative is randomized, so the expected randomly guessing performance is 50%. The dataset was first proposed in Russian SuperGLUE [1] and is an analog of the English COPA [2] dataset that was constructed as a translation of the English COPA dataset from SuperGLUE [3] and edited by professional editors. The data split from COPA is retained.
Keywords: reasoning, commonsense, causality, commonsense causal reasoning
Authors: Shavrina Tatiana, Fenogenova Alena, Emelyanov Anton, Shevelev Denis, Artemova Ekaterina, Malykh Valentin, Mikhailov Vladislav, Tikhonova Maria, Evlampiev Andrey
Motivation
The dataset tests the models’ ability to identify cause-and-effect relationships in the text and draw conclusions based on them. The dataset is first presented from the RussianSuperGLUE leaderboard, and it’s one of the sets for which there is still a significant gap between model and human estimates.
Dataset Description
Data Fields
Each dataset sample represents a premise
and two options
for continuing situations depending on the task tag: cause or effect.
instruction
is a prompt specified for the task, selected from different pools for cause and effect;inputs
is a dictionary containing the following input information:premise
is a text situation;choice1
is the first option;choice2
is the second option;
outputs
are string values "1" or "2";meta
is meta-information about the task:task
is a task class: cause or effect;id
is the id of the example from the dataset.
Data Instances
Below is an example from the dataset:
{
"instruction": "Дано описание ситуации: \"{premise}\" и два возможных продолжения текста: 1. {choice1} 2. {choice2} Определи, какой из двух фрагментов является причиной описанной ситуации? Выведи одну цифру правильного ответа.",
"inputs": {
"premise": "Моё тело отбрасывает тень на траву.",
"choice1": "Солнце уже поднялось.",
"choice2": "Трава уже подстрижена."
},
"outputs": "1",
"meta": {
"task": "cause",
"id": 0
}
}
Data Splits
The dataset consists of 400 train samples, 100 dev samples, and 500 private test samples. The number of sentences in the whole set is 1000. The number of tokens is 5.4 · 10^3.
Prompts
We prepare 10 different prompts of various difficulty for this task. Prompts are presented separately for the cause
and for the effect
, e.g.:
For cause: "Дана текстовая ситуация: \"{premise}\" и два текста продолжения: 1) {choice1} 2) {choice2} Определи, какой из двух фрагментов является причиной описанной ситуации? В качестве ответа выведи одну цифру 1 или 2."
.
For effect: "Дано описание ситуации: \"{premise}\" и два фрагмента текста: 1) {choice1} 2) {choice2} Определи, какой из двух фрагментов является следствием описанной ситуации? В качестве ответа выведи цифру 1 (первый текст) или 2 (второй текст)."
.
Dataset Creation
The dataset was taken initially from the RussianSuperGLUE set and reformed in an instructions format. All examples for the original set from RussianSuperGLUE were collected from open news sources and literary magazines, then manually cross-checked and supplemented by human evaluation on Yandex.Toloka.
Please, be careful! PArsed RUssian Sentences is not the same dataset. It’s not a part of the Russian SuperGLUE.
Evaluation
Metrics
The metric for this task is Accuracy.
Human Benchmark
Human-level score is measured on a test set with Yandex.Toloka project with the overlap of 3 reviewers per task. The Accuracy score is 0.982
.
References
[1] Original COPA paper: Roemmele, M., Bejan, C., and Gordon, A. (2011) Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. AAAI Spring Symposium on Logical Formalizations of Commonsense Reasoning, Stanford University, March 21-23, 2011.
[2] Wang A. et al. Superglue: A stickier benchmark for general-purpose language understanding systems //Advances in Neural Information Processing Systems. – 2019. – С. 3261-3275.
[3] Tatiana Shavrina, Alena Fenogenova, Emelyanov Anton, Denis Shevelev, Ekaterina Artemova, Valentin Malykh, Vladislav Mikhailov, Maria Tikhonova, Andrey Chertok, and Andrey Evlampiev. 2020. RussianSuperGLUE: A Russian Language Understanding Evaluation Benchmark. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4717–4726, Online. Association for Computational Linguistics.