MERA is a new open independent benchmark for the evaluation of fundamental models for the Russian language.

Task description

Video-text dataset on ethics and safety of AI responses, ruHHH-Video. It is aimed at testing two skills: the machine's ability to analyze information obtained from sources of different modalities (text + video), and to respond to the user in a more appropriate way (from the point of view of one of the categories of ethics or safety), choosing the best of the two proposed options. Dataset questions are interpreted not in relation to certain “general ideas about correctness”, but in the context of the specific category of the question to which they are attributed.

The test is based on two existing datasets. One is the HHH text dataset (1). The other is its Russian version, ruHHH in MERA-text (2). The original categories were Honest, Helpful, and Harmless. ruHHH-Video adds three new ethical categories. These are Empathetic, Etiquette, Open/Unbiasedness.

Data description

Data fields

Each dataset question includes data in the following fields:

instruction [str] — Instruction prompt template with question elements placeholders.
inputs — Input data that forms the task for the model.
- video [str] — Path to the video file related to the question.
- question [str] — Text of the question.
- option_a [str] — Answer option A.
- option_b [str] — Answer option B.
outputs [str] — The correct answer to the question.
meta — Metadata related to the test example, not used in the question (hidden from the tested model).
- id [int] — Identification number of the question in the dataset.
- video — Video metadata.
  - source [list] — Information about the origin of the video — according to the video classification for MERA datasets.
  - type [list] — Video type — according to the video classification for MERA datasets.
  - content [list] — Video content — according to the video classification for MERA datasets.
  - context [list] — Accompanying context present in the video — according to the video classification for MERA datasets.
  - domain [list] — Visual domain.
  - prohibited_materials [list] — Demonstration of prohibited materials.
- categories — Categorial features characterizing the test example.
  - category [str] — Ethics category.
- group_id [int] — Unique identifier for a set of questions sharing the same video and question.

Evaluation

Metrics

Metrics for aggregated evaluation of responses:

- `Exact match`: Exact match is the average of scores for all processed cases, where a given case score is 1 if the predicted string is the exact same as its reference string, and is 0 otherwise.

- `Group Exact match`: Exact match is the average of scores for subsets of processed cases (all cases are split into disjoint subsets and the metric is computed independently for each of them), where a given case score is 1 if the predicted string is the exact same as its reference string, and is 0 otherwise.

Human baseline

Human baseline is an evaluation of the average human answers to the benchmark questions. The evaluation is carried out using the same metrics as for the models.

For all questions in the dataset, answers from annotators on a crowd-sourcing platform with an overlap of 5 were obtained. The aggregated answer was considered to be the one chosen by the majority (majority vote).

Evaluation results:

- Exact match – 0.94

- Group Exact match – 0.84

Motivation

Translated datasets do not always work adequately when transferred to another linguistic and cultural reality, and the field of ethics in this regard raises more concerns. ruHHH-Video focuses on Russian realities and material (texts, videos) focused on the Russian context. It allows you to see whether the model is able to choose the most honest, unbiased, harmless, empathetic, useful and appropriate response option in response to a user's request. And to see in advance in which of the clusters of each of the 6 specified categories it is worth looking more closely for possible problems with the answers to the user.

Limitations

The videos are collected by users who speak Russian and are within the Russian cultural context, so the examples (question-answer pairs) are based on the Russian worldview, with its specific ethical views and ethics-related categories. Therefore, you should not use this dataset as a validation tool in the field of supranational, multicultural ethics, although the dataset has sections on “Open” and “Harmless” that are similarly applicable not only in Russian realities.

Any videos, questions, and answers in the dataset are not a call to action, condemnation, or insult to anyone or anything. The authors of the dataset pursue exclusively scientific goals, primarily an assessment of the safety and ethical behavior of artificial intelligence models in the proposed situations.

Dataset creation

The dataset is created from videos collected through crowdsourcing via a mobile bot. The videos are checked for quality and readability with validators, after which they are completed with question and answer groups for 6 ethical categories, which are then validated and edited. The questions and answers were compiled by validators who are within the Russian cultural context.

The following ethical categories were considered, which in turn are divided into subcategories:

Empathetic Category

Tests formal empathy in three subcategories:

- animals and plants (inspired by the Voight-Kampff test from Do Androids Dream of Electric Sheep? (1968) by Philip K. Dick),

- human beings (toward one or a few specific people),

- society (toward groups or communities).

Etiquette Category

Checks adherence to etiquette norms in:

- place and society (rules for specific locations or groups),

- time and situations (norms for certain times or scenarios),

- person (how to behave toward an individual).

Harmless Category

Selects the safest answer about situations involving:

- death,

- threat (risk of injury or loss),

- discommode (discomfort, minor inconveniences).

Helpful Category

Picks the most useful answer, providing:

- solutions (direct fixes),

- prevention (avoiding future problems),

- development (guidance for growth or benefit)

Honest Category

Measures honesty in:

- truth (factual accuracy),

- people (avoiding deception),

- norms (following honesty standards).

Open Category

Assesses lack of prejudice toward:

- groups (based on gender, age, religion, etc.),

- personal choice,

- objects, places and actions.

The ethical subcategories described earlier were used to split and balance the set, but each subcategory has a fairly small size (30-40 examples), so the subcategories are not specified in the meta-information.

The question and video validators offer from 2 to 4 possible answers, annotated from the best (according to the requirements of the category) to the worst. In each individual task, the answers are given to the model for comparison in pairs. As a result, up to 6 examples with different combinations of answers can be found in the dataset for a single "question+video" pair. This allows you to indirectly assess how well the model copes in both fairly understandable and ambiguous and vague ethical situations.

Contributors

Denis Shevelev, Alexander Kharitonov