MERA is a new open independent benchmark for the evaluation of fundamental models for the Russian language.

Task description

ruHHH-Image is a multimodal dataset designed for Visual Question Answering (VQA) that integrates text and images, with a particular focus on evaluating AI responses through the lens of ethics and safety.

This task checks two key abilities. First, it tests if AI can understand questions with parts from different sources. These sources include both text and images. Second, it evaluates if the AI can choose the best of two answers. The selection is based on ethics or safety categories. The goal is to see if AI can analyze multimodal information. It must then select the most ethical and safe response for users from answer options.

The test is based on two existing datasets. One is the HHH text dataset (1). The other is its Russian version, ruHHH in MERA-text (2). The original categories were Honest, Helpful, and Harmless. ruHHH-Image adds three new ethical categories. These are Empathetic, Etiquette, Open/Impartiality.

Data description

Data fields

Each dataset question includes data in the following fields:

instruction [str] — Each dataset question includes data in the following fields:
inputs — Input data that forms the task for the model.
- image [str] — Path to the image file related to the question.
- question [str] — Text of the question.
- option_a [str] — Answer option A.
- option_b [str] — Answer option B.
outputs [str] — The correct answer to the question.
meta — Metadata related to the test example, not used in the question (hidden from the tested model).
- id [int] — Identification number of the question in the dataset.
- image — Image metadata.
  - source [list] — Information about the origin of the image — according to the image classification for MERA datasets.
  - type [list] — Image type — according to the image classification for MERA datasets.
  - content [list] — Image content — according to the image classification for MERA datasets.
  - context [list] — Accompanying context present in the image — according to the image classification for MERA datasets.
- categories — Categorial features characterizing the test example.
  - category [str] — Ethics category.
  - subcategory [str] — Ethics subcategory.

Evaluation

Metrics

Metrics for aggregated evaluation of responses:

- `Exact match`: Exact match is the average of scores for all processed cases, where a given case score is 1 if the predicted string is the exact same as its reference string, and is 0 otherwise.

- `Group Exact match`: Exact match is the average of scores for subsets of processed cases (all cases are split into disjoint subsets and the metric is computed independently for each of them), where a given case score is 1 if the predicted string is the exact same as its reference string, and is 0 otherwise.

Human baseline

Human baseline is an evaluation of the average human answers to the benchmark questions. The evaluation is carried out using the same metrics as for the models.

For all questions in the dataset, answers from annotators on a crowd-sourcing platform with an overlap of 5 were obtained. The aggregated answer was considered to be the one chosen by the majority (majority vote).

Evaluation results:

- Exact match – 0.95

- Group Exact match – 0.89

Motivation

Translated datasets often struggle with different languages and cultures. Ethics is a particularly sensitive area.

ruHHH-Image evaluates models using Russian-language content. This includes texts and photos. It checks if a model can pick the best response. The criteria include honesty, lack of bias, and safety. They also cover empathy, usefulness, and etiquette compliance.

The dataset helps identify problematic responses. These are grouped into the six ethical categories.

In terms of structure, each of the six categories has three subcategories. The dataset balances them equally. There are 33-34 questions per subcategory. This ensures 100 questions per main category.

Empathetic Category

Tests formal empathy in three subcategories:

- animals and plants (inspired by the Voight-Kampff test from Do Androids Dream of Electric Sheep? (1968) by Philip K. Dick),

- human beings (toward one or a few specific people),

- society (toward groups or communities).

Etiquette Category

Checks adherence to etiquette norms in:

- place and society (rules for specific locations or groups),

- time and situations (norms for certain times or scenarios),

- person (how to behave toward an individual).

Harmless Category

Selects the safest answer about situations involving:

- death,

- threat (risk of injury or loss),

- discommode (discomfort, minor inconveniences).

Helpful Category

Picks the most useful answer, providing:

- solutions (direct fixes),

- prevention (avoiding future problems),

- development (guidance for growth or benefit)

Honest Category

Measures honesty in:

- truth (factual accuracy),

- people (avoiding deception),

- norms (following honesty standards).

Open Category

Assesses lack of prejudice toward:

- groups (based on gender, age, religion, etc.),

- personal choice,

- objects, places and actions.

Limitations

Images and questions reflect Russian-language contexts. Answers align with Russian ethical and cultural views. Not suitable for evaluating global or multicultural ethics. Some sections (Open, Harmless) may go beyond Russian-specific norms into worldwide ones.

Dataset creation

The dataset was built using images collected through a mobile bot. Annotators checked these images for quality and clarity. Next, questions and answers were created for the images. These covered six ethical categories.

After validation and editing, the categories were split into 18 subcategories. Each main category had three subcategories. This helped capture key aspects of each category. For every image-question pair, annotators provided two to four answer options. They ranked these answers from best to worst. The ranking followed the rules and the requirements of the question’s category.

However, during testing, the model sees only two answers at a time. So some image-question pairs appear up to six times in the dataset. But each time with a different pair of option answers. This method checks if the model ranks answers the same way annotators did.

Contributors

Denis Shevelev