Task description
CommonVideoQA is a public Russian-language question-answering dataset designed for evaluating video-text models (Video-LLMs), comprising questions related to video clips. It comprehensively assesses the following competencies: general video comprehension and detail recognition, possession of common and domain-specific knowledge, ability to determine the precise order of actions within a video and reconstruct the complete sequence, capability to count objects and actions over time, as well as the skill to associate actions with corresponding temporal boundaries in the video. Given an input video and a question, the task requires selecting the single correct answer from four provided options. Correct answers do not require audio track comprehension. All video clips are sourced from open public repositories.
Data description
Data fields
Each dataset question includes data in the following fields:
instruction[str] — Instruction prompt template with question elements placeholdersinputs— Input data that forms the task for the model.video[str] — Path to the video file related to the question.question[str] — Text of the question.option_a[str] — Answer option A.option_b[str] — Answer option B.option_c[str] — Answer option C.option_d[str] — Answer option D.
outputs[str] — Answer option A.meta— Metadata related to the test example, not used in the question (hidden from the tested model).id[int] — Identification number of the question in the dataset.video— Video metadata.source[list] — Information about the origin of the video — according to the video classification for MERA datasets.type[list] — Video type — according to the video classification for MERA datasets.content[list] — Video content — according to the video classification for MERA datasets.context[list] — Accompanying context present in the video — according to the video classification for MERA datasets.domain[list] — Video domain.
categories— Categorial features characterizing the test example.category[str] — Question type.
Evaluation
Metrics
Metrics for aggregated evaluation of responses:
- `Exact match`: Exact match is the average of scores for all processed cases, where a given case score is 1 if the predicted string is the exact same as its reference string, and is 0 otherwise.
Human baseline
Human baseline is an evaluation of the average human answers to the benchmark questions. The evaluation is carried out using the same metrics as for the models.
For all questions in the dataset, annotator answers were obtained on a crowd-sourcing platform with an overlap of 5. Free-form answers were normalized (case, spaces) for comparison with the reference. The aggregated answer was considered to be the one that was chosen by the majority (majority vote).
Evaluation results:
- Exact match – 0.96
Motivation
Most published benchmarks in video understanding focus on English-language content, and currently no Russian-language benchmark is available in the public domain. The CommonVideoQA dataset are designed to bridge this gap: it enables the evaluation of how effectively video models can address questions requiring video comprehension (the VideoQA task). This dataset covers the assessment of both basic and advanced model capabilities, including general video comprehension and detail recognition (excluding audio track perception), understanding of diverse question types, and the ability to select correct answers from suggested options.
The "General Description" category requires to answer a question about the main action in the video or the object in the foreground. Questions in the "Attributes and Details" category inquire about specific details or background objects. The "Common and Domain Knowledge" category comprises questions necessitating both common sense knowledge and expertise in specific applied domains (e.g., "In what order should the presented dish be prepared?"). The "Action Sequences" category includes questions testing the understanding of actions in the video, their sequential order, and the ability to reconstruct this sequence. The "Counting" category involves questions assessing the capability to count objects, repetitions of actions over time, and perform basic arithmetic operations with the counts. The "Temporal Intervals" category evaluates the ability to associate actions with temporal boundaries (video timestamps) during which these actions occur. Thus, the dataset evaluates key competencies essential for the video domain.
The examples do not require audio comprehension, and all videos are sourced from open repositories (EPIC-KITCHENS), which must be considered during evaluation interpretation.
Dataset creation
Video clips for the dataset were sourced from the EPIC-KITCHENS-100 dataset. Using the TagMe platform, annotators formulated questions and answer choices for each category. Each example includes only one correct answer, eliminating ambiguity. Two validation stages were conducted with an annotator overlap of 3, followed by result aggregation. Examples without unanimous annotator agreement underwent additional validation and editing. Post-processing was performed to correct typos. Correct answer options are balanced across classes.
Contributors
Vildan Saburov