DramaQA Dataset Description




1. Overview


Drama is a genre of narrative that can be described as a series of events consisting of several main characters. These characteristics of drama make it a suitable target for video story research. We collected the dataset on a popular Korean drama Another Miss Oh, which has 18 episodes, 20.5 hours in total. DramaQA dataset consists of sequences of video frames (3 frames per second), character-centered video annotations, and QA pairs with hierarchical difficulty levels.


1) Question-Answer Hierarchy for Levels of Difficulty

To classify question-answer pairs into hierarchical levels of understanding, we propose two criteria: Memory Capacity and Logical Complexity. Memory Capacity (MC) is defined as the length of the video clip, and corresponds to working memory in human cognitive process. Logical Complexity (LC) is defined by the number of logical reasoning steps required to answer the question, which is in line with Piaget’s developmental stage.
The hierarchical levels are as follows,

  • Difficulty 1: Difficulty 1 level QAs are based on a shot-level video length and requires single supporting fact to answer the question.
  • Difficulty 2: Difficulty 2 level QAs are based on a shot-level video length and requires multiple supporting facts to answer the question.
  • Difficulty 3: Difficulty 3 level QAs are based on a scene-level video length and requires multiple supporting facts with time factor to answer the question.
  • Difficulty 4: Difficulty 4 level QAs are based on a scene-level video length and requires reasoning for causality to answer the question.

2) Character-Centered Video Annotations

As the characters are primary components of stories, we provide rich annotations for the main characters in the drama Another Miss Oh. As visual metadata, all image frames in the video clips are annotated with main characters information. In each image frames, bounding boxes of both a face rectangle and a full-body rectangle for the main characters are annotated with their name. Along with bounding boxes, behaviors and emotions of the characters shown in the image frames are annotated. Including none behavior, total 28 behavioral verbs, such as drink, hold, cook, is used for behavior expression. Also, to give a consistent view of the main characters, all coreference of the main characters are resolved in scripts of the video clips.

We describe more detailed information at our paper.



2. Dataset Specification

  • 23,928 video clips
    • 803 scenes
    • 23,125 shots
  • 16,191 question-answer pairs with multi-level difficulties
    • 5 multiple choice QA
    • Four levels of difficulty for the questions
    • 9,313 level 1 questions
    • 4,530 level 2 questions
    • 2,074 level 3 questions
    • 2,066 level 4 questions
  • Table of Contents

1) Image Frames

  • AnotherMissOh_images.zip contains image frames of each video clips.
  • The image frames in a scene are saved in {episodeName/sceneNum} folder, and the image frames in a shot are saved in {episodeName/sceneNum/shotNum} folder.
    • e.g., AnotherMissOh01/002/0003 folder for 3rd shot in 2nd scene in episode 1.
  • The image frames are captured at 3 frames per second (FPS).
  • In our baseline code, each image frame is fed into Resnet-50 and transformed to features from the last layer of the network.

2) Hierarchical QA

AnotherMissOh_QA.zip contains 3 json files, each denotes a split of DramaQA dataset:

Files #Data Usage
AnotherMissOhQA_train_set.json 11,118 Model training
AnotherMissOhQA_val_set.json 3,412 Hyperparmeter tuning
AnotherMissOhQA_test_set.json 3,453 Model testing

In QA files, there are questions, answers, question levels and other useful information. Each of QAs is composed of one question and five candidate answers among which only one answer is correct.
In case of question level, there are two types : Memory Capacity level and Logical Complexity level.

Each of files contains following entries:

Key Type Description
correct_idx int index of correct answer among candidates (1~5)
answers list of string list of candidate answers
que string question
shot_contained list of int a list of the first and the last shot index of the target video
(when the target video is a shot, it only contains one element.)
q_level_logic int Logical Complexity level, values are from 1 to 4
vid string video clip id: episodeName_sceneNum_shotNum
(for scene-level video vid, episodeName_sceneNum_0000 is used.)
q_level_mem int Memory Capacity level, values are from 2 to 3
qid int question id
videoType string video type: shot or scene

Here is an example of json file:

{
    "correct_idx": 3, 
    "answers": ["Dokyung texted the message to mom.", 
                "Dokyung texted the message to dad.", 
                "Dokyung texted the message to Haeyoung1.",
                "Dokyung texted the message to sister.", 
                "Dokyung texted the message to brother."], 
    "que": "What did Dokyung do in his home?", 
    "shot_contained": [48, 115], 
    "q_level_logic": 3, 
    "vid": "AnotherMissOh16_002_0000", 
    "q_level_mem": 3, 
    "qid": 3707, 
    "videoType": "scene"
}

3) Visual Metadata

  • Bounding Box: In each image frames, bounding boxes of both a face rectangle and a full-body rectangle for the main characters are annotated with their name. In total, 20 main characters are annotated with their unique name.
  • Behavior & Emotion, Along with bounding boxes, behaviors and emotions of the characters shown in the image frames are annotated. Including none behavior, total 28 behavioral verbs, such as drink, hold, cook, is used for behavior expression. Also, we present characters’ emotion with 7 emotional adjectives: Anger, Disgust, Fear, Happiness, Sadness, Surprise, and Neutral.
  • You can check a list of person_id, behavior, and emotion in here.

Here is an example of json file:

{
    "frame_id": "AnotherMissOh17_013_0261_IMAGE_0000021778",
    "persons": [
        "person_info": {
            "behavior": "stand up",
            "face_rect": {
                "min_x": 427,
                "max_x": 498,
                "max_y": 234,
                "min_y": 124
            },
            "full_rect": {
                "min_x": 330,
                "max_x": 569,
                "max_y": 617,
                "min_y": 74
            },
            "emotion": "Sadness",
        },
        "person_id": "Jiya"
        }
    ]  
}

4) Coreference Resolved Scripts

For understanding video stories, especially drama, it is important to understand the dialogue between the characters. Especially, the information such as “Who is talking to whom about who did what?” is significant for understanding whole stories. In DramaQA, we provide these information by resolving the coreferences for main characters in scripts.

Here is an example of json file:

"AnotherMissOh01_001_0109": {
    "contained_subs": [
    {
        "et": "295.595",
        "speaker": "Haeyoung1",
        "st": "293.685",
        "utter": "I(Heayoung1) said I(Heayoung1)'m not going to get married."
    },
    {
        "et": "292.426",
        "speaker": "Deogi",
        "st": "290.376",
        "utter": "Just what in the world are you(Heayoung1) trying to say now?"
    }],
    "et": "294.6",
    "st": "291.56"
}