Dataset Description



Table of Contents


1) Overview

Drama is a genre of narrative that can be described as a series of events consisting of several main characters. These characteristics of drama make it a suitable target for video story research. We collected the dataset on a popular Korean drama Another Miss Oh, which has 18 episodes, 20.5 hours in total. DramaQA dataset consists of sequences of video frames (3 frames per second), character-centered video annotations, and QA pairs with hierarchical difficulty levels.


2) DramaQA

DramaQA.zip contains 3 json files, each denotes a split of DramaQA dataset:

Files #Data Usage
AnotherMissOhQA_train_set.json 18,465 Model training
AnotherMissOhQA_val_set.json 3,889 Hyperparmeter tuning
AnotherMissOhQA_test_set.json 4,019 Model testing

In QA files, there are questions, answers, question levels and other useful information. Each of QAs is composed of one question and five candidate answers among which only one answer is correct.
In case of question level, there are two types : Memory Capacity level and Logical Complexity level.

Each of files contains following entries:

Key Type Description
correct_idx int index of correct answer among candidates (0~4)
answers list of string list of candidate answers
que string question
shot_contained list of int a list of the first and the last shot index of the target video
(when the target video is a shot, it only contains one element.)
q_level_logic int Logical Complexity level, values are from 1 to 4
vid string video clip id: episodeName_sceneNum_shotNum
(for scene-level video vid, episodeName_sceneNum_0000 is used.)
q_level_mem int Memory Capacity level, values are from 2 to 3
qid int question id
videoType string video type: shot or scene

Here is an example of json file:

{
    "correct_idx": 3,
    "answers": ["Dokyung texted the message to mom.",
                "Dokyung texted the message to dad.",
                "Dokyung texted the message to Haeyoung1.",
                "Dokyung texted the message to sister.",
                "Dokyung texted the message to brother."],
    "que": "What did Dokyung do in his home?",
    "shot_contained": [48, 115],
    "q_level_logic": 3,
    "vid": "AnotherMissOh16_002_0000",
    "q_level_mem": 3,
    "qid": 3707,
    "videoType": "scene"
}

3) DramaCap

DramaCap.zip contains 3 json files, each denotes a split of DramaCap dataset:

Files #Data Usage
DramaCap_train.json 11,602 Model training
DramaCap_val.json 2,471 Hyperparmeter tuning
DramaCap_test.json 2,329 Model testing

Each of DramaCap files is dictionary which has vid as key and its caption as value.
There are two types of descriptions: shot-level descriptions and scene-level descriptions. In case of scene-level description, vid ends with _0000.


Here is an example of DramaCap file:

{
    ...
    "AnotherMissOh01_018_0000": "Heeran and Haeyoung1 are laughing. Heeran and Haeyoung1 made a bet. Haeyoung1 tried to drink an energy drink at once but Haeyoung1 failed. Haeyoung1 fell backwards from the chair.",
    "AnotherMissOh01_018_0738": "Heeran is laughing on the chair.",
    "AnotherMissOh01_018_0739": "Heeran is talking.",
    "AnotherMissOh01_018_0740": "Haeyoung1 is talking.",
    ...
}

4) Image Frames

  • AnotherMissOh_images.zip contains image frames of each video clips.
  • The image frames in a scene are saved in {episodeName/sceneNum} folder, and the image frames in a shot are saved in {episodeName/sceneNum/shotNum} folder.
    • e.g., AnotherMissOh01/002/0003 folder for 3rd shot in 2nd scene in episode 1.
  • The image frames are captured at 3 frames per second (FPS).
  • In our baseline code, each image frame is fed into Resnet-50 and transformed to features from the last layer of the network.

5) Visual Metadata

As the characters are primary components of stories, we provide rich annotations for the main characters in the drama Another Miss Oh. As visual metadata, all image frames in the video clips are annotated with main characters information. In each image frames, bounding boxes of both a face rectangle and a full-body rectangle for the main characters are annotated with their name. Along with bounding boxes, behaviors and emotions of the characters shown in the image frames are annotated. Including none behavior, total 28 behavioral verbs, such as drink, hold, cook, is used for behavior expression. Also, to give a consistent view of the main characters, all coreference of the main characters are resolved in scripts of the video clips.

We describe more detailed information at our paper.

  • Bounding Box: In each image frames, bounding boxes of both a face rectangle and a full-body rectangle for the main characters are annotated with their name. In total, 20 main characters are annotated with their unique name.
  • Behavior & Emotion, Along with bounding boxes, behaviors and emotions of the characters shown in the image frames are annotated. Including none behavior, total 28 behavioral verbs, such as drink, hold, cook, is used for behavior expression. Also, we present characters’ emotion with 7 emotional adjectives: Anger, Disgust, Fear, Happiness, Sadness, Surprise, and Neutral.
  • You can check a list of person_id, behavior, and emotion in here.

Here is an example of json file:

{
    "frame_id": "AnotherMissOh17_013_0261_IMAGE_0000021778",
    "persons": [
        "person_info": {
            "behavior": "stand up",
            "face_rect": {
                "min_x": 427,
                "max_x": 498,
                "max_y": 234,
                "min_y": 124
            },
            "full_rect": {
                "min_x": 330,
                "max_x": 569,
                "max_y": 617,
                "min_y": 74
            },
            "emotion": "Sadness",
        },
        "person_id": "Jiya"
        }
    ]
}

6) Coreference Resolved Scripts

For understanding video stories, especially drama, it is important to understand the dialogue between the characters. Especially, the information such as “Who is talking to whom about who did what?” is significant for understanding whole stories. In DramaQA, we provide these information by resolving the coreferences for main characters in scripts.

Here is an example of json file:

"AnotherMissOh01_001_0109": {
    "contained_subs": [
    {
        "et": "295.595",
        "speaker": "Haeyoung1",
        "st": "293.685",
        "utter": "I(Heayoung1) said I(Heayoung1)'m not going to get married."
    },
    {
        "et": "292.426",
        "speaker": "Deogi",
        "st": "290.376",
        "utter": "Just what in the world are you(Heayoung1) trying to say now?"
    }],
    "et": "294.6",
    "st": "291.56"
}