Dataset Description
Table of Contents
1) Overview
Drama is a genre of narrative that can be described as a series of events consisting of several main characters. These characteristics of drama make it a suitable target for video story research. We collected the dataset on a popular Korean drama Another Miss Oh, which has 18 episodes, 20.5 hours in total. DramaQA dataset consists of sequences of video frames (3 frames per second), character-centered video annotations, and QA pairs with hierarchical difficulty levels.
2) DramaQA
DramaQA.zip
contains 3 json files, each denotes a split of DramaQA dataset:
Files | #Data | Usage |
---|---|---|
AnotherMissOhQA_train_set.json | 18,465 | Model training |
AnotherMissOhQA_val_set.json | 3,889 | Hyperparmeter tuning |
AnotherMissOhQA_test_set.json | 4,019 | Model testing |
In QA files, there are questions, answers, question levels and other useful information. Each of QAs is composed of one question and five candidate answers among which only one answer is correct.
In case of question level, there are two types : Memory Capacity level and Logical Complexity level.
Each of files contains following entries:
Key | Type | Description |
---|---|---|
correct_idx | int | index of correct answer among candidates (0~4) |
answers | list of string | list of candidate answers |
que | string | question |
shot_contained | list of int | a list of the first and the last shot index of the target video (when the target video is a shot, it only contains one element.) |
q_level_logic | int | Logical Complexity level, values are from 1 to 4 |
vid | string | video clip id: episodeName_sceneNum_shotNum (for scene-level video vid, episodeName_sceneNum_0000 is used.) |
q_level_mem | int | Memory Capacity level, values are from 2 to 3 |
qid | int | question id |
videoType | string | video type: shot or scene |
Here is an example of json file:
{
"correct_idx": 3,
"answers": ["Dokyung texted the message to mom.",
"Dokyung texted the message to dad.",
"Dokyung texted the message to Haeyoung1.",
"Dokyung texted the message to sister.",
"Dokyung texted the message to brother."],
"que": "What did Dokyung do in his home?",
"shot_contained": [48, 115],
"q_level_logic": 3,
"vid": "AnotherMissOh16_002_0000",
"q_level_mem": 3,
"qid": 3707,
"videoType": "scene"
}
3) DramaCap
DramaCap.zip
contains 3 json files, each denotes a split of DramaCap dataset:
Files | #Data | Usage |
---|---|---|
DramaCap_train.json | 11,602 | Model training |
DramaCap_val.json | 2,471 | Hyperparmeter tuning |
DramaCap_test.json | 2,329 | Model testing |
Each of DramaCap files is dictionary which has vid as key and its caption as value.
There are two types of descriptions: shot-level descriptions and scene-level descriptions. In case of scene-level description, vid ends with _0000
.
Here is an example of DramaCap file:
{
...
"AnotherMissOh01_018_0000": "Heeran and Haeyoung1 are laughing. Heeran and Haeyoung1 made a bet. Haeyoung1 tried to drink an energy drink at once but Haeyoung1 failed. Haeyoung1 fell backwards from the chair.",
"AnotherMissOh01_018_0738": "Heeran is laughing on the chair.",
"AnotherMissOh01_018_0739": "Heeran is talking.",
"AnotherMissOh01_018_0740": "Haeyoung1 is talking.",
...
}
4) Image Frames
-
AnotherMissOh_images.zip
contains image frames of each video clips. - The image frames in a scene are saved in
{episodeName/sceneNum}
folder, and the image frames in a shot are saved in{episodeName/sceneNum/shotNum}
folder. - e.g.,
AnotherMissOh01/002/0003
folder for 3rd shot in 2nd scene in episode 1. - The image frames are captured at 3 frames per second (FPS).
- In our baseline code, each image frame is fed into Resnet-50 and transformed to features from the last layer of the network.
5) Visual Metadata
As the characters are primary components of stories, we provide rich annotations for the main characters in the drama Another Miss Oh.
As visual metadata, all image frames in the video clips are annotated with main characters information.
In each image frames, bounding boxes of both a face rectangle and a full-body rectangle for the main characters are annotated with their name.
Along with bounding boxes, behaviors and emotions of the characters shown in the image frames are annotated.
Including none
behavior, total 28 behavioral verbs, such as drink
, hold
, cook
, is used for behavior expression.
Also, to give a consistent view of the main characters, all coreference of the main characters are resolved in scripts of the video clips.
We describe more detailed information at our paper.
- Bounding Box: In each image frames, bounding boxes of both a face rectangle and a full-body rectangle for the main characters are annotated with their name. In total, 20 main characters are annotated with their unique name.
- Behavior & Emotion, Along with bounding boxes, behaviors and emotions of the characters shown in the image frames are annotated.
Including
none
behavior, total 28 behavioral verbs, such asdrink
,hold
,cook
, is used for behavior expression. Also, we present characters’ emotion with 7 emotional adjectives:Anger
,Disgust
,Fear
,Happiness
,Sadness
,Surprise
, andNeutral
. - You can check a list of
person_id
,behavior
, andemotion
in here.
Here is an example of json file:
{
"frame_id": "AnotherMissOh17_013_0261_IMAGE_0000021778",
"persons": [
"person_info": {
"behavior": "stand up",
"face_rect": {
"min_x": 427,
"max_x": 498,
"max_y": 234,
"min_y": 124
},
"full_rect": {
"min_x": 330,
"max_x": 569,
"max_y": 617,
"min_y": 74
},
"emotion": "Sadness",
},
"person_id": "Jiya"
}
]
}
6) Coreference Resolved Scripts
For understanding video stories, especially drama, it is important to understand the dialogue between the characters. Especially, the information such as “Who is talking to whom about who did what?” is significant for understanding whole stories. In DramaQA, we provide these information by resolving the coreferences for main characters in scripts.
Here is an example of json file:
"AnotherMissOh01_001_0109": {
"contained_subs": [
{
"et": "295.595",
"speaker": "Haeyoung1",
"st": "293.685",
"utter": "I(Heayoung1) said I(Heayoung1)'m not going to get married."
},
{
"et": "292.426",
"speaker": "Deogi",
"st": "290.376",
"utter": "Just what in the world are you(Heayoung1) trying to say now?"
}],
"et": "294.6",
"st": "291.56"
}