Workshop 2019

1. Call for Papers

The ability to craft and understand stories is a crucial cognitive tool used by humans for communication. According to computational linguists, narrative theorists and cognitive scientists, the story understanding is a good proxy to measure the readers' intelligence. Readers can understand a story as a way of problem-solving in which, for example, they keep focusing on how main characters overcome certain obstacles throughout the story. Readers need to make inferences both in prospect and in retrospect about the causal relationships between different events in the story.

Especially, the video story data such as TV shows and movies can serve as an excellent testbed to evaluate the human-level AI algorithms from two points of view. First, video data have different modalities such as a sequence of images, audios (including dialogue, sound effects and background music) and text (subtitles or added comments). Second, video data show various cross-sections of everyday life. Therefore, understanding video story can be thought of a significant challenge to current AI technology, which involves analyzing and simulating human vision, language, thinking, and behavior.

Towards human-level video understanding, machine intelligence needs to extract meaningful information such as events from the sequential multimodal video data, consider the causal relationships between different events, and make inferences both in prospect and in retrospect about what events will occur and how these events could occur. Story in the video is highly-abstracted information which consists of a series of events across multiple scenes in a scenario.

In this workshop, we emphasize the necessity of findings and insights from the various research domain for video story understanding. We aims to invite experts in variety of related fields, including vision, language processing, computational narratology and neuro-symbolic computing to provide a perspective on the research that exists, and initiates discussion of future challenges in data-driven video understanding. Topics of interest include but not limited to:

Deep learning architecture for multi-modal video story representation
Question answering about video story
Summarization and retrieval from long story video contents
Scene description generation for video understanding
Scene graph generation and relationship detection from video
Activity/Event recognition from video
Character identification & interaction modeling in video
Emotion recognition in video
Novel tasks about video understanding and challenge dataset

This workshop will invite a selected set of leading researchers in the related fields for invited talks. Also, we encourage submissions of papers as extended abstract within 4 pages.

Submission Instructions

We invite submissions of papers as extended abstract within 4 pages, excluding references or supplementary materials. All submissions must be in pdf format as a single file (incl. supplementary materials) using below templates and submitted through this CMT link. The review process is single-round and double-blind. All submissions have to be anonymized.

LaTeX/Word Templates (tar): iccv2019AuthorKit.tgz

All accepted papers will be presented as posters during the workshop and listed on the website. Additionally, a small number of accepted papers will be selected to be presented as contributed talks.

Dual Submissions

Note that this workshop will not publish official proceedings. The accepted submission will not be counted as a publication. We encourage submissions of relevant work that has been previously published, or is to be presented at the main conference.

2. Important Dates

Paper Submission Deadline	September 10, 2019 (GMT+9)
Notification to Authors	October 7, 2019
Paper Camera-Ready Deadline	October 18, 2019
Workshop Date	November 2, 2019

3. Schedule

Chair: Chang Dong Yoo (KAIST)

Time	Presentation
08:30 - 08:45	Opening Remarks: Video Turing Test, Byoung-Tak Zhang (Seoul National University)
08:45 - 09:15	Invited Talk 1: Video Understanding: Action, Activities and Beyond, Leonid Sigal (University of British Columbia)
09:15 - 09:45	Invited Paper 1: VideoMem: Constructing, Analyzing, Predicting Short-Term and Long-Term Video Memorability Romain Cohendet, Claire-Hélène Demarty, Ngoc Q. K. Duong, Martin Engilberge Invited Paper 2: Progressive Attention Memory Network for Movie Story Question Answering Junyeong Kim, Minuk Ma, Kyungsu Kim, Sungjin Kim, Chang D. Yoo Invited Paper 3: HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, Josef Sivic
09:45 - 10:15	Spotlight Talks (5 minutes each) DIFRINT: Deep Iterative Frame Interpolation for Full-frame Video Stabilization Jinsoo Choi, In So Kweon Adversarial Inference for Multi-Sentence Video Description Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach Robust Person Re-identification via Graph Convolution Networks Guisik Kim, Dongwook Shu, Junseok Kwon Enhancing Performance of Character Identification on Multiparty Dialogues of Drama via Multimodality Donghwan Kim Dual Attention Networks for Visual Reference Resolution in Visual Dialog Gi-Cheon Kang, Jaeseo Lim, Byoung-Tak Zhang Event Structure Frame-Annotated WordNet for Multimodal Inferencing Seohyun Im
10:15 - 11:25	Coffee Break & Poster Session
11:25 - 11:55	Invited Talk 2: High-level Video Understanding via Adversarial Inference and Spatio-temporal Graphs, Trevor Darrell (University of California, Berkeley)
11:55 - 12:25	Invited Talk 3: Video Recognition from a Story, Cees Snoek (University of Amsterdam)
12:25 - 12:30	Closing

4. Invited Speakers

Invited Speaker 1: Leonid Sigal, University of British Columbia
Title: Video Understanding: Action, Activities and Beyond
Abstract: Automatic understanding and interpretation of videos is one of the core challenges in computer vision. Many real-world systems could benefit from various levels of human and non-human video "action" understanding. In this talk, I will discuss some of the approaches we developed over the years for addressing aspects of this challenging problem. In particular, I will first discuss a strategy for learning activity progression in LSTM models, using structured rank losses, which explicitly encourage the architecture to increase its confidence in prediction over time. The resulting model turns out to be especially effective in early action detection. I will then talk about some of our recent work on single-frame situational recognition. Situational recognition goes beyond traditional action and activity understanding, which only focuses on detection of salient actions. Situational recognition further requires recognition of semantic roles for each action. For example, who is performing the action, what is the source and/or target of the action. We propose a mixture-kernel attention Graph Neural Network (GNN) for addressing this problem. Finally, time permitting, I will also discuss our recent work on audio-visual weakly-supervised dense video-captioning.

Invited Speaker 2: Trevor Darrell, University of California, Berkeley
Title: High-level Video Understanding via Adversarial Inference and Spatio-temporal Graphs
Abstract: In this talk I'll present recent work towards High-level Video Understanding, including results on multi-sentence video description using novel adversarial inference methods. Our adversarial method relies on a hybrid discriminator design, with constituent elements for linguistic coherence, visual relevance, and paragraph consistency. I'll also present models for few-shot video activity recognition, leveraging scene graphs defined over space and time. Our method targets activity recognition where labeled training data are expensive and rare, i.e. in few shot conditions, such as crash detection for autonomous driving. Time permitting I'll cover other ongoing efforts on video description and activity recognition.

Invited Speaker 3: Cees Snoek, University of Amsterdam
Title: Video Recognition from a Story
Abstract: By 2022 there will be 45 billion cameras in the world, many of them tiny, connected and live streaming 24/7. Self-driving cars, drones and service robots are just three manifestations. For all these applications it will be of critical importance to understand what is happening where and when in the video streams. The common tactic to spatiotemporal video recognition is to track a human-specified box or to learn a deep classification network from a set of predefined action classes. In this talk I will present an alternative approach, that allows for spatiotemporal recognition from a natural language sentence as input, and show its potential for object tracking and action segmentation. For object tracking, rather than specifying the target in the first frame of a video by a bounding box, a natural language specification of the target provides a more natural human-machine interaction as well as a means to improve tracking results. For action segmentation, rather than learning to segment from a fixed vocabulary of actor and action pairs, inference from a natural language input sentence allows to distinguish between fine-grained actors in the same super-category, identify actor and action instances, and segment pairs that are completely outside of the vocabulary. For both tasks we discuss the realization via multimodal network architectures and sentence-augmented datasets, comparisons with the traditional state-of-the-art, as well as their potential for application in surveillance and other live video streams.

5. Accepted papers

DIFRINT: Deep Iterative Frame Interpolation for Full-frame Video Stabilization

(spotlight)

Jinsoo Choi (KAIST); In So Kweon (KAIST)

Adversarial Inference for Multi-Sentence Video Description

(spotlight)

Jae Sung Park (UC Berkeley); Marcus Rohrbach (Facebook AI Research); Trevor Darrell (UC Berkeley); Anna Rohrbach (UC Berkeley)

Robust Person Re-identification via Graph Convolution Networks

(spotlight)

Guisik Kim (Chung-Ang Univ., Korea); Dongwook Shu (Chung-Ang Univ., Korea); Junseok Kwon (Chung-Ang Univ., Korea)

Enhancing Performance of Character Identification on Multiparty Dialogues of Drama via Multimodality

(spotlight)

Donghwan Kim (Korea Advanced Institute of Science and Technology)

Dual Attention Networks for Visual Reference Resolution in Visual Dialog

(spotlight)

Gi-Cheon Kang (Seoul National University); Jaeseo Lim (Seoul National University); Byoung-Tak Zhang (Seoul National University)

Event Structure Frame-Annotated WordNet for Multimodal Inferencing

(spotlight)

Seohyun Im (Seoul National University)

A Neural Question-Answering Manager for Video Question Answering

A-Yeong Kim (Kyungpook National University); Gyu-Min Park (Kyung Hee University); Su-Hwan Yun (Kyung Hee University); Seong-Bae Park (Kyung Hee University)

Emotion-based Story Event Clustering

Hye-Yeon Yu (Department of Electrical and Computer Engineering, Sungkyunkwan University, Suwon, Korea); Seohui Park (Sungkyunkwan University); Yun-Gyung Cheong (Sungkyunkwan University); Byung-Chull Bae (Hongik University)

Tripartite Heterogeneous Graph Propagation for Multimodal Entity Interaction Prediction

Kyung-Min Kim (Clova AI Research), Donghyun Kwak (Search Solution Inc.), Hanock Kwak (LINE Plus Copr.), Young-Jin Park (Naver R&D Center), Sangkwon Sim (Clova AI Research), Jae-Han Cho (Clova AI Research), Minkyu Kim (Clova AI Research), Jihun Kwon (Naver R&D Center), Nako Sung (Clova AI Research), Jung-Woo Ha (Clova AI Research)