Workshop 2020

1. Schedule

Session 1 (August 28th, 00:00 - 03:05 (UTC+1))

Time(UTC+1)	Title & Presenter	Links
00:00 - 00:10	Opening Remarks
00:10 - 00:55	Invited Talk 1: Towards Generating Stories about Video Anna Rohrbach, UC Berkeley	[video & slides] [Q&A]
00:55 - 01:35	Invited Talk 2: Imagination Supervised Visiolinguistic Learning Mohamed H. Elhoseiny, KAUST (King Abdullah University of Science and Technology)	[video & slides] [Q&A]
01:35 - 02:35	Invited Talk 3: Commonsense Intelligence: Cracking the Longstanding Challenge in AI Yejin Choi, University of Washington & Allen Institute for Artificial Intelligence	[video & slides] [Q&A]
02:35 - 03:05	Accepted Paper 1: Object-Centric Relational Reasoning for Video Question Answering Long Dang, Thao Le, Vuong Le, Truyen Tran	[video & slides] [Q&A]
	Accepted Paper 2: GCF-Net: Gated Clip Fusion Network for Video Action Recognition Jenhao Hsiao, Jiawei Chen, Chiuman Ho	[video & slides] [Q&A]
	Accepted Paper 3: A Unified Framework for Shot Type Classification Based on Subject Centric Lens Anyi Rao, Jiaze Wang, Linning Xu, Xuekun Jiang, Qingqiu Huang, Bolei Zhou, Dahua Lin	[video & slides] [Q&A]
	Accepted Paper 4: Detecting Human-Object Interactions with Action Co-occurrence Priors Dong-Jin Kim, Xiao Sun, Jinsoo Choi, Stephen Lin, In So Kweon	[video & slides] [Q&A]
	Accepted Paper 5: Watch Hours in Minutes: Summarizing Video with User Intent Saiteja Nalla, Mohit Agrawal, Vishal Kaushal, Rishabh Iyer, Ganesh Ramakrishnan	[video & slides] [Q&A]

Session 2 (August 28th, 14:00 - 17:10 (UTC+1))

Time(UTC+1)	Title & Presenter	Links
14:00 - 14:10	Opening Remarks
14:10 - 14:45	Invited Talk 1: Reasoning about Complex Media from Weak Multi-modal Supervision Adriana Kovashka, University of Pittsburgh	[video & slides] [Q&A]
14:45 - 15:40	Invited Talk 2: Machine Understanding of Social Situations Makarand Tapaswi, Inria Paris	[video & slides] [Q&A]
15:40 - 16:25	Invited Talk 3: Ten Questions for a Theory of Vision Marco Gori, University of Siena	[video & slides] [Q&A]
16:25 - 16:45	Accepted Paper 6: Long Term Spatio-Temporal Modeling for Action Detection Vijay Kumar, Makarand Tapaswi, Ivan Laptev	[video & slides] [Q&A]
	Accepted Paper 7: Condensed Movies: Story Based Retrieval with Contextual Embeddings Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman	[video & slides] [Q&A]
	Accepted Paper 8: Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition Muhammet Esat Kalfaoglu, Sinan Kalkan, Abdullah Aydin Alatan	[video & slides] [Q&A]
16:45 - 16:55	Overview of DramaQA Challenge Seongho Choi, Yu-Jung Heo, Kyoung-Woon On, Youwon Jang, Ahjeong Seo, Minsu Lee, Byoung-Tak Zhang	[video & slides]
16:55 - 17:10	DramaQA Challenge Winner talks 1: Who is who? (Character Information as Attention Priors) Jiwan Chung, Youngjae Yu, Heeseung Yun, Jongseok Kim, Eunkyu Park, Gunhee Kim	[video & slides] [Q&A]
	DramaQA Challenge Winner talks 2: A Character-guided QA based Feature Fusion Method for DramaQA Challenge Zhicheng Guo, Jiaxuan Zhao	[video & slides] [Q&A]
	DramaQA Challenge Winner talks 3: Deep Contextualized QA Representations Based on Bidirectional Matching Network KyungTae Lim, Youhan Lee, Yonggyun Yu	[video & slides] [Q&A]

2. Invited Speakers

Anna Rohrbach, UC Berkeley
Title: Towards Generating Stories about Video
Abstract: Humans can easily tell a story about what they see in a video. Such a story would be relevant, coherent and non-redundant, it would mention distinct person identities and significant objects or places, make use of co-references, etc. Machines are still far from being able to generate such stories. In my talk I will focus on how to make progress towards the following two goals. First, I will talk about how to obtain more coherent and diverse multi-sentence video descriptions. Next, I will discuss how to connect video descriptions to person identities.

Mohamed H. Elhoseiny, KAUST (King Abdullah University of Science and Technology)
Title: Imagination Supervised Visiolinguistic Learning
Abstract: Most existing learning algorithms can be categorized into supervised, semi-supervised, and unsupervised methods. Most of these approaches relies on defining losses / empirical risks on the provided labled and/or unlabeled data. In this talk, we first reflect on an existing class of methods that goes beyond training data to improve recognition through imaginary data that are produced based on the learning algorithms. We refer to them in this talk as imagination supervised methods and examples of these techniques includes Mixup and CutMix that produce these imaginary data at the pixel level, and (Bharath et.al, 2017) that produces mid-level generations for the few-shot learning task. We focus on this talk at zero-shot learning and long-tail recognition. In this case, this imaginary visual data/features is semantically guided and hence can be connected to language. We will cover how language guided generative models has been recently adopted to improve both understanding the unseen (zero-shot learning), briefly on creating the unseen (creative AI), and how both areas benefits improving unseen representations. Towards the end, we will show how reflect on how imagination supervised Learning may benefit more complicated visiolinguistic tasks like 3D object localization by natural language and Vision & Language.

Yejin Choi, University of Washington & Allen Institute for Artificial Intelligence
Title: Commonsense Intelligence: Cracking the Longstanding Challenge in AI
Abstract: Despite considerable advances in deep learning, AI remains to be narrow and brittle. One fundamental limitation comes from its lack of commonsense intelligence: reasoning about everyday situations and events, which in turn, requires knowledge about how the physical and social world works. In this talk, I will share some of our recent efforts that attempt to crack commonsense intelligence with emphasis on intuitive causal inferences and the synergistic connection between declarative knowledge in multimodal knowledge graphs and observed knowledge in unstructured web data. I will conclude the talk by discussing major open research questions, including the importance of algorithmic solutions to reduce incidental biases in data that can lead to overestimation of true AI capabilities.

Adriana Kovashka, University of Pittsburgh
Title: Reasoning about Complex Media from Weak Multi-modal Supervision
Abstract: In a world of abundant information targeting multiple senses, and increasingly powerful media, we need new mechanisms to model content. Techniques for representing individual channels, such as visual data or textual data, have greatly improved, and some techniques exist to model the relationship between channels that are “mirror images” of each other and contain the same semantics. However, multimodal data in the real world contains little redundancy; the visual and textual channels complement each other. We examine the relationship between multiple channels in complex media, in two domains, advertisements and political articles. We develop a large annotated dataset of advertisements and public service announcements, covering almost forty topics (ranging from automobiles and clothing, to health and domestic violence). We pose decoding the ads as automatically answering the questions “What should do viewer do, according to the ad” (the suggested action), and “Why should the viewer do the suggested action, according to the ad” (the suggested reason). We collect annotations and train a variety of algorithms to choose the appropriate action-reason statement, given the ad image and potentially a slogan embedded in it. The task is challenging because of the great diversity in how different users annotate an ad, even if they draw similar conclusions. One approach mines information from external knowledge bases, but there is a plethora of information that can be retrieved yet is not relevant. We show how to automatically transform the training data in order to focus our approach’s attention to relevant facts, without relevance annotations for training. We also present an approach for learning to recognize new concepts given supervision only in the form of noisy captions. Next, we collect a dataset of multimodal political articles containing lengthy text and a small number of images. We learn to predict the political bias of the article, as well as perform cross-modal retrieval. To better understand political bias, we use generative modeling to show how the face of the same politician appears differently at each end of the political spectrum. To understand how image and text contribute to persuasion and bias, we learn to retrieve sentences for a given image, and vice versa. The task is challenging because unlike image-text in captioning, the images and text in political articles overlap in only a very abstract sense. To better model the visual domain, we leverage the semantic domain. Specifically, when performing retrieval, we impose a loss requiring images that correspond to similar text to live closeby in a projection space, even if they appear very diverse purely visually. We show that our loss significantly improves performance in conjunction with a variety of existing recent losses.

Makarand Tapaswi, Inria Paris
Title: Machine Understanding of Social Situations
Abstract: There is a growing interest in AI to build socially intelligent robots. This requires machines to have the ability to read peoples' emotions, motivations, and other factors that affect behavior. Towards this goal, I will discuss three directions from our recent work. Firstly, as one of the building blocks towards video analysis, I will show how video face clustering can be performed in a realistic setting, without knowing the number of characters and considering all background characters. Second, I will introduce MovieGraphs, a dataset that provides detailed, graph-based annotations of social situations depicted in movie clips. I will show how common-sense can emerge by analyzing various social aspects of movies, and also describe applications ranging from querying videos with graphs, interaction understanding via order, and reason prediction. Finally, I will present our recent work on analyzing whether machines can recognize and benefit from the joint interplay of interactions and relationships between movie characters.

Marco Gori, University of Siena
Title: Ten Questions for a Theory of Vision
Abstract: By and large, the remarkable progress in visual recognition and understanding in the last few years is attributed to the availability of huge labelled data paired with strong and suitable computational resources. This has opened the doors to the massive use of deep learning which has led to remarkable improvements on common benchmarks. While subscribing to this view, in this talk I claim that the time has come to begin working towards a deeper understanding of visual computational processes, that instead of being regarded as applications of general purpose machine learning algorithms, are likely to require appropriate learning schemes. The major claim is that nowadays object recognition problems turn out to be significantly more difficult than the one offered by nature. This is due to learning algorithms that are working on images while neglecting the crucial role of frame temporal coherence, which naturally leads to the principle of motion invariance. I advocate an in-depth re-thinking of the discipline by shifting to truly visual environments, where time plays a central role. In order to provide a well-posed formulation, I propose addressing ten questions for a good theory of vision, which must be well-suited to better grasp the connection with language.

3. Call for Papers

The ability to craft and understand stories is a crucial cognitive tool used by humans for communication. According to computational linguists, narrative theorists and cognitive scientists, story understanding is a good proxy to measure the readers' intelligence. Readers can understand a story as a way of problem-solving in which, for example, they keep focusing on how main characters overcome certain obstacles throughout the story. Readers need to make inferences both in prospect and in retrospect about the causal relationships between different events in the story.

Video story data such as TV shows and movies can serve as an excellent testbed to evaluate human-level AI algorithms from two points of view. First, video data has different modalities such as a sequence of images, audio (including dialogue, sound effects and background music) and text (subtitles or added comments). Second, video data shows various cross-sections of everyday life. Therefore, understanding video stories can be thought of as a significant challenge to current AI technology, which involves analyzing and simulating human vision, language, thinking, and behavior.

This workshop aims to invite experts from various fields, including vision, language processing, multimedia, and speech recognition, to provide their perspective on existing research, and initiate discussion on future challenges in data-driven video understanding. To assess current state-of-the-art methodologies and encourage rapid progress in the field, we will also host a challenge based on the DramaQA dataset, which encourages story-centered video question answering. Topics of interest include but are not limited to:

Deep learning architectures for multi-modal video story representation

Question answering for video stories

Summarization and retrieval from long story video contents

Scene description generation for video understanding

Activity/event recognition from video

Character identification & interaction modeling in video

Scene graph generation and relationship detection from video

Emotion recognition in video

Novel tasks about video understanding and challenge dataset

This workshop will invite leading researchers from various fields. Also, we encourage submissions of papers as archival and non-archival tracks. All accepted papers will be presented as posters during the workshop and listed on the website. Additionally, a small number of accepted papers will be selected to be presented as contributed talks.

Submission Instructions

Note that we provide both an archival and non-archival tracks. All submissions will be handled through CMT, at this CMT link.

Archival full paper track (up to 14 pages excluding references): The submission must be no longer than 14 pages (excluding references). All submissions must be in pdf format as a single file (incl. supplementary materials) using this ECCV’20 template. Accepted papers in this track will be published in the ECCV’20 workshop proceedings. The review process is single-round and double-blind. All submissions in this track have to be anonymized.

Non-archival short paper track (up to 4 pages including references) : The submission must be no longer than 4 pages (including references). All submissions must be in pdf format as a single file (incl. supplementary materials) using this ECCV’20 template. Accepted papers in this track will not be published in the ECCV’20 workshop proceedings. Also, non-archival short paper submissions can share contents with a paper under review for ECCV’20 (or any other conference). That is, this track will not conflict with the dual submission policy of ECCV’20. The review process is single-round and double-blind. All submissions in this track have to be anonymized.

Non-archival long paper track (for published papers from previous conferences) : This track is only for previously published papers, or papers set to appear in the main ECCV’20 conference. There is no page limit and no submission template. Accepted papers in this track will not be published in ECCV workshop proceedings. All submissions in this track do not need to be anonymized.

4. Challenge: DramaQA Challenge

Please see the challenge page for details.

5. Important Dates

Signup to receive updates: Link

Paper submission deadline: July 24, 2020 at 11:59pm (PST) (Extended)

Notification to authors: August 7, 2020 (Extended)

Camera-Ready Paper submission deadline: August 30, 2020 at 11:59pm (PST) (Extended)

Workshop date: August 28, 2020 (Full day)

6. Organizers

Yu-Jung Heo

Seoul National University

Seongho Choi

Seoul National University

Kyoung-Woon On

Seoul National University

Minsu Lee

Seoul National University

Vicente Ordonez

University of Virginia

Leonid Sigal

University of British Columbia

Chang Dong Yoo

KAIST

Gunhee Kim

Seoul National University

Marcello Pelillo

University of Venice

Byoung-Tak Zhang

Seoul National University

7. Program Committee

Prof. Bohyung Han (Seoul National University)

Prof. Byung-Chull Bae (Hongik University)

Prof. Efstratios Gavves (University of Amsterdam)

Dr. Eun-Sol Kim (Kakao Brain)

Prof. In So Kweon (KAIST)

Prof. Ji-Hwan Kim (Sogang University)

Dr. Jin-Hwa Kim (SK Telecom)

Prof. Junmo Kim (KAIST)

Prof. Junseok Kwon (Chung-Ang University)

Dr. Kyung-Min Kim (Naver)

Dr. Manohar Paluri (Facebook research)

Prof. Seon Joo Kim (Yonsei University)

Prof. Seong-Bae Park (Kyung Hee University)

The 2nd Workshop on Video Turing Test:
Toward Human-Level Video Story Understanding

Online, 28 August. 2020
16th European Conference on Computer Vision (ECCV)

1. Schedule

Session 1 (August 28th, 00:00 - 03:05 (UTC+1))

Session 2 (August 28th, 14:00 - 17:10 (UTC+1))

2. Invited Speakers

3. Call for Papers

Submission Instructions

4. Challenge: DramaQA Challenge

5. Important Dates

6. Organizers

7. Program Committee