1. Schedule
Session 1 (August 28th, 00:00 - 03:05 (UTC+1))
Time(UTC+1) | Title & Presenter | Links |
---|---|---|
00:00 - 00:10 | Opening Remarks | |
00:10 - 00:55 | Invited Talk 1: Towards Generating Stories about Video Anna Rohrbach, UC Berkeley |
[video & slides] [Q&A] |
00:55 - 01:35 | Invited Talk 2: Imagination Supervised Visiolinguistic Learning Mohamed H. Elhoseiny, KAUST (King Abdullah University of Science and Technology) |
[video & slides] [Q&A] |
01:35 - 02:35 | Invited Talk 3: Commonsense Intelligence: Cracking the Longstanding Challenge in AI Yejin Choi, University of Washington & Allen Institute for Artificial Intelligence |
[video & slides] [Q&A] |
02:35 - 03:05 |
Accepted Paper 1: Object-Centric Relational Reasoning for Video Question Answering Long Dang, Thao Le, Vuong Le, Truyen Tran |
[video & slides] [Q&A] |
Accepted Paper 2: GCF-Net: Gated Clip Fusion Network for Video Action Recognition Jenhao Hsiao, Jiawei Chen, Chiuman Ho |
[video & slides] [Q&A] | |
Accepted Paper 3: A Unified Framework for Shot Type Classification Based on Subject Centric Lens Anyi Rao, Jiaze Wang, Linning Xu, Xuekun Jiang, Qingqiu Huang, Bolei Zhou, Dahua Lin |
[video & slides] [Q&A] | |
Accepted Paper 4: Detecting Human-Object Interactions with Action Co-occurrence Priors Dong-Jin Kim, Xiao Sun, Jinsoo Choi, Stephen Lin, In So Kweon |
[video & slides] [Q&A] | |
Accepted Paper 5: Watch Hours in Minutes: Summarizing Video with User Intent Saiteja Nalla, Mohit Agrawal, Vishal Kaushal, Rishabh Iyer, Ganesh Ramakrishnan |
[video & slides] [Q&A] |
Session 2 (August 28th, 14:00 - 17:10 (UTC+1))
Time(UTC+1) | Title & Presenter | Links |
---|---|---|
14:00 - 14:10 | Opening Remarks | |
14:10 - 14:45 | Invited Talk 1: Reasoning about Complex Media from Weak Multi-modal Supervision Adriana Kovashka, University of Pittsburgh |
[video & slides] [Q&A] |
14:45 - 15:40 | Invited Talk 2: Machine Understanding of Social Situations Makarand Tapaswi, Inria Paris |
[video & slides] [Q&A] |
15:40 - 16:25 | Invited Talk 3: Ten Questions for a Theory of Vision Marco Gori, University of Siena |
[video & slides] [Q&A] |
16:25 - 16:45 |
Accepted Paper 6: Long Term Spatio-Temporal Modeling for Action Detection Vijay Kumar*, Makarand Tapaswi*, Ivan Laptev |
[video & slides] [Q&A] |
Accepted Paper 7: Condensed Movies: Story Based Retrieval with Contextual Embeddings Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman |
[video & slides] [Q&A] | |
Accepted Paper 8: Late Temporal Modeling in 3D CNN Architectures with BERT for Action Recognition Muhammet Esat Kalfaoglu, Sinan Kalkan, Abdullah Aydin Alatan |
[video & slides] [Q&A] | |
16:45 - 16:55 |
Overview of DramaQA Challenge Seongho Choi, Yu-Jung Heo, Kyoung-Woon On, Youwon Jang, Ahjeong Seo, Minsu Lee, Byoung-Tak Zhang |
[video & slides] |
16:55 - 17:10 |
DramaQA Challenge Winner talks 1: Who is who? (Character Information as Attention Priors) Jiwan Chung, Youngjae Yu, Heeseung Yun, Jongseok Kim, Eunkyu Park, Gunhee Kim |
[video & slides] [Q&A] |
DramaQA Challenge Winner talks 2: A Character-guided QA based Feature Fusion Method for DramaQA Challenge Zhicheng Guo, Jiaxuan Zhao |
[video & slides] [Q&A] | |
DramaQA Challenge Winner talks 3: Deep Contextualized QA Representations Based on Bidirectional Matching Network KyungTae Lim, Youhan Lee, Yonggyun Yu |
[video & slides] [Q&A] |
2. Invited Speakers
Title: Towards Generating Stories about Video
Abstract: Humans can easily tell a story about what they see in a video. Such a story would be relevant, coherent and non-redundant, it would mention distinct person identities and significant objects or places, make use of co-references, etc. Machines are still far from being able to generate such stories. In my talk I will focus on how to make progress towards the following two goals. First, I will talk about how to obtain more coherent and diverse multi-sentence video descriptions. Next, I will discuss how to connect video descriptions to person identities.
Title: Imagination Supervised Visiolinguistic Learning
Abstract: Most existing learning algorithms can be categorized into supervised, semi-supervised, and unsupervised methods. Most of these approaches relies on defining losses / empirical risks on the provided labled and/or unlabeled data. In this talk, we first reflect on an existing class of methods that goes beyond training data to improve recognition through imaginary data that are produced based on the learning algorithms. We refer to them in this talk as imagination supervised methods and examples of these techniques includes Mixup and CutMix that produce these imaginary data at the pixel level, and (Bharath et.al, 2017) that produces mid-level generations for the few-shot learning task. We focus on this talk at zero-shot learning and long-tail recognition. In this case, this imaginary visual data/features is semantically guided and hence can be connected to language. We will cover how language guided generative models has been recently adopted to improve both understanding the unseen (zero-shot learning), briefly on creating the unseen (creative AI), and how both areas benefits improving unseen representations. Towards the end, we will show how reflect on how imagination supervised Learning may benefit more complicated visiolinguistic tasks like 3D object localization by natural language and Vision & Language.
Title: Commonsense Intelligence: Cracking the Longstanding Challenge in AI
Abstract: Despite considerable advances in deep learning, AI remains to be narrow and brittle. One fundamental limitation comes from its lack of commonsense intelligence: reasoning about everyday situations and events, which in turn, requires knowledge about how the physical and social world works. In this talk, I will share some of our recent efforts that attempt to crack commonsense intelligence with emphasis on intuitive causal inferences and the synergistic connection between declarative knowledge in multimodal knowledge graphs and observed knowledge in unstructured web data. I will conclude the talk by discussing major open research questions, including the importance of algorithmic solutions to reduce incidental biases in data that can lead to overestimation of true AI capabilities.
Title: Reasoning about Complex Media from Weak Multi-modal Supervision
Abstract: In a world of abundant information targeting multiple senses, and increasingly powerful media, we need new mechanisms to model content. Techniques for representing individual channels, such as visual data or textual data, have greatly improved, and some techniques exist to model the relationship between channels that are “mirror images” of each other and contain the same semantics. However, multimodal data in the real world contains little redundancy; the visual and textual channels complement each other. We examine the relationship between multiple channels in complex media, in two domains, advertisements and political articles. We develop a large annotated dataset of advertisements and public service announcements, covering almost forty topics (ranging from automobiles and clothing, to health and domestic violence). We pose decoding the ads as automatically answering the questions “What should do viewer do, according to the ad” (the suggested action), and “Why should the viewer do the suggested action, according to the ad” (the suggested reason). We collect annotations and train a variety of algorithms to choose the appropriate action-reason statement, given the ad image and potentially a slogan embedded in it. The task is challenging because of the great diversity in how different users annotate an ad, even if they draw similar conclusions. One approach mines information from external knowledge bases, but there is a plethora of information that can be retrieved yet is not relevant. We show how to automatically transform the training data in order to focus our approach’s attention to relevant facts, without relevance annotations for training. We also present an approach for learning to recognize new concepts given supervision only in the form of noisy captions. Next, we collect a dataset of multimodal political articles containing lengthy text and a small number of images. We learn to predict the political bias of the article, as well as perform cross-modal retrieval. To better understand political bias, we use generative modeling to show how the face of the same politician appears differently at each end of the political spectrum. To understand how image and text contribute to persuasion and bias, we learn to retrieve sentences for a given image, and vice versa. The task is challenging because unlike image-text in captioning, the images and text in political articles overlap in only a very abstract sense. To better model the visual domain, we leverage the semantic domain. Specifically, when performing retrieval, we impose a loss requiring images that correspond to similar text to live closeby in a projection space, even if they appear very diverse purely visually. We show that our loss significantly improves performance in conjunction with a variety of existing recent losses.
Title: Machine Understanding of Social Situations
Abstract: There is a growing interest in AI to build socially intelligent robots. This requires machines to have the ability to read peoples' emotions, motivations, and other factors that affect behavior. Towards this goal, I will discuss three directions from our recent work. Firstly, as one of the building blocks towards video analysis, I will show how video face clustering can be performed in a realistic setting, without knowing the number of characters and considering all background characters. Second, I will introduce MovieGraphs, a dataset that provides detailed, graph-based annotations of social situations depicted in movie clips. I will show how common-sense can emerge by analyzing various social aspects of movies, and also describe applications ranging from querying videos with graphs, interaction understanding via order, and reason prediction. Finally, I will present our recent work on analyzing whether machines can recognize and benefit from the joint interplay of interactions and relationships between movie characters.
Title: Ten Questions for a Theory of Vision
Abstract: By and large, the remarkable progress in visual recognition and understanding in the last few years is attributed to the availability of huge labelled data paired with strong and suitable computational resources. This has opened the doors to the massive use of deep learning which has led to remarkable improvements on common benchmarks. While subscribing to this view, in this talk I claim that the time has come to begin working towards a deeper understanding of visual computational processes, that instead of being regarded as applications of general purpose machine learning algorithms, are likely to require appropriate learning schemes. The major claim is that nowadays object recognition problems turn out to be significantly more difficult than the one offered by nature. This is due to learning algorithms that are working on images while neglecting the crucial role of frame temporal coherence, which naturally leads to the principle of motion invariance. I advocate an in-depth re-thinking of the discipline by shifting to truly visual environments, where time plays a central role. In order to provide a well-posed formulation, I propose addressing ten questions for a good theory of vision, which must be well-suited to better grasp the connection with language.
3. Call for Papers
The ability to craft and understand stories is a crucial cognitive tool used by humans for communication. According to computational linguists, narrative theorists and cognitive scientists, story understanding is a good proxy to measure the readers' intelligence. Readers can understand a story as a way of problem-solving in which, for example, they keep focusing on how main characters overcome certain obstacles throughout the story. Readers need to make inferences both in prospect and in retrospect about the causal relationships between different events in the story.
Video story data such as TV shows and movies can serve as an excellent testbed to evaluate human-level AI algorithms from two points of view. First, video data has different modalities such as a sequence of images, audio (including dialogue, sound effects and background music) and text (subtitles or added comments). Second, video data shows various cross-sections of everyday life. Therefore, understanding video stories can be thought of as a significant challenge to current AI technology, which involves analyzing and simulating human vision, language, thinking, and behavior.
This workshop aims to invite experts from various fields, including vision, language processing, multimedia, and speech recognition, to provide their perspective on existing research, and initiate discussion on future challenges in data-driven video understanding. To assess current state-of-the-art methodologies and encourage rapid progress in the field, we will also host a challenge based on the DramaQA dataset, which encourages story-centered video question answering. Topics of interest include but are not limited to:
This workshop will invite leading researchers from various fields. Also, we encourage submissions of papers as archival and non-archival tracks. All accepted papers will be presented as posters during the workshop and listed on the website. Additionally, a small number of accepted papers will be selected to be presented as contributed talks.
Submission Instructions
Note that we provide both an archival and non-archival tracks. All submissions will be handled through CMT, at this CMT link.
4. Challenge: DramaQA Challenge
Please see the challenge page for details.
5. Important Dates
6. Organizers
7. Program Committee