1. Call for Papers
The ability to craft and understand stories is a crucial cognitive tool used by humans for communication. According to computational linguists, narrative theorists and cognitive scientists, story understanding is a good proxy to measure the readers' intelligence. Readers can understand a story as a way of problem-solving in which, for example, they keep focusing on how main characters overcome certain obstacles throughout the story. Readers need to make inferences both in prospect and in retrospect about the causal relationships between different events in the story.
Video story data such as TV shows and movies can serve as an excellent testbed to evaluate human-level AI algorithms from two points of view. First, video data has different modalities such as a sequence of images, audio (including dialogue, sound effects and background music) and text (subtitles or added comments). Second, video data shows various cross-sections of everyday life. Therefore, understanding video stories can be thought of as a significant challenge to current AI technology, which involves analyzing and simulating human vision, language, thinking, and behavior.
This workshop aims to invite experts from various fields, including vision, language processing, multimedia, and speech recognition, to provide their perspective on existing research, and initiate discussion on future challenges in data-driven video understanding. To assess current state-of-the-art methodologies and encourage rapid progress in the field, we will also host a challenge based on the DramaQA dataset, which encourages story-centered video question answering. Topics of interest include but are not limited to:
This workshop will invite leading researchers from various fields. Also, we encourage submissions of papers as archival and non-archival tracks. All accepted papers will be presented as posters during the workshop and listed on the website. Additionally, a small number of accepted papers will be selected to be presented as contributed talks.
Note that we provide both an archival and non-archival tracks. All submissions will be handled through CMT, at this CMT link.
2. Challenge: DramaQA Challenge
Please see the challenge page for details.
3. Important Dates
4. Invited Speakers
-- Live QnA Session-1 at 4 P.M - 6 P.M August 27th (PDT)
Title: Towards generating stories about video
Abstract: Humans can easily tell a story about what they see in a video. Such a story would be relevant, coherent and non-redundant, it would mention distinct person identities and significant objects or places, make use of co-references, etc. Machines are still far from being able to generate such stories. In my talk I will focus on how to make progress towards the following two goals. First, I will talk about how to obtain more coherent and diverse multi-sentence video descriptions. Next, I will discuss how to connect video descriptions to person identities.
Title: Imagination supervised visiolinguistic Learning
Title: Commonsense Modeling with Vision and Language
Abstract: Despite considerable advances in deep learning, AI remains to be narrow and brittle. One fundamental limitation comes from its lack of commonsense intelligence: reasoning about everyday situations and events, which in turn, requires knowledge about how the physical and social world works. In this talk, I will share some of our recent efforts that attempt to crack commonsense intelligence with emphasis on intuitive causal inferences and the synergistic connection between declarative knowledge in multimodal knowledge graphs and observed knowledge in unstructured web data. I will conclude the talk by discussing major open research questions, including the importance of algorithmic solutions to reduce incidental biases in data that can lead to overestimation of true AI capabilities.
-- Live QnA Session-2 at 6 A.M - 8 A.M August 28th (PDT)
Title: Reasoning about Complex Media from Weak Multi-modal Supervision
Abstract: In a world of abundant information targeting multiple senses, and increasingly powerful media, we need new mechanisms to model content. Techniques for representing individual channels, such as visual data or textual data, have greatly improved, and some techniques exist to model the relationship between channels that are “mirror images” of each other and contain the same semantics. However, multimodal data in the real world contains little redundancy; the visual and textual channels complement each other. We examine the relationship between multiple channels in complex media, in two domains, advertisements and political articles. We develop a large annotated dataset of advertisements and public service announcements, covering almost forty topics (ranging from automobiles and clothing, to health and domestic violence). We pose decoding the ads as automatically answering the questions “What should do viewer do, according to the ad” (the suggested action), and “Why should the viewer do the suggested action, according to the ad” (the suggested reason). We collect annotations and train a variety of algorithms to choose the appropriate action-reason statement, given the ad image and potentially a slogan embedded in it. The task is challenging because of the great diversity in how different users annotate an ad, even if they draw similar conclusions. One approach mines information from external knowledge bases, but there is a plethora of information that can be retrieved yet is not relevant. We show how to automatically transform the training data in order to focus our approach’s attention to relevant facts, without relevance annotations for training. We also present an approach for learning to recognize new concepts given supervision only in the form of noisy captions. Next, we collect a dataset of multimodal political articles containing lengthy text and a small number of images. We learn to predict the political bias of the article, as well as perform cross-modal retrieval. To better understand political bias, we use generative modeling to show how the face of the same politician appears differently at each end of the political spectrum. To understand how image and text contribute to persuasion and bias, we learn to retrieve sentences for a given image, and vice versa. The task is challenging because unlike image-text in captioning, the images and text in political articles overlap in only a very abstract sense. To better model the visual domain, we leverage the semantic domain. Specifically, when performing retrieval, we impose a loss requiring images that correspond to similar text to live closeby in a projection space, even if they appear very diverse purely visually. We show that our loss significantly improves performance in conjunction with a variety of existing recent losses.
Title: Machine Understanding of Social Situations
Abstract: There is a growing interest in AI to build socially intelligent robots. This requires machines to have the ability to read peoples' emotions, motivations, and other factors that affect behavior. Towards this goal, I will discuss three directions from our recent work. Firstly, as one of the building blocks towards video analysis, I will show how video face clustering can be performed in a realistic setting, without knowing the number of characters and considering all background characters. Second, I will introduce MovieGraphs, a dataset that provides detailed, graph-based annotations of social situations depicted in movie clips. I will show how common-sense can emerge by analyzing various social aspects of movies, and also describe applications ranging from querying videos with graphs, interaction understanding via order, and reason prediction. Finally, I will present our recent work on analyzing whether machines can recognize and benefit from the joint interplay of interactions and relationships between movie characters.
Title: Ten Questions for a Theory of Vision
Abstract: By and large, the remarkable progress in visual recognition and understanding in the last few years is attributed to the availability of huge labelled data paired with strong and suitable computational resources. This has opened the doors to the massive use of deep learning which has led to remarkable improvements on common benchmarks. While subscribing to this view, in this talk I claim that the time has come to begin working towards a deeper understanding of visual computational processes, that instead of being regarded as applications of general purpose machine learning algorithms, are likely to require appropriate learning schemes. The major claim is that nowadays object recognition problems turn out to be significantly more difficult than the one offered by nature. This is due to learning algorithms that are working on images while neglecting the crucial role of frame temporal coherence, which naturally leads to the principle of motion invariance. I advocate an in-depth re-thinking of the discipline by shifting to truly visual environments, where time plays a central role. In order to provide a well-posed formulation, I propose addressing ten questions for a good theory of vision, which must be well-suited to better grasp the connection with language.