1. Call for Papers

The ability to craft and understand stories is a crucial cognitive tool used by humans for communication. According to computational linguists, narrative theorists and cognitive scientists, story understanding is a good proxy to measure the readers' intelligence. Readers can understand a story as a way of problem-solving in which, for example, they keep focusing on how main characters overcome certain obstacles throughout the story. Readers need to make inferences both in prospect and in retrospect about the causal relationships between different events in the story.

Video story data such as TV shows and movies can serve as an excellent testbed to evaluate human-level AI algorithms from two points of view. First, video data has different modalities such as a sequence of images, audio (including dialogue, sound effects and background music) and text (subtitles or added comments). Second, video data shows various cross-sections of everyday life. Therefore, understanding video stories can be thought of as a significant challenge to current AI technology, which involves analyzing and simulating human vision, language, thinking, and behavior.

This workshop aims to invite experts from various fields, including vision, language processing, multimedia, and speech recognition, to provide their perspective on existing research, and initiate discussion on future challenges in data-driven video understanding. To assess current state-of-the-art methodologies and encourage rapid progress in the field, we will also host a challenge based on the DramaQA dataset, which encourages story-centered video question answering. Topics of interest include but are not limited to:

  • Deep learning architectures for multi-modal video story representation
  • Question answering for video stories
  • Summarization and retrieval from long story video contents
  • Scene description generation for video understanding
  • Activity/event recognition from video
  • Character identification & interaction modeling in video
  • Scene graph generation and relationship detection from video
  • Emotion recognition in video
  • Novel tasks about video understanding and challenge dataset

  • This workshop will invite leading researchers from various fields. Also, we encourage submissions of papers as archival and non-archival tracks. All accepted papers will be presented as posters during the workshop and listed on the website. Additionally, a small number of accepted papers will be selected to be presented as contributed talks.

    Submission Instructions

    Note that we provide both an archival and non-archival tracks. All submissions will be handled through CMT, at this CMT link.

  • Archival full paper track (up to 14 pages excluding references): The submission must be no longer than 14 pages (excluding references). All submissions must be in pdf format as a single file (incl. supplementary materials) using this ECCV’20 template. Accepted papers in this track will be published in the ECCV’20 workshop proceedings. The review process is single-round and double-blind. All submissions in this track have to be anonymized.
  • Non-archival short paper track (up to 4 pages including references) : The submission must be no longer than 4 pages (including references). All submissions must be in pdf format as a single file (incl. supplementary materials) using this ECCV’20 template. Accepted papers in this track will not be published in the ECCV’20 workshop proceedings. Also, non-archival short paper submissions can share contents with a paper under review for ECCV’20 (or any other conference). That is, this track will not conflict with the dual submission policy of ECCV’20. The review process is single-round and double-blind. All submissions in this track have to be anonymized.
  • Non-archival long paper track (for published papers from previous conferences) : This track is only for previously published papers, or papers set to appear in the main ECCV’20 conference. There is no page limit and no submission template. Accepted papers in this track will not be published in ECCV workshop proceedings. All submissions in this track do not need to be anonymized.

  • 2. Challenge: DramaQA Challenge

    Please see the challenge page for details.

    3. Important Dates

  • Signup to receive updates: Link
  • Paper submission deadline: July 24, 2020 at 11:59pm (PST) (Extended)
  • Notification to authors: August 7, 2020 (Extended)
  • Camera-Ready Paper submission deadline: August 14, 2020 at 11:59pm (PST)
  • Workshop date: August 28, 2020 (Full day)

  • 4. Invited Speakers

    -- Live QnA Session-1 at 4 P.M - 6 P.M August 27th (PDT)

  • Anna Rohrbach, UC Berkeley
    Title: Towards generating stories about video
    Abstract: Humans can easily tell a story about what they see in a video. Such a story would be relevant, coherent and non-redundant, it would mention distinct person identities and significant objects or places, make use of co-references, etc. Machines are still far from being able to generate such stories. In my talk I will focus on how to make progress towards the following two goals. First, I will talk about how to obtain more coherent and diverse multi-sentence video descriptions. Next, I will discuss how to connect video descriptions to person identities.

  • Mohamed H. Elhoseiny, KAUST (King Abdullah University of Science and Technology)
    Title: Imagination supervised visiolinguistic Learning
    Abstract: TBA

  • Yejin Choi, University of Washington & Allen Institute for Artificial Intelligence
    Title: Commonsense Modeling with Vision and Language
    Abstract: Despite considerable advances in deep learning, AI remains to be narrow and brittle. One fundamental limitation comes from its lack of commonsense intelligence: reasoning about everyday situations and events, which in turn, requires knowledge about how the physical and social world works. In this talk, I will share some of our recent efforts that attempt to crack commonsense intelligence with emphasis on intuitive causal inferences and the synergistic connection between declarative knowledge in multimodal knowledge graphs and observed knowledge in unstructured web data. I will conclude the talk by discussing major open research questions, including the importance of algorithmic solutions to reduce incidental biases in data that can lead to overestimation of true AI capabilities.

  • -- Live QnA Session-2 at 6 A.M - 8 A.M August 28th (PDT)

  • Adriana Kovashka, University of Pittsburgh
    Title: Reasoning about Complex Media from Weak Multi-modal Supervision
    Abstract: In a world of abundant information targeting multiple senses, and increasingly powerful media, we need new mechanisms to model content. Techniques for representing individual channels, such as visual data or textual data, have greatly improved, and some techniques exist to model the relationship between channels that are “mirror images” of each other and contain the same semantics. However, multimodal data in the real world contains little redundancy; the visual and textual channels complement each other. We examine the relationship between multiple channels in complex media, in two domains, advertisements and political articles. We develop a large annotated dataset of advertisements and public service announcements, covering almost forty topics (ranging from automobiles and clothing, to health and domestic violence). We pose decoding the ads as automatically answering the questions “What should do viewer do, according to the ad” (the suggested action), and “Why should the viewer do the suggested action, according to the ad” (the suggested reason). We collect annotations and train a variety of algorithms to choose the appropriate action-reason statement, given the ad image and potentially a slogan embedded in it. The task is challenging because of the great diversity in how different users annotate an ad, even if they draw similar conclusions. One approach mines information from external knowledge bases, but there is a plethora of information that can be retrieved yet is not relevant. We show how to automatically transform the training data in order to focus our approach’s attention to relevant facts, without relevance annotations for training. We also present an approach for learning to recognize new concepts given supervision only in the form of noisy captions. Next, we collect a dataset of multimodal political articles containing lengthy text and a small number of images. We learn to predict the political bias of the article, as well as perform cross-modal retrieval. To better understand political bias, we use generative modeling to show how the face of the same politician appears differently at each end of the political spectrum. To understand how image and text contribute to persuasion and bias, we learn to retrieve sentences for a given image, and vice versa. The task is challenging because unlike image-text in captioning, the images and text in political articles overlap in only a very abstract sense. To better model the visual domain, we leverage the semantic domain. Specifically, when performing retrieval, we impose a loss requiring images that correspond to similar text to live closeby in a projection space, even if they appear very diverse purely visually. We show that our loss significantly improves performance in conjunction with a variety of existing recent losses.

  • Makarand Tapaswi, Inria Paris
    Title: Machine Understanding of Social Situations
    Abstract: There is a growing interest in AI to build socially intelligent robots. This requires machines to have the ability to read peoples' emotions, motivations, and other factors that affect behavior. Towards this goal, I will discuss three directions from our recent work. Firstly, as one of the building blocks towards video analysis, I will show how video face clustering can be performed in a realistic setting, without knowing the number of characters and considering all background characters. Second, I will introduce MovieGraphs, a dataset that provides detailed, graph-based annotations of social situations depicted in movie clips. I will show how common-sense can emerge by analyzing various social aspects of movies, and also describe applications ranging from querying videos with graphs, interaction understanding via order, and reason prediction. Finally, I will present our recent work on analyzing whether machines can recognize and benefit from the joint interplay of interactions and relationships between movie characters.

  • Marco Gori, University of Siena
    Title: Ten Questions for a Theory of Vision
    Abstract: By and large, the remarkable progress in visual recognition and understanding in the last few years is attributed to the availability of huge labelled data paired with strong and suitable computational resources. This has opened the doors to the massive use of deep learning which has led to remarkable improvements on common benchmarks. While subscribing to this view, in this talk I claim that the time has come to begin working towards a deeper understanding of visual computational processes, that instead of being regarded as applications of general purpose machine learning algorithms, are likely to require appropriate learning schemes. The major claim is that nowadays object recognition problems turn out to be significantly more difficult than the one offered by nature. This is due to learning algorithms that are working on images while neglecting the crucial role of frame temporal coherence, which naturally leads to the principle of motion invariance. I advocate an in-depth re-thinking of the discipline by shifting to truly visual environments, where time plays a central role. In order to provide a well-posed formulation, I propose addressing ten questions for a good theory of vision, which must be well-suited to better grasp the connection with language.

  • 5. Organizers

    Yu-Jung Heo
    Seoul National University
    Seongho Choi
    Seoul National University
    Kyoung-Woon On
    Seoul National University
    Minsu Lee
    Seoul National University
    Vicente Ordonez
    University of Virginia
    Leonid Sigal
    University of British Columbia
    Chang Dong Yoo
    Gunhee Kim
    Seoul National University
    Marcello Pelillo
    University of Venice
    Byoung-Tak Zhang
    Seoul National University

    Back to top

    © 2020 Video Intelligence Center @ Seoul National University