Featured Publications

We propose a video story question-answering (QA) architecture, Multimodal Dual Attention Memory (MDAM). The key idea is to use a dual attention mechanism with late fusion. MDAM uses self-attention to learn the latent concepts in scene frames and captions. Given a question, MDAM uses the second attention over these latent concepts. Multimodal fusion is performed after the dual attention processes (late fusion). Using this processing pipeline, MDAM learns to infer a high-level vision-language joint representation from an abstraction of the full video content. We evaluate MDAM on PororoQA and MovieQA datasets which have large-scale QA annotations on cartoon videos and movies, respectively. For both datasets, MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models. We confirm the best performance of the dual attention mechanism combined with late fusion by ablation studies. We also perform qualitative analysis by visualizing the inference mechanisms of MDAM.
In ECCV, 2018

In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets. With a simple ensemble of BANs, we won the runners-up in 2018 VQA Challenge while BAN was a winner of single models among the entries.
In NIPS, 2018

In this work, we propose a goal-driven collaborative task that contains vision, language, and action in a virtual environment as its core components. Specifically, we develop a collaborative `Image Drawing’ game between two agents, called CoDraw. We collect the CoDraw dataset of ~10K dialogs consisting of 138K messages exchanged between a Teller and a Drawer from Amazon Mechanical Turk (AMT).
In arXiv, 2017

Recent Publications

. Multimodal Dual Attention Memory for Video Story Question Answering. In ECCV, 2018.


. Bilinear Attention Networks. In NIPS, 2018.

Preprint Code Poster Slides Video

. CoDraw: Visual Dialog for Collaborative Drawing. In arXiv, 2017.

Preprint Code

. Overcoming Catastrophic Forgetting by Incremental Moment Matching. In NIPS (Spotlight), 2017.

Preprint PDF Code Poster

. Hadamard Product for Low-rank Bilinear Pooling. In ICLR, 2017.

Preprint PDF Code Slides

. Multimodal Residual Learning for Visual QA. In NIPS, 2016.

Preprint PDF Code Poster Video

Recent & Upcoming Talks

Bilinear Attention Networks for VizWiz Grand Challenge 2018
Sep 14, 2018 10:50 AM
Multimodal Deep Learning
Sep 6, 2018 2:50 PM
Multimodal Deep Learning for Visually-Grounded Reasoning
Jun 27, 2018 1:00 PM
Visually-Grounded Question and Answering: from VisualQA to MovieQA
Jun 26, 2018 8:45 AM
Bilinear Attention Networks for Visual Question Answering
Jun 18, 2018 11:35 AM
Multimodal Deep Learning for Visually-Grounded Reasoning
Mar 28, 2017 1:00 PM

Recent Posts