Learning Representations of Vision and Language

Abstract

Multimodal deep learning provides a way to interleave computer vision and natural language processing. In the heart of this, how to learn joint representations of vision and language is critical to design an efficient model. I will review the recent works focusing on multimodal pooling methods in aspects of tensor operations, linear to bilinear models, and various attentional networks. Through this, I hope this talk helps to understand a facet of how our deep learning community responses toward this challenging mission and discuss the limitation of the current multimodal learning approaches for the challenging tasks.

Date
Location
COEX Convention Center 317A
Links