The advancement of computer vision and natural language processing based on deep learning has resulted in the remarkable progress of visually-grounded language learning. In particular, visual question and answering (VQA) has become one of the most interesting and preferred research topics in computer vision since the VQA challenge started in 2016. Beyond VQA, MovieQA also emerged a challenging topic as a new multimodal semantic learning tasks. In this talk, we introduce two visually-grounded question and answering such as VQA and MovieQA. This talk includes the task definition, data preparation, various state-of-the art approaches for two multimodal QAs. In addition, we explain the challenging issues of these tasks and discuss of their future direction.
This tutorial is jointly presented with Jung-Woo Ha (NAVER Corp.) and Kyung Min Kim (Seoul National University).