Bilinear Attention Networks for VizWiz Grand Challenge 2018


VizWiz challenge addresses two tasks: visual question answering and visual question answerability. Unlike conventional VQA datasets, VizWiz dataset was collected by the visually impaired using mobile phones and has the following characteristics: (1) the images include a significant amount of blur and unstable lighting while the questions include conversational style. (2) the questions can be unrelated to the images due to the limited visual information. In this paper, we propose bilinear attention networks (BAN), which exploits bilinear interactions between multimodal input channels, followed by two-layer MLPs for each task with a joint loss. Experimental results on VizWiz dataset show that the proposed method significantly outperforms previous methods.

Theresianum 601 in TU Munchen, Munich, Germany