Machine learning for computer vision and natural language processing accelerate the advancement of artificial intelligence. Since vision and natural language are the primary modalities of human interaction, understanding and reasoning based on both vision and natural language become a key challenge to machine intelligence. In this talk, the advances in multimodal deep learning models, multimodal residual networks (MRN), multimodal low-rank bilinear attention networks (MLB), and recently proposed bilinear attention networks (BAN), are explored for visual question answering tasks. To answer correctly, learning the joint representation of visual and linguistic information is critical to the models. We will discuss three essential principals, residual learning in multimodality, low-rank bilinear approximation, and a bilinear attention method, which considers every interaction of multimodal channels, to effectively and efficiently get a joint representation.