Multimodal deep learning provides a way to interleave computer vision and natural language processing. In the heart of this, how to learn joint representations of vision and language is critical to design an efficient model. I will review the recent works focusing on multimodal pooling methods in aspects of tensor operations, linear to bilinear models, and various attentional networks. Through this, I hope this talk helps to understand a facet of how our deep learning community responses toward this challenging mission and discuss the limitation of the current multimodal learning approaches for the challenging tasks.