Attention networks, inspired by the attention mechanism of human, allow reducing the redundancy in multi-source information for a given task. To exploit this advantage, neural networks adopt the attention networks as a sub-network of a model, and it is prone to be an ad-hoc design in individual tasks. In this talk, I generalize attention networks as a probabilistic model to estimate a probabilistic distribution over multiple sources, which leads to geometrical analysis and empirical techniques. Based on this observation, we will explore the recently proposed transformer networks consisting of self-attention and guided-attention sub-modules. Lastly, I introduce our recent work, Bilinear Attention Networks, applying to a challenging multimodal learning task, visual question answering. This novel attention networks, which models the joint probabilistic distribution over vision and language information, achieved the shared second place of VQA challenge 2018 and have been recognized as state-of-the-art VQA model in recent works.