Jin-Hwa Kim

PhD, Research Scientist

NAVER AI Lab, South Korea

Biography

Jin-Hwa Kim has been the Leader of Generation Research at NAVER AI Lab, working since August 2021, and a Guest Associate Professor at the Artificial Intelligence Institute of Seoul National University (SNU AIIS) since August 2022. He has studied multimodal deep learning, multimodal generation, ethical and safe AI, and other related topics. In 2018, he received a Ph.D. from Seoul National University under the supervision of Professor Byoung-Tak Zhang for the work on “Multimodal Deep Learning for Visually-grounded Reasoning.” In September 2017, he received 2017 Google Ph.D. Fellowship in Machine Learning, Ph.D. Completion Scholarship by Seoul National University, and the VQA Challenge 2018 runners-up at the CVPR 2018 VQA Challenge and Visual Dialog Workshop. He was Research Intern at Facebook AI Research (Menlo Park, CA) mentored by Yuandong Tian, Devi Parikh, and Dhruv Batra, from January to May in 2017. He worked for SK Telecom (August 2018 to July 2021) and SK Communications (January 2011 to October 2012).

Notice

📣 We’re hiring research and engineering interns and full-time research scientist.
You can find my recent publications on my Google Scholar or CV. I must apologize for not keeping it up-to-date due to other commitments. 🙏

Teaching

Co-lectured 4190.773: Multimodal Deep Learning Theories and Applications at SNU (Fall 2025) with Sangdoo Yun.
Co-lectured 4190.773: Multimodal Deep Learning Theories and Applications at SNU (Fall 2024) with Jiyoung Lee and Sangdoo Yun.
Co-lectured 4190.773: Multimodal Deep Learning Theories and Applications at SNU (Fall 2023) with Jiyoung Lee.

Academic Services

Area Chair for ICML 2025, NeurIPS 2024-2025, and ACL ARR 2024 June
Co-organizer for ICML 2023 Social - ML in Korea 🤖🇰🇷🌺
Reviewer for NeurIPS 2018-2023, ICLR 2019, 2021-2023, ICML 2019-2021, 2024, and COLING 2022
Reviewer for Neural Networks (2020)
Reviewer for IEEE Transactions on Neural Networks and Learning Systems (2019)
Topic Editor for the Identifying, Analyzing, and Overcoming Challenges in Vision and Language Research in the Frontiers Research Topics
Program Committee for the 1st workshop on Video-Language Models at NeurIPS 2024
Program Committee for the 2nd workshop on Video Turing Test at ECCV 2020
Program Committee for the 1st Workshop on Video Turing Test at ICCV 2019
Program Committee for Workshop on SiVL at ECCV 2018

Interests

Multimodal Deep Learning
Multimodal Generation
Ethical and Safe AI

Education

Ph.D. in Computer Science and Engineering, 2018

Seoul National University
M.Sc. in Computer Science and Engineering, 2015

Seoul National University
B.Eng. in Computer Software, 2011

Kwangwoon University

Featured Publications

Polyhedral Complex Derivation from Piecewise Trilinear Networks

Recent advancements in deep neural network visualization reveal insights into mesh extraction from Continuous Piecewise Affine (CPWA) functions, while neural surface representation learning addresses spectral bias but complicates CPWA-based techniques. We focus on trilinear interpolation as positional encoding, showing how hypersurfaces transform into flat planes under the eikonal constraint and introducing a method for approximating intersection points among three hypersurfaces. Empirical validation confirms the correctness and efficiency of our approach through chamfer and angular distance, highlighting the link between eikonal loss and hypersurface planarity.

Jin-Hwa Kim

In NeurIPS, 2024

Preprint Code

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Text-to-image generation and image captioning are recently emerged as a new experimental paradigm to assess machine intelligence. They predict continuous quantity accompanied by their sampling techniques in the generation, making evaluation complicated and intractable to get marginal distributions. Based on a recent trend that multimodal generative evaluations exploit a vison-and-language pre-trained model, we propose the negative Gaussian cross-mutual information using the CLIP features as a unified metric, coined by Mutual Information Divergence (MID). To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model. We look forward to seeing the underrepresented implications of the Gaussian cross-mutual information in multimodal representation learning and the future works based on this novel proposition.

Jin-Hwa Kim, Yunji Kim, Jiyoung Lee, Kang Min Yoo, Sang-Woo Lee

In NeurIPS, 2022

Preprint PDF Code Dataset Poster Slides

Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer

Visual dialog is a task of answering a sequence of questions grounded in an image utilizing a dialog history. Previous studies have implicitly explored the problem of reasoning semantic structures among the history using softmax attention. However, we argue that the softmax attention yields dense structures that could distract to answer the questions requiring partial or even no contextual information. In this paper, we formulate the visual dialog tasks as graph structure learning tasks. To tackle the problem, we propose Sparse Graph Learning Networks (SGLNs) consisting of a multimodal node embedding module and a sparse graph learning module. The proposed model explicitly learn sparse dialog structures by incorporating binary and score edges, leveraging a new structural loss function. Then, it finally outputs the answer, updating each node via a message passing framework. As a result, the proposed model outperforms the state-of-the-art approaches on the VisDial v1.0 dataset, only using 10.95% of the dialog history, as well as improves interpretability compared to baseline methods.

Gi-Cheon Kang, Junseok Park, Hwaran Lee, Byoung-Tak Zhang, Jin-Hwa Kim

In Findings of EMNLP, 2021

Preprint Code

CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication

In this work, we propose a goal-driven collaborative task that combines language, perception, and action. Specifically, we develop a Collaborative image-Drawing game between two agents, called CoDraw. Our game is grounded in a virtual world that contains movable clip art objects. The game involves two players: a Teller and a Drawer. The Teller sees an abstract scene containing multiple clip art pieces in a semantically meaningful configuration, while the Drawer tries to reconstruct the scene on an empty canvas using available clip art pieces. The two players communicate with each other using natural language. We collect the CoDraw dataset of ~10K dialogs consisting of ~138K messages exchanged between human players. We define protocols and metrics to evaluate learned agents in this testbed, highlighting the need for a novel crosstalk evaluation condition which pairs agents trained independently on disjoint subsets of the training data. We present models for our task and benchmark them using both fully automated evaluation and by having them play the game live with humans.

Jin-Hwa Kim†, Nikita Kitaev†, Xinlei Chen, Marcus Rohrbach, Byoung-Tak Zhang, Yuandong Tian, Dhruv Batra, Devi Parikh

In ACL, 2019

Preprint Code

Bilinear Attention Networks

In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly. BAN considers bilinear interactions among two groups of input channels, while low-rank bilinear pooling extracts the joint representations for each pair of channels. Furthermore, we propose a variant of multimodal residual networks to exploit eight-attention maps of the BAN efficiently. We quantitatively and qualitatively evaluate our model on visual question answering (VQA 2.0) and Flickr30k Entities datasets. With a simple ensemble of BANs, we won the runners-up in 2018 VQA Challenge while BAN was a winner of single models among the entries.

Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang

In NeurIPS, 2018

Preprint Code Poster Slides Video

Recent Publications

Jin-Hwa Kim. Polyhedral Complex Derivation from Piecewise Trilinear Networks. In NeurIPS, 2024.

Preprint Code

Jin-Hwa Kim, Yunji Kim, Jiyoung Lee, Kang Min Yoo, Sang-Woo Lee. Mutual Information Divergence: A Unified Metric for Multimodal Generative Models. In NeurIPS, 2022.

Preprint PDF Code Dataset Poster Slides

Gi-Cheon Kang, Junseok Park, Hwaran Lee, Byoung-Tak Zhang, Jin-Hwa Kim. Reasoning Visual Dialog with Sparse Graph Learning and Knowledge Transfer. In Findings of EMNLP, 2021.

Preprint Code

Jin-Hwa Kim, Junyoungk Park, Yongseok Choi. Multi-step Estimation for Gradient-based Meta-learning. In arXiv, 2020.

Preprint

Jin-Hwa Kim†, Nikita Kitaev†, Xinlei Chen, Marcus Rohrbach, Byoung-Tak Zhang, Yuandong Tian, Dhruv Batra, Devi Parikh. CoDraw: Collaborative Drawing as a Testbed for Grounded Goal-driven Communication. In ACL, 2019.

Preprint Code

Kyungmin Kim, Seong-Ho Choi, Jin-Hwa Kim, Byoung-Tak Zhang. Multimodal Dual Attention Memory for Video Story Question Answering. In ECCV, 2018.

PDF

Jin-Hwa Kim, Jaehyun Jun, Byoung-Tak Zhang. Bilinear Attention Networks. In NeurIPS, 2018.

Preprint Code Poster Slides Video

Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha, Byoung-Tak Zhang. Overcoming Catastrophic Forgetting by Incremental Moment Matching. In NIPS (Spotlight), 2017.

Preprint PDF Code Poster

Jin-Hwa Kim, Kyoung-Woon On, Woosang Lim, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang. Hadamard Product for Low-rank Bilinear Pooling. In ICLR, 2017.

Preprint PDF Code Slides

Jin-Hwa Kim, Sang-Woo Lee, Donghyun Kwak, Min-Oh Heo, Jeonghee Kim, Jung-Woo Ha, Byoung-Tak Zhang. Multimodal Residual Learning for Visual QA. In NIPS, 2016.

Preprint PDF Code Poster Video