- 제목
- [세미나] Frontiers of Offline Interactive Machine Learning: From Contextual Bandits to LLM Alignment(Kwang-Sung Jun, Ph.D.)
- 작성자
- 첨단컴퓨팅학부
- 작성일
- 2025.11.07
- 최종수정일
- 2025.11.07
- 분류
- 세미나
- 게시글 내용
-
일시: 2025. 11. 12. (수요일), 오후 2시
장소: 제2공학관 B041호
Speaker: Kwang-Sung Jun, Ph.D. (Assistant Professor in the Department of Computer Science at the University of Arizona)
Title: Frontiers of Offline Interactive Machine Learning: From Contextual Bandits to LLM Alignment
Abstract:
At the heart of modern machine learning lies a fundamental challenge: how can an intelligent system not just learn from data but also decide which data to collect for learning? This is the essence of interactive machine learning (IML) -- a paradigm that encompasses reinforcement learning, contextual bandits, and active learning. Recently, the offline version of IML has gained popularity because the traditional online version often cannot be run due to real-world constraints. In this talk, I will show two recent advances in offline IML problems. First, I will discuss the contextual bandit problem that has applications in recommendation systems. I will show how an improved confidence bound for [0,∞)-valued random variable translates into a superior learning algorithm, both in theory and practice. Second, I will show that the LLM alignment problem is an instance of offline IML and that existing training objectives for it lack theoretical justification, leaving us wondering if they are the right ones to use. As such, I will present a novel theoretical framework for alignment from which three different alignment algorithms are derived along with theoretical guarantees, which is a strong form of justification. Surprisingly, two of them are very similar to existing algorithms called Direct Policy Optimization (DPO) and reinforcement learning from human feedback (RLHF), respectively. Together with our theoretical guarantees, our work can be seen as providing theoretical justifications to DPO and RLHF, with minor corrections. Furthermore, our theory confirms the existing empirical finding that RLHF performs better than DPO. I will conclude with empirical results and exciting future research directions.
Bio:
Kwang-Sung Jun is an assistant professor in the Department of Computer Science at the University of Arizona. His research interest is interactive machine learning. In particular, he is interested in theoretical aspects and their practical implications of bandits, active learning, confidence bounds, and, more recently, emerging problems in LLMs. (personal website : https://kwangsungjun.github.io)


