DPO

19 March 2026

Why DPO is a Misspecified Estimator and How to Fix It

ICLR'26 Oral

๐Ÿ’กDPO์˜ ์ „์ œ๊ฐ€ realisticํ•˜์ง€ ์•Š์Œ์„ ์œ„์ƒํ•™์ ์œผ๋กœ ํŒŒํ—ค์นจ AuxDPO๋ฅผ ํ†ตํ•ด DPO์˜ Misspecifection๋ฅผ ์™„ํ™”ํ•˜์ž!

์ตœ๋ฏผ์˜
19 March 2026

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

ICLR'26 Oral

๐Ÿ’กPreference Alignment์—์„œ ์•ˆ์ „(์œ„ํ—˜ํ•œ ๋‹ตX)์„ ๊ฐ•ํ•˜๊ฒŒ ๋ณด์žฅํ•˜๋ฉด์„œ๋„, ๊ธฐ์กด RLHF์ฒ˜๋Ÿผ ๋ณต์žกํ•œ ํŒŒ์ดํ”„๋ผ์ธ ์—†์ด DPO์ฒ˜๋Ÿผ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ชจ๋ธ์„ ์ •๋ ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ธ SafeDPO ๋ฅผ ์ œ์‹œ๊ธฐ์กด์˜ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์žฌ์ •์˜ํ•˜๊ณ , ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ์ •๋ ฌํ•ด ๋ชจ๋ธ์ด ์•ˆ์ „ํ•œ ๋‹ต์„ ์ผ๊ด€๋˜๊ฒŒ ๋” ์„ ํ˜ธํ•˜๋„๋ก ํ•จ

19 March 2026

Multiplayer Nash Preference Optimization

ICLR'26 Poster

๐Ÿ’กalignment๊ฐ€ ๊ฐ€์ ธ์•ผ ํ•  ๋ชฉํ‘œ๋Š” ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋‹ค์ˆ˜ ๊ฐ€์น˜ ๋ฐ ์ •์ฑ… ์ง‘๋‹จ ์†์—์„œ ๊ทธ ๋ˆ„๊ตฌ์—๊ฒŒ๋„ ์ง€์ง€ ์•Š๋Š” ์•ˆ์ •์  ๊ท ํ˜• ์ƒํƒœ๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์ด๋‹ค!

19 March 2026

Beyond Pairwise: Empowering LLM Alignment With (Ranked) Choice Modeling

ICLR'26 Poster

๐Ÿ’กRLHF๋‚˜ DPO์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•๋“ค์€ Pairwise(์Œ) Preference Optimization์— ๋งž์ถฐ์ ธ ์žˆ์–ด, ๋” ์ž์„ธํ•œ ์ •๋ณด(Human Feedback)๋ฅผ ํ•™์Šตํ•  ๊ธฐํšŒ๋ฅผ ๊ฐ„๊ณผํ•œ๋‹ค.โ‡’ Response์— ๋Œ€ํ•ด Pairwise๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๊ทธ ์ด์ƒ๊นŒ์ง€ rank๋ฅผ ๋งค๊ฒจ ๋ชจ๋ธ์— ํ•™์Šต์„ ์‹œ์ผœ๋ณด์ž.