Blog

19 March 2026

Why DPO is a Misspecified Estimator and How to Fix It

ICLR'26 Oral

๐Ÿ’กDPO์˜ ์ „์ œ๊ฐ€ realisticํ•˜์ง€ ์•Š์Œ์„ ์œ„์ƒํ•™์ ์œผ๋กœ ํŒŒํ—ค์นจ AuxDPO๋ฅผ ํ†ตํ•ด DPO์˜ Misspecifection๋ฅผ ์™„ํ™”ํ•˜์ž!

์ด์Šนํ™˜
19 March 2026

Whatโ€™s In My Human Feedback? Learning Interpretable Descriptions of Preference Data

ICLR'26 Oral

๐Ÿ’กSAE๋ฅผ ํ†ตํ•ด preference dataset์—์„œ ๋‘ ์‘๋‹ต ๊ฐ„ ์„ ํ˜ธ๋ฅผ ๊ฒฐ์ •์ง“๋Š” ์ž ์žฌ์  ํŠน์ง•(feature) ์ถ•์„ ์ž๋™์œผ๋กœ ์ถ”์ถœํ•˜๊ณ , ์–ด๋–ค ์‘๋‹ต ํŠน์„ฑ์ด ์ธ๊ฐ„์˜ ์„ ํ˜ธ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š”์ง€ ์ž์—ฐ์–ด๋กœ ํ•ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๋Š” WIMHF ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆ

์ตœ๋ฏผ์˜
19 March 2026

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

ICLR'26 Oral

๐Ÿ’กPreference Alignment์—์„œ ์•ˆ์ „(์œ„ํ—˜ํ•œ ๋‹ตX)์„ ๊ฐ•ํ•˜๊ฒŒ ๋ณด์žฅํ•˜๋ฉด์„œ๋„, ๊ธฐ์กด RLHF์ฒ˜๋Ÿผ ๋ณต์žกํ•œ ํŒŒ์ดํ”„๋ผ์ธ ์—†์ด DPO์ฒ˜๋Ÿผ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ชจ๋ธ์„ ์ •๋ ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ธ SafeDPO ๋ฅผ ์ œ์‹œ๊ธฐ์กด์˜ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์žฌ์ •์˜ํ•˜๊ณ , ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ์ •๋ ฌํ•ด ๋ชจ๋ธ์ด ์•ˆ์ „ํ•œ ๋‹ต์„ ์ผ๊ด€๋˜๊ฒŒ ๋” ์„ ํ˜ธํ•˜๋„๋ก ํ•จ