19 March 2026

Why DPO is a Misspecified Estimator and How to Fix It

๐Ÿ’กDPO์˜ ์ „์ œ๊ฐ€ realisticํ•˜์ง€ ์•Š์Œ์„ ์œ„์ƒํ•™์ ์œผ๋กœ ํŒŒํ—ค์นจ AuxDPO๋ฅผ ํ†ตํ•ด DPO์˜ Misspecifection๋ฅผ ์™„ํ™”ํ•˜์ž!

๐Ÿฅˆ

Why DPO is a Misspecified Estimator and How to Fix It

Review

๋‹‰๋„ค์ž„ Strength & Weakness & Sugguestions ๋ณ„์  (0/5)
์ปคํ”ผ๊ฐ•์  : DPO๋Š” ๊ธฐ์กด RLHF์˜ ๋งˆ์ง€๋ง‰ ๋ณต์žกํ•œ 2๋‹จ๊ณ„(reward, RL)๋ฅผ ์šฐํšŒํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋ฌด์กฐ๊ฑด์ ์œผ๋กœ ์ข‹์€ ์ค„ ์•Œ์•˜์ง€๋งŒ, reward์˜ ์ •ํ™•์„ฑ์„ ์žƒ๊ฒŒ ๋˜์–ด trade-off๊ฐ€ ์ƒ๊ธฐ๋Š” ๊ฒƒ์„ ์•Œ๊ฒŒ๋˜์—ˆ์Œ.
์•ฝ์  : LLM ํŒŒ๋ผ๋ฏธํ„ฐ๋„ ๋งŽ์„ํ…๋ฐ, ์‹ค์ œ๋กœ null space ํƒ์ƒ‰์ด ์šฉ์ดํ• ๊นŒ?
๋˜ํ•œ DPO์—์„œ ์ถ”๊ฐ€ ๊ณ„์‚ฐ์ด ์ƒ๊ธด๋งŒํผ cost๋„ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š์„๊นŒโ€ฆ?
์ œ์•ˆ : reward์˜ ์ •ํ™•์„ฑ์„ ๊ณ ๋ คํ•˜๋ฉด์„œ null space ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ํƒ์ƒ‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•  ๊ฒƒ ๊ฐ™์Œ.
4.3
์ฝ”์Šคํ”ผ๊ฐ•์ : DPO๊ฐ€ ๋ฌด์กฐ๊ฑด ์ •ํ™•๋„๊ฐ€ ๋†’๋‹ค๊ณ  ์•Œ๊ณ  ์žˆ์—ˆ๋Š”๋ฐ, ๋ถ„ํฌ ์ž์ฒด์˜ ์˜ํ–ฅ์œผ๋กœ ์‹ ๋ขฐ์„ฑ์ด ๋–จ์–ด์ง€๋Š” ์ ์„ ์–ธ๊ธ‰ํ•œ ๊ฒƒ์ด ์ด ๋…ผ๋ฌธ์˜ ๊ฐ•์ .
์•ฝ์ : Null Space์˜ ์ฐจ์ด์™€ ์ถ”๊ฐ€์ ์ธ ์ž์œ ๋„๋ฅผ ๊ณ ๋ คํ•˜๋Š”๋ฐ ์—ฌ๊ธฐ์„œ ์˜ค์ฐจ๊ฐ€ ์ƒ๊ธธ ์ˆ˜ ์žˆ๋Š” ๋ถ€๋ถ„์ด ์žˆ์ง€ ์•Š์„๊นŒ?
์ œ์•ˆ: Failure Mode๋ฅผ ๊ณ ๋ คํ•œ Optimization์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•๋„ ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ.
4.5
์–ผ๋ผ๊ฐ•์ : misspecified estimator ๋ผ๋Š” ๊ฐœ๋… ์ž์ฒด๊ฐ€ ๋ชน์‹œ ํฅ๋ฏธ๋กœ์›€. ICLR Oral์€ ๋Œ€๋‹จํ•˜๊ตฌ๋‚˜
์•ฝ์ : Null space ๋ฐฉํ–ฅ์œผ๋กœ ๋ณด์กฐ ๋ณ€์ˆ˜๋ฅผ ์„ค์ •ํ•  ๋•Œ ์–ด๋–ป๊ฒŒ ์ดˆ๊ธฐํ™” ์ตœ์ ํ™”ํ• ์ง€ ๋ชจํ˜ธํ•จ
์ œ์•ˆ: ์ข€ ๋” ํฐ ๋ชจ๋ธ์ด๋‚˜ ๋‹ค์–‘ํ•œ alignment์—์„œ๋„ ๋™์ผํ•˜๊ฒŒ ๋˜๋Š”์ง€ ์ถ”๊ฐ€ ๊ฒ€์ฆ์ด ๋” ํ•„์š”ํ•ด๋ณด์ž„
4.5
๋น„์š”๋œจ๊ฐ•์ : ๋ง๋กœ๋งŒ ๋“ค์œผ๋ฉด ๊ต‰์žฅํžˆ ์ถ”์ƒ์ ์ผ ์ˆ˜ ์žˆ๊ฒŒ ๋А๋‚„๋งŒํ•œ๊ฒƒ์„ fig๋กœ ์ž˜ ์„ค๋ช…ํ•œ๊ฑฐ๊ฐ™์Œ. DPO์˜ ๋ณด์ƒ/ ์ •์ฑ… ๊ณต๊ฐ„์ด ์ข์•„์„œ ์ƒ๊ธฐ๋Š” ๋ฌธ์ œ๋ฅผ '์ถ”๊ฐ€ ๋ณ€์ˆ˜๋กœ ํ‘œํ˜„๋ ฅ์„ ๋Š˜๋ฆฌ๋Š” ์‹์œผ๋กœ ์ง๊ด€์ ์œผ๋กœ ๊ฐœ์„ ์„ ์‹œ๋„ํ•จ (๊ทผ๋ฐ ํ™•์‹คํžˆ ๋‚ด์šฉ์ด ์–ด๋ ต๋‹ค)
์•ฝ์ : null space๋Š” ๊ฒฐ๊ตญ ๋ชจ๋ธ์˜ ์ •์ฑ… ์ตœ์ ํ™” ๊ณผ์ •์—์„œ (๋ณด์ƒ์˜?) ๋ณ€ํ™”๊ฐ€ ์—†๋Š” ๊ณต๊ฐ„์„ ๋œปํ•˜๋Š”๊ฒƒ ๊ฐ™์€๋ฐ, null space ์„ ํƒ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์„๊ฒƒ ๊ฐ™์Œ.
์ œ์•ˆ: pairwise๋ณด๋‹ค ํ’๋ถ€ํ•œ ๋žญํ‚น ํ”ผ๋“œ๋ฐฑ์„ ์“ธ ๋•Œ์˜ ํˆฌ์˜์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ™”ํ•˜๋Š”์ง€ ๊ถ๊ธˆํ•จ
4.2
์นซ์†”๊ฐ•์ : ์ˆ˜ํ•™์ ์œผ๋กœ ๊ฐ•๊ฑดํ•˜๊ฒŒ ์˜ค๋ฅ˜๋ถ„์„ํ•˜๊ณ  ํ•ด๊ฒฐ๋ฐฉ๋ฒ• ์ œ์•ˆํ–ˆ๋Š”๋ฐ, ์ด๋ก ์ ์œผ๋กœ๋„ ๊ฐ•ํ•˜๊ณ  ๊ฒฝํ—˜์ ์œผ๋กœ๋„ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํผ
์•ฝ์ : ๋น„์šฉ์ด ์–ด๋А์ •๋„๋กœ ์ฆ๊ฐ€ํ–ˆ์„์ง€ ๊ถ๊ธˆํ•จ, ๋งŽ์ด ์ฆ๊ฐ€ํ•˜๋Š”์ง€?
์ œ์•ˆ: ๋ณด๋‹ค ํฐ ๋ชจ๋ธ์— ๋Œ€ํ•œ ์‹คํ—˜๊ฒฐ๊ณผ. ํšจ์œจ์„ฑ ๊ด€๋ จ ๋ถ„์„์ด๋‚˜ ๊ฐœ์„ 
4.7
์„คํ–ฅ๋”ธ๊ธฐ๊ฐ•์ : DPO๊ฐ€ ๊ฐ€์ง€๋Š” ์œ ํ•œ ๊ณต๊ฐ„์—์„œ์˜ ๊ทผ์‚ฌ ๋ฌธ์ œ๋ฅผ ๊ทœ๋ช…ํ•˜๊ณ , ํ•ด๊ฒฐํ•จ. ๋งค๋ฒˆ ๋А๋ผ์ง€๋งŒ, ๊ฒฐ๊ณผ๋งŒ ๋†“๊ณ  ๋ณด๋ฉด ์ง๊ด€์ ์ธ๋ฐ, ์ด๊ฑธ ์–ด๋–ป๊ฒŒ ์ƒ๊ฐํ–ˆ์„๊นŒํ•˜๋Š” ์ƒ๊ฐ์ด ๋“ ๋‹ค. DPO์—์„œ reward๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š์•„๋„ ๋ณด์™„ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ.
์•ฝ์ : null space๊ฐ€ ์ž์œ ๋„๋ฅผ ๊ณ ๋ คํ•˜์ง€๋งŒ, ๊ฒฐ๊ตญ ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ์˜ํ–ฅ ๋ฐ›๋Š” ์ ์€ ๋งคํ•œ๊ฐ€์ง€ ์•„๋‹Œ๊ฐ€?
์ œ์•ˆ: ์˜คํžˆ๋ ค, ์ด ์—ฐ๊ตฌ์˜ ๋ฐฉ๋ฒ•๋ก ๊ณผ ๊ฒฐ๊ณผ๋ฅผ ๋ดค์„ ๋•Œ ๊ตณ์ด DPO์—์„œ reward๋ฅผ ์ œ์™ธํ•ด์•ผ ํ•  ๊ทผ๊ฑฐ๋ฅผ ๋”ฑํžˆ ์ƒ๊ฐํ•˜์ง€ ๋ชปํ•˜๊ฒ ์Œ. main reward๊ฐ€ ์•„๋‹ˆ๋”๋ผ๋„, ๋ณด์กฐ reward ํ˜•ํƒœ๋กœ ๋„์ž…ํ•˜๋Š” ๊ฒƒ์ด ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ๋” ๊ฒฌ๊ณ ํ•œ ๋ฐฉ๋ฒ•์ด์ง€ ์•Š์„๊นŒ?
4.3
๋‚˜์Šค๋‹ฅ๊ฐ•์ : DPO๊ฐ™์ด prove๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋…ผ๋ฌธ๋“ค์€ ๋ณดํ†ต์€ ํ˜„์‹ค์ ์ธ issue(๋” ์ข‹์€ ๋ฐ์ดํ„ฐ๋Š” ๋ณดํ†ต ๋” ์ƒ์„ธํ•˜๊ธฐ์— ๊ธธ์ด๊ฐ€ ๋” ๊น€)๋ฅผ ๋‹ด๊ณ  ์žˆ์–ด์„œ ๊ทธ๊ฒƒ๋“ค์„ ํƒ€๊ฒŸํŒ…ํ•˜๋Š”๋ฐ, ๋ชจ๋ธ๋‹จ์˜ issue๋ฅผ ์งš๊ณ  ๋„˜์–ด๊ฐ€๋Š” ๊ฒƒ์€ ์ง„์งœ ์†Œ์ˆ˜๋งŒ ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๊ณ , ๊ต‰์žฅํžˆ ์ค‘์š”ํ•œ ๋ฌธ์ œ์ ์„ ์งš์—ˆ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ฆ! Soundness๊ฐ€ 11/10!
์•ฝ์ : ๋น„๊ต ๋ชจ๋ธ๋“ค์ด ์ข€ out-dated๋œ ๋“ฏ ํ•จ!! ์•ž์—์„œ ์ž˜ํ•ด๋†“๊ณ  ์™œ ํ† ๋ผ๋“ค์„ ์ƒ๋Œ€ํ•˜์ง€?
์ œ์•ˆ: ์œ ํ•œํ•œ ๊ฒƒ์ด ๋ฌธ์ œ๋ผ๋ฉด ์ž‘์€ ๋ชจ๋ธ, ์ž‘์€ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๋ถ€ํ„ฐ ํฐ ๋ชจ๋ธ, ํฐ ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๊นŒ์ง€ ์˜ฌ๋ฆฌ๋ฉด์„œ DPO์™€์˜ ๊ฐ„๊ทน ์ฐจ์ด๋ฅผ ๋ณด์ด๋Š” ๊ฒƒ์ด ์ด๋ก ์ ์ธ ์„ฑ๊ณผ๋ฅผ ๊ฐ•์กฐํ•˜๋Š” ๋ฐ์— ๋„์›€๋  ๊ฒƒ!
5
404๊ฐ•์ : DPO์™€ ๊ฐ™์ด ํŒŒ๊ธ‰๋ ฅ์ด ํฐ ํ•™์Šต ๊ธฐ๋ฒ•์˜ ๊ณ ์งˆ์ ์ธ (๊ทธ๋Ÿฌ๋‚˜ ๋ชจ๋‘๊ฐ€ ๋†“์น˜๊ณ  ์žˆ๋Š”) ๋ถ€๋ถ„์„ ์บ์น˜ํ•จ. ๋งค์šฐ ๋˜‘๋˜‘ํ•˜๊ณ  ์—ฐ๊ตฌ๋ ฅ(?)์ด ๋›ฐ์–ด๋‚˜์„œ ๋ถ€๋Ÿฌ์›€. ๋˜ํ•œ ์ €์ž๋“ค์˜ motivation๋ฅผ ์ˆ˜์‹์ /๊ธฐํ•˜ํ•™์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ณ  ์ฃผ์žฅํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋›ฐ์–ด๋‚จ.
์•ฝ์ &์ œ์•ˆ: ๋” ๋งŽ์€ task/๋” ๋‹ค์–‘ํ•œ backbone LLM์— ๋Œ€ํ•ด์„œ ์‹คํ—˜ํ–ˆ์œผ๋ฉด ๋‚ด์šฉ์ด ๋” ํ’๋ถ€ํ•ด์กŒ์„ ๋“ฏ
4.5
AI๊ฐ•์ : ์–ผํ•๋ณด๋ฉด ๋‹จ์ˆœํžˆ RLHF์˜ ๋ณ€์ด๋กœ ๋ณผ ์ˆ˜ ์žˆ๋Š” DPO๋ฅผ ์‚ฌ์‹ค์€ reward๋ฅผ ์ถ”์ •ํ•˜๋Š” ํ†ต๊ณ„ ๋ฌธ์ œ๋กœ ํ•ด์„ํ•˜๋Š” ๋ฐœ์ƒ์ด ๋†€๋ผ์›€
์•ฝ์ : ์‚ฐ์—… ๋„๋ฉ”์ธ๊ณผ๊ฐ™์€ ๋Œ€๊ทœ๋ชจ ๋ฐ์ดํ„ฐ์—์„œ๋Š” ๋ฐ์ดํ„ฐ๋งˆ๋‹ค ์ถ”๊ฐ€ ๋ณ€์ˆ˜๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹์ด ๋‹ค์†Œ ์ œํ•œ์ ์ผ ์ˆœ ์žˆ์„๋“ฏ?
์ œ์•ˆ: ๋ชจ๋ธ์ด ์›€์ง์ด๋Š” local ์˜์—ญ์„ ๋„˜์–ด global ๊ด€์ ์˜ ํ•ด์„์ด ํ•„์š”
4.7
๊ตญ๋ฐฅ๊ฐ•์ : DPO๊ฐ€ ๋ฐ์ดํ„ฐ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋ผ ์„ค๊ณ„ ์ˆ˜์ค€์—์„œ misspecification์ด ์žˆ๋‹ค๋Š”๊ฑธ ์ˆ˜ํ•™์ ์œผ๋กœ ์ฆ๋ช…ํ•˜๊ณ  ๋ณด์กฐ๋ณ€์ˆ˜๋กœ ์ด๋™ ๋ฐฉํ–ฅ์„ ๋Š˜๋ฆฐ๋‹ค๋Š” ์•„์ด๋””์–ด๊ฐ€ ๋Œ€๋‹จํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•จ
์•ฝ์ :์‹คํ—˜์˜ LLM ํฌ๊ธฐ๊ฐ€ ์ž‘์•„์„œ ๋Œ€ํ˜• ๋ชจ๋ธ์—์„œ๋„ ๋™์ผํ•œ misspecification ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์—†์„๊ฒƒ ๊ฐ™๋‹ค.
์ œ์•ˆ:๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ํฌ๊ธฐ์— ๋”ฐ๋ฅธ DPO์™€ AuxDPO์˜ ์ฐจ์ด๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์‹คํ—˜
4.5

TL; DR

๐Ÿ’ก

DPO์˜ ์ „์ œ๊ฐ€ realisticํ•˜์ง€ ์•Š์Œ์„ ์œ„์ƒํ•™์ ์œผ๋กœ ํŒŒํ—ค์นจ

AuxDPO๋ฅผ ํ†ตํ•ด DPO์˜ Misspecifection๋ฅผ ์™„ํ™”ํ•˜์ž!

Summary

  • ์—ฐ๊ตฌ์ง„: ์ธ๋„๊ณผํ•™์›(IISc Bangalore), HP AI Research
  • github: x
  • ์ธ์šฉ์ˆ˜: 0ํšŒ

Background & Motivation

  • Preference-based alignment
    ๐Ÿ’ก

    given comparison data (s, awa_w๏ปฟ, ala_l๏ปฟ), the goal is to shape a policy ฯ€ whose induced responses align with a latent reward model that generated those preferences.

    ** ss๏ปฟ: state, aa๏ปฟ: action, awa_w๏ปฟ: winning action, ala_l๏ปฟ: losing action

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model (NIPSโ€™23)
    • policy model์„ ๋‘๋‹จ๊ณ„์— ๊ฑธ์ณ ํ•™์Šตํ•ด์•ผ ํ•จ
      • pretrained model (1์ฐจ ํ•™์Šต)์„
      • reward model์˜ prefer๋กœ steering (2์ฐจ ํ•™์Šต)
    • ๋”ฐ๋กœ reward model๋„ ํ•™์Šต์‹œ์ผœ์•ผ ํ•จ

    โ‡’ computational cost๊ฐ€ ๋„ˆ๋ฌด ํผ !!

  • DPO (Direct Preference Optimization)
    Direct Preference Optimization: Your Language Model is Secretly a Reward Model (NIPSโ€™23)
    • 2์ฐจ ํ•™์Šต(KL-regularized objective) ๋Œ€์‹ ์—, human์ด ์„ ํ˜ธํ•˜๋Š”/๋œ ์„ ํ˜ธํ•˜๋Š” preference data๋ฅผ ํ™œ์šฉํ•จ
      • how to?
        Direct Preference Optimization: Your Language Model is Secretly a Reward Model (NIPSโ€™23)
        Direct Preference Optimization: Your Language Model is Secretly a Reward Model (NIPSโ€™23)

        policy ์ˆ˜์‹์— ๋ณ€ํ˜•์„ ํ†ตํ•ด, reward ํ•จ์ˆ˜(Eq 4)๋ฅผ policy ํ•จ์ˆ˜์˜ ํ™•๋ฅ  ๋ถ„ํฌ(Eq 5)๋กœ ํ‘œํ˜„ํ•˜๊ณ ,

        ์ด๋ฅผ ๊ทผ์‚ฌํ•˜์—ฌ 2์ฐจ ํ•™์Šต(KL-regularized objective)๋ฅผ ๊ทผ์‚ฌํ™”

        Direct Preference Optimization: Your Language Model is Secretly a Reward Model (NIPSโ€™23)

      โ‡’ 1๋ฒˆ์˜ training๋งŒ์œผ๋กœ ํ•™์Šต

      โ‡’ cost๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ์–ด, Preference-based alignment ์˜ ๋Œ€์•ˆ์ด ๋จ

    • โ€œpolicy class๊ฐ€ tabularโ€์ด๋ผ๋Š” ์ด์ƒ์ ์ธ ๊ฐ€์ •์„ ์ „์ œ๋กœ, KL-regularized policy optimization์„ ๊ทผ์‚ฌํ™” ํ•œ ๊ฒƒ์ž„

      ** policy class: neural network(ฮธ)๊ฐ€ ์ •์˜ํ•˜๋Š” conditional distribution ์ฆ‰, input์— ๋Œ€ํ•œ output ๋ถ„ํฌ

      ** tabularํ•˜๋‹ค: ํŠน์ • row(e.g. input)์™€ ํŠน์ • column(e.g. output)์— ํ•ด๋‹นํ•˜๋Š” ๊ฐ’(e.g. reward)์ด table์ฒ˜๋Ÿผ ์ •์˜๋  ์ˆ˜ ์žˆ๋‹ค!

      ์ฆ‰, policy class๊ฐ€ ๋ชจ๋“  ์กฐ๊ฑด๋ถ€ ๋ถ„ํฌ๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” tabular class์—ฌ์„œ, ๋ชจ๋“  (Input s, output a)์— ๋Œ€ํ•ด conditional probability distribution (ฯ€(aโˆฃs))s,a(ฯ€(a | s))_{s,a}๏ปฟ ์„ ๊ฐ€์ง„๋‹ค

      โ‡’ ์‹ค์ œ๋กœ๋Š” ๊ทธ๋ ‡์ง€ ์•Š๋‹ค!!

      why? Transformer๋Š” neural architectures๋ผ์„œ, parameter์˜ ์ˆ˜๊ฐ€ ์œ ํ•œํ•จ! (non-tabular)

  • Main Motivation
    non-tabular policy class์—์„œ DPO loss๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๊ฒƒ์ด full two-stage RLHF์™€ ๋™๋“ฑํ•œ๊ฐ€? 
    ๋งŒ์•ฝ ๊ทธ๋ ‡์ง€ ์•Š๋‹ค๋ฉด, ideal RLHF-optimal policy๊ณผ ์–ด๋–ป๊ฒŒ ๋‹ค๋ฅธ๊ฐ€? 
    
    ideal RLHF-optimal policy์˜ ์„ฑ๋Šฅ๊ณผ ๋™์ผํ•˜๋‹ค๋Š” ๋ณด์žฅ์ด ์žˆ๋Š”๊ฐ€? 
    ๋งŒ์•ฝ ๊ทธ๋ ‡์ง€ ์•Š๋‹ค๋ฉด, ํ•ด๊ฒฐ์ฑ…์ด ์žˆ๋Š”๊ฐ€?

Contributions (What theyโ€™ve revealed)

(Left) DPO essentially performs a projection of the true preference-generating reward function (rโˆ— in black) onto the manifold of reward functions implicitly expressed by the policy class. If rโˆ— is in the manifold, then DPO finds the correct KL-regularized RLHF policy, but otherwise, the policy found (any orange point) is unreliable. (Right, zoomed inset) Locally linearizing the manifold around the base policyโ€™s implicit reward function (r_ฮธ_0 ) uncovers geometric insights. To reliably drive the solution to the reward function corresponding to the ideal RLHF solution (r_ฮธ_RLHF in blue), AuxDPO introduces additional controlled degrees of freedom, along the null space of a base-policy dependent matrix to sidestep misspecification.

DPO๋Š”, ์‹ค์ œ๋กœ ์ƒ์„ฑ๋œ reward function(r*; ๊ฒ€์€ ์ )๋ฅผ policy class์— ์˜ํ•ด ์•”๋ฌต์ ์œผ๋กœ ํ‘œํ˜„๋˜๋Š”, reward function์˜ manifold(DPOโ€™s implicit reward manifold; ์ดˆ๋ก ๊ณก์„ )๋กœ projectionํ•จ

  • ์ด์ƒ์ ์ธ ๊ฑด r*๊ฐ€ manifold ์œ„์— ์žˆ๋Š” ๊ฒƒ์ด์ง€๋งŒ, ์•„๋‹Œ ๊ฒฝ์šฐ data distribution์— ์˜ํ–ฅ์„ ๋ฐ›์•„ unreliable solution (์ฃผํ™ฉ ์ )์œผ๋กœ projection ํ•œ๋‹ค
  • data๊ฐ€ Noisyํ•ด์„œ ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋ถ„ํฌ ์ž์ฒด์— ์˜ํ–ฅ์„ ๋ฐ›์•„ unreliableํ•˜๋‹ค๋Š” ๊ฒƒ!

โ‡’ AuxDPO๋ฅผ ํ†ตํ•ด ์ถ”๊ฐ€๋กœ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋Š” degrees of freedom๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ, projection ์˜ค์ฐจ๋ฅผ ์ค„์ž„

โ‡’ ์ด์ƒ์ ์ธ RLHF solution ฮธ ์— ๋Œ€์‘ํ•˜๋Š” reward function( rฮธRLHFr_{ฮธ_{RLHF}}๏ปฟ; ํŒŒ๋ž€์ )์œผ๋กœ ํˆฌ์˜ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์ž!

  • DPO ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด ๋‹ค์–‘ํ•œ failure mode๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ์„ ์ˆ˜์‹์ ์œผ๋กœ ๋ฐํž˜

    : ์–‘์งˆ์˜ ๋ฐ์ดํ„ฐ(=true reward function rโˆ—๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” Bradley-Terry-Luce (BTL) model๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋˜๋Š” infinite preference data)๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ์•„์ฃผ ๋‹จ์ˆœํ•œ ์„ค์ •(single prompt๋ฅผ ์“ฐ๊ณ , 3๊ฐ€์ง€ ์‘๋‹ต์— ๋Œ€ํ•œ 1์ฐจ์› policy parameter ์‚ฌ์šฉ)์—์„œ๋„ DPO๊ฐ€

    ๋‘ ๋ฒˆ์งธ๋กœ ์ข‹์€ ์‘๋‹ต์„ ๋” ์„ ํ˜ธํ•˜๋„๋ก ํ•™์Šต๋˜๊ฑฐ๋‚˜(order reversal of preferences)

    ์ตœ๊ณ  ๋ณด์ƒ ์‘๋‹ต์˜ ํ™•๋ฅ ์ด ๊ธฐ์ค€ ์ •์ฑ…๋ณด๋‹ค ๊ฐ์†Œํ•จ (overall reward reduction)

    โ‡’ ํŠนํžˆ, ์–ด๋–ค preference pair data๊ฐ€ ๋งŽ์ด ์‚ฌ์šฉ๋˜์—ˆ๋Š”์ง€(=count)์— ๋Œ€ํ•ด์„œ๋„ ๋งค์šฐ ๋ฏผ๊ฐํ•จ.

    • about Bradley-Terry-Luce (BTL) model

      : โ€œ๋‘ ์„ ํƒ์ง€ ์ค‘ ๋ฌด์—‡์„ ๋” ์„ ํ˜ธํ•˜๋Š”๊ฐ€โ€ ๊ฐ™์€ pairwise comparison๋ฅผ ํ™•๋ฅ ๋กœ ๋ชจ๋ธ๋งํ•˜๋Š”, ๊ฐ€์žฅ ์‹ฌํ”Œํ•œ ์ƒ์„ฑ ๋ชจ๋ธ ์ค‘ ํ•˜๋‚˜

  • DPO ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ misspecified statistical estimation problem์„ ๋ฐํž˜
    • tabular assumption ๋•Œ๋ฌธ์—, ์‹ค์ œ DPO์˜ projection์ด optimalํ•˜์ง€ ์•Š๋‹ค!
      An example with 3 responses and 1-d policy parameter showing failure modes of DPO. rโˆ— is the latent reward. The red line denotes the linear approximationC(A^โŠค_ฮธ0 ) of the implicit reward manifold Rฮฒ . The region shaded in orange represents all possible implicit reward functions that DPO can possibly project onto, depending on the relative proportion of pairwise preference counts n_1,2, n_2,3, n_3,1. If n3,1 dominates the rest, then the projection rฮฒ ฮธ induces a postoptimized policy parameter ฮธ > 0, leading to preference reversal and reduction of expected reward, causing DPO to fail.

      DPO๋Š” implicit reward manifold( C(Aฮธ0T)C(A^T_{ฮธ_0})๏ปฟ;์‹ค์ œ DPO๊ฐ€ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” reward ๊ณต๊ฐ„;๋นจ๊ฐ„์„ )์œ„๋กœ ์‹ค์ œ reward (r*; ํŒŒ๋ž€ ์ )๋ฅผ projectionํ•˜๋Š” ๊ฒƒ์ด๋ฉฐ, ๋Œ€๋ถ€๋ถ„ DPO๋Š” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์— ์˜ํ•ด mis-projection(red dashed line)๋  ๊ฐ€๋Šฅ์„ฑ์ด ํผ

    • reward๋Š”, model์˜ ๋” ์ข‹์€ ์„ ํƒ์„ ์œ„ํ•ด policy parameter๋ฅผ ์–ด๋А ๋ฐฉํ–ฅ์œผ๋กœ updateํ• ์ง€๋ฅผ ์•Œ๋ ค์คŒ. ๊ทธ๋Ÿฐ๋ฐ ์‹ค์ œ DPO์˜ reward๋Š” ์œ ํ•œํ•œ manifold๋ผ์„œ, reward๋ฅผ ๋ฐ”๊พธ๋”๋ผ๋„ policy๊ฐ€ ํ•ญ์ƒ ์œ ์˜๋ฏธํ•˜๊ฒŒ update๋˜์ง€ ์•Š๋Š”๋‹ค!

      ์ฆ‰, policy์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋Š” null space Aฯ,ฮธ0A_{ฯ,ฮธ_0} ๏ปฟ ๊ฐ€ ์กด์žฌํ•œ๋‹ค

  • DPO์˜ misspecification์„ ์šฐํšŒํ•˜๊ธฐ ์œ„ํ•œ AuxDPO ์ œ์•ˆ

    : policy์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋Š” null space Aฯ,ฮธ0A_{ฯ,ฮธ_0} ๏ปฟ ๋ฅผ ์ž์œ ๋„๋กœ ํ™œ์šฉํ•˜์ž!

    AuxDPO fixes DPOโ€™s misspecification. rโˆ— is the latent reward. The blue line denotes the equivalence class R^ฮฒ_eq(ฮธโˆ—)of all reward functions that yield the RLHFoptimal policy ฯ€ฮธโˆ— . The red line denotes the linear approximation C(A^โŠค_ฮธ0 ) of the implicit reward manifold Rฮฒ . The region shaded in orange represents all possible implicit reward functions that DPO can possibly project onto. The green line depicts the domain of optimization over AuxDPOโ€™s auxiliary variables ฮด โˆˆ N (A_ฯ,ฮธ_0 ) for a fixed ฮธ (the line shifts in parallel for other ฮธ). ฮด introduces additional degrees of freedom, which help push the KL projection of rโˆ— to lie in the equivalence class Rฮธโˆ— . The projection induces the optimal policy ฯ€ฮธโˆ— .

    DPO๋Š” implicit reward manifold( C(Aฮธ0T)C(A^T_{ฮธ_0})๏ปฟ;์‹ค์ œ DPO๊ฐ€ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” reward ๊ณต๊ฐ„;๋นจ๊ฐ„์„ )์œ„๋กœ ์‹ค์ œ reward (r*; ํŒŒ๋ž€ ์ )์˜ equivalence class(policy ๊ด€์ ์—์„œ ๊ฐ™์€ ํšจ๊ณผ๋ฅผ ๋‚ด๋Š” reward์˜ ์ง‘ํ•ฉ;ํŒŒ๋ž€์„ )๋ฅผ projectionํ•˜๋Š” ๊ฒƒ.

    ์‹ค์ œ๋กœ๋Š”, r*์™€ optimal policy ๊ฐ„์— ฮดโˆˆN(Aฯ,ฮธ0)โŠŠRmฮด โˆˆ N (A_{ฯ,ฮธ_0} ) โŠŠ R^m๏ปฟ ( reward space RmR^m๏ปฟ์ด ์•„๋‹Œ, null space N(Aฯ,ฮธ0)N(A_{ฯ,ฮธ_0})๏ปฟ ์˜ ์›์†Œ)๋งŒํผ ์ฐจ์ด๊ฐ€ ๋ฐœ์ƒํ•จ.

    ** rโˆ—=rฮธโˆ—ฮฒ+ฮดr^* = r^ฮฒ_{ฮธ*} + ฮด๏ปฟ

    โ‡’ null space ๋ฐฉํ–ฅ์œผ๋กœ ์›€์ง์ด๋Š”, ์ถ”๊ฐ€์ ์ธ ์ž์œ ๋„ (ฮดโˆˆRmฮด โˆˆ R_m๏ปฟ ; green line)๋ฅผ ํ™œ์šฉํ•˜์—ฌ reward space RmR^m๏ปฟ ์„ ํƒ์ƒ‰ํ•˜๋„๋ก ํ•˜์ž!

    ์ฆ‰, rโˆ—=rฮธโˆ—ฮฒ+ฮดโˆ—r^* = r^ฮฒ_{ฮธ*} + ฮด^*๏ปฟ ๋ฅผ ๋งŒ์กฑํ•˜๋Š” ฮดโˆ—ฮด^*๏ปฟ ๋ฅผ ํƒ์ƒ‰ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ธฐ์กด DPO์— ฮดโˆˆRmฮด โˆˆ R_m๏ปฟ ํ•ญ์„ ์ถ”๊ฐ€ํ•˜์—ฌ, optimization(green line)ํ•˜์ž!

    โ‡’ ๋นจ๊ฐ„์„ +์ดˆ๋ก์„ ์„ ํ•จ๊ป˜ ์›€์ง์ผ ์ˆ˜ ์žˆ๊ฒŒ ๋˜์–ด misspecification ์™„ํ™”!

    • AuxDPO ์ˆ˜์‹
      • ฮด ๋กœ ์‹œ์ž‘ํ•˜๋Š” ํ•ญ์ด, null space๋ฅผ ํ™œ์šฉํ•œ ๋ถ€๋ถ„

Categories

DPO RL research