19 March 2026

Beyond Pairwise: Empowering LLM Alignment With (Ranked) Choice Modeling

๐Ÿ’กRLHF๋‚˜ DPO์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•๋“ค์€ Pairwise(์Œ) Preference Optimization์— ๋งž์ถฐ์ ธ ์žˆ์–ด, ๋” ์ž์„ธํ•œ ์ •๋ณด(Human Feedback)๋ฅผ ํ•™์Šตํ•  ๊ธฐํšŒ๋ฅผ ๊ฐ„๊ณผํ•œ๋‹ค.โ‡’ Response์— ๋Œ€ํ•ด Pairwise๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๊ทธ ์ด์ƒ๊นŒ์ง€ rank๋ฅผ ๋งค๊ฒจ ๋ชจ๋ธ์— ํ•™์Šต์„ ์‹œ์ผœ๋ณด์ž.

Beyond Pairwise: Empowering LLM Alignment With (Ranked) Choice Modeling

Review

๋‹‰๋„ค์ž„ Strength & Weakness & Sugguestions ๋ณ„์  (0/5)
์ฝ”์Šคํ”ผ๊ฐ•์ : Ranking Preference ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜์—ฌ ์ˆœ์„œ ์ •๋ณด๋ฅผ ํ‘œ๋ณธ์— ๋ฐ˜์˜ํ•จ์œผ๋กœ์จ ์„ฑ๋Šฅ์„ ๋†’์ธ ์ ์ด ๊ฐ•์ 
์•ฝ์ : Preference๊ฐ€ ์–ด๋–ป๊ฒŒ ๋” Richํ•ด์กŒ๋Š”์ง€ ๋ชจํ˜ธํ•จ
์ œ์•ˆ: Preference์˜ ์ •๋ณด๊ฐ€ ์ฆ๊ฐ€๋˜์—ˆ์Œ์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ฆ๋ช…์ด๋‚˜ ์„ค๋ช…์ด ์ถ”๊ฐ€๋˜์—ˆ์œผ๋ฉด ํ•จ.
3.8
์–ผ๋ผ๊ฐ•์ : ์—ฌ๋Ÿฌ ๋ฒค์น˜๋งˆํฌ์—์„œ ์ผ๊ด€๋œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์ธ ์ ์—์„œ ๋” richํ•œ preference ์ •๋ณด๋ฅผ ํ•™์Šตํ•œ๋‹ค๋Š” ์ž…์žฅ์„ ํž˜์„ ์‹ค์Œ
์•ฝ์ : ๊ตณ์ด ์—ฌ๋Ÿฌ ์Œ์œผ๋กœ ํ•  ํ•„์š”์„ฑ์„ ๋ชป ๋А๋ผ๊ฒ ์Œ. ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘ ๋น„์šฉ์ด๋‚˜ ์ด๋Ÿฐ ์ ์—์„œ practicalํ•œ ๋ฐฉ๋ฒ•๋ก ์€ ์•„๋‹ˆ๋ผ๊ณ  ์ƒ๊ฐํ•จ
์ œ์•ˆ: ๋‹ค์ค‘ preference๊ฐ€ ์ค‘์š”ํ•œ ์š”์ฆ˜, RCPO๊ฐ€ ์—ฌ๋Ÿฌ ์„ ํ˜ธ๊ฐ€ ๋™์‹œ์— ์กด์žฌํ•˜๋Š” alignment ํ™˜๊ฒฝ์—์„œ๋„ ์œ ํšจํ•œ์ง€๋ฅผ ๊ฒ€์ฆํ•ด๋ณด๋ฉด ์ข‹๊ฒŸ์Œ
3.7
๋น„์š”๋œจ๊ฐ•์ : preference์— ๋Œ€ํ•œ response ์ •๋ณด๊ฐ€ (๊ณผ๋„ํ•˜๊ฒŒ) ๋งŽ๋‹ค๋ฉด response๊ฐ„ ๊ตฌ๋ถ„์ด ๋œ ๊ฐ€์„œ ์˜คํžˆ๋ ค ํ•™์Šต์— ๋ฐฉํ•ด๊ฐ€ ๋˜์ง€ ์•Š์„๊นŒ ์ƒ๊ฐํ–ˆ๋Š”๋ฐ ์ „์—ญ์ ์ธ ํŠน์„ฑ์„ ์•„๋Š”๊ฒŒ ๋” ํšจ๊ณผ์ ์ธ๊ฐ€๋ด„
์•ฝ์ : ์‹คํ—˜์—๋„ ๋‚˜์™€์žˆ๋“ฏ์ด ์ ๋‹นํ•œ k๋ฅผ ์žก๋Š”๊ฒŒ ์ค‘์š”ํ•ด๋ณด์ž„
์ œ์•ˆ: DPO์— top/bottom pair์„ ์‚ฌ์šฉํ•œ๊ฒƒ์œผ๋กœ ๋ณด์ด๋Š”๋ฐ, top2 ์ •๋ณด๋ฅผ ์ถ”๊ฐ€๋กœ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด ๊ฒฐ๊ณผ๊ฐ€ ์–ด๋–จ์ง€ ๊ถ๊ธˆํ•จ
4
์นซ์†”๊ฐ•์ : preference๊ฐ€ ์ด์ง„์„ ํƒ์ธ ์ƒํ™ฉ์—์„œ ์„ ํƒ์ง€๊ฐ€ ๋ณด๋‹ค ๋งŽ๋„๋ก ๋ชจ๋ธ๋งํ•˜๋Š” ๊ฑด ์ž˜ ๋‚ฉ๋“๊ฐ€๋Š” ๋™๊ธฐ. ์—ฌ๋Ÿฌ ์„ ํƒ์ง€์— ๋Œ€ํ•œ ์„ ํ˜ธ๋ฅผ ๋ถ„ํฌ๋กœ ๋ณด์•„์„œ ๋ฐฉ๋ฒ•๋ก  ์„ค๊ณ„ํ•œ ๊ฒƒ๋„ ์ž˜ ๋‚ฉ๋“๊ฐ
์•ฝ์ : ์ด๋ ‡๊ฒŒ ์„ธ๋ถ„ํ™”ํ•˜๋‹ค ๋ณด๋ฉด SFT ๋Œ€๋น„ํ•ด์„œ ์žฅ์ ์ด ๋ฌด์—‡์ผ์ง€ ๊ถ๊ธˆํ•จ
์ œ์•ˆ: pairwise PO vs. RCPO vs. SFT ์žฅ๋‹จ์  ์‹ฌ์ธต ๋ถ„์„
3.5
์„คํ–ฅ๋”ธ๊ธฐ๊ฐ•์ : ์„ ํ˜ธ๋„ ์ตœ์ ํ™” ํ•™์Šต์— ๋Œ€ํ•ด์„œ, ์ƒˆ๋กœ์šด ๋ฐฉํ–ฅ์„ ์ œ์‹œ.
์•ฝ์ : ์—ฌ๋Ÿฌ๊ฐœ๋ฅผ ํ•˜๋ฉด, ์˜คํžˆ๋ ค noisy ํ•œ ์ •๋ณด๊ฐ€ ๋” ๋งŽ์•„์ง€๊ณ , ์™œ ๊ทธ rank๊ฐ€ ๋˜๋Š”์ง€๋„ ๋ชจํ˜ธํ•ด์งˆ ๊ฒƒ ๊ฐ™์€๋ฐ, ๊ทธ๊ฑธ ํ•™์Šตํ•˜๋Š” ๊ฑด ์˜คํžˆ๋ ค ๋” ์–ด๋ ต์ง€ ์•Š๋‚˜? ์ฐจ๋ผ๋ฆฌ, ranking์„ ๋งค๊ธฐ๊ณ  dual๋กœ ๊ณ„์† ๋ฐ˜๋ณตํ•˜๋Š” ๊ฑด ์ดํ•ด๊ฐ€ ๊ฐˆ ๊ฒƒ ๊ฐ™์€๋ฐ, ๊ตณ์ด ์—ฌ๋Ÿฌ๊ฐœ์˜ prefernece๋ฅผ ๋งค๋ฒˆ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด ๋‚ฉ๋“์ด ์–ด๋ ค์›€.
์ œ์•ˆ: ranking์€ ๊ทธ๋Œ€๋กœ ํ•˜๊ณ , ํ•™์Šต์€ DPO๋กœ ํ•˜๋ฉด ์–ด๋–ป๊ฒŒ ๋˜๋ ค๋‚˜?
3.6
๋‚˜์Šค๋‹ฅ์žฅ์ : ์„ฑ๋Šฅ์ด ์˜ฌ๋ž๋‹ค!
์•ฝ์ : ์ด๊ฑฐ ์ด์ „์— ๋ดค๋˜ ๋…ผ๋ฌธ ์•„์ด๋””์–ด๋ž‘ ์ข€ ๊ฒน์น˜๋Š” ๊ฒƒ ๊ฐ™์Œ.. instruction evolutionํ•ด์„œ ๊ฐ๊ฐ์˜ ์ˆœ์„œ์— ๋Œ€ํ•œ ํŠน์„ฑ์„ ํ™œ์šฉํ•˜๋Š” ๋…ผ๋ฌธ์ด์—ˆ๋Š”๋ฐ, ๋„ˆ๋ฌด ์•„์ด๋””์–ด๊ฐ€ ์ฐฝ์˜์ ์ด์ง€ ์•Š์Œ
์ œ์•ˆ: ํ•ด์„๊ฐ€๋Šฅ์„ฑ์„ ๊ฐ€์ง€๋ฉด์„œ ๋ฐ์ดํ„ฐ ์ •๋ ฌ์„ ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ๋” ์ข‹์•˜์„ ๊ฒƒ. ๊ทธ๋ฆฌ๊ณ  evol ๋ฐฉ๋ฒ•๋ก ๊ณผ ๋‹ค๋ฅด๊ฒŒ hard negative๋ผ๋Š” ํŠน์„ฑ์„ ๋” ๊ฐ•์กฐํ•  ์ˆ˜ ์žˆ์œผ๋ฉด ์ข‹์ง€ ์•Š์„๊นŒ?
2.5
404์žฅ์ : preference๋ฅผ binaryํ•˜๊ฒŒ ๋ฐ˜์˜ํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, ๋” ๋งŽ์€ ํ›„๋ณด๊ตฐ์— ๋Œ€ํ•ด rankingํ•จ์œผ๋กœ์„œ ์ •๊ตํ•œ preference๋ฅผ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ์Œ. (์•„๋งˆ๋„?)
๋‹จ์ (์ด๋ผ๊ธฐ๋ณด๋‹ค ๊ถ๊ธˆํ•œ ๊ฒƒ): ๋งŒ์•ฝ a>b>c>d ์ด๋ฉด, ์ด๊ฑธ RCPO๋กœ ํ•œ๋ฒˆ์— ์ฃผ๋Š” ๊ฒƒ๋ณด๋‹ค a>b, a>c, a>d, b>c, b>d, c>d ๋กœ ๋„ฃ์–ด์ฃผ๋Š” ๊ฒŒ ๋ชจ๋ธ์ด ๋” ์ž˜ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ?
์ œ์•ˆ: RCPO๊ฐ€ ๋” richํ•œ preference์ด๋ฉฐ ํ•™์Šต์— ๋„์›€์ด ๋œ๋‹ค๋Š” ๊ฒƒ์„, ์„ฑ๋Šฅ์œผ๋กœ๋งŒ ์ œ์‹œํ•˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, ์‹ค์ œ loss๊ฐ€ ์–ด๋–ป๊ฒŒ ๋–จ์–ด์ง€๊ณ  ํ•™์Šต ๊ณต๊ฐ„์—์„œ์˜ ๋ถ„ํฌ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ™”ํ•˜๋Š”์ง€ ๋ถ„์„ํ•˜๋ฉด ์ข‹์„๋“ฏ !
4
AI๊ฐ•์ : Pairwise ๋‹จ์œ„์˜ preference optimization์ด ์•„๋‹Œ, ์—ฌ๋Ÿฌ ์‘๋‹ต์˜ ์ˆœ์œ„๋ฅผ ํ•œ๋ฒˆ์— ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋‹ค
๋‹จ์ : Choice ๋ชจ๋ธ์— ๊ธฐ๋ฐ˜์„ ํ•˜๋Š”๋ฐ,, ์‹ค์ œ human preference๋Š” ๋งฅ๋ฝ ์ง€์‹์— ์˜์กดํ•˜๊ณ  ํ•ญ์ƒ ์ผ๊ด€์ ์ด์ง€๋Š” ์•Š์€๋ฐ ์ด๋Ÿฐ๊ฒƒ๊นŒ์ง€ ๊ณ ๋ ค๋ชปํ•˜๋Š”๊ฑด ์•„์‰ฌ์›€
์ œ์•ˆ: Preference๋ฅผ ๋‹ค์–‘ํ•œ ๊ด€์ ์—์„œ (e.g., ๋ฌธํ™”, ๊ฐ€์น˜๊ด€) ๊ณ ๋ คํ•˜๋Š”๊ฒŒ ์ข‹์ง€ ์•Š๋‚˜โ€ฆ? OrthAlign ๋…ผ๋ฌธ๊ณผ ๊ฒฐํ•ฉํ•  ์ˆ˜ ์žˆ์„๊ฑฐ๊ฐ™์Œ
3.5
๊ตญ๋ฐฅ๊ฐ•์ : Pairwise๊ฐ€ richํ•œ ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ๋™๊ธฐ๊ฐ€ ๋‹จ์ˆœํ•˜์ง€๋งŒ ์ƒ๊ฐํ•˜๊ธฐ ์–ด๋ ค์šด ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ์ƒ๊ฐ์ด ๋จ. top k ๋ฐฉ์‹์ด pairwise ๋ณด๋‹ค ๋” ์ž์—ฐ์Šค๋Ÿฌ์šด ๋ฐฉ์‹์ธ๊ฒƒ ๊ฐ™๋‹ค.
๋‹จ์ : k์™€ s์˜ ์ตœ์ ๊ฐ’์ด ํƒœ์Šคํฌ๋งˆ๋‹ค ๋‹ฌ๋ผ์งˆ๊ฒƒ ๊ฐ™์€๋ฐ ์‹ค์ œ ์ ์šฉํ•˜๊ธฐ์— ์‹ค์šฉ์„ฑ์ด ๋–จ์–ด์ง€์ง€ ์•Š์„๊นŒ.
์ œ์•ˆ: top2๊ฐ€ ์ตœ์ ์ด๋ผ๋Š” ์„ค๋ช…์— ์™œ ๊ทธ๋Ÿฐ์ง€ ์ข€ ๋” ๊ทผ๊ฑฐ๊ฐ€ ์žˆ์œผ๋ฉด ์ข‹์„๊ฒƒ ๊ฐ™์Œ
3.6
์ปคํ”ผ๊ฐ•์  : DPO์™€ RLHF์˜ ๋…ผ๋ฌธ์„ ์ ‘ํ•˜๋ฉด์„œ, ํ•ญ์ƒ pairwise๋กœ ํ•™์Šต์„ ํ•˜๋Š” ๊ฒƒ์ด ๋‹น์—ฐํ•˜๊ฒŒ ์—ฌ๊ฒจ์กŒ๋Š”๋ฐ, ์—ฌ๋Ÿฌ rank ๋น„๊ต ์ •๋ณด๋„ choice model๊ณผ ํ™•๋ฅ  ๋ถ„ํฌ๊ฐ€ ์ •์˜๊ฐ€ ๋œ๋‹ค๋ฉด ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ๊ฒŒ ๋œ ๋…ผ๋ฌธ์ด์—ˆ์Œ.
์•ฝ์  : ๊ทธ๋ ‡๋‹ค๋ฉด ๋‹น์—ฐํžˆ ๋งŽ์€ ์ •๋ณด๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ๊ฒƒ์ด ์ข‹์ง€ ์•Š์„๊นŒ ์‹ถ์—ˆ์ง€๋งŒ, pairwise์™€ kโ‰ฅ2 ๊ฐ๊ฐ ์—์„œ ์„ฑ๋Šฅ์ ์ธ ์ธก๋ฉด, ๋น„์šฉ ์ธก๋ฉด์˜ trade-off๋ฅผ ์ž˜ ๊ณ ๋ คํ•ด์•ผํ•  ๊ฒƒ ๊ฐ™์Œ. ๋˜ํ•œ ๋…ผ๋ฌธ์—์„œ๋งŒ ๋ดค์„ ๋•Œ๋Š” choice model์˜ ๋‹ค์–‘์„ฑ๋„ ๋ถ€์กฑํ•ด ๋ณด์ž„. ๋˜ํ•œ ํฌ๊ฒŒ ๋ดค์„ ๋•Œ ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๋„ ๊ธฐ์กด์˜ RLHF ๋ฐฉ์‹๊ณผ ํฐ ์ฐจ์ด๋Š” ์—†์–ด ๋ณด์ž„.
์ œ์•ˆ : ๋‹ค์–‘ํ•œ choice model์— ๋Œ€ํ•ด์„œ๋„ ์ถ”๊ฐ€ ์‹คํ—˜์„ ํ†ตํ•ด ๋ช…ํ™•ํ•œ ์ผ๊ด€์„ฑ์„ ์ฃผ์—ˆ์œผ๋ฉด ์ข‹๊ฒ ์Œ.
3.6

์ธ์šฉ์ˆ˜ : 0

TL; DR

๐Ÿ’ก

RLHF๋‚˜ DPO์™€ ๊ฐ™์€ ๋ฐฉ๋ฒ•๋“ค์€ Pairwise(์Œ) Preference Optimization์— ๋งž์ถฐ์ ธ ์žˆ์–ด,

๋” ์ž์„ธํ•œ ์ •๋ณด(Human Feedback)๋ฅผ ํ•™์Šตํ•  ๊ธฐํšŒ๋ฅผ ๊ฐ„๊ณผํ•œ๋‹ค.

โ‡’ Response์— ๋Œ€ํ•ด Pairwise๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ๊ทธ ์ด์ƒ๊นŒ์ง€ rank๋ฅผ ๋งค๊ฒจ ๋ชจ๋ธ์— ํ•™์Šต์„ ์‹œ์ผœ๋ณด์ž.

Summary

Introduction & background(โญ)

๊ธฐ์กด LLM์„ Fine-tuningํ•˜๋Š” ๊ธฐ๋ฒ•์œผ๋กœ RLHF, DPO๊ฐ€ ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์œผ๋กœ ๋ถ€์ƒํ•˜์˜€์Œ.

โ‡’ ํ•˜์ง€๋งŒ, ์ด๋Ÿฌํ•œ ๋ฐฉ์‹๋“ค์€ โ€œPreference Pairsโ€ ์—๋งŒ ์˜์กดํ•˜์—ฌ richํ•œ ๋‹ค์ˆ˜ ์ •๋ณด๋“ค์„ 2๊ฐœ๋กœ ์ค„์—ฌ๋ฒ„๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ€์น˜์žˆ๋Š” ์ •๋ณด๋“ค์„ ๋ฒ„๋ฆด ์œ„ํ—˜์ด ์žˆ๋‹ค.

์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด RCPO(Ranked Choice Preference Optimization)์„ ์ œ์‹œํ•จ.

๐Ÿ’ก

RCPO๋Š” ๋ชจ๋ธ์ด ์ž…๋ ฅ ํ”„๋กฌํ”„ํŠธ x๋ฅผ ๋ฐ›์œผ๋ฉด,

ํ›„๋ณด response ์ง‘ํ•ฉ์—์„œ โ€œ์ˆœ์„œโ€ ์ •๋ณด๋ฅผ ํ•™์Šตํ•ด ๋ชจ๋ธ์ด preference ์ˆœ์„œ๋ฅผ ํ•™์Šตํ•˜๊ฒŒ ํ•œ๋‹ค.


โญ ํ™•๋ฅ ๋ถ„ํฌ๋กœ ์ •์˜๊ฐ€ ๊ฐ€๋Šฅํ•˜๋ฉด, MLE ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•˜๊ณ , Objective(=-loss) ํ‘œํ˜„์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

โญ Choice model : ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์„ ํƒ์ง€ ์ค‘์—์„œ ์–ด๋–ค ๊ฒƒ์ด ์„ ํƒ๋  ํ™•๋ฅ ์„ ํ‘œํ˜„ํ•œ ๋ชจ๋ธ

โ‡’ Ranking data๋ฅผ ํ™•๋ฅ ๋กœ ์“ฐ๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ๋‹ค!

โญ Reward model : response์˜ ํ’ˆ์งˆ์„ ์ ์ˆ˜๋กœ ํ‰๊ฐ€ํ•˜๋Š” ๋ชจ๋ธ


  • Conceptual Framework : LLM์˜ Fine-tuning๊ณผ Choice Modeling์„ ์—ฐ๊ฒฐํ•œ๋‹ค.

    โ‡’ Choice Model์ด ํ™•๋ฅ ๋ถ„ํฌ์ด๋ฏ€๋กœ, LLM Fine-tuningโ†’Choice modelโ†’MLE ์—ฐ๊ฒฐ์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

  • Concrete Example of Choice model : ๋Œ€ํ‘œ์ ์ธ Choice model ์˜ˆ์‹œ๋กœ MNL๊ณผ Mallows-RMJ๋ฅผ ์‚ฌ์šฉ.

    โ‡’ ๊ฐ choice model๋ณ„๋กœ objective ํ•จ์ˆ˜๋ฅผ ์ •์˜

  • Experiments : RCPO Framework๋ฅผ Llama-3-8B-Instruct, Gemma-2-9B-it, Mistral-7B-Instruct์—์„œ ํ‰๊ฐ€.

    โ‡’ In-distribution, out-of-distribution benchmark์—์„œ ํ‰๊ฐ€ํ•จ.

Motivation
๐Ÿ’ก
  1. Richํ•œ ์ •๋ณด๋ฅผ ์žƒ์„ ์ˆ˜ ์žˆ๋Š” Pairwise ๋ฐฉ์‹์—์„œ, ๊ผญ Response ํ‘œ๋ณธ์„ 2๊ฐœ๋กœ ๋‘์–ด์•ผ ํ• ๊นŒ?

    โ‡’ 2๊ฐœ ์ด์ƒ์„ ์„ค์ •ํ•ด๋ณด์ž.

  1. 2๊ฐœ ์ด์ƒ์œผ๋กœ ์„ค์ •ํ•œ๋‹ค๋ฉด, ํ•ด๋‹น ํ›„๋ณด Response๋“ค์€ ์–ด๋–ป๊ฒŒ ๋งŒ๋“ค๊นŒ?

    โ‡’ ๋ชจ๋ธ์—์„œ ๋‚˜์˜จ ํ›„๋ณด response๋“ค์„ reward model์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ ์ˆ˜๋ฅผ ๋งค๊ฒจ ๋งŒ๋“ ๋‹ค.

  1. Choice model์„ LLM๊ณผ ์—ฐ๊ฒฐ์‹œํ‚ฌ ์ˆ˜ ์žˆ์„๊นŒ?

    โ‡’ ๋ชจ๋ธ Fine-tuning์—๋Š” Objective ํ•จ์ˆ˜๊ฐ€ ํ•„์š”ํ•œ๋ฐ, ์ด๊ฒƒ์€ ํ™•๋ฅ  ๋ถ„ํฌ์™€ MLE๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‘๊ธฐ ๋•Œ๋ฌธ์—, Choice model์„ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ๊ฐ€๋Šฅํ•˜๋‹ค.

  1. ํ›„๋ณด Response๋ฅผ 2๊ฐœ ์ด์ƒ์œผ๋กœ ์„ค์ •ํ–ˆ๋‹ค๋ฉด, ๊ฒฐ๊ณผ์˜ ๊ฐฏ์ˆ˜๋„ ์˜ํ–ฅ์ด ์žˆ์„๊นŒ?

    โ‡’ Preference ์ˆœ์„œ ์ •๋ณด๊ฐ€ ๋งŽ๋‹ค๋ฉด ๋‹น์—ฐํžˆ ์ข‹์•„๋ณด์ธ๋‹ค. ์‹คํ—˜์—์„œ ํ™•์ธ.

โญ ๊ฒฐ๊ณผ์ ์œผ๋กœ, Preference์— ๋”ฐ๋ฅธ ์ˆœ์„œ ์ •๋ณด๋ฅผ ๋ชจ๋ธ์ด ํ•™์Šตํ•จ์œผ๋กœ์จ, ์ง€์—ญ์  ์ •๋ณด์—์„œ ์ „์—ญ์ ์ธ ์ˆœ์„œ ์ •๋ณด๋ฅผ ์•Œ๊ฒŒ๋œ๋‹ค!

**์•„๋ž˜ figure๋Š” Pairwise, Single-Best Feedback, Top-k Feedback ๊ตฌ์กฐ์ด๋‹ค.

Contribution

์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ

  • (1) ์ด๋ฏธ ํ›ˆ๋ จ๋œ LLM์„ ๊ฐ€์ ธ์˜ด

    Llama-3-8B-instruct, Gemma-2-9B-it, Mistral-7B-instruct

  • (2) Choice model ์„ ํƒ(MNL, Mallows-RMJ)
    • MNL์ด๋ž€?

      utility ๊ธฐ๋ฐ˜์˜ choice model

      ๊ฐ response๋Š” utility(์ ์ˆ˜)๋ฅผ ๊ฐ–๋Š”๋‹ค. โ†’ ์ ์ˆ˜๊ฐ€ ๋†’์„์ˆ˜๋ก ์„ ํƒ๋  ํ™•๋ฅ ์ด ์ปค์ง„๋‹ค.

    • Mallows-RMJ Model์ด๋ž€?

      Rank ๊ธฐ๋ฐ˜์˜ choice model โ‡’ ์ˆœ์œ„๊ฐ€ ์–ด๋–ค ํ™•๋ฅ ๋กœ ์ƒ์„ฑ๋˜๋Š”์ง€ ์ •์˜ํ•˜๋Š” ๋ชจ๋ธ

  • (3) ์„ ํƒ๋œ Choice model์„ ๋ฐ”ํƒ•์œผ๋กœ objective function ์ •์˜(=Loss)
    • MNL์˜ ํ™•๋ฅ ๋ถ„ํฌ, objective function

      Single-best(๊ฒฐ๊ณผ๊ฐ€ ํ•˜๋‚˜)

      Top-K(๊ฒฐ๊ณผ๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ)


      closed form์œผ๋กœ ๊ฐ๊ฐ ์žฌํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

      Single-best

      P(yiโˆฃS;x)=eฮฝyi(x)โˆ‘jโˆˆSeฮฝyj(x)P(y_i|S; x) = \frac{e^{\nu_{y_i}(x)}}{\sum_{j \in S} e^{\nu_{y_j}(x)}}

      โ‡’ winner resonse์™€ ๋‚˜๋จธ์ง€ response๋ฅผ ์ „๋ถ€ ๋™์‹œ์— ๋น„๊ตํ•จ.

      Top-K(๊ฒฐ๊ณผ๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ)

    • Mallows-RMJ Model ํ™•๋ฅ ๋ถ„ํฌ, objective function

      Single-Best

      Top-K


      closed form์œผ๋กœ ๊ฐ๊ฐ ์žฌํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

      Single-best

      Top-K

  • (4) TrainData Set์„ ํ†ตํ•ด Ranked based preference data๋ฅผ ๋งŒ๋“ฆ.

    UltraFeedback Dataset์„ ์‚ฌ์šฉํ•จ.

    • LLM์— UltraFeedback Dataset์˜ ํ”„๋กฌํ”„ํŠธ x ์ž…๋ ฅ
    • x๋ฅผ ๋ฐ›๊ณ  LLM์ด ์ƒ์„ฑํ•œ ์—ฌ๋Ÿฌ ์‘๋‹ต(ํ›„๋ณด ์ง‘ํ•ฉ)์— ๋Œ€ํ•ด reward model๋กœ ์ ์ˆ˜๋ฅผ ๋ถ€์—ฌ ํ›„ ์ •๋ ฌ
    • (X,S,ฮผk) ํ˜•ํƒœ๋กœ ๋ฐ์ดํ„ฐ ๊ตฌ์ถ•
  • (5) ์œ„์—์„œ ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ์™€ objective function์œผ๋กœ LLM์„ Fine-tuning
    โˆ‡ฮธLMallowsโˆ’RMJโˆ’POโˆ’Topโˆ’k(ฯ€ฮธ)โˆ‡_{ฮธ}L_{Mallows-RMJ-PO-Top-k} (ฯ€_{ฮธ})
    • Mallows-RMJ-PO-Top-2 ๋ฐฉ์‹์ด ๊ฐ€์žฅ ์„ฑ๋Šฅ์ด ์ข‹๊ธฐ์—, ๋Œ€ํ‘œ์ ์œผ๋กœ ์„ค๋ช…ํ•จ.
    • ๋†’์€ ์ˆœ์œ„์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋Š” ์˜ฌ๋ฆฌ๊ณ , ๋‚ฎ์€ ์ˆœ์œ„์˜ ๊ฐ€์ค‘์น˜๋Š” ๋‚ฎ์ถค.
    • ๋˜ํ•œ S์—์„œ ๋žญํ‚น์˜ ์œ„์น˜์™€ reward์˜ ๋น„์Šทํ•œ ์ •๋„์— ๋”ฐ๋ผ์„œ ์—…๋ฐ์ดํŠธ ๊ฐ•๋„๋ฅผ ์กฐ์ ˆํ•œ๋‹ค.

  • Experiment & Result
    • out-of-distribution : AlpacaEval 2.0 / Arena-hard-v0.1 (๋ช…๋ น์–ด ์ˆ˜ํ–‰ ๋ฒค์น˜๋งˆํฌ)

      AlpacaEval 2.0 = Fine-Tuning LLM๊ณผ GPT-4-Turbo์—์„œ ์ƒ์„ฑ๋œ ๋‹ต๋ณ€์— ๋Œ€ํ•œ WR๊ณผ LC๋กœ ์ธก์ •.

      Arena-hard-v0.1 = Fine-Tuning LLM๊ณผ GPT-4-0314์— ๋Œ€ํ•œ WR์„ ์ธก์ •

      โ‡’ Q : ์ด๋ ‡๊ฒŒ ํ•˜๋Š” ์˜๋ฏธ๋Š”?

      โ‡’ A : [Fine-Tuning ๋ชจ๋ธ ์ถœ๋ ฅ๊ณผ ํ‰๊ฐ€์ž ์—ญํ• ์˜ ๋ชจ๋ธ ์ถœ๋ ฅ]์„ ๋‘๊ณ , GPT-4.1-mini๋ฅผ ํ†ตํ•ด ์–ด๋А ๊ฒƒ์ด ๋” ์ ํ•ฉํ•œ ์ถœ๋ ฅ์ธ์ง€ ํ‰๊ฐ€ํ•œ๋‹ค.

      Arena-Hard-v0.1์—์„œ๋Š” GPT-5-mini๋ฅผ ์‹ฌํŒ์—ญํ• ์— ์ถ”๊ฐ€ ์‚ฌ์šฉํ•จ.


    • in-distribution

      [Fine-Tuning ๋ชจ๋ธ ์ถœ๋ ฅ๊ณผ ๊ธฐ์กด Test Dataset์˜ Preference Response] ๋ฅผ ๋‘๊ณ  GPT-4.1-mini๋ฅผ ํ†ตํ•ด ์–ด๋А ๊ฒƒ์ด ๋” ์ ํ•ฉํ•œ ์ถœ๋ ฅ์ธ์ง€ ํ‰๊ฐ€ํ•œ๋‹ค.


    Llama-3-8B-Instruct

    • ์ „๋ฐ˜์ ์œผ๋กœ Mallows-RMJ-PO-Top-2๊ฐ€ ์„ฑ๋Šฅ์ด ๊ฐ€์žฅ ์ข‹์Œ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ
    • Top-2์ธ ์ด์œ ?

      โ‡’ Top-2 Feedback์œผ๋กœ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์œผ๋กœ Top-1๋ณด๋‹ค ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋ณด์ธ ๊ฒƒ์„ ํ™•์ธ.

    • Choice model์˜ ์˜ํ–ฅ

      ์–ด๋–ค Choice model์„ ์“ฐ๋А๋ƒ์— ๋”ฐ๋ผ์„œ, ์„ฑ๋Šฅ์ด ์ขŒ์šฐ๋œ๋‹ค.

    โ‡’ Q) AlpacaEval 2 dataset์˜ LC ๋ถ€๋ถ„์—์„œ SimPO์˜ ์„ฑ๋Šฅ์ด ์™œ ๋” ์ข‹์„๊นŒ?

    โ‡’ A) LC๋Š” ๊ธธ์ด ๋ณด์ •์„ ํ•œ ํ›„์˜ ๋น„๊ต ๊ฒฐ๊ณผ๋กœ, SimPo๊ฐ€ ๊ธธ์ด์— ๋œ ์˜์กด์ ์ด๊ณ , ์•ˆ์ •์ ์ด๊ธฐ ๋•Œ๋ฌธ.

    ๋‹ค๋ฅธ LLM ๋ชจ๋ธ์— ์ ์šฉํ–ˆ์„ ๋•Œ์˜ ๊ฒฐ๊ณผ

    • Ablation Study

      ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ, Llama-3-8B-Instruct๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ์ˆœ์œ„์˜ ๊ฐฏ์ˆ˜ K์™€ ์ง‘ํ•ฉ ํฌ๊ธฐ S๋ฅผ ์‚ดํŽด๋ด„.

      1. K์™€ ์„ฑ๋Šฅ์€ ํ•ญ์ƒ ๋น„๋ก€ํ•˜์ง€ ์•Š์Œ.

        โ‡’ K๊ฐ€ ์ปค์งˆ์ˆ˜๋ก, ์ •๋ ฌ ๊ณผ์ •๊ณผ ํ•ญ๋ชฉ๋“ค์„ ๊ตฌ๋ณ„ํ•˜๊ธฐ ์–ด๋ ค์›Œ์ง.

      1. S๋Š” ์„ฑ๋Šฅ์— ์ผ๋ฐ˜์ ์œผ๋กœ ๋น„๋ก€ํ•˜์ง€๋งŒ, S=3๋งŒ์œผ๋กœ๋„ S=2(Pairwise)์— ๋น„ํ•ด ์ƒ๋‹นํ•œ ๊ฐœ์„ ์„ ๋‹ฌ์„ฑํ•จ.

        โ‡’๋˜ํ•œ, S๊ฐ€ ์ปค์งˆ์ˆ˜๋ก negative sample์ด ์ƒ๊ฒจ, LM์ด ๊ตฌ๋ณ„ ๋Šฅ๋ ฅ์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Œ.

      ๐Ÿ’ก

      S์™€ K์˜ ๊ท ํ˜•์„ ๋งž์ถ”๋Š” ์ค‘๊ฐ„ ์ •๋„์˜ ๊ฐ’์ด ์ด์ƒ์ ์ด๋‹ค.


  • conclusion

    RCPO๋Š” Preference Optimization(์„ ํ˜ธ๋„ ์ตœ์ ํ™”)๊ณผ Choice Model Estimation(์„ ํƒ ๋ชจ๋ธ ์ถ”์ •๋ฒ•)์„ ์—ฐ๊ฒฐํ•˜๋Š” Framework์ž„.

    MLE๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ RCPO๋Š” Pairwise, Single-Best, Top-K Preference Data๋ฅผ ํ†ตํ•ฉํ•จ.

    Utility-Base์™€ Rank-Base Choice model์„ ์˜ˆ์‹œ๋กœ,

    RCPO๋Š” Pairwise๋ณด๋‹ค ๋” ํ’๋ถ€ํ•œ ํ”ผ๋“œ๋ฐฑ์„ ๋ณด์กดํ•ด๋‚ด๋Š” ์„ฑ๋Šฅ ๊ฐœ์„  ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์คŒ.

๐Ÿ’ฌ ๊ธฐ์กด ์—ฐ๊ตฌ๋Š” Preference Pairwise์— ๋Œ€ํ•ด ํ•™์Šตํ•˜๋ฏ€๋กœ, richํ•œ ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜์ง€ ๋ชปํ–ˆ๋‹ค.

โ‡’ RCPO๋Š” ์—ฌ๋Ÿฌ response์— ๋Œ€ํ•œ ranking preference ์ •๋ณด๋ฅผ choice model์˜ ํ™•๋ฅ ๋กœ ๋ณ€ํ™˜ํ•˜์—ฌ, LLM์— ํ•™์Šต์‹œํ‚จ๋‹ค.

โญ ๋” richํ•œ preference ์ •๋ณด๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋‹ค.

x+yx+y๏ปฟ

Categories

DPO MLE research