19 March 2026

Multiplayer Nash Preference Optimization

๐Ÿ’กalignment๊ฐ€ ๊ฐ€์ ธ์•ผ ํ•  ๋ชฉํ‘œ๋Š” ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋‹ค์ˆ˜ ๊ฐ€์น˜ ๋ฐ ์ •์ฑ… ์ง‘๋‹จ ์†์—์„œ ๊ทธ ๋ˆ„๊ตฌ์—๊ฒŒ๋„ ์ง€์ง€ ์•Š๋Š” ์•ˆ์ •์  ๊ท ํ˜• ์ƒํƒœ๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์ด๋‹ค!

Multiplayer Nash Preference Optimization

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
์ฝ”์Šคํ”ผ๊ฐ•์ : Aligmnet๋ฅผ Min-max๊ฐ€ ์•„๋‹Œ โ€˜๋‹ค๋ฅธ ๋ชจ๋ธโ€™์— ๋น„ํ•ด์„œ ์–ผ๋งˆ๋‚˜ ๋ณดํŽธ์ ์œผ๋กœ ์ž˜ ํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€์— ๋Œ€ํ•œ ์ง€ํ‘œ๋ฅผ ์ œ์‹œํ•œ ๊ฒƒ์ด ๊ฐ•์ .
์•ฝ์ : ์ „์ฒด ๊ธฐ์ค€์— ๋งž์ถ”๋‹ค ๋ณด๋ฉด ํŠน์ •ํ•œ ๊ธฐ์ค€์— ๋Œ€ํ•œ ์ตœ์ ํ™” ๋ถ€๋ถ„์—์„œ๋Š” ์•ฝ์ ์„ ๋ณด์ผ ์ˆ˜ ์žˆ์„ ๊ฒƒ์ด๋ผ ๋ด„.
์ œ์•ˆ: ํŠน์ • Oracle์— ๋งž์ถ”๋ฉด์„œ ์ „์ฒด ๊ธฐ์ค€์— ๋งž์ถ”๋Š” ํ•™์Šต ๋ฐฉ๋ฒ•์ด ์ข‹์„ ๋“ฏ.
3.7
์–ผ๋ผ๊ฐ•์ : ๋กœ์ง“์ด ์ ˆ๋Œ€๊ฐ’๋ณด๋‹ค ์ƒ๋Œ€์  ๋น„๊ต๋กœ ์˜๋ฏธ๋ฅผ ๊ฐ–๋“ฏ์ด alignment๋„ ๋‹จ์ผ ์ ์ˆ˜ maximization์ด ์•„๋‹ˆ๋ผ ์—ฌ๋Ÿฌ ์ •์ฑ… ๊ฐ„ ์ƒ๋Œ€์  ์šฐ์œ„์™€ ๊ท ํ˜•์œผ๋กœ ํ•ด์„ํ•œ ๊ด€์ ์ด ์‹ ์„ ํ•จ
์•ฝ์ : ์‹ค์ œ ๋น…ํ…Œํฌ๊ธฐ์—…๋“ค์€ ํŠน์ • Preference์— ์ง‘์ค‘ํ•ด์„œ alignment๋ฅผ ํ•˜๊ณ  ์ด๋ฅผ selling point๋กœ ์‚ฌ์šฉํ•˜๋Š”๊ฑธ๋กœ ์•Œ๊ณ  ์žˆ๋Š”๋ฐ ํ•ด๋‹น ๋ฐฉ๋ฒ•๋ก ์€ ์‹œ์žฅ ๊ฒฝ์Ÿ๋ ฅ์—์„œ ์•ฝํ•  ๊ฒƒ ๊ฐ™์Œ
์ œ์•ˆ: ํ•˜๋‚˜์˜ preference๋Š” ์˜ฌ๋ฆฐ๋‹ค๋Š” ๊ฐ€์ • ํ•˜์— ๋‚˜๋จธ์ง€์˜ ํ‰๊ท ์„ ์˜ฌ๋ฆฌ๋Š” ๋ฐฉ๋ฒ•๋ก ์ด ๋‚˜์˜ค๋ฉด ์ข‹์„๋“ฏ
4.0
๋น„์š”๋œจ๊ฐ•์ : ๋…ผ๋ฌธ์ด alignment๋ฅผ ๋ฐ”๋ผ๋ณด๋Š” ๊ด€์ ์€ ์ฐธ์‹ ํ•˜๊ธด ํ•˜๋‹ค. ๋งˆ์น˜ ์‹ค์„ธ๊ณ„์˜ '๋‚จ๋“ค๊ณผ ๋น„๊ตํ•ด์„œ ํ‰๊ท  ์ด์ƒ๋งŒ ํ•˜์ž'์™€ ์œ ์‚ฌํ•œ ๊ด€์ ์ธ๋“ฏ
์•ฝ์ : 'ํ‰๊ท ์ ์œผ๋กœ ๊ฐ•ํ•œ ์ •์ฑ…'์ด ๊ผญ 'ํŠน์ • ์‚ฌ์šฉ์ž์—๊ฒŒ ์ข‹์€ ์ •์ฑ…'์€ ์•„๋‹๊ฒƒ ๊ฐ™์Œ. ๊ทธ๋ƒฅ ๋‘๋ฃจ๋‘๋ฃจ ๋ฌด๋‚œํ•œ~ ์ •์ฑ…์„ ๋งŒ๋“œ๋Š”๊ฑฐ๊ณ , ์–ด๋А ๊ทธ ํ•˜๋‚˜์˜ ๊ด€์ ์—์„œ๋„ ์ตœ๊ณ ๊ฐ€ ์•„๋‹˜
์ œ์•ˆ: ์ƒํ™ฉ์ด๋‚˜ ๋‹ฌ์„ฑํ•˜๊ณ ์ž ํ•˜๋Š” ๋ชฉ์ ์— ๋งž๊ฒŒ adaptive ํ•˜๊ฒŒ '์ƒ๋Œ€'๋ฅผ ๊ตฌ์„ฑํ•ด๋„ ์ข‹์„๋“ฏ
4.1
์นซ์†”๊ฐ•์ : ๋‹ฌ์„ฑํ•ด์•ผ ํ•  ๊ธฐ์ค€์ด ์—ฌ๋Ÿฌ๊ฐœ์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด ํ˜„์‹ค์  ์‹œ๋‚˜๋ฆฌ์˜ค์™€ ์ž˜ ๋งž์Œ
์•ฝ์ : ๋‹ค ์ž˜ํ•œ๋‹ค๋Š” ๊ฒŒ ํ˜„์‹ค์ ์œผ๋กœ ๋ถˆ๊ฐ€๋Šฅํ•œ๋ฐ, ์ด๋„์ €๋„ ์•„๋‹ˆ๊ฒŒ ์ •๋ ฌ๋  ๊ฐ€๋Šฅ์„ฑ
์ œ์•ˆ: ํ‰๊ท ์— ๋งž์ถ”๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ ๋งž์ถฐ์•ผ ํ•  ๊ธฐ์ค€์„ ํŒŒ์•…ํ•˜๊ณ  ๊ฑฐ๊ธฐ์— ์ตœ์ ํ™”
3.9
์„คํ–ฅ๋”ธ๊ธฐ๊ฐ•์ : โ€œ์–ด๋–ค ๋ชจ๋ธ๋ณด๋‹คโ€ ๊ฐ€ ์•„๋‹ˆ๋ผ, โ€œ์ค‘๊ฐ„์€ ๊ฐ€์žโ€ ๋А๋‚Œ์œผ๋กœ ์ดํ•ด๋˜๊ณ , ๊ด€์ ์— ๋”ฐ๋ผ ์ข‹์€ ๋ฐฉํ–ฅ์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ. ์˜คํžˆ๋ ค ๋ณดํŽธํ™”๋  ๋ชจ๋ธ์ด ๊ฐ€์ ธ์•ผํ•˜๋Š” ์ตœ์ ํ™” ๋ฐฉํ–ฅ์ด๋ผ๊ณ  ๋А๊ปด์ง.
์•ฝ์ : ๋ณดํŽธํ™”๋ฅผ ์œ„ํ•จ์ด๋ผ๋ฉด ๋‚ฉ๋“๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ๊ฒฐ๊ตญ ๋ณดํŽธํ™” ์ดํ›„ ํŠน์ • task์— ํ•™์Šตํ•˜๊ณ  ๊ฐœ์„ ํ•˜๋Š” ๊ณผ์ •์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํ•„์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•จ. ํ•˜์ง€๋งŒ, ์ด ์—ฐ๊ตฌ์—์„œ๋Š” ์ด๋ ‡๊ฒŒ ์ตœ์ ํ™”ํ•œ ๋’ค์˜ ๋ชจ๋ธ ๊ฐœ๋Ÿ‰์€ ๊ณ ๋ คํ•˜์ง€ ์•Š์Œ.
์ œ์•ˆ: ์„ ํ˜ธ๋„ ์ตœ์ ํ™”๋ฅผ ๋ฐฑ๋‚ ์ฒœ๋‚  ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ์ผ๊นŒ? ๊ทธ ๋‹ค์Œ ํ•™์Šต์— ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€๋„ ๊ณ ๋ คํ•ด์•ผ ํ•˜์ง€ ์•Š์„๊นŒ? ๋ผ๋Š” ์ƒ๊ฐ์ด ๋“ฆ. ์ด ๋…ผ๋ฌธ๊ณผ๋Š” ํฌ๊ฒŒ ๊ด€๋ จ ์—†์ง€๋งŒ, ๊ทธ๋ƒฅ ์ฝ๊ณ  ๋ณด๋‹ˆ ์ƒ๊ฐ๋‚จ.
4.0
๋‚˜์Šค๋‹ฅ์žฅ์ : ๊ด€์ ์ด ์žฌ๋ฏธ์žˆ๊ณ  ์ด๋Ÿฐ์—ฐ๊ตฌ ํ•˜๋‚˜์ฏค์€ ํ•„์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐ ๋“ฆ!
๋‹จ์ : ์—ฐ๊ตฌ์˜ ํ•„์š”์„ฑ์— ๋Œ€ํ•œ ์„ค๋“๋ ฅ์ด ๋งŽ์ด ๋–จ์–ด์งโ€ฆ ์ง€์ง€์•Š๋Š” ๋ชจ๋ธ์ด๋ผ๋Š” ๊ฒƒ์ด ์™œ ํ•„์š”ํ•˜์ง€? ์ฝ”๋“œ๋Š” ์ง€ํ”ผํ‹ฐ๊ฐ€ ์ž˜ํ•˜๊ณ  safety๋Š” ํด๋กœ๋“œ๊ฐ€ ์ž˜ํ•˜๊ณ  RAG๋Š” ์ œ๋ฏธ๋‚˜์ด๊ฐ€ ์ž˜ํ•˜๋ฉด ์ง€์ง€์•Š๋Š” ๋ชจ๋ธ์€ ์–ด๋””๋‹ค ์จ์•ผํ• ๊นŒ?
์ œ์•ˆ: ์ƒ๋ฐ˜๋˜๋Š” ๊ฐ€์น˜๋ฅผ ์ค˜์•ผํ•  ๋•Œ ์‹ค์šฉ์ ์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ด์ž! e.g. LLM safety์—์„œ ๋ชจ๋ธ์€ ์ ๊ทน์ ์œผ๋กœ ๋งํ•ด์•ผ ํ•˜์ง€๋งŒ ๋™์‹œ์— ์กฐ์‹ฌ์Šค๋Ÿฝ๊ฒŒ ๋งํ•ด์•ผ ํ•จ
3.5
์ปคํ”ผ๊ฐ•์  : human preference๋ฅผ ์‹ค์ œ ์„ธ๊ณ„์˜ "๋‹ค์–‘์„ฑ"์ด๋ผ๋Š” ํ˜„์‹ค์ ์ธ ๊ด€์ ์— ๋งž์ถฐ alignmentํ•ด์•ผํ•œ๋‹ค๋Š” ๋‚ด์šฉ. ๋˜ํ•œ ์„ค๊ณ„ ์–ด๋ ค์›€์œผ๋กœ ์ธํ•ด ์ž์‹ ์˜ ์ด์ „ ๋ชจ๋ธ์„ opponent๋กœ ๋‘์–ด ๋น„๊ตํ•˜๋Š”๊ฒŒ ๋…ผ๋ฆฌ์ ์œผ๋กœ ํƒ€๋‹นํ•ด ๋ณด์—ฌ ์ฐธ์‹ ํ–ˆ์Œ.
์•ฝ์  : ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ์€ ์ข‹์•„๋ณด์ด์ง€๋งŒ, ๋ชจ๋“  ์„ฑ์งˆ์„ ๊ณ ๋ คํ•˜์—ฌ ์„ฑ๋Šฅ์ด ์šฐ์ˆ˜ํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ๊ทธ์— ๋”ฐ๋ผ์˜ค๋Š” ๋น„์šฉ์„ ๊ฐ์•ˆํ• ๋งŒํผ ๊ฐ€์น˜๊ฐ€ ์žˆ์„์ง€๊ฐ€ ๊ถ๊ธˆํ•จ.
์ œ์•ˆ : ์‹ค์ œ๋กœ ๊ด€๋ จ์žˆ๊ฑฐ๋‚˜ ์ค‘์š”ํ•œ ๋ชฉํ‘œ๋ฅผ ๊ธฐ์ค€์„ ์‚ผ๊ณ , ๊ทธ์— ๋”ฐ๋ผ alignment์˜ ์„ฑ๋Šฅ ๋ณ€ํ™” ์‹คํ—˜์„ ์ œ์‹œํ•˜๋ฉด ํ™•์žฅ ๊ฐ€๋Šฅ์„ฑ์„ ๋” ์ž˜ ๋ณด์—ฌ์ค„ ๊ฒƒ ๊ฐ™์Œ.
3.5
404๊ฐ•์ : 1๋“ฑ์€ ๋ชปํ•ด๋„ ๊ผด์ฐŒ๋Š” ํ•˜์ง€ ๋ง์ž! ๋ฅผ ์ถ”๊ตฌํ•˜๋Š” ์—ฐ๊ตฌ. ์•„์นด๋ฐ๋ฏนํ•œ ํ•€ํŠธ์—์„œ๋Š” ์•„์‰ฝ์ง€๋งŒ, ์‚ฌ์‹ค ์‹ค์ œ์ƒํ™ฉ์—์„œ๋Š” ์ด๋Ÿฐ (์ ๋‹นํ•œ ๋น„์šฉ์œผ๋กœ ์ ๋‹นํ•œ ์ˆ˜์ค€์˜ ์„œ๋น„์Šค๋ฅผ ํ•  ์ˆ˜ ์žˆ๋Š”) ๋ฐฉํ–ฅ์„ ์ข€ ๋” ์„ ํ˜ธํ• ์ˆ˜๋„!
์•ฝ์ : ์—ฐ๊ตฌ์˜ ํ•„์š”์„ฑ์ด ๊ฐ•ํ•˜๊ฒŒ ์™€๋‹ฟ์ง„ ์•Š์Œ.
์ œ์•ˆ: ๊ทธ๋ž˜๋„ ํ•˜๋‚˜ ์ •๋„๋Š” ๋ช…ํ™•ํ•œ Objective๋ฅผ ๊ฐ€์ง€๋„๋ก ํ•™์Šตํ•ด์•ผ ํ•˜์ง€ ์•Š์„๊นŒ? ๋‹ค์–‘ํ•œ objective๋ฅผ ๋™์‹œ์— alignํ• ๋•Œ trade-off๋ฅผ ๋ณด์™„ํ•˜๋Š” ๋А๋‚Œ์œผ๋กœ
3.5
AI๊ฐ•์ : ์‹ค์ œ ์ธ๊ฐ„ ์„ ํ˜ธ๊ฐ€ non-transitiveํ•จ์„ ๋ฐ˜์˜ํ•˜๊ณ  ์ด๋ก ์  ์ •๋‹น์„ฑ์„ ์œ ์ง€ํ•˜๋ฉด์„œ ํ™•์žฅํ•จ
์•ฝ์ : ๋…ผ๋ฌธ์—์„œ multiplayer์„ ์ฃผ์žฅํ•˜๊ณ  ์žˆ๋Š”๋ฐ ์‹ค์ œ๋กœ๋Š” ๊ณผ๊ฑฐ policy๋“ค์˜ mixture ํ˜•ํƒœ๋ผ์„œ ๋‹จ์ผ ๋ชจ๋ธ trajectory์ผ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์ง€ ์•Š์„๊นŒ...? populaiton game์ด๋ผ๊ณ  ๋ถ€๋ฅด๋Š”๊ฒŒ ๋‹ค์†Œ ๊ณผ์žฅ์ผ์ˆ˜๋„
์ œ์•ˆ: Player ์ˆซ์ž๋ฅผ ๊ณ ์ •ํ•˜์ง€ ์•Š๊ณ  ์ ์ง„์ ์œผ๋กœ ์ง„ํ™”ํ•˜๋Š” policy๋ฅผ ๊ณ ๋ คํ•œ multi-agent ์—ฐ๊ตฌ ์ˆ˜ํ–‰ ๊ฐ€๋Šฅ
3.7
๊ตญ๋ฐฅ๊ฐ•์ : ๋ณด์ƒ ์ตœ๋Œ€ํ™”๊ฐ€ ์•„๋‹ˆ๋ผ ๋ˆ„๊ตฌ์—๊ฒŒ๋„ ์ง€์ง€ ์•Š๋Š” ๊ท ํ˜•์„ ๋ชฉํ‘œ๋กœ ์‚ผ๋Š”๋‹ค๋Š” ๊ด€์  ์ „ํ™˜์ด ์‹ ์„ ํ•œ๊ฒƒ ๊ฐ™์Œ.
์•ฝ์ : Time-dependent MNPO์—์„œ ๊ณผ๊ฑฐ snapshot ์ •์ฑ…๋“ค์„ opponent๋กœ ์“ฐ๋Š” ๋ฐฉ์‹์ด ์ง„์งœ multiplayer์ธ์ง€, ๊ฒฐ๊ตญ ์ž๊ธฐ ์ž์‹ ์˜ ๊ณผ๊ฑฐ์™€ ๊ฒฝ์Ÿํ•˜๋Š” ๊ฒƒ ์•„๋‹Œ๊ฐ€.
์ œ์•ˆ: ๊ณผ๊ฑฐ ์ž๊ธฐ snapshot๋ฟ ์•„๋‹ˆ๋ผ ์‹ค์ œ ๋‹ค๋ฅธ LLM์„ opponent pool์— ํฌํ•จ์‹œํ‚ค๋Š” ์‹คํ—˜์„ ์ถ”๊ฐ€ํ•˜๋ฉด ์ข‹์ง€ ์•Š์„๊นŒ
3.7

TL; DR

๐Ÿ’ก

alignment๊ฐ€ ๊ฐ€์ ธ์•ผ ํ•  ๋ชฉํ‘œ๋Š” ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋‹ค์ˆ˜ ๊ฐ€์น˜ ๋ฐ ์ •์ฑ… ์ง‘๋‹จ ์†์—์„œ ๊ทธ ๋ˆ„๊ตฌ์—๊ฒŒ๋„ ์ง€์ง€ ์•Š๋Š” ์•ˆ์ •์  ๊ท ํ˜• ์ƒํƒœ๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์ด๋‹ค!

Summary

  • ๋ญ์— ์“ฐ๋ ค๊ณ  ์ด ์—ฐ๊ตฌ๋ฅผ ํ–ˆ์„๊นŒ?
    • ๋‹ค์–‘ํ•œ ๊ฐ€์น˜์™€ ๊ด€์ ์ด ์กด์žฌํ•˜์ง€๋งŒ, RLHF๋Š” ์ ์ˆ˜ ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋‘๋ฅผ ๋งŒ์กฑ์‹œํ‚ค๊ธฐ ์–ด๋ ค์›€
      • ์–ด๋–ค ๊ด€์ ๊ณผ ๊ฐ€์น˜๋กœ ํ•™์Šตํ•˜๊ณ  ์ตœ์ ํ™”๋˜๋А๋ƒ์— ๋”ฐ๋ผ ๋งค๋ฒˆ ๋‹ฌ๋ผ์ ธ์„œ, ์–ด๋–ค ๊ฒฝ์šฐ์—๋Š” ๋” ์•ˆ ์ข‹์•„์งˆ ์ˆ˜ ์žˆ์Œ
    • Nash ์ตœ์ ํ™”๋Š” ์ด๋ฅผ ๋ณด์™„ํ•จ. ์–ด๋–ค ์‹ฌํŒ์ด ์˜ค๋˜, ์ƒ๋Œ€ ๋ชจ๋ธ์— ๋Œ€ํ•ด์„œ ์ตœ์†Œํ•œ ์ง€์ง€๋Š” ๋ง์ž!
      • ๊ทธ๋Ÿฐ๋ฐ, ๊ธฐ์กด ๋ฐฉ๋ฒ•์€ 2-player๋งŒ์„ ๊ณ ๋ ค. ๊ทธ๋Ÿฐ๋ฐ, ์ƒ๋Œ€ ๋ชจ๋ธ์€ ๋ณดํ†ต ์—ฌ๋Ÿฌ๊ฐœ ์•„๋‹Œ๊ฐ€?
    • ๊ทธ๋ž˜์„œ, Multiplayer Nash PO๋ฅผ ์ œ์•ˆ
      • ์–ด๋–ค ๊ฐ€์น˜๊ฐ€ ์˜ค๋“ , ์–ด๋–ค ์ƒ๋Œ€ ๋ชจ๋ธ์ด ์˜ค๋“ , ํ‰๊ท ์ ์œผ๋กœ ์ง€์ง€๋Š” ๋ง์ž!
      • ์–ด๋–ค ์ƒํ™ฉ์—์„œ๋“  ์ตœ์„ ์˜ ์„ ํƒ์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ๋งŒ๋“ค๊ธฐ
    • ์˜ˆ)ํŠธ๋žœ์Šคํฌ๋จธ ์„ค๋ช…ํ•ด์ค˜
      • RLHF (concise) (๋ณด์ƒ ์ ์ˆ˜ ์ตœ์ ํ™”)
        • ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์–ดํ…์…˜ ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค
      • RLHF(detail)
        • ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์ธ์ฝ”๋” ๋””์ฝ”๋” ๊ธฐ๋ฐ˜์œผ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ์–ดํ…์…˜โ€ฆ
      • NLHF(concise vs detail) (์ƒ๋Œ€ ํ•œ๋ช…๊ณผ ํ† ๋ก ํ•ด์„œ ์ง€์ง€ ์•Š๊ธฐ)
        • ์ƒ๋Œ€ ๋ชจ๋ธ์˜ ํŠน์„ฑ์— ๋”ฐ๋ผ, oracle์— ๋”ฐ๋ผ ํŠน์ • ๋ฐฉํ–ฅ์œผ๋กœ ์ ๋ฆด ์ˆ˜ ์žˆ์Œ
      • MNPO (ํ† ๋ก  ์ƒ๋Œ€๊ฐ€ ์—ฌ๋Ÿฌ๋ช…์ด์–ด๋„ ๋ˆ„๊ตฌ์—๊ฒŒ๋„ ์™„ํŒจํ•˜์ง€ ์•Š๊ธฐ)
        • concise+detail+โ€ฆ ๊ฐ€ ๊ฒฝ์Ÿํ•˜๊ณ , ์–ด๋А ํ•˜๋‚˜๋„ ๋†“์น˜์ง€ ์•Š๋„๋ก ๊ตฌ์„ฑ

Background

  • ๋…ผ๋ฌธ์—์„œ ์ •๋ฆฌ๋ฅผ ์ž˜ํ•จ
  • Bradley-Terry (BT)
    • ํ•˜๋‚˜์˜ ์Šค์นผ๋ผ reward ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ์ตœ์ ํ™”
    • Transitive ๊ฐ€์ •
      • A๋ฅผ RLHFํ•ด์„œ Aโ€™๋ฅผ ๋งŒ๋“ค๋ฉด, A< Aโ€™
      • Aโ€™๋ฅผ RLHFํ•ด์„œ B๋ฅผ ๋งŒ๋“ค๋ฉด Aโ€™< B
      • ๊ทธ๋Ÿฌ๋ฉด, A<B ๋ผ๊ณ  ๋ด„
  • Nash ๊ท ํ˜•
    • ์ƒ๋Œ€๊ฐ€ ๋ฐ”๊พธ์ง€ ์•Š์œผ๋ฉด ๋‚˜๋„ ๋ฐ”๊ฟ€ ์ด์œ ๊ฐ€ ์—†์Œ
    • ํ•ญ์ƒ ํŒŒ๋ ˆํ†  ์ตœ์ ์€ ์•„๋‹˜(์ตœ์„ ์˜ ์„ ํƒ์ด ์•„๋‹˜)
    • ํ•™์Šต ๊ด€์ ์—์„œ, ๋ฌด์Šจ๋ง์ธ๊ฐ€?
      • ๋‚ด๊ฐ€ ํ•œ๋ฒˆ ๋” ํ•™์Šตํ•˜๋ฉด, ์ƒ๋Œ€๋ณด๋‹ค ๋ชปํ•ด์งˆ ๊ฐ€๋Šฅ์„ฑ์ด ์žˆ์Œ. ๊ทธ๋ž˜์„œ, ๋” ํ•™์Šตํ•˜๋Š” ๊ฒŒ ์˜๋ฏธ๊ฐ€ ์—†์Œ.
        • ๊ฐ€๋งŒํžˆ ์žˆ๋Š” ๊ฒƒ์ด ๋” ์ข‹์€ ๊ฒฝ์šฐ
    • ์ฐธ๊ณ 

Motivation

  • ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ BT ๊ธฐ๋ฐ˜์˜ RLHF(Reinforcement Learning from Human Feedback)๋ฅผ NLHF(Nash Learning from Human Feedback)๋กœ ํ™•์žฅ
    • RLHF์˜ ๋ฌธ์ œ์ 
      • ์‹ค์ œ ์ธ๊ฐ„ ์„ ํ˜ธ๊ฐ€ transitive ํ•˜์ง€ ์•Š์Œ
      • ๋‹ค์–‘ํ•œ ์„ฑ์งˆ์ด ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฉฐ(์•ˆ์ „์„ฑ, ํšจ์šฉ์„ฑ, ๊ฐ„๊ฒฐ์„ฑ ๋“ฑ) annotator๋งˆ๋‹ค ๊ธฐ์ค€์ด ๋‹ค๋ฆ„
    • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” 2-player nash game์œผ๋กœ์˜ ์ตœ์ ํ™” ์ •์˜
      • ์ƒ๋Œ€๊ฐ€ ๋ˆ„๊ฐ€ ์˜ค๋“ , ๊ดœ์ฐฎ๋„๋ก ๋งŒ๋“ค๊ธฐ
      • ์–ด๋–ป๊ฒŒ?
        • ์‹ฌํŒ์„ ๋‘๊ณ , ๊ทธ ์‹ฌํŒ์ด ๋” ์ข‹์€ ๊ฒƒ์„ ๋ฝ‘๊ฒŒ ๋งŒ๋“ค์—ˆ์„ ๋•Œ, ์ƒ๋Œ€๋ฐฉ์œผ๋กœ๋ถ€ํ„ฐ ์ตœ์†Œํ•œ ์ง€์ง€ ์•Š๊ฒŒ ๋งŒ๋“ค๊ธฐ
      • ์ฆ‰, RLHF์˜ ๋ณด์ƒ ์ ์ˆ˜๋ฅผ โ‡’ ๋น„๊ต ๊ธฐ๋ฐ˜ ์šฐ์œ„๋กœ ๋ณ€๊ฒฝ
  • ํ•˜์ง€๋งŒ, ์ •๋ ฌ์€ n-player game์ด์–ด์•ผ ํ•จ
    • ์ƒ๋Œ€ ์ •์ฑ…์ด ๊ผญ ํ•˜๋‚˜์ธ๊ฐ€?(X)
    • ๋ชฉํ‘œ๊ฐ€ ๋‹จ์ผ์ธ๊ฐ€?(X)
  • ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌ?
    • ์—ฌ๋Ÿฌ ์ •์ฑ…์˜ ํ‰๊ท ์„ ํ™œ์šฉํ•˜์—ฌ, ํ‰๊ท ์ ์ธ ์ƒ๋Œ€๋ณด๋‹ค ๋” ์ž์ฃผ ์ด๊ธฐ๋„๋ก

Idea

  • Alignment(์ •๋ ฌ)์€ min-max๊ฐ€ ์•„๋‹ˆ๋ผ, ์ „์ฒด ๊ด€์ ์˜ ํ‰ํ˜•์„ ๋ด์•ผ ํ•จ
    • ์—ฌ๋Ÿฌ ๊ธฐ์ค€์„ ๊ฐ€์ง€๋Š” ์ •์ฑ… ์ง‘๋‹จ๊ณผ ๊ฒฝ์Ÿํ•ด์•ผ์ง€, ๋‹จ์ผ ์ •์ฑ…๊ณผ ๊ฒฝ์Ÿํ•ด๋ด์•ผ ์˜๋ฏธ ์—†๋‹ค!
    • ํ‰๊ท ๋ณด๋‹ค ๋” ์ข‹์€๊ฐ€? โ‡’ ์ตœ์†Œํ•œ ์ค‘๊ฐ„ ์ด์ƒ์€ ๊ฐ„๋‹ค!
    • DPO, SimPO ๋ฅผ ํฌ๊ด„ํ•˜๋Š” ๊ฐœ๋…์ž„!
      • ์–ด๋–ค ์ƒ๋Œ€๊ฐ€ ์žˆ๊ณ , ๊ทธ ์ƒ๋Œ€๋ณด๋‹ค ๋” ์šฐ์ˆ˜ํ•ด์ง€๋„๋ก ๋น„๊ต ์šฐ์œ„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•˜๋ฉด ์ด ๋ฐฉ๋ฒ•์ž„(์ €์ž๋“ค ์ฃผ์žฅ์ด๊ธด ํ•จ)

Method (์ˆ˜์‹ ๋‹ค ๋บŒ)

  • Multiplayer Nash Preference Optimization(MNPO) ์ œ์•ˆ
    • preference oracle(์‹ฌํŒ)์„ ๊ณต์œ ํ•˜๋Š” ๊ฒฝ์šฐ (Homogeneous)
      • ์ด๋ก ์ ์œผ๋กœ ์™„๋ฒฝํ•˜์ง€๋งŒ, ์‹ค์„ธ๊ณ„์™€๋Š” ๋ฉ€์–ด์ง
    • ๊ฐ๊ฐ ๋‹ค๋ฅธ preference orcale์ด ์กด์žฌํ•˜๋Š” ๊ฒฝ์šฐ(Heterogeneous)
      • ์ด๋ก ์ ์œผ๋กœ ์กฐ๊ธˆ ์ด์ƒํ•ด์ง€์ง€๋งŒ, ์‹ค์„ธ๊ณ„์— ๊ฐ€๊นŒ์›€
        • ์•ˆ์ „์„ฑ, ํšจ์šฉ์„ฑ ๋“ฑ ๊ด€์ ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ

Homogeneous MNPO

  • Oracle์ด ํ•˜๋‚˜, ๋ชจ๋“  ๋ชจ๋ธ์ด ๊ณต์œ 
  • Oracle์ด ์ข‹์•„์•ผ ํ•จ
  • ์ด๋ก ์ ์œผ๋กœ ์ข‹๋‹ค!
    • ๋‚ด์‰ฌ ๊ท ํ˜•์ด ๋ณด์žฅ๋  ์ˆ˜ ์žˆ์Œ

Heterogeneous MNPO

  • Oracle์ด ์—ฌ๋Ÿฌ๊ฐœ(๊ฐ ๋ชจ๋ธ์ด preference๊ฐ€ ์—ฌ๋Ÿฌ๊ฐœ์ž„)
  • ๋‚ด์‰ฌ ์ด๋ก  ๋ณด์žฅ์ด ์•ฝํ•ด์ง€์ง€๋งŒ, ํ˜„์‹ค์— ๊ฐ€๊นŒ์›€
  • ๋ณธ ์—ฐ๊ตฌ์—์„œ๋Š” ์ œ์•ˆ๋งŒ ํ•˜๊ณ , ๋ช…ํ™•ํ•˜๊ฒŒ ๋‹ค๋ฃจ์ง€๋Š” ์•Š์Œ

Time-dependent MNPO

  • ์ด์ „ ํ•™์Šต ์ •์ฑ…์„ ๊ฒฝ์Ÿ ์ƒ๋Œ€๋กœ ํ™œ์šฉ
  • ์—ฌ๋Ÿฌ ์ •์ฑ…์„ ๋™์‹œ์— ์“ฐ๊ธฐ, GPU ๋ฉ”๋ชจ๋ฆฌ๋„ ํ•œ๊ณ„๊ฐ€ ์žˆ์œผ๋‹ˆ, ํšจ์œจ์„ฑ ์ธก๋ฉด์—์„œ snapshot ๊ธฐ๋ฐ˜ ํ™œ์šฉ

Experiments

  • ์ •๋ ฌ์€ ๋ณด์ƒ์„ ๋งŽ์ด ๋ฐ›๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์—ฌ๋Ÿฌ ์ƒ๋Œ€๋ณด๋‹ค ๋ชปํ•˜์ง€ ์•Š๋Š” ๊ฒƒ! โ‡’ ์•ˆ์ „์„ฑ, MNPO์˜ motivation
  • Instruction following ๋ฒค์น˜๋งˆํฌ
  • Knowledge, commonsense ๋ฒค์น˜๋งˆํฌ

Analysis

  • ์—ฌ๋Ÿฌ oracle์ด ์žˆ์–ด๋„, ๋™์ž‘ ๊ฐ€๋Šฅ
    • ์„ฑ๋Šฅ์ด ๋” ์ข‹์•„์งˆ ์ˆ˜ ์žˆ๋‹ค!
  • Single player๋ณด๋‹ค Multiplayer๊ฐ€ ๋” ์ข‹๋‹ค
    • vs., INPO์—์„œ ์„ฑ๋Šฅ ์šฐ์œ„
    • single opponent๋Š” ๊ณผ์ ํ•ฉ ์œ„ํ—˜, ๋” ๊ฐ•๊ฑดํ•˜๊ฒŒ ํ•™์Šตํ•œ๋‹ค.
  • Alignment ๊ฐ•ํ™” + ๋Šฅ๋ ฅ ์œ ์ง€
    • ์žƒ๋Š” ๊ฒƒ์„ ์ค„์ด๊ณ , ์ƒˆ๋กœ์šด ๊ฒƒ์„ ์ž˜ ๋ฐ›์•„๋“ค์ž„
    • RLHF ๊ณ„์—ด์€ ์„ฑ๋Šฅ ํŽธ์ฐจ๊ฐ€ ์กด์žฌํ•˜์ง€๋งŒ, MNPO๋Š” ์„ฑ๋Šฅ์˜ ์ผ๊ด€์„ฑ์ด ์ข‹๋‹ค

Categories

DPO RL research