19 March 2026

OrthAlign: Orthogonal Subspace Decomposition for Non-Interfering Multi-Objective Alignment

๐Ÿ’ก๋‹ค์ค‘ preference ์ตœ์ ํ™” ์‹œ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ๊ณต๊ฐ„์„ orthogonal subspace๋กœ ๋ถ„ํ•ดํ•˜์—ฌ, objective ๊ฐ„ ๊ฐ„์„ญ์„ ์›์ฒœ์ ์œผ๋กœ ์ œ๊ฑฐํ•˜์ž

์—ผ๊ทœํ™˜
์—ผ๊ทœํ™˜
๐Ÿฅ‰

OrthAlign: Orthogonal Subspace Decomposition for Non-Interfering Multi-Objective Alignment

Review

๋‹‰๋„ค์ž„ Strength & Weakness & Sugguestions ๋ณ„์  (0/5)
๋‚˜์Šค๋‹ฅ๊ฐ•์ : Superposition์— ๋งŽ์ด ์˜์กดํ•˜๊ณ  ์žˆ๋Š” ํŠธ๋ Œ๋“œ์—์„œ ๊ทธ ๋ฌธ์ œ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ํ’ˆ
์•ฝ์ : ์—ฌ๊ธฐ์„œ ํ’€๊ณ ์žํ•˜๋Š” MPA ํŠน์„ฑ์ƒ prefrence๋ผ๋ฆฌ ์ •๋ง orthogonalํ•˜๊ฒŒ ํ•™์Šต์‹œํ‚ค๋Š”๊ฒƒ์ด ์ข‹์€์ง€ ๋ชจ๋ฅด๊ฒ ์Œ ๊ฒฐ๊ตญ ์ด ๋…ผ๋ฌธ์ด ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์€ ํ•™์Šต์„ ๋” ๋งค๋„๋Ÿฝ๊ฒŒ ํ•œ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ฆ (์ตœ์ ํ™”๋ฅผ ์ž˜ํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ ๊ฐ™์Œ)
์ œ์•ˆ: MPA๊ฐ€ ์•„๋‹ˆ๋ผ catastrophic forgetting์— focusํ•ด์„œ ๊ด€๋ จ task๋“ค์— ์ ์šฉ์‹œํ‚ค๋Š” ๊ฒƒ์€ ์–ด๋–จ๊นŒ?
4
์ปคํ”ผ๊ฐ•์  : ๊ธฐ์กด ๋ฌธ์ œ๋ฅผ ์ž˜ ์งš๊ณ , ํŠน์ž‡๊ฐ’ ๋ถ„ํ•ด๋ฅผ ์ ์šฉํ•ด ์„ ํ˜ธ๋„ ์ข…๋ฅ˜๋ณ„๋กœ ๊ณต๊ฐ„์„ ๋ถ„๋ฆฌํ•˜์—ฌ conflict๋ฅผ ์ œ๊ฑฐํ•จ. method ๋˜ํ•œ ์ˆœ์„œ์— ๋งž๊ฒŒ ์ž˜ ์„ค๊ณ„ํ•จ.
์•ฝ์  : conflict๋ฅผ ์ œ๊ฑฐํ•˜์ง€๋งŒ, safe subspace๋ฅผ ๊ตฌํ•  ๋•Œ singular vector์˜ ์ฐจ์ด๊ฐ€ ๋ชจํ˜ธํ•˜๋‹ค๋ฉด, principal space๋ฅผ ๊ฑด๋“œ๋ฆด ์œ„ํ—˜์ด ์žˆ์„ ๊ฒƒ ๊ฐ™์Œ.
๊ทธ๋ž˜์„œ adaptive k๋ฅผ ํ†ตํ•ด ์–ด๋А์ •๋„ ๋ณด์™„ํ•˜๋Š” ๊ฒƒ ๊ฐ™์ง€๋งŒ, ์ถ”๊ฐ€์ ์ธ ๋ฐฉ๋ฒ•์ด ์žˆ์œผ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ.
์ œ์•ˆ : K๋ฅผ ์ •ํ•˜๋Š” ๊ธฐ์ค€์„ ์ถ”๊ฐ€ ์ ์šฉ
4.1
์ฝ”์Šคํ”ผ๊ฐ•์ : Parameter๊ฐ„ Update ๊ณต๊ฐ„์„ ๋ถ„๋ฆฌํ•˜์—ฌ ๊ธฐ์กด ๋ชจ๋ธ์—์„œ ๊ฐ€์น˜ ๊ฐ„ ํ•™์Šต์— ์ถฉ๋Œ์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์„ ํ•ด๊ฒฐํ•œ ์ ์€ Novelty๊ฐ€ ์žˆ๋‹ค๊ณ  ๋ด„.
์•ฝ์ : safe subspace ์•ˆ์œผ๋กœ ํˆฌ์˜ํ•ด์„œ ์“ธ ๋•Œ, orthogonalํ•˜๋‹ค๊ณ  ํ•˜๋ฉด ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๊ณ  ๊ณต๊ฐ„ ๋ถ„๋ฆฌ๋Š” ๋˜๊ฒ ์ง€๋งŒ, ์„œ๋กœ ๋‹ค๋ฅธ ํŠน์„ฑ์ด ๊ด€๋ จ์ด ์žˆ๋Š” ๋ถ€๋ถ„์€ ์–ด๋–ป๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š” ๊ฑด์ง€ ์˜๋ฌธ์ด ์ƒ๊น€.
์ œ์•ˆ: Tail ๊ณต๊ฐ„์„ ์–ผ๋งˆ๋‚˜ ํ—ˆ์šฉํ• ์ง€์— ๋Œ€ํ•œ ๋ช…ํ™•ํ•œ ๊ธฐ์ค€์ด๋‚˜ ๋‹ค๋ฅธ ๊ณต๊ฐ„์ด ์žˆ๋Š” ๊ฒƒ์ด ์–ด๋–จ๊นŒ?
4.1
์–ผ๋ผ๊ฐ•์ : ๋‹ค์ค‘ preference ์ตœ์ ํ™” ๊ด€ํ•œ ๋…ผ๋ฌธ๋“ค์ด preference ๊ฐ„์˜ trade-off๋ฅผ ์–ด์ฉ” ์ˆ˜ ์—†๋Š” ๋ฌธ์ œ๋กœ ์—ฌ๊ธฐ๊ณ  ๋„˜์–ด๊ฐ€๋Š” ๋…ผ๋ฌธ๋“ค์ด ๋งŽ์€๋ฐ ํŠน์ž‡๊ฐ’ ๋ถ„ํ•ด๋ฅผ ํ†ตํ•ด ์ด trade-off๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ์•„์ด๋””์–ด๊ฐ€ ์ข‹๋‹ค๊ณ  ์ƒ๊ฐํ•จ.
์•ฝ์ : ๋‹ค์–‘ํ•œ Preference๋“ค ์ค‘์— helpfulness, harmlessness, truthfulness 3๊ฐœ์˜ preference.์— ๋Œ€ํ•œ ์‹คํ—˜๋งŒ ์žˆ๋Š” ์ ์ด ์•„์‰ฌ์›€
์ œ์•ˆ: ๋‹ค์–‘ํ•œ preference์— ๋Œ€ํ•œ ๋ฐฉ๋ฒ•๋ก  ์ ์šฉ์ด ๊ถ๊ธˆํ•จ
4.2
๊ตญ๋ฐฅ๊ฐ•์ : ์„ธ๊ฐœ์˜ preference๋ฅผ ์ˆœ์„œ๋Œ€๋กœ ํ•™์Šตํ•˜๋ฉด์„œ ์ด์ „ preference๊ฐ€ ๋ง๊ฐ€์ง€์ง€ ์•Š๋Š”๊ฒƒ์— ๋Œ€ํ•œ ๋‹จ์ˆœํ•˜๋ฉด์„œ ํ™•์‹คํ•œ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๋ฉด์„œ ๊ธฐ์กด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•จ
์•ฝ์ : ๋งˆ์ง€๋ง‰์— ํ•™์Šต๋˜๋Š” preference์ผ์ˆ˜๋ก ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๊ณต๊ฐ„์ด ์ข์•„์ ธ์„œ ์„ฑ๋Šฅ์ด ๋‚ฎ์•„์งˆ์ˆ˜ ์žˆ์„๊ฒƒ ๊ฐ™๋‹ค. ์ˆœ์„œ ๋ณ€๊ฒฝ์— ๋”ฐ๋ผ ๋น„๊ตํ•˜๋Š” ์‹คํ—˜์ด ์žˆ์œผ๋ฉด ์ข‹์„๊ฒƒ ๊ฐ™๋‹ค.
์ œ์•ˆ: preferenceํ•™์Šต ์ˆœ์„œ๋ฅผ ๋ฐ”๊ฟจ์„๋•Œ์˜ ์„ฑ๋Šฅ ๋น„๊ต ์‹คํ—˜
4.1
๋น„์š”๋œจ๊ฐ•์ : Orthogonal ํŠน์„ฑ์€ ์ฐธ ๋‹ค์–‘ํ•œ ๊ณณ์—์„œ ์“ฐ์ผ ์ˆ˜ ์žˆ๋Š”๋“ฏ. ๊ทธ๋ฆฌ๊ณ  ๊ธฐ์กด๋ชจ๋ธ์— projection๋งŒ ์ถ”๊ฐ€ํ–ˆ๋Š”๋ฐ๋„ ์„ฑ๋Šฅ์ด ๋งŽ์ด ๊ฐœ์„ ๋œ๊ฑด ๋ฒ”์šฉ์„ฑ ์ธก๋ฉด์—์„œ ๋งŽ์ด ๊ฐ•์ ์ธ๋“ฏ
์•ฝ์ : Objective ๊ฐ€ ๋” ๋งŽ์•„์ง„๋‹ค๋ฉด, ์ œ์•ˆ๋œ ๊ณต๊ฐ„ ๋‚ด์—์„œ ์–ผ๋งˆ๋‚˜ ๋” ๋งŽ์€ objective๋ฅผ ์•ˆ์ •์ ์œผ๋กœ ๋‹ค๋ฃฐ์ˆ˜ ์žˆ์„์ง€ ์˜๋ฌธ์ž„
์ œ์•ˆ: ๋” ๋งŽ์€ objective๋ฅผ ๋‹ค๋ฃจ๊ณ ์ž ํ•œ๋‹ค๋ฉด objective ์ค‘์š”๋„์— ๋”ฐ๋ผ ๊ฐ„์„ญ์„ ์ผ๋ถ€ ํ—ˆ์šฉํ•˜๋˜ ํŒจ๋„ํ‹ฐ๋กœ ์ œ์–ดํ•˜๋Š”๋ฐฉ์‹์œผ๋กœ ํ™•์žฅํ• ์ˆ˜ ์žˆ์–ด๋ณด์ž„
4.2
์นซ์†”๊ฐ•์ : orthogonality ๊ฐ•ํ™”๋กœ ์ธํ•ด ๊ฐœ๋ณ„ ๋ชฉํ‘œ์˜ ํ›ˆ๋ จ์ •๋„ ๋“ฑ์„ ๋ณด๋‹ค ํ•ด์„ํ•˜๊ธฐ ์ข‹์•„๋ณด์ž„
์•ฝ์ : ๊ฐ ๋ชฉํ‘œ๊ฐ€ ์ •๋ง ์„œ๋กœ orthogonalํ•œ ๊ฒŒ ๋งž์„๊นŒ? ๋ถ„๋ฆฌํ•˜์ง€ ์•Š์€ ๋ชฉํ‘œ ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ์—์„œ ์ถ”๊ฐ€์ ์œผ๋กœ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ์„ฑ๋Šฅ๋„ ์žˆ์ง€ ์•Š์„๊นŒ
์ œ์•ˆ: orthogonality์™€ ํƒœ์Šคํฌ ์„ฑ๋Šฅ ๊ฐ„ ๊ท ํ˜•์„ ๋ชจ๋ธ๋งํ•˜๊ฑฐ๋‚˜ ํ‰๊ฐ€
4.3
์„คํ–ฅ๋”ธ๊ธฐ๊ฐ•์ : ์„ ํ˜ธ๋„ ํ•™์Šต ๊ด€์ ์—์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ์ƒํ˜ธ ๊ฐ„์„ญ์„ ์ง๊ต ๊ณต๊ฐ„์œผ๋กœ ๋ถ„ํ•ดํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ฐฉํ–ฅ์œผ๋กœ์˜ ํ•™์Šต์ด ์›ํ™œํ•˜๊ฒŒ ์ด๋ฃจ์–ด์ง€๋„๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ œ์•ˆ. ์ง๊ด€์ ์ด๊ณ , motivation์ด ๋ช…ํ™•ํ•˜๋‹ค
์•ฝ์ : ์„ ํ˜ธ๋„์˜ trade-off๊ฐ€ ๊ณผ์—ฐ ๋‚˜์œ ๊ฒƒ์ผ๊นŒ? ์˜คํžˆ๋ ค ๊ทธ trade-off๋ฅผ ์ž˜ ์กฐ์ ˆํ•˜๋Š” ๊ฒƒ์ด ๋” ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•จ. ๋ชจ๋“  ๊ณต๊ฐ„์„ ์ง๊ต๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ๋ณด๋‹ค, ์ข‹์€๊ฑด ์ข‹๊ฒŒ, ๋‚˜์œ๊ฑด ๋‚˜์˜๊ฒŒ ์ž˜ ๊ด€๋ฆฌํ•˜๊ณ  ํ•™์Šตํ•˜๋Š” ๊ฒŒ ๋” ํšจ์œจ์ ์ผ์ˆ˜๋„.
์ œ์•ˆ: ์ด ๋…ผ๋ฌธ์—์„œ๋„ ๊ฒฐ๊ตญ ์„ ํ˜ธ๋„๋Š” ํ•™์Šต ๊ธฐ๋ฐ˜ ์ตœ์ ํ™”๋กœ ์ˆ˜ํ–‰ํ•จ. ์ด ํ•™์Šต ๊ธฐ์ค€์„, ์ด์ „ ์ƒํƒœ๋ณด๋‹ค ๋” ์ข‹์•„์ง€๋„๋ก ์กฐ์ •ํ•  ๋•Œ, ๊ทธ ์กฐ์ •์„ ํ•˜๋‚˜์˜ ์ง๊ต ๊ณต๊ฐ„์ด ์•„๋‹ˆ๋ผ ์ „์ฒด ๊ณต๊ฐ„์˜ ํ•ฉ์ด ์žฅ๊ธฐ์ ์ธ ๊ด€์ ์—์„œ ๋” ๊ฐœ์„ ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š”๊ฒƒ?(MCTS ์ฒ˜๋Ÿผ ์ƒ๊ฐํ•ด๋ณด๊ธฐ)
4.0
404๊ฐ•์ : preference ๋ฐ parameter space ๋“ฑ ํ•™๊ณ„ ํŠธ๋ Œ๋“œ์˜ ๊ธฐ์กฐ๊ฐ€ ๋˜๋Š” concept์— ๋Œ€ํ•œ ์ง๊ด€์ ์ธ ๋ฌธ์ œ ์ œ๊ธฐ๋ฅผ ํ•จ. motivation์ด ๋งค์šฐ ๋ช…ํ™•ํ•˜๊ณ , ๊ทธ ์˜ํ–ฅ๋ ฅ์ด ํผ
์•ฝ์ : Multi-objective์—์„œ objective๊ฐ€ ํ•ญ์ƒ orthogonalํ• ๊นŒ? ์ˆ˜ํ•™์  ์ฆ๋ช…์œผ๋กœ ์„ค๋ช… ๊ฐ€๋Šฅํ• ๊นŒ? orthogonalํ•˜์ง€ ์•Š์€ objective๋Š” ์–ด๋–ค ํŠน์ง•์ด ์žˆ์„๊นŒ? ์ƒํ˜ธ๋ณด์™„๋˜๋Š” ๊ฒฝ์šฐ๋Š” ์—†์„๊นŒ?
์ œ์•ˆ: multi objective์˜ orthogonality ๊ด€๋ จ ๋ถ„์„ / objective ๋ณ„ ์ค‘์š”๋„ ๋ฐ˜์˜
4.5
AI๊ฐ•์ : ์—ฐ๊ตฌ์˜ framing ์ž์ฒด๊ฐ€ ํƒ„ํƒ„ํ•˜๋‹ค. ๊ธฐ์กด ์—ฐ๊ตฌ ๋Œ€๋ถ€๋ถ„์€ ๋‹จ์ˆœํžˆ reward engineering์„ ํ•˜๋Š” ๋А๋‚Œ์ธ๋ฐ ๊ตฌ์ฒด์ ์œผ๋กœ MPA ๋ฌธ์ œ๋ฅผ ํŒŒ๋ผ๋ฏธํ„ฐ geometry ๊ด€์ ์—์„œ ์ ‘๊ทผํ•ด์„œ ์ด๋ก ์  ์•ˆ์ •์„ฑ์„ ๋ณด์žฅํ•จ
์•ฝ์ : Projection matrix๋“ค์ด ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ ์ ์šฉํ•  ๋•Œ overhead๊ฐ€ ๊ต‰์žฅํžˆ ํฌ์ง€ ์•Š์„๊นŒ? ๋น„์šฉ ๋ถ„์„ ๋‚ด์šฉ์ด ๋…ผ๋ฌธ์— ์—†๋„ค
์ œ์•ˆ: ํ–‰๋ ฌ๋“ค์„ ์–‘์žํ™”ํ•ด์„œ ๋” ํฐ LLM์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ• ํƒ์ƒ‰
4.2

TL; DR

๐Ÿ’ก

๋‹ค์ค‘ preference ์ตœ์ ํ™” ์‹œ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ๊ณต๊ฐ„์„ orthogonal subspace๋กœ ๋ถ„ํ•ดํ•˜์—ฌ, objective ๊ฐ„ ๊ฐ„์„ญ์„ ์›์ฒœ์ ์œผ๋กœ ์ œ๊ฑฐํ•˜์ž

Summary

  • ์—ฐ๊ตฌ์ง„: ์ฐจ์ด๋‚˜ํ…”๋ ˆ์ฝค, ์ค‘๊ตญ์ธ๋ฏผ๋Œ€ํ•™, ์ค‘๊ตญ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™
  • ์ธ์šฉ์ˆ˜ : 1

Preliminary

  • MPA (Multi-preference alignment)๋ž€?
    • ์„œ๋กœ ์ถฉ๋Œ ๊ฐ€๋Šฅํ•œ ์ธ๊ฐ„ ์„ ํ˜ธ(preference)๋ฅผ ๋™์‹œ์— ๋งŒ์กฑํ•˜๋„๋ก ๋ชจ๋ธ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๊ณผ์ •
      • Helpfulness (์œ ์šฉ์„ฑ)
      • Harmlessness (์•ˆ์ „์„ฑ)
      • Truthfulness (์ง„์‹ค์„ฑ)
      • Honesty, Fairness
    • Ex) โ€œ์–ด๋–ป๊ฒŒ ํญํƒ„์„ ๋งŒ๋“ค๊นŒ?โ€
      • Helpful ๋ชจ๋ธ โ†’ ์„ค๋ช…
      • Harmless ๋ชจ๋ธ โ†’ ๊ฑฐ๋ถ€
  • Conflict Mitigation of MPA
    • MPA๋Š” ๋ณดํ†ต SFT๋กœ ํ•™์Šต๋œ ๊ธฐ๋ณธ ๋ชจ๋ธ ฯ€0\pi_0๏ปฟ์„ ๊ธฐ์ค€์œผ๋กœ ํ•จ.
      • ฯ€0(yโˆฃx)\pi_0(y|x)๏ปฟ: ์ž…๋ ฅ xx๏ปฟ์— ๋Œ€ํ•ด ์‘๋‹ต yy๏ปฟ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ์ดˆ๊ธฐ policy
    • ์ธ๊ฐ„ preference์˜ ์ˆ˜ํ•™์  ๋ชจ๋ธ๋ง
      • ์‚ฌ๋žŒ์€ ์„ ํ˜ธ ๋ฐ์ดํ„ฐ๋ฅผ ์•„๋ž˜์™€ ๊ฐ™์ด ํ‰๊ฐ€
        • y1โ‰ปy2y_1\succ y_2๏ปฟ (๊ฐ™์€ ํ”„๋กฌํ”„ํŠธ xx๏ปฟ์— ๋Œ€ํ•ด ์‘๋‹ต y1y_1๏ปฟ์ด y2y_2๏ปฟ๋ณด๋‹ค ๋‚ซ๋‹ค)
      • ์ž ์žฌ ๋ณด์ƒ ์ •์˜
        • riโˆ—(x,y)r^*_i(x,y)๏ปฟ
    • Bradley-Terry ๋ชจ๋ธ (์„ ํ˜ธ ํ™•๋ฅ  ์ •์˜)
      • ์—ฌ๋Ÿฌ preference๋ฅผ ๊ฐ€์ค‘ํ•ฉ์œผ๋กœ ํ†ตํ•ฉ ํ›„, softmax

      โ‡’ ์ด๋Ÿฌํ•œ ๊ฐ€์ •์€ multi-objective conflict์˜ ์›์ธ์ด ๋จ

    • DPO
      • ์„ ํ˜ธ ์‘๋‹ต ywy_w๏ปฟ์˜ ํ™•๋ฅ ์€ ๊ธฐ์ค€ ๋ชจ๋ธ ฯ€0\pi_0๏ปฟ ๋Œ€๋น„ ๋” ํฌ๊ฒŒ,
        ๋น„์„ ํ˜ธ ์‘๋‹ต
        yly_l๏ปฟ์˜ ํ™•๋ฅ ์€ ๋” ์ž‘๊ฒŒ!
      • Reward ๋ชจ๋ธ์„ ๋ช…์‹œ์ ์œผ๋กœ ํ•™์Šตํ•˜์ง€ ์•Š๊ณ  policy์™€ implicit reward์˜ ๊ด€๊ณ„๋ฅผ ์ง์ ‘ ์ด์šฉ

    โ‡’ ํ•ต์‹ฌ ๋ฌธ์ œ: MPA ๋ฐฉ๋ฒ•๋“ค์€ constraint loss ์ถ”๊ฐ€๋ฅผ ํ†ตํ•ด conflict๋ฅผ ์™„ํ™”ํ•˜๋ ค๊ณ  ํ•˜์ง€๋งŒ ๋™์ผ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต๊ฐ„์—์„œ ๋ˆ„์ ๋˜์–ด ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ์˜ ์•ˆ์ •์„ฑ์„ ์ €ํ•ด

์—ฐ๊ตฌ ๋™๊ธฐ

  • LLM alignment์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ชฉํ‘œ 3๊ฐ€์ง€
    • Helpfulness
    • Harmlessness
    • Honesty/Truthfulness

    โ‡’ ํ•˜๋‚˜์˜ objective๋ฅผ ๊ฐœ์„ ํ•˜๋ฉด ๋‹ค๋ฅธ objective๊ฐ€ ์•…ํ™”๋˜๋Š” ๊ทผ๋ณธ์ ์ธ trade-off ๋ฌธ์ œ ์กด์žฌ

  • ๊ธฐ์กด multi-preference (or objective) alignment ๋ฐฉ๋ฒ• ๊ฐœ์š” ๋ฐ ํ•œ๊ณ„
    • ๋ฐ์ดํ„ฐ ๊ธฐ๋ฐ˜ ์ ‘๊ทผ
      • ์„ ๋ณ„/๊ฐ€์ค‘์น˜/์Šค์ฝ”์–ด๋ง ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ํ˜ผํ•ฉ
      • ํ•œ๊ณ„: human labor ๋งŽ์ด ๋“ฆ + ์‹œ์Šคํ…œ์  ํŽธํ–ฅ
    • ๋ชจ๋ธ ๋ณ‘ํ•ฉ (Model Merging)
      • ์„œ๋กœ ๋‹ค๋ฅธ preference๋ฅผ ๊ฐ€์ง€๋Š” ๋ชจ๋ธ๋“ค์„ ๊ฒฐํ•ฉ
      • ํ•œ๊ณ„: Pareto ํƒ€ํ˜‘์œผ๋กœ ์ธํ•œ ๊ฐœ๋ณ„ objective ์„ฑ๋Šฅ ์ €ํ•˜
    • RLHF (Dynamic reward / Multi-objective reward)
      • ์ƒํ™ฉ์— ๋”ฐ๋ผ reward ๊ฐ€์ค‘์น˜๋ฅผ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ํ•™์Šต / ์—ฌ๋Ÿฌ reward๋ฅผ ๊ฐ€์ค‘ํ•ฉ์œผ๋กœ ๊ณ ๋ ค
        โ‡’ ํ•™์Šต ๋ฐฉํ–ฅ์„ ๋ถ€๋“œ๋Ÿฝ๊ฒŒ steering
      • ํ•œ๊ณ„: Global ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณต๊ฐ„์—์„œ trajectory๋ฅผ ์กฐ์ •ํ•˜๋Š” ์ˆ˜์ค€์— ๋จธ๋ฌด๋ฆ„
        โ‡’ ํŒŒ๋ผ๋ฏธํ„ฐ ๋‚ด๋ถ€ ๊ตฌ์กฐ ์ž์ฒด๋Š” ๋ฐ”๊พธ์ง€ ์•Š์•„ gradient ๊ฐ„์„ญ ๋ฐœ์ƒ
  • ํ•ต์‹ฌ ํ†ต์ฐฐ
    ๋‹ค์ค‘ objective ์ถฉ๋Œ์˜ ์›์ธ์€ gradient์˜ ๋น„์ง๊ต์„ฑ(non-orthogonality) ์ด๋‹ค.

    ๋‚ด์ ๊ฐ’์ด 0์ด ์•„๋‹ˆ๋‹ค? โ†’ ์„œ๋กœ ๋‹ค๋ฅธ objective์˜ gradient๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฐฑ์‹ ํ•˜๋ฉฐ ๊ฐ„์„ญ ๋ฐœ์ƒ

    ๊ธฐ์กด RLHF์ฒ˜๋Ÿผ ๋‹จ์ˆœํ•œ ๊ฐ€์ค‘ํ•ฉ์œผ๋กœ ๊ณ ๋ คํ•˜๋ฉด? โ†’ ๋‘ gradient๋ฅผ ํ•ฉ์ณค์„ ๋•Œ 0์ด ๋˜์–ด ํ•™์Šต์ด ๋ฉˆ์ถœ ์ˆ˜ ์žˆ์Œ


์ œ์•ˆ ์•„์ด๋””์–ด

์„œ๋กœ ๋‹ค๋ฅธ objective๋“ค์„ โ€œ์ˆ˜ํ•™์ ์œผ๋กœ ๊ฐ„์„ญํ•˜์ง€ ์•Š๋Š” ๋ฐฉํ–ฅโ€์œผ๋กœ ํ•™์Šตํ•  ์ˆ˜๋Š” ์—†์„๊นŒ?
โ†’ ์• ์ดˆ์— ์„œ๋กœ ๋‹ค๋ฅธ preference๋ฅผ ๋‹ค๋ฅธ ๊ณต๊ฐ„์—์„œ ํ•™์Šตํ•˜์ž!
  • ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ๊ณต๊ฐ„์„ orthogonal subspace๋กœ ๋ถ„ํ•ดํ•˜์—ฌ, objective ๊ฐ„ ๊ฐ„์„ญ์„ ์›์ฒœ์ ์œผ๋กœ ์ œ๊ฑฐ
    • SVD๋กœ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ WW๏ปฟ ๋ถ„ํ•ด
      • W=UฮฃVTW=U\Sigma{V^T}๏ปฟ
        • ์ƒ์œ„ singular vector โ†’ ํ˜„์žฌ preference์˜ ์ฃผ์š” ๋ฐฉํ–ฅ (์ด๋ฏธ ํ•™์Šต๋œ ์ •๋ณด๊ฐ€ ๋งŽ์ด ๋‹ด๊ธด ๊ณต๊ฐ„)
        • ํ•˜์œ„ singular vector โ†’ ํ˜„์žฌ preference์— ๋œ ์ค‘์š”ํ•œ ๋ฐฉํ–ฅ (๊ฑฐ์˜ ์˜ํ–ฅ์ด ์—†๋Š” ๊ณต๊ฐ„)

    โ‡’ ํ•˜์œ„ ๋ฒกํ„ฐ ๊ณต๊ฐ„์—์„œ ์ƒˆ๋กœ์šด preference๋ฅผ ํ•™์Šตํ•˜๋ฉด ๊ธฐ์กด preference๋ฅผ ๋œ ์นจ๋ฒ”ํ•˜๋ฉฐ, gradient ์ถฉ๋Œ์ด ๊ฐ์†Œํ•œ๋‹ค!
    โ‡’ ์ƒ์œ„ ๋ฒกํ„ฐ ๊ณต๊ฐ„๊ณผ ์ง๊ตํ•˜๋Š” ๊ณต๊ฐ„์ธ Orthogonal projection ํ–‰๋ ฌ PโŠฅP_\perp๏ปฟ๋กœ ์ƒˆ๋กœ์šด gradient๋ฅผ ํˆฌ์˜ํ•˜๋ฉด ๊ธฐ์กด objective์™€ ๊ฒน์น˜๋Š” ์„ฑ๋ถ„์ด ์ œ๊ฑฐ๋œ๋‹ค!


Methods

  • Orthogonzlied Preference Updates with Stability Control
    ์ƒˆ๋กœ์šด preference ์—…๋ฐ์ดํŠธ๋ฅผ orthogonal subspace์—๋งŒ ์ œํ•œํ•˜๋ฉด

    ๊ธฐ์กด safety๋ฅผ ๊ฑด๋“œ๋ฆฌ์ง€ ์•Š๋Š”๋‹ค.

    • ฮ”W=BA\Delta W=BA๏ปฟ
      • LoRA์™€ ์œ ์‚ฌํ•œ low-rank adaptation โ†’ ์ฒซ๋ฒˆ์งธ preference (e.g., safety alignment)๋กœ ํ•™์Šต๋œ ์—…๋ฐ์ดํŠธ ํ–‰๋ ฌ โ†’ ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฑด๋“œ๋ฆฌ์ง€ ์•Š๊ณ  ํŠน์ • ๋ฐฉํ–ฅ๋งŒ ์—…๋ฐ์ดํŠธ
      • ์•ž๋ถ€๋ถ„ (์ƒ์œ„ rr๏ปฟ๊ฐœ singular component): safety ์„ฑ๋Šฅ์„ ์ฃผ๋กœ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐฉํ–ฅ (principal subspace)
      • ๋’ท๋ถ€๋ถ„ (๋‚˜๋จธ์ง€ singular component): safety์— ๊ฑฐ์˜ ์˜ํ–ฅ X, ๊ธฐ์กด preference์™€ ๊ฑฐ์˜ ์ง๊ตํ•จ
    • ์ด๋ฅผ ์œ„ํ•œ 2๊ฐ€์ง€ constraint
      • Subspace constraint ฮ”ฮธโˆˆSkโŠฅ\Delta \theta\in \mathit{S}_k^\bot๏ปฟ where Sk\mathit{S}_k๏ปฟ=safety principal subspace (safety์— ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ฐฉํ–ฅ๋“ค)
        : ์—…๋ฐ์ดํŠธ๊ฐ€ safety ์ฃผ์š” ๋ฐฉํ–ฅ๊ณผ ์™„์ „ํžˆ ์ง๊ตํ•˜๋„๋ก ํ•จ
      • Spectral constraint โˆฃโˆฃฮ”Wโˆฃโˆฃ2โ‰คฯ„||\Delta W||_2 \le\tau๏ปฟ
        : ๊ฐ€์žฅ ํฐ singular value๋ฅผ ์ œํ•œํ•˜์—ฌ safety drift ๋ฐฉ์ง€
  • Adaptive Subspace-Rank Selection
    • ฮ”W\Delta W๏ปฟ๊ฐ€ Xsafe{\mathbf X}_{safe}๏ปฟ๋ฅผ ์–ด๋–ค ๋ฐฉํ–ฅ๋“ค์˜ ์„ ํ˜•๊ฒฐํ•ฉ์œผ๋กœ ๋ฐ”๊พธ๋Š”์ง€?
      • uiu_i๏ปฟ: ์ถœ๋ ฅ ๋ฐฉํ–ฅ
      • cic_i๏ปฟ: ํ•ด๋‹น ๋ฐฉํ–ฅ์˜ ๊ธฐ์—ฌ๋„
    • ๊ธฐ์กด ๋ฐฉ์‹์€ tail ๋ฐฉํ–ฅ ์˜ํ–ฅ๋ ฅ์ด ์—†์—ˆ์ง€๋งŒ, ์—…๋ฐ์ดํŠธ ํ›„ singular value๊ฐ€ ์ปค์ง€๋ฉด์„œ ๊ทธ ๋ฐฉํ–ฅ์ด safety์— ์˜ํ–ฅ์„ ์ฃผ๊ธฐ ์‹œ์ž‘ํ•จ
    • Tail ๊ณต๊ฐ„์„ ์–ผ๋งˆ๋‚˜ ํ—ˆ์šฉํ• ์ง€ ๋™์ ์œผ๋กœ ๊ฒฐ์ •ํ•˜์ž!
      • ๋งˆ์ง€๋ง‰ kk๏ปฟ๊ฐœ์˜ singular value๋ฅผ ์ƒ์œ„ rr๏ปฟ๊ฐœ์˜ ํ‰๊ท ๊ฐ’์œผ๋กœ rescale
        • Tail ๋ฐฉํ–ฅ์ด ์ƒ์œ„ ์ˆ˜์ค€๊นŒ์ง€ ์ปค์ง„๋‹ค๋ฉด safety๊ฐ€ ์–ผ๋งˆ๋‚˜ ํ”๋“ค๋ฆด์ง€ ํ…Œ์ŠคํŠธํ•˜๊ธฐ ์œ„ํ•จ
      • Safety reward ๋ณ€ํ™” ์ธก์ •
      • ํ—ˆ์šฉ ์˜ค์ฐจ ฯ„\tau๏ปฟ ์ดํ•˜๊ฐ€ ๋˜๋Š” kk๏ปฟ ์ตœ๋Œ€๊ฐ’ ์„ ํƒ
  • Subspace-constrained Multi-Preference Alignment
    ์ƒˆ๋กœ์šด ์„ ํ˜ธ์˜ gradient๋ฅผ ๊ทธ๋ƒฅ ์“ฐ์ง€ ๋ง๊ณ , ์šฐ๋ฆฌ๊ฐ€ ์„ ํƒํ•œ ์ง๊ต subspace ์•ˆ์œผ๋กœ ํˆฌ์˜ํ•ด์„œ ์“ฐ์ž.
    • ์•ž์„œ ์„ ๋ณ„๋œ kk๏ปฟ๊ฐœ ๋ฐฉํ–ฅ ๋ฒกํ„ฐ๋ฅผ ๋ชจ์•„์„œ ํ–‰๋ ฌ U^\hat{U}๏ปฟ ์ƒ์„ฑ
      • ์ด ๊ณต๊ฐ„ ์•ˆ์—์„œ๋งŒ ์—…๋ฐ์ดํŠธํ•ด๋„ safety๊ฐ€ ํฌ๊ฒŒ ํ”๋“ค๋ฆฌ์ง€ ์•Š๋Š”๋‹คโ€๊ณ  ํŒ๋‹จ๋œ ์•ˆ์ „ํ•œ ๋ฐฉํ–ฅ ์ง‘ํ•ฉ
    • Projection ํ–‰๋ ฌ P=U^U^TP=\hat{U}\hat{U}^T๏ปฟ ์ •์˜
      • ์–ด๋–ค ๋ฒกํ„ฐ๋ฅผ ๋„ฃ์œผ๋ฉด U^\hat{U}๏ปฟ๊ฐ€ spanํ•˜๋Š” subspace์œ„๋กœ projection๋จ
    • Gradient ์—…๋ฐ์ดํŠธ

Experiments

  • ํ™œ์šฉ ๋ชจ๋ธ ๋ฐ ๋ฐ์ดํ„ฐ์…‹
    • LLM: Llama3-SFT, Mistral-7B-SFT
    • ํ•™์Šต ๋ฐ์ดํ„ฐ
      • Helpful: Helpsteer2, UltraFeedback
      • Harmless: SafeRLF-10k
      • Truthful: Helpsteer2, UltraFeedback
    • ํ‰๊ฐ€์šฉ ๋ฒค์น˜๋งˆํฌ (ํ‰๊ฐ€์ง€ํ‘œ)
      • Helpfulness: Alpaca-Eval (Win rate)
      • Harmlessness: AdvBench (Harmless Rate: ์œ ํ•ดํ•œ query์— ๋Œ€ํ•œ ๊ฑฐ๋ถ€ ๋น„์œจ)
      • Truthfulness: TruthfulQA (MC2: ๊ฐ๊ด€์‹ ์ •ํ™•๋„)
  • ๊ธฐ์กด ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ๊ณผ์˜ ๋น„๊ต
    OrthAlign์€ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค๋ณด๋‹ค multi-objective preference๋ฅผ ๋” ์ž˜ ๊ท ํ˜• ์žˆ๊ฒŒ ๋งž์ถœ ์ˆ˜ ์žˆ๋Š”๊ฐ€?
    • ์‹คํ—˜ ๋ฐฉ๋ฒ•: Sequential Preference Optimization
      • Harmless โ†’ Helpful โ†’ Truthful ์ˆœ์„œ๋กœ ํ•™์Šตํ•˜๋ฉด์„œ ์ด์ „ preference๊ฐ€ ๋ง๊ฐ€์ง€์ง€ ์•Š๋Š”์ง€ ํ™•์ธ
    • ์‹คํ—˜ ๊ฒฐ๊ณผ
      • Harmless + Helpful โ‡’ ๊ธฐ์กด ๋ฐฉ๋ฒ• ๋Œ€ํ”ผ ํ‰๊ท  8.77% ๊ฐœ์„ 
      • Harmless + Helpful + Truthful โ‡’ ๋” ํฐ ์ˆ˜์น˜๋กœ ๊ฐœ์„ 

      โ‡’ ๋‹จ์ˆœ ๊ฐ€์ค‘ํ•ฉ ๋ฐฉ์‹๋ณด๋‹ค ํ›จ์”ฌ ์•ˆ์ •์ ์ž„

  • Representation level์—์„œ์˜ ์•ˆ์ •์„ฑ

    ๋‚ด๋ถ€ ํ‘œํ˜„์ด ๋ฐ”๋€Œ๋ฉด ์ด์ „ ์„ ํ˜ธ๊ฐ€ ํ‘œํ˜„๋˜๋˜ ๋ฐฉ์‹๋„ ๊นจ์ ธ ์„ฑ๋Šฅ ์ €ํ•˜๋กœ ์ด์–ด์ง

    โ†’ ์ด์ „์— ์ •๋ ฌ๋œ preference ๋ถ„ํฌ๋ฅผ ์ž˜ ๋ณด์กดํ•˜๋Š” ์ง€ ํ™•์ธ

    • ์‹คํ—˜ ๋ฐฉ๋ฒ•
      • ์ฒซ ๋ฒˆ์งธ preference alignment ๋ชจ๋ธ์— ๋Œ€ํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ 3000๊ฐœ ์ƒ˜ํ”Œ๋ง
      • hidden state ์ถ”์ถœ
      • ์ดํ›„ ์—ฌ๋Ÿฌ preference๋ฅผ ์ˆœ์ฐจ์ ์œผ๋กœ ์ถ”๊ฐ€ ์ •๋ ฌ
      • ์ตœ์ข… ๋ชจ๋ธ์—์„œ ๊ฐ™์€ ์ž…๋ ฅ๋“ค์˜ hidden states ์ถ”์ถœ
      • ๋‘ ๋ถ„ํฌ๋ฅผ t-SNE๋กœ ์‹œ๊ฐํ™”
    • ์‹คํ—˜ ๊ฒฐ๊ณผ
      • ์ฒซ๋ฒˆ์งธ alignment ์‹œ์  ๋ถ„ํฌ๊ฐ€ ๊ฑฐ์˜ ๊ทธ๋Œ€๋กœ ์œ ์ง€๋จ
        โ‡’ ์ƒˆ preference๋ฅผ ์ถ”๊ฐ€ํ•ด๋„ ๊ธฐ์กด representation ๊ตฌ์กฐ๋ฅผ ๊ฑฐ์˜ ๊ฑด๋“œ๋ฆฌ์ง€ ์•Š์Œ
        โ‡’ Parameter conflict ์ œ๊ฑฐ
  • ๋ฒ ์ด์Šค๋ผ์ธ ๋ชจ๋ธ ์ ์šฉ ์‹คํ—˜
    OrthAlign์ด โ€œ์ƒˆ๋กœ์šด ์•Œ๊ณ ๋ฆฌ์ฆ˜โ€์ธ์ง€

    ์•„๋‹ˆ๋ฉด โ€œplug-and-play ๋ชจ๋“ˆโ€์ธ์ง€ ๊ฒ€์ฆ

    • ์‹คํ—˜ ๋ฐฉ๋ฒ•
      • ๊ธฐ์กด ๋ฒ ์ด์Šค๋ผ์ธ (e.g., DPO, SPO)์— subspace projection๋งŒ ์ถ”๊ฐ€
    • ์‹คํ—˜ ๊ฒฐ๊ณผ
      • Harmless๊ฐ€ Helpfulness๋ณด๋‹ค ํฌ๊ฒŒ ํ–ฅ์ƒ๋จ
  • Adaptive Subspace-Rank์˜ ํšจ๊ณผ ๊ฒ€์ฆ
    • Rank๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ์•ˆ์ •์„ฑ์ด ๋–จ์–ด์ง
      • ๊ธฐ์กด preference์˜ "์ค‘์š”ํ•œ ๋ฐฉํ–ฅ"์„ ์ ๊ฒŒ ๋ณดํ˜ธํ•œ๋‹ค๋Š” ๋œป
      • ์ฆ‰, ์ƒˆ๋กœ์šด preference๊ฐ€ ๊ธฐ์กด ์•ˆ์ „ ๋ฐฉํ–ฅ๊นŒ์ง€ ์นจ๋ฒ” ๊ฐ€๋Šฅ
    • Helpful ์ ์ˆ˜๋Š” rank์™€ ์ƒ๊ด€์—†์ด ์•ˆ์ •์ ์ž„
      • Helpful ๋ฐฉํ–ฅ์€ ์ถฉ๋ถ„ํ•œ ์ž์œ  ๊ณต๊ฐ„๋งŒ ํ™•๋ณด๋˜๋ฉด ์„ฑ๋Šฅ์ด ์•ˆ์ •ํ™”๋จ
      • ๋„ˆ๋ฌด ๋งŽ์€ rank๋ฅผ ์—ด์–ด์ค˜๋„ ๋” ์ข‹์•„์ง€์ง€ ์•Š์Œ

Categories

ALIGNMENT MPA research