21 January 2026

MAP: Multi-Human-Value Alignment Palette

๐Ÿ’ก๋‹ค์ค‘ ๊ฐ€์น˜ ์ •๋ ฌ์„ ๊ธฐ์กด์˜ ๊ฐ€์ค‘์น˜ ํŠœ๋‹ ๋ฐฉ์‹์ด ์•„๋‹ˆ๋ผ ์›ํ•˜๋Š” ์ˆ˜์ค€์˜ ๋ชฉํ‘œ(palette)๋ฅผ ๋จผ์ € ์ง€์ •ํ•˜๊ณ , ๊ทธ ๋ชฉํ‘œ๋ฅผ ๋งŒ์กฑํ•˜๋Š” ฮป๋ฅผ ์ž๋™์œผ๋กœ ์ฐพ์•„ Pareto ๊ฐœ์„ ์„ ๋ณด์žฅํ•˜๋Š” ์ •๋ ฌ๋กœ ๋ฐ”๊ฟ”๋ณด์ž!

์ด์Šนํ™˜
์ด์Šนํ™˜
๐Ÿฅ‰

MAP: Multi-Human-Value Alignment Palette

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
๋งน๊ตฌ์•„์ด๋””์–ด๋Š” ๋ช…ํ™•ํ•œ ๊ฒƒ ๊ฐ™์Œ. ์ด๋Ÿฐ ์ธก๋ฉด์—์„  ์˜คํžˆ๋ ค ์‚ฌ๋žŒ๋ณด๋‹ค ๋‚ซ๋‹ค ์‹ถ๊ธฐ๋„ ํ•œ๊ฒŒ, ์‚ฌ๋žŒ์€ ๋ง‰์ƒ ํ•™์Šตํ•˜๋‹ค๋ณด๋ฉด ์ดˆ๊ธฐ ๋ชฉํ‘œ๋ฅผ ์žƒ์–ด๋ฒ„๋ฆฌ๊ณ  ์‚ผ์ฒœํฌ๋กœ ์ƒˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ. ์‹คํ—˜ ๊ณผ์ •์—์„œ ์‹ค์ œ๋กœ ๊ทธ๋Ÿฐ ํ˜„์ƒ์ด ์žˆ๋Š”์ง€๋„ ๊ถ๊ธˆํ•ด์ง.4.2
๊ณ„๋ž€์ดˆ๋ฐฅ ์„ ํ˜• objective ์—์„œ์˜ hyperparameter์ฐพ๊ธฐ๋‚˜ ์—ฌ๋Ÿฌ objective ์ค‘ trade-off๋Š” ๋‹น์—ฐํ•œ๊ฑฐ๋ผ๊ณ  ์ƒ๊ฐํ•ด์™”๋Š”๋ฐ, ์ด๋ก ์  ์ด์ƒํ–ฅ์„ ์ฐพ๋Š”๋‹ค๋Š” ์•„์ด๋””์–ด๊ฐ€ ์‹ ๋ฐ•ํ•˜๋‹ค. ๋จธ๋ฆฌ๋ฅผ ๋ต ๋งž์€ ๊ธฐ๋ถ„! ์›ํ•˜๋Š” ๋ฐฉํ–ฅ, ์›ํ•˜๋Š” ๊ฐ•๋„๋กœ ๋ชจ๋ธ์„ optimizeํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ๋ชจ๋ธ์ด ๋” ์ปค์งˆ ํ•„์š”๊ฐ€ ์—†๊ฒ ๋„ค 4.3
๊ตญ๋ฐฅ๋‹ค์ค‘ ๊ฐ€์น˜๋ฅผ dual convex ์ตœ์ ํ™”๋กœ ํ•ด์„ํ•ด์„œ ์‹ค์ œ ๊ฐ€์น˜์˜ ๋ชฉํ‘œ ์ˆ˜์ค€์„ ์ž๋™์œผ๋กœ ๋งŒ์กฑ๊ฐ€๋Šฅํ•œ์ง€ ํŒ๋‹จํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐœ์ƒ์ด ๋†€๋ž๋‹ค. ์‹œํ–‰์ฐฉ์˜ค๋ฅผ ํ™•์‹คํžˆ ์ค„์ผ ์ˆ˜ ์žˆ๊ฒ ๋„ค.
๊ณ ์ •๋œ ๋ชฉํ‘œ๊ฐ€ ์•„๋‹ˆ๋ผ, ๋™์ ์œผ๋กœ ๋ณ€ํ•˜๋Š” ์ƒํ™ฉ์— ๋”ฐ๋ฅธ ์งˆ๋ฌธ์ด ๋“ค์–ด์˜ฌ๋•Œ ์•Œ์•„์„œ ๊ฐ€์น˜ ๋ชฉํ‘œ๋ฅผ ์กฐ์ ˆํ•ด์ฃผ๋Š” ํ›„์† ์—ฐ๊ตฌ๊ฐ€ ๊ธฐ๋Œ€๋œ๋‹ค
4.5
ํ”ผ์ž์ด ๋…ผ๋ฌธ์€ ๋ชจ๋ธ์ด Training์„ ํ•  ๋•Œ, ํ•˜๋‚˜์˜ ๋Šฅ๋ ฅ์„ ์ค‘์‹ฌ์œผ๋กœ ํ•™์Šต์„ ํ•˜๋ฉด ๋‹ค๋ฅธ ๋Šฅ๋ ฅ์ด ์˜คํžˆ๋ ค ๋–จ์–ด์ง€๋Š” ํ˜„์ƒ์„ ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐํ•˜๋Š”์ง€, ๋ชฉํ‘œ ์ค‘์‹ฌ์œผ๋กœ ๋ฐ˜๋Œ€๋กœ ์ ‘๊ทผํ•จ์œผ๋กœ์จ ํ•ด๊ฒฐํ•˜๋ ค ํ–ˆ๋‹ค๋Š” ์ ์—์„œ ์˜๋ฏธ๊ฐ€ ์žˆ๋Š” ๊ฒƒ ๊ฐ™์Œ.
์—ฌ๋Ÿฌ Round์— ๊ฑธ์ณ Alignํ•˜๋ฉด์„œ ๋ชจ๋ธ์˜ ํฌ๊ธฐ๊นŒ์ง€ ๋Š˜๋ ค ๋‚˜๊ฐ„๋‹ค๋ฉด ํ›„์† ์—ฐ๊ตฌ์—์„œ๋Š” ๊ฑฐ์˜ ์™„๋ฒฝ์— ๊ฐ€๊นŒ์šด ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์„ ๋“ฏ ํ•˜๋‹ค.
4.4
์น˜ํ‚จ๊ด€์ ์„ ๋ฐ”๊พผ๋‹ค๋Š”๊ฒŒ ์ฐธ ์–ด๋ ค์šด๋ฐ ํŒŒ๋ ˆํ†  ํ”„๋ก ํ‹ฐ์–ด๋ฅผ ๋‹ค์ค‘ ์ธ๊ฐ„ ๊ฐ€์น˜ ์ •๋ ฌ ๋ฌธ์ œ์™€ ์ ‘๋ชฉ์‹œ์ผœ ์›ํ•˜๋Š” ์ •๋„์˜ value๋“ค์„ ์ž…๋ ฅ์„ ๋ฐ›์•„์„œ ์ถœ๋ ฅ์œผ๋กœ ๊ฐ€์ค‘์น˜๋ฅผ ์•Œ๋ ค์ค€๋‹ค๋Š” ์ ์ด ์ž„ํŒฉํŠธ๊ฐ€ ํฌ๋‹ค. 4
ํ–„๋ฒ„๊ฑฐ์ฒ˜์Œ์—๋Š” ๋‚ด๊ฐ€ ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ์„ underestimateํ•ด์„œ ๋” ์ข‹์€ ์ƒํƒœ(๋” ๋†’์€ ๊ฐ€์น˜ ์กฐํ•ฉ)๋ฅผ ๋†“์น  ์ˆ˜๋„ ์žˆ์ง€ ์•Š์„๊นŒ ์ƒ๊ฐํ–ˆ๋Š”๋ฐ ๋ชฉํ‘œ๋ฅผ ๋ช…ํ™•ํžˆ ์„ค์ •ํ•˜๊ณ  ๊ทธ์— ๋Œ€ํ•œ ๋™์ž‘์„ ์–ป๊ณ ์ž ํ•  ๋•Œ๋Š” ์˜คํžˆ๋ ค ์ ํ•ฉํ•œ ์ ‘๊ทผ์ธ๋“ฏ4.3
ํŽ˜๋ธŒ๋ฆฌ์ฆˆ์ด์ „์— ๊ฐ€์ค‘์น˜ ์กฐ์ •ํ•จ์œผ๋กœ์จ ๋ชฉํ‘œ ๋”œ์„ฑํ•˜๋ ค๊ณ  ํ–ˆ๋‹ค๋ฉด, ๋ชฉํ‘œ์— ๋จผ์ € ์ ‘๊ทผํ•ด์„œ ์‹คํ˜„ ๊ฐ€๋Šฅํ•œ์ง€ ๋”ฐ์ง€๋Š” ๊ฒƒ๋ถ€ํ„ฐ ํ•˜๋‹ˆ๊นŒ, ์ข€๋” ์ง์ ‘์ ์ธ ์ ‘๊ทผ๋ฐฉ์‹ ๊ฐ™๋‹ค. ๊ทธ๋ž˜์„œ ์ง๊ดธ์ ์œผ๋กœ ๋‚ฉ๋“๊ฐ€๋Š” ์•„์ด๋””์–ด์ธ๋“ฏ, ์™œ ์ด๋Ÿฐ ์ƒ๊ฐ์„ ํžŒ๋ฒˆ๋„ ๋ชปํ•ด๋ดค์ง€?!4.5

TL; DR

๐Ÿ’ก

๋‹ค์ค‘ ๊ฐ€์น˜ ์ •๋ ฌ์„ ๊ธฐ์กด์˜ ๊ฐ€์ค‘์น˜ ํŠœ๋‹ ๋ฐฉ์‹์ด ์•„๋‹ˆ๋ผ ์›ํ•˜๋Š” ์ˆ˜์ค€์˜ ๋ชฉํ‘œ(palette)๋ฅผ ๋จผ์ € ์ง€์ •ํ•˜๊ณ , ๊ทธ ๋ชฉํ‘œ๋ฅผ ๋งŒ์กฑํ•˜๋Š” ฮป๋ฅผ ์ž๋™์œผ๋กœ ์ฐพ์•„ Pareto ๊ฐœ์„ ์„ ๋ณด์žฅํ•˜๋Š” ์ •๋ ฌ๋กœ ๋ฐ”๊ฟ”๋ณด์ž!

Summary

  • cited: 14

Preliminary

Pareto Frontier (ํŒŒ๋ ˆํ†  ํ”„๋ก ํ‹ฐ์–ด)

  • ์—ฌ๋Ÿฌ ๋ชฉํ‘œ(objectives) ๊ฐ„์— ์ƒ์ถฉ ๊ด€๊ณ„(trade-off)๊ฐ€ ์กด์žฌํ•  ๋•Œ, ์–ด๋А ํ•œ ๋ชฉํ‘œ๋ฅผ ๋” ๊ฐœ์„ ํ•˜๋ฉด ๋‹ค๋ฅธ ๋ชฉํ‘œ ์ค‘ ์ ์–ด๋„ ํ•˜๋‚˜๋Š” ๋ฐ˜๋“œ์‹œ ์•…ํ™”๋˜๋Š” ๊ฒฝ๊ณ„์„ ์˜ ์ง‘ํ•ฉ
  • Pareto Optimization
    • ๋‹ค์ค‘ ๋ชฉํ‘œ ์ตœ์ ํ™” ๋ฌธ์ œ์—์„œ ๋ชจ๋“  ๋ชฉํ‘œ๋ฅผ ๋™์‹œ์— ๋” ์ด์ƒ ๊ฐœ์„ ํ•  ์ˆ˜ ์—†๋Š” ์ตœ์  ์ƒํƒœ๋ฅผ ์ฐพ๋Š” ๋ฐฉ๋ฒ•
    • ์‰ฝ๊ฒŒ ๋งํ•ด์„œโ€ฆ ์„ค์ •ํ•œ ๋ชจ๋“  ๋ชฉํ‘œ๋“ค์„ ๋‹ค ์ด๋ค„๋ณด์ž..!
      • e.g., ๋ชฉํ‘œ: ๊ณต๋ถ€, ์šด๋™
        • ์šด๋™์„ ๋„ˆ๋ฌด ์—ด์‹ฌํžˆ ํ•˜๋ฉด ์กธ๋ ค์„œ ๊ณต๋ถ€์— ์ง€์žฅ ๊ฐ (trade-off ๋ฐœ์ƒ!)
        • โ†’ ์šด๋™์„ ๋”ฑ 30๋ถ„๋งŒ ํ•˜๊ณ  ๊ณต๋ถ€ํ•˜๋ฉด ๋จธ๋ฆฌ๋„ ์ข‹์•„์ง€๊ณ  ๋ชธ๋„ ์ข‹์•„์ง (๋‘ ๋ชฉํ‘œ ๋ชจ๋‘ ๋‹ฌ์„ฑ!)

โ‡’ ๋…ผ๋ฌธ์˜ view point: ์—ฌ๋Ÿฌ ์ธ๊ฐ„ ๊ฐ€์น˜๋ฅผ ๋™์‹œ์— Pareto Optimize ํ•  ์ˆ˜ ์žˆ์„๊ฐ€?

๋ถ„์œ„์ˆ˜(quantile) โ† (์‹คํ—˜์—์„œ ๋“ฑ์žฅํ•  ์˜ˆ์ •..)

  • ์–ด๋–ค ์ ์ˆ˜๊ฐ€ ์ „์ฒด ๋ถ„ํฌ์—์„œ ์–ด๋А ์œ„์น˜์— ์žˆ๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ„
  • e.g.,
    • 50% quantile = ์ค‘์•™๊ฐ’
      • ์ „์ฒด ๊ฒฐ๊ณผ ์ค‘ ์ ˆ๋ฐ˜์€ ์ด๋ณด๋‹ค ๋‚ฎ๊ณ , ์ ˆ๋ฐ˜์€ ์ด๋ณด๋‹ค ๋†’์Œ
    • 80% quantile = ์ƒ์œ„ 20%
    • 90% quantile = ์ƒ์œ„ 10%

Introduction

Background

  • Human Value Alignment์˜ ๊ธฐ์กด ์ ‘๊ทผ
    • LLM์˜ human value alignment๋Š” ์ฃผ๋กœ reward function, preference data๋ฅผ ํ†ตํ•ด ํŠน์ • ๊ฐ€์น˜(e.g., helpfulness, harmlessnessโ€ฆ) ๋ฅผ ๊ฐ•ํ™”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์™”์Œ
    • ๋‹ค์ค‘ ์ธ๊ฐ„ ๊ฐ€์น˜ ์ •๋ ฌ์„ ์œ„ํ•ด Multi-Objective Reinforcement Learning (MORL) ์ด ์‚ฌ์šฉ๋˜์–ด ์˜ด
  • ๊ธฐ์กด ๋‹ค์ค‘ ๊ฐ€์น˜ ์ •๋ ฌ ๋ฐฉ์‹์˜ ํ•œ๊ณ„
  • ๋Œ€๋ถ€๋ถ„์˜ ์—ฐ๊ตฌ๋Š” ์—ฌ๋Ÿฌ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์„ ํ˜• ๊ฒฐํ•ฉํ•˜์—ฌ trade-off๋ฅผ ๊ทผ์‚ฌํ•จ

    R=ฮป1r1+ฮป2r2+โ‹ฏR = \lambda_1 r_1 + \lambda_2 r_2 + \cdots๏ปฟ

    • e.g., Rewarded Soup: ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ€์น˜์— ๋Œ€ํ•ด ํ•™์Šตํ•œ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ์‚ฌํ›„์ ์œผ๋กœ ์„ž๋Š” ๋ฐฉ์‹
  • ๋ฌธ์ œ์ 
    • ฮป (๊ฐ€์ค‘์น˜)๋Š” ์–ด๋–ป๊ฒŒ ์ •ํ•  ๊ฒƒ์ธ๊ฐ€?
    • ์ •ํ•œ ฮป๊ฐ€ Pareto optimalํ•œ์ง€ ์–ด๋–ป๊ฒŒ ์•Œ ์ˆ˜ ์žˆ๋Š”๊ฐ€?

Motivation

  • ์—ฌ๋Ÿฌ ์ธ๊ฐ„ ๊ฐ€์น˜๋ฅผ ๋™์‹œ์— ์ •๋ ฌํ•˜๊ธฐ ์œ„ํ•ด์„  ์—ฌ๋Ÿฌ๊ฐ€์ง€ Challenges ์กด์žฌํ•จ

RQ1 ์—ฌ๋Ÿฌ ์ธ๊ฐ„ ๊ฐ€์น˜๋ฅผ ์†์ƒ ์—†์ด ๋™์‹œ์— ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆœ ์—†์„๊นŒ? ๊ทธ๋ฆฌ๊ณ  ์ด๊ฒƒ์„ ์ •๋Ÿ‰ํ™”ํ•  ์ˆœ ์—†์„๊นŒ?

  • ํ•˜๋‚˜์˜ ๊ฐ€์น˜๋ฅผ ์ •๋ ฌํ•˜๋ฉด ๋‹ค๋ฅธ ๊ฐ€์น˜๊ฐ€ ์˜๋„์น˜ ์•Š๊ฒŒ ๊ฐ์†Œํ•  ์ˆ˜ ์žˆ์Œ
    • Helpfulness โ†‘ โ†’ Harmlessness โ†“
    • Humor โ†‘ โ†’ Coherence โ†“

  • RQ2 ์‹œํ–‰์ฐฉ์˜ค ์—†์ด ํ•œ ๋ฒˆ์˜ ์„ค์ •๋งŒ์œผ๋กœ ๋ชจ๋“  ์ธ๊ฐ„ ๊ฐ€์น˜๋ฅผ Pareto ๊ฐœ์„ ํ•˜๋„๋ก ์ •๋ ฌํ•  ์ˆ˜ ์žˆ๋Š”๊ฐ€?
    • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์˜ ๋ถˆํ™•์‹ค์„ฑ: RLHF์—์„œ ์›๋ž˜ ๋ชจ๋ธ p0p_0๏ปฟ๊ฐ€ ์ •๋ ฌ๋œ ๋ชจ๋ธ pp๏ปฟ๊ฐ€ ๋˜๊ธฐ ์œ„ํ•ด ํ•„์š”ํ•œ reward ํ•จ์ˆ˜ RR๏ปฟ์™€ ํ•˜์ดํผ๋งˆ๋ผ๋ฏธํ„ฐ ฮฒ\beta๏ปฟ๋ฅผ ํ•œ๋ฒˆ์— ๊ตฌํ•  ์ˆœ ์—†์„๊นŒ?
    • ์ข‹์€ ๊ฐ€์ค‘์น˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ฐพ๊ธฐ๋Š” ์ •๋ง ํž˜๋“ฆ

Contribution

  1. MAP ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์•ˆ
    • ์—ฌ๋Ÿฌ ์ธ๊ฐ„ ๊ฐ€์น˜๋ฅผ ๋™์‹œ์— ์ •๋ ฌํ•˜๋ฉด์„œ ์‚ฌ์šฉ์ž๊ฐ€ ์›ํ•˜๋Š” ๊ฐ ๊ฐ€์น˜์˜ ๋ชฉํ‘œ ์ˆ˜์ค€(target level)์„ ์ง์ ‘ ์ง€์ •ํ•  ์ˆ˜ ์žˆ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ
  1. ๋‹ค์ค‘ ๊ฐ€์น˜ ์ •๋ ฌ์„ ๋ณด์ƒ ๊ฐ€์ค‘์น˜ ํŠœ๋‹ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋ผ ๋ชฉํ‘œ ์ˆ˜์ค€์„ ๋งŒ์กฑ์‹œํ‚ค๋Š” ์ œ์•ฝ ์ตœ์ ํ™” ๋ฌธ์ œ๋กœ ์žฌ์ •์˜

Method: MAP

๊ธฐ์กด ์ ‘๊ทผ (RLHF / DPO / MORL)

  • ์—ฌ๋Ÿฌ ๊ฐ€์น˜๋ฅผ ์ •๋ ฌํ•˜๊ธฐ ์œ„ํ•ด ๋ณดํ†ต ์•„๋ž˜์™€ ๊ฐ™์€ ๋ฐฉ์‹์„ ํƒํ•จ
    Reward = ฮปโ‚ยทHelpfulness + ฮปโ‚‚ยทHarmlessness + ฮปโ‚ƒยทHumor + โ€ฆ
  • ๋ฌธ์ œ:
    • ๊ฐ€์ค‘์น˜ ฮป ์„ ํƒ ๊ธฐ์ค€์ด ๋ถˆ๋ช…ํ™•ํ•จ (์–ด๋–ป๊ฒŒ ์ •ํ•ด์•ผ ํ•˜๋Š”์ง€ ๊ฐ์ด ์•ˆ ์˜ด)
    • ฮป ๋ฅผ ์กฐ๊ธˆ๋งŒ ๋ฐ”๊ฟ”๋„ ๊ฒฐ๊ณผ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง
    • ๋Œ€๋ถ€๋ถ„์˜ ฮป๋Š” ํ•˜๋‚˜์˜ ๊ฐ€์น˜๋งŒ ์˜ฌ๋ฆฌ๊ณ  ๋‹ค๋ฅธ ๊ฐ€์น˜๋ฅผ ๋ง์นจ(trade-off)
    • ์ข‹์€ ฮป๋Š” ๊ทนํžˆ ์ผ๋ถ€
    • ์ •๋ ฌํ•ด์•ผ ํ•  ๊ฐ€์น˜๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก ํƒ์ƒ‰ ๋‚œ์ด๋„ very very hard

โ‡’ ๊ด€์ ์„ ๋ฐ”๊ฟ”๋ณด์ž!

  • ๊ธฐ์กด ์ด ์ •๋„ ๊ฐ€์ค‘์น˜๋ฉด ๊ฒฐ๊ณผ๊ฐ€ ๊ดœ์ฐฎ์„๊นŒ?
  • MAP ์ด ์ •๋„ ์ˆ˜์ค€์€ ๋ฐ˜๋“œ์‹œ ๋งŒ์กฑํ•ด์•ผ๋œ๋‹ค!!

    โ‡’ ฮป ๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์“ฐ์ง€ ์•Š๊ณ  ๋ชฉํ‘œ ์ˆ˜์ค€ ์ž์ฒด๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์‚ฌ์šฉ

  • ์ตœ๋Œ€ํ™”๊ฐ€ ์•„๋‹Œ ์ด ์ˆ˜์ค€ ์ด์ƒ์€ ๋ณด์žฅํ•ด๋‹ฌ๋ผ๋Š” ์ œ์•ฝ
  • โ‡’ MULTI-HUMAN-VALUE ALIGNMENT PALETTE (MAP) ์˜ ๋“ฑ์žฅ

MAP์˜ 3๋‹จ๊ณ„ ํ”„๋กœ์„ธ์Šค

์ž…๋ ฅ/์ถœ๋ ฅ

  • ์ž…๋ ฅ
    • rr๏ปฟ: ๊ฐ€์น˜(Values) ๋ณ„ score functions
    • e.g., r=[rhelp,rharmless,rhumor,...]r = [r_{\text{help}}, r_{\text{harmless}}, r_{\text{humor}}, ...]๏ปฟ
      • ์ƒ์„ฑ๋œ ๋ฌธ์žฅ yy๏ปฟ์— ๋Œ€ํ•ด, ๊ฐ ๊ฐ€์น˜(help, harmless, humorโ€ฆ) score๋ฅผ ๋ฐ˜ํ™˜ํ•˜๋Š” ํ•จ์ˆ˜
    • p0(โ‹…โˆฃx)p_0(\cdot|x)๏ปฟ: ์ •๋ ฌ ์ „ ๊ธฐ๋ณธ ๋ชจ๋ธ
    • xx๏ปฟโˆผD\sim\mathcal{D}๏ปฟ: ๋ฐ์ดํ„ฐ ๋ถ„ํฌ D\mathcal{D}๏ปฟ์˜ ํ”„๋กฌํ”„ํŠธ xx๏ปฟ
  • ์ถœ๋ ฅ
    • multiple value์— ์ •๋ ฌ์ด ๋ฐ˜์˜๋œ ์ตœ์ข… ์ƒ์„ฑ ๋‹ต๋ณ€ yy๏ปฟ

Step 1: Value Palette (๋ชฉํ‘œ ์ˆ˜์ค€ ์„ค์ •)

  • ํ•ต์‹ฌ ๊ด€์ ์„ ๋ฐ”๊พธ์ž!!
  • ๊ธฐ์กด ๋ฐฉ์‹์€ ๊ฐ€์ค‘์น˜(ฮป\lambda๏ปฟ)๋ฅผ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ๋ชฉํ‘œ์น˜์— ๋„๋‹ฌํ•˜๋Š” ๋ฐฉ์‹์ด์—ˆ๋‹ค๋ฉด, MAP์€ ๋ฐ˜๋Œ€๋กœ ๋ชฉํ‘œ๋ถ€ํ„ฐ ์„ค์ •
  • Value Palette: ๊ฐ ๊ฐ€์น˜์— ๋Œ€ํ•ด ์‚ฌ์šฉ์ž๊ฐ€ ์›ํ•˜๋Š” ๋ชฉํ‘œ ์ˆ˜์ค€์„ ๋ชจ์•„๋‘” ๋ฒกํ„ฐ
    • ์‚ฌ์šฉ์ž๊ฐ€ ๊ฐ ๊ฐ€์น˜์˜ ๋ชฉํ‘œ ์ˆ˜์ค€ ์ง์ ‘ ์ง€์ •
  • ์˜ˆ: Harmlessness 70%, Humor 60%, Helpfulness 80%
    palette = {
        "Helpfulness": 80%,  # ์ƒ์œ„ 20% ์ˆ˜์ค€
        "Harmlessness": 70%, # ์ƒ์œ„ 30% ์ˆ˜์ค€  
        "Humor": 60%         # ์ƒ์œ„ 40% ์ˆ˜์ค€
    }

Step 2: Feasibility Check (์‹คํ˜„ ๊ฐ€๋Šฅ์„ฑ ๊ฒ€์ฆ)

  • ํ•ต์‹ฌ Step 1์—์„œ ์ •ํ•œ ๋ชฉํ‘œ ์ฆ‰, Value Palette๊ฐ€ ํ˜„์‹ค์ ์œผ๋กœ ๊ฐ€๋Šฅํ•œ์ง€ ๊ฒ€์ฆ
  • ๊ธฐ์กด ๋ฐฉ์‹์€ ์‹คํ—˜ํ•ด๋ณด๊ธฐ ์ „๊นŒ์ง€ ์‹คํŒจํ• ์ง€ ์•Œ ์ˆ˜ ์—†์Œ, but, MAP์€ ์‚ฌ์ „์— ์‹คํŒจ๋ฅผ ์ฐจ๋‹จ
    • ๋ชฉํ‘œ๋“ค์„ ๋™์‹œ์— ๋งŒ์กฑ ๊ฐ€๋Šฅํ•œ์ง€๋ฅผ ๋จผ์ € ๊ฒ€์ฆ
    • Value Palette๊ฐ€ ์ด๋ก ์ ์œผ๋กœ ๊ฐ€๋Šฅํ•œ์ง€ ํŒ๋‹จ
      • ๋ถˆ๊ฐ€๋Šฅ โ‡’ ๋ชฉํ‘œ๊ฐ€ ํ˜„์žฌ ๋ชจ๋ธ๋กœ๋Š” ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ์•Œ๋ฆฌ๊ณ  ๋Œ€์•ˆ Palette ์ œ์‹œ (์˜ˆ์‹œ ์ฐธ๊ณ )
      • ๊ฐ€๋Šฅ โ‡’ ๋ชฉํ‘œํ•œ ๊ฐ€์ค‘์น˜ ๋ฒกํ„ฐ ฮป\lambda๏ปฟ์™€ ์ตœ์ข… ๋‹จ์ผ ๋ณด์ƒ ํ•จ์ˆ˜ R(x,y)=ฮปTr(x,y)R(x,y) = \lambda^T r(x,y)๏ปฟ๋ฅผ ์ž๋™์œผ๋กœ ๊ณ„์‚ฐ!
    # Feasibilty Check
    result = MAP.check([80, 70, 60])
    
    # Case 1: ๊ฐ€๋Šฅ
    โ†’ "๊ฐ€๋Šฅ, ฮป์™€ R(x,y)=ฮป^T r ๋ฐ˜ํ™˜"
    
    # Case 2: ๋ถˆ๊ฐ€๋Šฅ  
    โ†’ "๋ถˆ๊ฐ€๋Šฅ. [70, 60, 65]๋Š” ์–ด๋–ค๊ฐ€์š”?"

Step 3: Align model

  • step2์—์„œ ๋งŒ๋“  ์ตœ์ข… ๋ณด์ƒ R์„ ๊ฐ€์ง€๊ณ  ์ •๋ ฌ์„ ์‹ค์ œ๋กœ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‹จ๊ณ„ (๋‘๊ฐ€์ง€ ๋ฐฉ์‹์ด ์กด์žฌํ•จ)
  1. MAP-D (Decoding)
    • ์ƒ์„ฑํ•  ๋•Œ๋งŒ ์กฐ์ •
    • ๋ฐฉ์‹
      1. ํ”„๋กฌํ”„ํŠธ xx๏ปฟ์— ๋Œ€ํ•ด ํ›„๋ณด ๋‹ต๋ณ€์„ y(1),...,y(m)y^{(1)},...,y^{(m)}๏ปฟ ์„ ์ƒ์„ฑ
      1. R(x,y(i))R(x,y^{(i)})๏ปฟ๊ฐ€ ํฐ ํ›„๋ณด๊ฐ€ ๋” ๋ฝ‘ํžˆ๋„๋ก softmax ํ™•๋ฅ ๋กœ ์ƒ˜ํ”Œ๋ง
    • ์žฅ์ : ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์•ˆ ๋ฐ”๊พธ๋‹ˆ ๋น ๋ฅด๊ณ  ๊ฐ„๋‹จํ•จ
    • ๋‹จ์ : ๊ทผ๋ณธ์ ์œผ๋กœ ๋ชจ๋ธ ์ž์ฒด๊ฐ€ ๋ฐ”๋€Œ์ง€๋Š” ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์ •๋ ฌ ํšจ๊ณผ์— ํ•œ๊ณ„๊ฐ€ ์žˆ์„ ์ˆ˜ ์žˆ์Œ
  1. MAP-F (Finetuning)
    • ๋ชจ๋ธ ์ž์ฒด๋ฅผ ํŒ”๋ ˆํŠธ์— ๋„๋‹ฌํ•˜๊ฒŒ๋” ํ•™์Šต
    • ๋ฐฉ์‹
      1. PPO๋กœ RR๏ปฟ์„ ๋ณด์ƒ์œผ๋กœ ์‚ผ์•„ p0โ†’p^p_0 โ†’ \hat{p}๏ปฟ ๋กœ fine tuning ์‹œํ‚ด
      1. ์ดํ›„์—๋Š” p^\hat{p}๏ปฟ๋ฅผ ๋‹ต๋ณ€ ์ƒ์„ฑ
    • ์žฅ์ : ๋” ๋‚˜์€ ์ •๋ ฌ ํšจ๊ณผ
    • ๋‹จ์ : ๋น„์‹ผ ํ•™์Šต ๋น„์šฉ, ํŒ”๋ ˆํŠธ๊ฐ€ ๋ฐ”๋€” ๋•Œ๋งˆ๋‹ค ์žฌํ•™์Šต ํ•„์š”

Experiment

Experiment Setup
  • Datasets
    • Anthropic Harmless Data: "Human:", "Assistant:" ํƒœ๊ทธ ์‚ฌ์ด์˜ ๋Œ€ํ™”
    • IMDB (30์ž ์ด์ƒ ์˜ํ™” ๋ฆฌ๋ทฐ)
  • Models
    • OPT-1.3B
    • Llama2-7B-chat
  • Aligned Values
    • Humor
    • Positiveness
    • Harmlessness
    • Helpfulness
    • Diversity
    • Coherence
    • Perplexity
  • Evaluation Models
    • Humor: humor detection logits
    • Positiveness: DistilBERT (IMDB)
    • Harmlessness , Helpfulness: value head๋ฅผ ํŒŒ์ธํŠœ๋‹ํ•œ GPT-2
    • Diversity: unique n-gram ๋น„์œจ(n=2,3,4)
    • Coherence: SimCSE BERT ๋ฌธ์žฅ ์ž„๋ฒ ๋”ฉ
Multi-value Alignment ํšจ๊ณผ
  • ๋ชฉ์ : MAP๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐ€์น˜๋ฅผ ๋™์‹œ์— ํšจ๊ณผ์ ์œผ๋กœ ์ •๋ ฌํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๊ฒ€์ฆ
  • ์‹คํ—˜ ์„ธํŒ…
    • model: OPT-1.3B
    • data: Anthropic conversational data
    • aligned values: Humor, Harmlessness, Helpfulness, Diversity, Coherence, Perplexity
    • HHH-{์ˆซ์ž}%

      Helpfulness, Harmlessness, Humor, ์ด ์„ธ ๊ฐ€์ง€ ๊ฐ€์น˜๊ฐ€ ๊ฐ๊ฐ ์›๋ž˜ ๋ชจ๋ธ ๊ธฐ์ค€์œผ๋กœ ์ค‘๊ฐ„๊ฐ’(์ƒ์œ„ {์ˆซ์ž}%) ์ด์ƒ์ด ๋˜๋„๋ก ์ •๋ ฌ

    • Value Palettes ์„ค์ •
      1. Multi-value palettes (3๊ฐœ ๊ฐ€์น˜ ๋™์‹œ ๊ฐœ์„ ํ•ด๋ณด์ž!)
        • HHH-50%: ์ฒซ 3๊ฐœ ๊ฐ€์น˜๋ฅผ 50% quantile๋กœ
        • HHH-60%: ์ฒซ 3๊ฐœ ๊ฐ€์น˜๋ฅผ 60% quantile๋กœ
        • HHH-70%: ์ฒซ 3๊ฐœ ๊ฐ€์น˜๋ฅผ 70% quantile๋กœ
        • HHH-80%: ์ฒซ 3๊ฐœ ๊ฐ€์น˜๋ฅผ 80% quantile๋กœ (step2์—์„œ ๋ถˆ๊ฐ€๋Šฅ์œผ๋กœ ํŒ์ •)
      1. Single-value palettes (1๊ฐœ ๊ฐ€์น˜๋งŒ ์ •๋ ฌ)
        • Humor-80%
        • Helpfulness-80%
        • Harmlessness-80%
    • ๊ตฌํ˜„ ๋ฐฉ๋ฒ•
      • MAP-D (Decoding): Best-of-N sampling
      • MAP-F (Finetuning): PPO ์‚ฌ์šฉ
  • ์‹คํ—˜ ๊ฒฐ๊ณผ
    1. Multi-value Alignment์˜ ๊ฐ•์ 
      • ๊ท ํ˜•์žกํžŒ ๊ฐœ์„ : 3๊ฐœ ๊ฐ€์น˜ ๋ชจ๋‘ ๋™์‹œ ๊ฐœ์„  (HHH-50%, 60%, 70%)
      • Trade-off ์ตœ์†Œํ™”: ๋‚˜๋จธ์ง€ 3๊ฐœ ๊ฐ€์น˜(Diversity, Coherence, Perplexity) ์œ ์ง€
      • Quantile ๋†’์ผ์ˆ˜๋ก ๊ฐœ์„  ํญ ์ฆ๊ฐ€
    1. Single-value Alignment์˜ ๋ฌธ์ œ์ 
      • ์‹ฌ๊ฐํ•œ Trade-off: ํ•œ ๊ฐ€์น˜ ๊ฐœ์„  ์‹œ ๋‹ค๋ฅธ ๊ฐ€์น˜ ํฌ๊ฒŒ ์ €ํ•˜
        • Humor-80%: Helpfulness -2.49๋กœ ์•…ํ™”
        • Helpfulness-80%: Harmlessness -0.58๋กœ ์•…ํ™”
        • Harmlessness-80%: Helpfulness -2.02๋กœ ์•…ํ™”
      • ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅ: ์–ด๋–ค ๊ฐ€์น˜๊ฐ€ ์ €ํ•˜๋ ์ง€ ์‚ฌ์ „์— ์•Œ ์ˆ˜ ์—†์Œ
Larger model Ablation Study
  • ๋ชฉ์ : ๋ชจ๋ธ ๊ทœ๋ชจ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก MAP์ด ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋Š” ์ •๋ ฌ ๊ฐ€๋Šฅ ๋ฒ”์œ„(feasible palette) ๊ฐ€ ํ™•์žฅ๋˜๋Š”์ง€ ๊ฒ€์ฆ
  • ์‹คํ—˜ ์„ธํŒ…
    • model: Llama2-7B-chat (OPT-1.3B๋ณด๋‹ค 5๋ฐฐ ์ด์ƒ ํผ)
    • data: Anthropic prompt data
    • ์ œ์•ฝ: GPU ๋ฉ”๋ชจ๋ฆฌ ํ•œ๊ณ„๋กœ MAP-D (Decoding)๋งŒ ๊ฐ€๋Šฅ, MAP-F ๋ถˆ๊ฐ€
  • ์‹คํ—˜ ๊ฒฐ๊ณผ
    • ๋” ํฐ ๋ชจ๋ธ์ผ์ˆ˜๋ก ๋” ๋งŽ์€ multi-value palette๋„ feasible
    • Llama2-7B๊ฐ€ ํ‘œํ˜„๋ ฅ์ด ๋” ํฌ๊ณ  ์œ ์—ฐํ•˜์—ฌ OPT-1.3B์—์„œ๋Š” ๋ถˆ๊ฐ€๋Šฅํ–ˆ๋˜ ๋ชฉํ‘œ๋„ ๋‹ฌ์„ฑ ๊ฐ€๋Šฅ
    • Step 2์˜ feasibility ํŒ๋‹จ์ด ๋ชจ๋ธ ์šฉ๋Ÿ‰ ์ฐจ์ด๋ฅผ ๋ฐ˜์˜
Simultaneous vs Sequential Alignment
  • ๋ชฉ์ : ๋‹ค์ค‘ ๊ฐ€์น˜๋ฅผ ํ•œ ๋ฒˆ์— ์ •๋ ฌ(MAP) ํ•˜๋Š” ๊ฒƒ๊ณผ ํ•˜๋‚˜์”ฉ ์ˆœ์ฐจ์ ์œผ๋กœ ๋ฐ˜๋ณต ์ •๋ ฌ(Sequential) ํ•˜๋Š” ๊ฒƒ์˜ ์„ฑ๋Šฅ ์ฐจ์ด์— ๋Œ€ํ•œ ์‹คํ—˜
  • ์‹คํ—˜ ์„ธํŒ…
    • model: OPT-1.3B
    • data: Anthropic conversational data
    • baselines
      1. MAP (Simultaneous): 6๊ฐœ ๊ฐ€์น˜ ํ•œ ๋ฒˆ์— ์ •๋ ฌ
      1. Sequential Round 1: ๊ฐ ๊ฐ€์น˜๋ฅผ ์ˆœ์„œ๋Œ€๋กœ 1๋ฒˆ์”ฉ ์ •๋ ฌ (6๋ฒˆ ์ •๋ ฌ)
      1. Sequential Round 5: ๊ฐ ๊ฐ€์น˜๋ฅผ ์ˆœ์„œ๋Œ€๋กœ 5๋ฒˆ์”ฉ ์ •๋ ฌ (30๋ฒˆ ์ •๋ ฌ)
    • ์ •๋ ฌ ์ˆœ์„œ

      Round 1: Humor โ†’ Harmlessness โ†’ Helpfulness โ†’ Diversity โ†’ Coherence โ†’ Perplexity
      Round 2: Humor โ†’ Harmlessness โ†’ Helpfulness โ†’ Diversity โ†’ Coherence โ†’ Perplexity
      ...
      Round 5: Humor โ†’ Harmlessness โ†’ Helpfulness โ†’ Diversity โ†’ Coherence โ†’ Perplexity

  • ์‹คํ—˜ ๊ฒฐ๊ณผ
    1. 1 Round๋กœ๋Š” ๋ถ€์กฑํ•จ โ‡’ Catastrophic forgetting ๋ฐœ์ƒ
      • ๋‚˜์ค‘์— ์ •๋ ฌํ•œ ๊ฐ€์น˜๋Š” ๊ฐœ์„ ๋˜์ง€๋งŒ
      • ๋จผ์ € ์ •๋ ฌํ•œ ๊ฐ€์น˜๋Š” ๋‹ค์‹œ ์ €ํ•˜๋จ
    1. 5 Rounds๋Š” ์ถฉ๋ถ„ํ•จ โ‡’ MAP์™€ ๊ฑฐ์˜ ๋™๋“ฑ
      • ์—ฌ๋Ÿฌ ๋ฒˆ ๋ฐ˜๋ณตํ•˜๋ฉด ๋ชจ๋“  ๊ฐ€์น˜๊ฐ€ ๊ท ํ˜•์žกํžŒ ์ˆ˜์ค€์œผ๋กœ ์ˆ˜๋ ด

Categories

research