10 December 2025

Mind the Value-Action Gap: Doย LLMs Act in Alignment with Their Values?

๐Ÿ’กLLM์ด ์ž๊ธฐ ๊ฐ€์น˜๊ด€์— ๋Œ€ํ•ด ์ง์ ‘ ์ฃผ์žฅํ•˜๋Š” ๋ฐ”์™€, ์‹ค์ œ ์ฃผ์–ด์ง„ ์ƒํ™ฉ์—์„œ ํ–‰๋™ํ•˜๋Š” ๊ฒƒ์ด ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ!๊ทธ๋ž˜์„œ ์ ๋‹นํžˆ ๋ฏฟ๊ณ  ์ฃผ์˜ํ•˜๋ฉด์„œ ํƒœ์Šคํฌ ๋งก๊ฒจ์•ผ ํ•จ

๐Ÿฅ‰

Mind the Value-Action Gap: Doย LLMs Act in Alignment with Their Values?

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
๋ธ”๋ž™ํ”„๋ผ์ด๋ฐ์ด'neural' network์ธ LLM์„ ๊ฑฐ์ง„ ์‚ฌ๋žŒ์œผ๋กœ treatํ•˜๊ณ  ๋‹ค์–‘ํ•œ ์‹ฌ๋ฆฌํ•™ ์ด๋ก ์„ ๋จน์—ฌ๋ณด๋Š” ์—ฐ๊ตฌ๋“ค์ด ๋„ˆ๋ฌด ์žฌ๋ฐŒ์–ด์š”
ํŠนํžˆ LLM์ด ์–ธํ–‰๋ถˆ์ผ์น˜ ํ•œ๋‹ค๊ณ  ํ•˜๋‹ˆ๊นŒ ๊ดœํžˆ ์‚ฌ๋žŒ๋ƒ„์ƒˆ๊ฐ€ ๋‚˜๋„ค์š”(?)
ํ˜ธ๊ธฐ์‹ฌ์ด ๋“œ๋Š” ๊ฐ€์„ค์„ ์„ธ์šฐ๊ณ , ๊ทธ๊ฒƒ์„ ์‹คํ—˜์œผ๋กœ ์ž˜ ์—ฐ๊ฒฐํ•œ ๋…ผ๋ฌธ์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ!
4
3์‹œLLM์˜ ์‹ ๋ขฐ ๊ฐ€๋Šฅ์„ฑ์„ ์ •๋ง๋กœ ์‹ค์„ธ๊ณ„ ์‚ฌ๋ก€๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ๋ถ„์„ํ•œ ์ข‹์€ ๋…ผ๋ฌธ!
์œ„๊ธฐ ์ƒํ™ฉ์—์„œ ์ธ๊ฐ„์ด LLM์— ์ „์ ์œผ๋กœ ์˜์กดํ•˜๋ฉด ์•ˆ๋œ๋‹ค๋Š”๊ฑธ ์‹œ์‚ฌํ•˜๋Š” ๋“ฏ
5
๊ณ ๋ถ•๊ฐ€์น˜โ€“ํ–‰๋™ ๊ฐ„ ์ผ์น˜๋Š” calibration(์ž์‹ ๊ฐโ€“์‹ค์ œ์ •๋‹ต ๊ฐ„ ์ผ์น˜)์™€ ์œ ์‚ฌํ•˜๊ฒŒ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ์„๊ฒƒ ๊ฐ™์Œ. ์‹คํ—˜๋ฐฉ๋ฒ•์€ ๊ฐ„๋‹จํ•˜๊ธด ํ•˜์ง€๋งŒ ํ•ด๋‹น ๋ชจ๋ธ์ด ์–ด๋–ค ๋„๋ฉ”์ธ์ด๋‚˜ ๊ฐ€์น˜๊ด€์„ ๊ฐ€์ง€๊ฒŒ๋” ํ•™์Šต์ด ๋˜์—ˆ๋Š”์ง€(์‹œ์ผฐ๋Š”์ง€) ๊ฐ„์ ‘์ ์œผ๋กœ ์•Œ ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์Œ.3
์‚ฌ์ด์‹œ์˜ท๊ฐ€์น˜๊ด€์— ํ•ด๋‹นํ•˜๋Š” head์™€ ์ „์ฒด์ ์ธ attention์˜ alignment๊ฐ€ ์ƒ๊ฐ๋ณด๋‹ค ์•ฝํ•œ ๊ฒƒ ๊ฐ™์Œ! ์‚ฌ๋žŒ์€ ๋ญ”๊ฐ€ ํฐ ๊ฐ€์น˜๊ด€์—์„œ ์ƒ๊ฐ์ด ๋ป—์–ด๋‚˜์˜ค๋Š”๋ฐ, LLM์€ ๊ทธ๋ ‡๊ฒŒ ํ•™์Šตํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ ์•„๋‹๊นŒ? โ€œ๋‚˜ํ•œํ…Œ๋Š” ๊ฐ€์น˜๊ด€๋„ ๊ทธ๋ƒฅ ํ•˜๋‚˜์˜ ๊ณ ๋ ค ์š”์†Œ์•ผ~~โ€ ์ด๋Ÿฐ ๋А๋‚Œ. ์ง„์งœ ๊ธฐ๊ณ„๊ฐ™๋‹ค. ์กฐ๊ธˆ ๋ฌด์„ญ๋‹ค. ๋…ผ๋ฌธ์ ์œผ๋กœ๋Š” ์ฐธ์‹ ํ•˜๊ณ , soundness๋„ ์ข‹์Œ!!4.5
๋ฐฅ๋ฐ์ดํ„ฐ ์ƒ์„ฑ๋ถ€ํ„ฐ ํƒœ์Šคํฌ ์ •์˜์™€ ํ‰๊ฐ€ ๋ฐฉ์‹๊นŒ์ง€, ์•Œ๊ณ ์ž ํ•˜๋Š” ๊ฒƒ์— ๋งž๊ฒŒ ๋ฐฉ๋ฒ•๋ก ์„ ์ž˜ ์งฐ๋‹ค. ์‹ค์ œ LLM ์‚ฌ์šฉ์ž์—๊ฒŒ ์ค‘์š”ํ•œ ๋‚ด์šฉ์„ ๋‹ค๋ฃจ๊ณ  ๊ฒฝ๊ฐ์‹ฌ์„ ์คŒ4
6์‹œ๊ฐ€์น˜์™€ ํ–‰๋™ ์‚ฌ์ด์˜ misalignment๊ฐ€ ์ผ์–ด๋‚ฌ์„ ๋•Œ, LLM์ด ์–ด๋– ํ•œ ์ด์œ ๋กœ ๊ทธ๋Ÿฌํ•œ ํ–‰๋™์„ ํ–ˆ๋Š”์ง€์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ์กฐ๋งŒ๊ฐ„ ๋‚˜์˜ค์ง€ ์•Š์„๊นŒ ์‹ถ๋‹ค 4
ํ”„๋ฆฌ๋ฐ”์ด์˜คํ‹ฑ์Šค๋Š” ์œ ์‚ฐ๊ท ๋จน์ดLLM์ด ์ •๋ง ๊ฐ€์น˜๊ด€์„ ๊ฐ€์ง€๊ณ  ์žˆ์„๊นŒ? ๊ทธ๋ƒฅ ํ™•๋ฅ ์ ์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ์ธ๋ฐ, ๊ฐ€์น˜๊ด€์€ ์‚ฌ์‹ค ์–ด๋–ค ๊ธฐ์ค€์ ์ด ์žˆ๊ณ  ๊ทธ๊ฒƒ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ฒฐ์ •ํ•˜๋Š” ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ.โ‡’ ๊ฒฐ์ •์˜ ๊ธฐ์ค€์ธ๋ฐ, ์ด๊ฒŒ ํ™•๋ฅ ์  ์ƒ์„ฑ๊ณผ ๋งž์ง€ ์•Š๋Š” ๊ฐœ๋…์ด๋ผ๊ณ  ์ƒ๊ฐ์ด ๋“ฆ. ์ฆ‰, ๊ฐ€์น˜๊ด€์— ๋Œ€ํ•ด์„œ ์ฃผ์žฅํ•˜๋Š” ๊ฒƒ ์ž์ฒด๊ฐ€ ์ด๋ฏธ ์ด์ƒํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•จ. ํ”„๋กฌํ”„ํŠธ์˜ ์˜ํ–ฅ์ด ๋” ํฌ์ง€ ์•Š์„๊นŒ? ํ•˜๋Š” ์ƒ๊ฐ์ด ๋“œ๋Š”๋ฐ, ๋‚˜๋ฆ„๋Œ€๋กœ ๊ทธ๋Ÿฐ ๊ฒƒ๋“ค์„ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ–ˆ๋˜ ๋…ผ๋ฌธ์ด์–ด์„œ ์ข‹์•˜์Œ.4
์š˜์„ธ์ดLLM์˜ ๊ฐ€์น˜๊ด€์„ ์–ด๋–ป๊ฒŒ ์ถœ๋ ฅํ•  ์ˆ˜ ์žˆ์„๊นŒ์— ๋Œ€ํ•œ ๋‹ต์ด ๋˜๋Š” ์—ฐ๊ตฌ๋ผ๊ณ  ์ƒ๊ฐํ•จ. Model-action๊ฐ„ ๋ถˆ์ผ์น˜๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค๋ฉด ์‹ ๋ขฐ์„ฑ์„ ์–ด๋–ป๊ฒŒ ๋ณด์žฅํ•  ์ˆ˜ ์žˆ์„์ง€ ์ƒ๊ฐํ•ด ๋ณด๊ฒŒ ํ•˜๋Š” ์—ฐ๊ตฌ.4.5

TL; DR

๐Ÿ’ก

LLM์ด ์ž๊ธฐ ๊ฐ€์น˜๊ด€์— ๋Œ€ํ•ด ์ง์ ‘ ์ฃผ์žฅํ•˜๋Š” ๋ฐ”์™€, ์‹ค์ œ ์ฃผ์–ด์ง„ ์ƒํ™ฉ์—์„œ ํ–‰๋™ํ•˜๋Š” ๊ฒƒ์ด ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ!

๊ทธ๋ž˜์„œ ์ ๋‹นํžˆ ๋ฏฟ๊ณ  ์ฃผ์˜ํ•˜๋ฉด์„œ ํƒœ์Šคํฌ ๋งก๊ฒจ์•ผ ํ•จ

Summary

Introduction

Motivation

  • LLM์˜ societal decisions (์‚ฌํšŒ์  ์˜์‚ฌ๊ฒฐ์ •)
    • ๊ณ ์ •๊ด€๋…, ์ฑ„์šฉ ๊ณผ์ •์—์„œ์˜ ํŽธํ–ฅ ๋“ฑ์˜ ์œ„ํ—˜ ์žˆ์Œ
  • ๊ธฐ์กด ์—ฐ๊ตฌ: LLM ์ง„์ˆ ์„ ๋ฐ”ํƒ•์œผ๋กœ LLM ํ–‰๋™์„ ์ถ”๋ก 
    • ๊ทธ๋Ÿฌ๋‚˜ ๋‘˜์ด ์ผ์น˜ํ•˜์ง€ ์•Š๊ธฐ๋„ ํ•จ
  • RQ: LLM์˜ ๊ฐ€์น˜ ์ง„์ˆ ๊ณผ ๊ฐ€์น˜ ๊ธฐ๋ฐ˜ ํ–‰๋™์ด ์–ด๋А์ •๋„ ์ผ์น˜ํ•˜๋Š”๊ฐ€?
    • LLM์˜ ๊ฐ€์น˜ ์„ ํƒ โ‰  ํ–‰๋™ ์„ ํƒ โ†’ ๋งŽ์ด ๊ด€์ฐฐ๋จ

Contribution

  • ์ด๋Ÿฌํ•œ ์ฐจ์ด๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ธก์ •ํ•˜๋Š” ValueActionLens ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์•ˆ
    • ์ธ๊ฐ„ ๊ฐ€์น˜ ์ด๋ก (Schewartz, 1994, 2012) ๊ธฐ๋ฐ˜์œผ๋กœ value-informed actions (VIA) ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•
    • ๊ตฌ์ถ• ๋ฐ์ดํ„ฐ์…‹ ๊ธฐ๋ฐ˜์œผ๋กœ LLM์ด ๋‘๊ฐ€์ง€ ๊ณผ์ œ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ํ•จ
      • stating value preferences
      • selecting actions in context
    • โ†’ ์„ธ๊ฐ€์ง€ ์ •๋ ฌ ์ง€ํ‘œ๋กœ ์ง„์ˆ -ํ–‰๋™ ๊ฐ„ ์ •๋ ฌ๋„ ํ‰๊ฐ€
  • 6๊ฐ€์ง€ LLM์œผ๋กœ ์‹คํ—˜
    • ๊ฐ€์น˜ ์ง„์ˆ ๊ณผ ์‹ค์ œ ํ–‰๋™ ๊ฐ„ ์ƒ๋‹นํ•œ ์ฐจ์ด๊ฐ€ ์žˆ์œผ๋ฉฐ ์ด๋Š” ๊ฐ€์น˜ ์œ ํ˜•, ๋ฌธํ™”, ์‚ฌํšŒ์  ์ฃผ์ œ๋ณ„๋กœ ์ฐจ์ด๊ฐ€ ๋‚˜ํƒ€๋‚จ์„ ๋ณด์ž„

ValueActionLens

Value-Action gap ํ‰๊ฐ€ ํ”„๋ ˆ์ž„์›Œํฌ

Contextualizing Values into Scenarios

  • 12๊ฐœ๊ตญ 11๊ฐœ ์‚ฌํšŒ์ฃผ์ œ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ 132๊ฐœ ๊ฐ€์น˜-ํ–‰๋™ ์ •๋ ฌ ํ‰๊ฐ€ ์‹œ๋‚˜๋ฆฌ์˜ค ๊ตฌ์„ฑ
    • ๊ฐ ์‹œ๋‚˜๋ฆฌ์˜ค๋ฅผ Shcwartzโ€™s basic values์—์„œ ์ œ์•ˆํ•œ 56๊ฐ€์ง€ ๊ฐ€์น˜์™€ ์ง ์ง€์Œ
      • Shcwartzโ€™s basic values ?
        • ๋ชจ๋“  ๋ฌธํ™”๊ถŒ์˜ ์ธ๊ฐ„์—๊ฒŒ ๋ณดํŽธ์ ์œผ๋กœ ์กด์žฌํ•˜๋Š” ๊ฐ€์น˜ ์œ ํ˜• (๊ฐœ์ธ์ด ์‚ถ์—์„œ ์ถ”๊ตฌํ•˜๋Š” ๋ชฉํ‘œ์˜ ์œ ํ˜•)
        • e.g., inequality, family, work, environment, health, โ€ฆ
    • โ†’ ์‹œ๋‚˜๋ฆฌ์˜ค-๊ฐ€์น˜ ์Œ์œผ๋กœ 14,784๊ฐœ Value-Informed Actions (VIA) ๋ฐ์ดํ„ฐ์…‹ ์ƒ์„ฑ

Generate Value-Informed Actions with Explanations

  • ์‹œ๋‚˜๋ฆฌ์˜ค(๊ตญ๊ฐ€+์‚ฌํšŒ์ฃผ์ œ) ๊ด€๋ จํ•œ ํ–‰๋™ ์ƒ์„ฑ
  • ์‹ฌ๋ฆฌํ•™ theory of reasoned action ๊ธฐ๋ฐ˜ํ•˜์—ฌ๊ฐ ํ–‰๋™์— ๋Œ€ํ•œ ์„ค๋ช… ์ƒ์„ฑ
    • theory of reasoned action ?
      • ๊ฐœ์ธ์˜ ํƒœ๋„์™€ ์ฃผ๊ด€์  ๊ทœ๋ฒ”์ด ํ–‰๋™ ์˜๋„์— ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋ฉฐ, ์ด ์˜๋„๊ฐ€ ์ตœ์ข…์ ์œผ๋กœ ํ–‰๋™์œผ๋กœ ์–ด๋–ป๊ฒŒ ์ด์–ด์ง€๋Š”์ง€ ์„ค๋ช…ํ•˜๋Š” ๋ฐ ์“ฐ์ด๋Š” ์‹ฌ๋ฆฌํ•™ ๋ชจ๋ธ
    • ์„ค๋ช… 1) action attribution: ์ƒ์„ฑ ํ…์ŠคํŠธ ์ค‘ value์— ๊ธฐ๋ฐ˜ํ•œ action ๋ถ€๋ถ„
    • ์„ค๋ช… 2) natural language explanation: reasoning process ์„ค๋ช…
  • VIA ๋ฐ์ดํ„ฐ ๊ด€๋ จ ํ”ผ์ณ, ์˜ˆ์‹œ

  • human-in-the-loop ๋ฐ์ดํ„ฐ ์ƒ์„ฑ ํŒŒ์ดํ”„๋ผ์ธ
    1. ํ”„๋กฌํ”„ํŠธ ๋ณ€ํ˜• ๊ตฌ์„ฑํ•˜์—ฌ value-informed action ์ƒ์„ฑ
      • ๊ฐ ๊ฐ€์น˜, ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋Œ€ํ•ด 8๊ฐ€์ง€ ๋ณ€ํ˜• ํ”„๋กฌํ”„ํŠธ ์‚ฌ์šฉ
        • 8๊ฐ€์ง€ ๋ณ€ํ˜•: paraphrase, ํ”„๋กฌํ”„ํŠธ ๊ตฌ์„ฑ์š”์†Œ ์žฌ์ •๋ ฌ, ๋‹ต๋ณ€์˜ ์š”๊ตฌ์‚ฌํ•ญ ๋ณ€๊ฒฝ
      • โ†’ ๊ฐ ๋ณ€ํ˜• ํ”„๋กฌํ”„ํŠธ๋กœ 80๊ฐœ, ์ด 640๊ฐœ value-informed action ์ƒ์„ฑ
    1. ์ตœ์ ์˜ ํ”„๋กฌํ”„ํŠธ ์„ ํƒ์„ ์œ„ํ•œ ์ฃผ์„ ์ˆ˜๋™ ์ƒ์„ฑ
      • ๋‘๋ช…์˜ AI ์ „๋ฌธ๊ฐ€๊ฐ€ ๊ฐ ์ƒ˜ํ”Œ์„ ์—ฌ๋Ÿฌ ์ง€ํ‘œ์— ๋Œ€ํ•ด ์ฃผ์„ ์ฒ˜๋ฆฌ
        ์ตœ์ ์˜ ํ”„๋กฌํ”„ํŠธ ์„ ์ • ์œ„ํ•œ ์ง€ํ‘œ
        • correctness: ์ฃผ์–ด์ง„ ๊ฐ€์น˜์™€ agreement/disagreement๊ฐ€ ์ผ์น˜ํ•˜๋Š”์ง€
        • harmlessness
        • sufficiency: value ์ถฉ๋ถ„ํžˆ ๋‚˜ํƒ€๋‚ผ ์ •๋„๋กœ ์ž์„ธํ•œ์ง€
        • plausibility: ์ฃผ์–ด์ง„ ์ƒํ™ฉ์—์„œ ์ผ์–ด๋‚  ์ˆ˜ ์žˆ๋Š” ํ˜„์‹ค์ ์ธ ํ–‰๋™์ธ์ง€
      • โ†’ ์ตœ์ ์˜ ํ”„๋กฌํ”„ํŠธ ์„ ํƒํ•˜๊ณ  ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋งฅ๋ฝํ™”๋œ 14,784๊ฐœ Value-Informed Actions (VIA) ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์ถ•
    1. ์ƒ์„ฑ ํ–‰๋™๊ณผ ์„ค๋ช…์˜ ํ’ˆ์งˆ ํ‰๊ฐ€
      • ๊ด€๋ จ ๋ฌธํ™”์  ๋ฐฐ๊ฒฝ ๊ฐ€์ง„ 27๋ช…์ด ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ํ‰๊ฐ€
      • ์ฃผ์„ ์ƒ์„ฑ ๋‹จ๊ณ„์™€ ๋™์ผํ•œ ์ง€ํ‘œ๋กœ ๋žœ๋ค ์„ ํƒํ•œ ํ–‰๋™๊ณผ ์„ค๋ช…์„ ํ‰๊ฐ€

Two Tasks for Evaluating Stated Values and Value-Informed Actions

์ƒ์„ฑํ•œ VIA ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ LLM ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ๋‘๊ฐ€์ง€ ํƒœ์Šคํฌ ์„ค๊ณ„
  • Task 1) state value inclinations
    • LLM ๊ฐ€์น˜ ์ง„์ˆ  ํ”„๋กฌํ”„ํŠธ์˜ ๊ตฌ์„ฑ์š”์†Œ
      • context: ๊ฐ€์น˜๊ด€ ์ง„์ˆ  ๋ฐฉ์‹
        • direct-inquiry (SVS-style): ์ฃผ์–ด์ง„ ๊ฐ€์น˜์— ์ž์‹ ์˜ agree ์ •๋„ ์ง„์ˆ ํ•˜๋„๋ก ํ•จ
        • portrait-based (PVQ-style): ์ฃผ์–ด์ง„ ๊ฐ€์น˜์™€ ๊ด€๋ จํ•˜์—ฌ ์ž์‹ ์˜ ์ธ๋ฌผ ๋ฌ˜์‚ฌ ์ƒ์„ฑํ•˜๊ฒŒ ํ•จ
      • options
        • strongly disagree ~ strongly agree
  • Task 2) select value-informed actions
    • VIA ๋ฐ์ดํ„ฐ์…‹์—์„œ ํŠน์ • ๊ฐ€์น˜์— agreeํ•˜๊ฑฐ๋‚˜ disagreeํ•˜๋Š” ๋‘๊ฐ€์ง€ ํ–‰๋™์„ ์ œ์‹œํ•˜๊ณ  ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•˜๊ฒŒ ํ•จ
    • Task 1 ๊ณผ ๋™์ผํ•œ ํ”„๋กฌํ”„ํŠธ ๊ตฌ์„ฑ์š”์†Œ ๊ฐ€์ง
      • context
      • options
        • ํŠน์ • ๊ฐ€์น˜์— ๋Œ€ํ•œ agree ํ–‰๋™, disagree ํ–‰๋™
        • โ†’ agree/disagree ์„ ํƒ์ง€ ์ˆœ์„œ๋Š” ๋žœ๋ค

Alignment Measures

  • ํŠน์ • ์‹œ๋‚˜๋ฆฌ์˜ค(๊ตญ๊ฐ€ + ์‚ฌํšŒ์ฃผ์ œ)์— ๋Œ€ํ•œ ๋‘ Task์˜ ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜
    • VV๏ปฟ : Task 1์˜ ๊ฐ€์น˜ ์‘๋‹ต ํ–‰๋ ฌ, AA๏ปฟ : Task 2์˜ ํ–‰๋™ ์‘๋‹ต ํ–‰๋ ฌ
      • vik,aikv_{ik}, a_{ik}๏ปฟ : kk๏ปฟ-th value, ii๏ปฟ-th scenario์— ๋Œ€ํ•œ ๊ฐ๊ฐ Task 1, Task 2 ์˜ ์‘๋‹ต
      • Task 1) 1 (strongly agree) ~ 4 (strongly disagree)
      • Task 2) 1 (agree), 2 (disagree)
    • โ‡’ ์ด๋ฅผ ๋Œ€์ƒ์œผ๋กœ ์•„๋ž˜์˜ metric ๊ณ„์‚ฐ
  • metric
    1. value-action alignment rate
      • ๊ฐ€์น˜, ํ–‰๋™ ์‘๋‹ต ํ–‰๋ ฌ์˜ ๊ฐ ์›์†Œ๋ฅผ agree๋ฉด 0, disagree๋ฉด 1๋กœ ๋ณ€ํ™˜
      • โ†’ ๋‘ ํ–‰๋ ฌ ๊ฐ„ F1 ์ ์ˆ˜๋กœ value-action ์ผ์น˜๋„ ๊ณ„์‚ฐ
    1. alignment distance
      • ๋‘ ํ–‰๋ ฌ ๊ฐ„ L1 distance๋กœ ๋ณด๋‹ค ์ž์„ธํ•œ value-action ์ผ์น˜๋„ ๊ณ„์‚ฐ
        • DikD_{ik}๏ปฟ : kth ๊ฐ€์น˜, ith ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋Œ€ํ•œ element-wise alignment distance
        • โ†’ DCkD_{Ck}๏ปฟ : ๊ตญ๊ฐ€ ํ˜น์€ ์‚ฌํšŒ์ฃผ์ œ์— ๋Œ€ํ•œ ํ‰๊ท  alignment distance (e.g., C = US)
    1. alignment ranking
      • ํŠน์ • ์‹œ๋‚˜๋ฆฌ์˜ค์— ๋Œ€ํ•ด ๊ฐ value๋ฅผ alignment distance ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌ

Experiment

Setting

  • models
    • closed-source: gpt 4o mini, gpt 3.5 turbo
    • open-source: gemma 2 9B, llama 3.3 70B, deepseek r1 distill llama 70B
    ๋‹ค์–‘ํ•œ ๊ตญ๊ฐ€์—์„œ ์ถœ์‹œ๋œ ์ตœ์‹  LLM ๋Œ€ํ‘œํ•˜๊ธฐ ์œ„ํ•œ ๋ชจ๋ธ ์„ ์ •

Result

๋ชจ๋ธ๋ณ„ value-action ๊ฐ„ ๋ถˆ์ผ์น˜ ์ƒ˜ํ”Œ ๊ฐœ์ˆ˜
  • ์ƒ๋‹นํžˆ ๋งŽ์€ ๋ถˆ์ผ์น˜ ๊ฒฝ์šฐ ๋ฐœ์ƒ

๋‚˜๋ผ ๊ธฐ์ค€ value-action ์ผ์น˜๋„
  • ๋ชจ๋ธ ์ฐจ์ด gpt 3.5๊ฐ€ ๊ฐ€์žฅ ๋ถˆ์ผ์น˜, gpt 4o๋Š” ๊ฐ€์žฅ ์ผ์น˜ํ•˜๋Š” ํŽธ
    • deepseek r1 ๋˜ํ•œ ์ผ์น˜๋„ ๋†’์Œ
  • ๋‚˜๋ผ ์ฐจ์ด Africa, Asia ๋Š” North America, Europe ์— ๋น„ํ•ด ์ผ์น˜๋„ ๋‚ฎ์€ ๊ฒฝํ–ฅ

๋‚˜๋ผ/๊ฐ€์น˜ ๊ธฐ์ค€ value-action ์ผ์น˜๋„
  • Independent, Choosing Own Goals ๊ฐ€์น˜์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ์‹œ๋‚˜๋ฆฌ์˜ค์— ๊ฑธ์ณ ๋ถˆ์ผ์น˜ ํผ

value-action ๋ถˆ์ผ์น˜ ์ƒ˜ํ”Œ์„ ์—ฌ๋Ÿฌ ์œ„ํ—˜ ์œ ํ˜•์œผ๋กœ ์ˆ˜๋™ ๋ถ„๋ฅ˜
์ด๋Ÿฌํ•œ ๋ถˆ์ผ์น˜๊ฐ€ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ๋Š” ์ž ์žฌ์  ์œ„ํ—˜ ๋‚˜ํƒ€๋ƒ„
  • e.g., discrimination ํ–‰๋™ ๋ณด์ด๋Š” ๋ชจ๋ธ์ด, discrimination์— ๋™์˜ํ•˜๋ƒ๊ณ  ์ง์ ‘ ๋ฌผ์„ ๋•Œ๋Š” ์•„๋‹ˆ๋ผ๊ณ  ์‘๋‹ตํ•  ์ˆ˜ ์žˆ์Œ

value-action ๋ถˆ์ผ์น˜ ์ƒ˜ํ”Œ ์˜ˆ์‹œ
  • โ‡’ LLM์— ์ด๋Ÿฌํ•œ ๋ถˆ์ผ์น˜ ์žˆ์Œ์„ ์ธ์‹ํ•˜๊ณ  ํƒœ์Šคํฌ ๋งก๊ฒจ์•ผ ํ•จ

Categories

research