19 March 2026

Whatโ€™s In My Human Feedback? Learning Interpretable Descriptions of Preference Data

๐Ÿ’กSAE๋ฅผ ํ†ตํ•ด preference dataset์—์„œ ๋‘ ์‘๋‹ต ๊ฐ„ ์„ ํ˜ธ๋ฅผ ๊ฒฐ์ •์ง“๋Š” ์ž ์žฌ์  ํŠน์ง•(feature) ์ถ•์„ ์ž๋™์œผ๋กœ ์ถ”์ถœํ•˜๊ณ , ์–ด๋–ค ์‘๋‹ต ํŠน์„ฑ์ด ์ธ๊ฐ„์˜ ์„ ํ˜ธ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š”์ง€ ์ž์—ฐ์–ด๋กœ ํ•ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๋Š” WIMHF ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆ

์ด์Šนํ™˜
์ด์Šนํ™˜
๐Ÿฅ‡

Whatโ€™s In My Human Feedback? Learning Interpretable Descriptions of Preference Data

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
์ฝ”์Šคํ”ผ๊ฐ•์ : ์„ ํ˜ธ๋˜๋Š” ์‘๋‹ต์˜ Feature๋ฅผ ์ฝ• ํ™œ์„ฑํ™”๋˜๋„๋ก Sparsity ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ํ•ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒํ•˜๋Š” ๊ฒƒ์ด ์ด ๋…ผ๋ฌธ์˜ ๊ฐ•์ 
์•ฝ์ : Feature๊ฐ„์˜ ์˜ํ–ฅ๋ ฅ์ด ์œ ์‚ฌํ•˜์—ฌ 4๊ฐœ์˜ Feature๋ฅผ ๊ณ ๋ฅผ ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ์— ๋Œ€ํ•ด์„œ๋Š” ์–ด๋–ป๊ฒŒ ๋ฐฉ๋ฒ•์„ ์ ์šฉํ• ์ง€ ๊ถ๊ธˆํ•จ.
์ œ์•ˆ: ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์„ ์œ„ํ•ด์„œ ์ž ์žฌ์  ํŠน์ง• ์ถ• ์ถ”์ถœ์„ ์ฆํญํ•˜๊ฑฐ๋‚˜, ๋ช…ํ™•ํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•ด๋ณด์ž„.
4.5
์ปคํ”ผ์žฅ์  : response์˜ โ€œ์ฐจ์ด ์ž„๋ฒ ๋”ฉโ€ ์„ latent space๋กœ ์••์ถ•ํ•˜์—ฌ ์„ ํ˜ธ๋„์˜ ์„ค๋ช…๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ๋Š” ๋…ผ๋ฌธ. ์‹ค์ œ๋กœ latent space์— ์ž„๋ฒ ๋”ฉ ์ฐจ์›์„ โ€˜์••์ถ•โ€™ ํ•ด ํ‘œํ˜„ํ•˜๋Š”๋ฐ ์žˆ์–ด์„œ ๋งŽ์€ ์ •๋ณด ์†์‹ค์ด ์žˆ์„๊ฑฐ๋ผ ์ƒ๊ฐํ–ˆ์ง€๋งŒ, ๊ฒ€์ฆ๊ฒฐ๊ณผ์—์„œ ์ž‘์€ representation์œผ๋กœ๋„ baseline์— ํฌ๊ฒŒ ๋’ค์ง€์ง€ ์•Š๋Š” ์ˆ˜์น˜๊ฐ€ ๋‚˜์˜จ ๊ฒƒ์ด ์‹ ๊ธฐํ–ˆ์Œ. ์ฆ‰, ์ •๋ณด ์†์‹ค์— ๋น„ํ•ด ์–ป๋Š” โ€˜์„ค๋ช…๊ฐ€๋Šฅ์„ฑโ€™์˜ ๊ฐ€์น˜๋Ÿ‰์ด ์ปค๋ณด์ด๋Š” ์—ฐ๊ตฌ๋ผ๊ณ  ์ƒ๊ฐํ•จ.
์•ฝ์  : latent space์— ์ฐจ์› ์••์ถ•์œผ๋กœ ์ธํ•ด ์–ด์ฉ” ์ˆ˜ ์—†์ด ๋ฐœ์ƒํ•˜๋Š” ์ •๋ณด ์†์‹ค.
์ œ์•ˆ : ์ •๋ณด ์†์‹ค์€ ๋ถˆ๊ฐ€ํ”ผํ•˜์ง€๋งŒ, ์ตœ์†Œํ™”ํ•˜๊ธฐ ์œ„ํ•ด์„œ K์™€ M์— ๋”ฐ๋ฅธ ๋งŽ์€ ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ๊ฐ€ ์ œ์‹œ๋˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ.
4.2
์–ผ๋ผ์žฅ์ : ์ธ๊ฐ„ ์ž์‹ ๋„ ๋ณธ์ธ์ด ์ด ์‘๋‹ต์„ ์™œ ๊ณจ๋ž๋Š”์ง€๋ฅผ ๋ชจ๋ฅผ ์ˆ˜ ์žˆ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋Š”๋ฐ, ์˜๋ฏธ๊ฐ€ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•จ. ๋˜ํ•œ ๋ณธ ๋…ผ๋ฌธ์ด ์ฃผ์žฅํ•˜๋Š” ์•„์ด๋””์–ด์˜ so what? ์— ๋Œ€ํ•œ ์˜๋ฌธ์„ ๋ฐ์ดํ„ฐ ํ๋ ˆ์ด์…˜ ๋“ฑ ์จ๋จน์„๋ฐ๊ฐ€ ์žˆ๋‹ค๋Š” ์ ์—์„œ ๊ธฐ์Šน์ „๊ฒฐ์ด ์ฐธ ์ข‹์•˜์Œ
๋‹จ์ : BatchTopK(32,4)๊ฐ€ ๊ฒฝํ—˜์ ์œผ๋กœ ์ œ์ผ ์ข‹์•˜๋‹ค๊ณ  ํ•˜๋Š”๋ฐ ์™œ ์ข‹์€์ง€ ์˜๋ฌธ์ž„. ์ข€ ๋” latent space์˜ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ ธ์œผ๋ฉด ๋” ๋ฏธ๋ฌ˜ํ•œ ์ฐจ์ด๋ฅผ ์žก์•„๋‚ผ ์ˆ˜ ์žˆ์ง€ ์•Š์•˜์„๊นŒ?
์ œ์•ˆ: M๊ณผ K๋ฅผ ๋‹ค๋ฅด๊ฒŒ ํ•œ ์‹คํ—˜์„ ๋ณด์˜€์œผ๋ฉด ๋” ์ข‹์•˜์„ ๊ฒƒ ๊ฐ™์Œ
4.2
๋น„์š”๋œจ์žฅ์ : ์ง€๊ธˆ๊นŒ์ง€๋Š” ์„ ํ˜ธ ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋Œ€๋กœ ๋ฏฟ๊ณ  ์‚ฌ์šฉํ–ˆ๋Š”๋ฐ ์ด๊ฒƒ์„ '์™œ' ๊ณจ๋ž๋Š”์ง€๋Š” ์™œ ์ƒ๊ฐํ•˜์ง€ ๋ชปํ–ˆ์„๊นŒ! ๋˜ํ•œ '๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ์‚ฌ์šฉํ•˜๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ์ข‹๋‹ค~' ๋ผ๋Š” ์ง๊ด€์ด ์žˆ๋Š”๋ฐ, ๋ฐ์ดํ„ฐ ๋‹ค์–‘์„ฑ์ด ํ•ญ์ƒ ์ด๋“์€ ์•„๋‹ˆ๊ณ , ์„œ๋กœ ์ƒ์ถฉํ•˜๋Š” ์„ ํ˜ธ ์‹ ํ˜ธ๊ฐ€ ์„ž์ผ ์ˆ˜ ์žˆ์Œ
์•ฝ์ : SAE๊ฐ€ response pair์˜ ์ž„๋ฒ ๋”ฉ ์ฐจ์ด์— ๊ธฐ๋ฐ˜ํ•ด์„œ๋งŒ feature๋ฅผ ํ•™์Šตํ•˜๊ธฐ ๋•Œ๋ฌธ์—, response์˜ ์„ ํ˜ธ ์—ฌ๋ถ€๊ฐ€ prompt ๋งฅ๋ฝ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋Š” ๊ฒฝ์šฐ๋Š” ํฌ์ฐฉํ•˜์ง€ ๋ชปํ• ๋“ฏ
์ œ์•ˆ: prompt์˜ ์ •๋ณด๋ฅผ ์ž„๋ฒ ๋”ฉ์ด๋‚˜ feature ๋ถ„์„ ์ฐจ์›์— ๋ฐ˜์˜ํ•  ์ˆ˜๋Š” ์—†์„๊นŒ? ๊ทธ๋ฆฌ๊ณ  M, K์— ๋Œ€ํ•œ ๋” ์‹คํ—˜์ด ์žˆ์–ด๋„ ์ข‹์„๊ฒƒ ๊ฐ™์Œ
4.3
์นซ์†”๊ฐ•์ : ์ž๋™์œผ๋กœ ์„ ํ˜ธ ํŠน์„ฑ ๋ฐœ๊ฒฌํ•˜๊ณ ์ž ํ•˜๋Š” ๋ชฉํ‘œ์™€, ์ด์— SAE ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์ž˜ ๋งž์Œ
์•ฝ์ : ์„ ํ˜ธ ์˜ˆ์ธก ์„ฑ๋Šฅ์€ ์†Œ์ˆ˜ SAE latent ์‚ฌ์šฉํ•˜๋Š”๋งŒํผ ๋งŽ์ด ๋†’๊ธฐ ์–ด๋ ค์›€
์ œ์•ˆ: ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์„ ์ค‘์š”์‹œํ•˜๊ณ  ์žˆ๋Š”๋งŒํผ, ์ž๋™ ๋ฐœ๊ฒฌํ•œ ์„ ํ˜ธ ํŠน์„ฑ์— ๋Œ€ํ•œ ๋ถ„์„์ด ์ข€๋” ์žˆ์œผ๋ฉด ์ข‹์•˜์„๋“ฏ (๊ธฐ์กด์— ์ •์˜ํ•˜๋˜ ์„ ํ˜ธ ํŠน์„ฑ๊ณผ ์ฐจ์ด์ , ๊ธฐ์กด LLM์ด ์ด๋ฅผ ์ž˜ ๋”ฐ๋ฅด๊ณ  ์žˆ๋Š”์ง€)
4.3
์„คํ–ฅ๋”ธ๊ธฐ๊ฐ•์ : ์„ ํ˜ธ๋„ ๋ฐ์ดํ„ฐ๋ฅผ ์™œ, ๊ทธ๋ ‡๊ฒŒ ์„ ํ˜ธ๋„๊ฐ€ ๊ฒฐ์ •๋˜์—ˆ๊ณ , ๋ชจ๋ธ์ด ๋ฌด์—‡์„ ํ•™์Šตํ•˜๊ธฐ๋ฅผ ์›ํ•˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ. ๋„ˆ๋ฌด ์ค‘์š”ํ•˜๊ณ , ๋ช…ํ™•ํ•œ motivation์ž„.
์•ฝ์ : ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ๋žŒ์ด๋‹ˆ๊นŒ ๊ทธ๋ ‡๊ธด ํ•˜์ง€๋งŒ, ๊ตณ์ด ์ž์—ฐ์–ด ๋‹จ๊ณ„์—์„œ ๊ทธ ์„ค๋ช…์„ ๋ณด๊ณ , ์ดํ•ดํ•ด์•ผ ํ•˜๋‚˜? ๊ทธ๋ƒฅ ๋ชจ๋ธ๋งŒ ์•Œ์•„๋„ ๋˜์ง€ ์•Š๋‚˜? ๋ผ๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ์Œ.
์ œ์•ˆ: ๋ชจ๋ธ๊ณผ ์‚ฌ๋žŒ์˜ ์„ ํ˜ธ๋„ ์ฐจ์ด๊ฐ€ ๋ถ„๋ช…ํžˆ ์žˆ์„ํ…๋ฐ, ์‚ฌ๋žŒ๊ด€์ ์—์„œ๋งŒ ๋ณธ ๊ฒƒ ๊ฐ™๋‹ค. ๋ชจ๋ธ ๊ด€์ ์—์„œ ๊ณ ๋ ค๋„ ํ•„์š”ํ•˜์ง€ ์•Š๋‚˜?
4.8
๋‚˜์Šค๋‹ฅ์žฅ์ : ์ธ๊ฐ„์˜ ์ƒ๊ฐ์„ ๋ชจ๋ธ๋‹จ์—์„œ ํ•ด์„ํ•˜๋Š” ๊ฒƒ์€ ์–ธ์ œ๋‚˜ ํฅ๋ฏธ๋กœ์›€! ํŠนํžˆ LLM์„ ์‚ฌ์šฉํ•ด์„œ ์ž์—ฐ์–ด๋กœ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด์„œ ๋” ์ž„ํŒฉํŠธ ์žˆ์Œ ์ด๊ฑธ๋กœ ์‹ค์ œ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ๋ถ„์„์„ ํ–ˆ๋‹ค๋Š” ๊ฒƒ ๊นŒ์ง€ ํ•ด์„œ ์•„์ฃผ soundnessํ•จ!!! ์ด๊ฑธ 4๋ช…์ด์„œ ํ–ˆ๋‹ค๊ณ ?
๋‹จ์ : ์†”์งํžˆ ๊ทธ๋ƒฅ 3b finetuning ์‹œ์ผœ์„œ ์“ธ๊ฑฐ๊ฐ™์Œ ์š”์ƒˆ ๊ทธ๊ฒŒ ๋ถ€๋‹ด์Šค๋Ÿฌ์šด cost๋„ ์•„๋‹ˆ๊ณ  ๊ตณ์ด SAE๋กœ ํ•ด์„ํ•ด์•ผํ•˜๋‚˜? ์ƒ๊ฐ์ด ๋“ฆ
์ œ์•ˆ: ๋” challengingํ•œ ํ™˜๊ฒฝ์—์„œ SAE๋กœ ๋ถ„์„ํ•ด์•ผ๋งŒ ์–ป์„ ์ˆ˜ ์žˆ๋Š” ํ•ด์„์— ๋Œ€ํ•ด ๋„์ถœ ํ•  ์ˆ˜ ์žˆ์œผ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ!
5
404๊ฐ•์ : ์ œ๋ชฉ๋ถ€ํ„ฐ ์‹คํ—˜๊นŒ์ง€ ๋ชจ๋‘ ๋‹ค ์žฌ๋ฐŒ๋‹ค!!! ๋ชจ๋“  ๋‚ด์šฉ์ด ๋‹ค reasonableํ•ด์„œ ๋ง‰ํžˆ๋Š” ๋ถ€๋ถ„ ์—†์ด ์ˆ ์ˆ  ์ฝํ˜”์Œ.
๋‹จ์ : Validating Learned Features ๋ถ€๋ถ„์—์„œ ์„ฑ๋Šฅ์ด ์ข€ ์•„์‰ฝ๋‹ค?
์ œ์•ˆ: Do I know this entity? ๋…ผ๋ฌธ์—์„œ์ฒ˜๋Ÿผ SAE ๊ด€๋ จ ๋ถ„์„์ด ์ถ”๊ฐ€๋˜๋ฉด ์ข‹์„๋“ฏ
5
๊ตญ๋ฐฅ๊ฐ•์ : ์ž์—ฐ์–ด๋กœ feature๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๋‹จ๊ณ„์—์„œ LLM์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ํฅ๋ฏธ๋กญ๊ณ  ์‚ฌ๋žŒ์ด ์ง์ ‘ ๋ ˆ์ด๋ธ”๋งํ•œ ์ด์œ  ์„ค๋ช…๊ณผ 60% ์ด์ƒ ์ผ์น˜ํ•œ๋‹ค๋Š” ์ ์—์„œ ์„ค๋“์ด ๋˜๋Š”๊ฒƒ ๊ฐ™๋‹ค
์•ฝ์ : ๊ฒ€์ฆ์—์„œ ์™ธ๋ถ€ ML ์—ฐ๊ตฌ์ž 3๋ช…์—๊ฒŒ ํ‰๊ฐ€๋ฐ›๋Š” ๋ฐฉ์‹์€ ๊ทœ๋ชจ๊ฐ€ ์ž‘๊ณ  ML ์ „๋ฌธ๊ฐ€์— ํ•œ์ •๋˜๋Š”๊ฒƒ ์•„๋‹๊นŒ?
์ œ์•ˆ: ํ‰๊ฐ€์ž ๋„๋ฉ”์ธ์„ ๋„“ํ˜€์„œ ์‹คํ—˜
4.5
AI๊ฐ•์ : Reward ๋ชจ๋ธ์€ ์˜ˆ์ธก์ž์ฒด๋Š” ์ž˜ํ•˜์ง€๋งŒ ์™œ ์‘๋‹ต์„ ์„ ํƒํ–ˆ๋Š”์ง€ ๋ชจ๋ฅด๋Š”๋ฐ, ๋ฐ์ดํ„ฐ ํ•ด์„๋ฌธ์ œ๋ฅผ ์ง์ ‘์ ์œผ๋กœ ๋‹ค๋ฃฌ๋‹ค๋Š” ์ ์—์„œ interpretability ์ธก๋ฉด์˜ ๊ฐ•์ ์ด ์กด์žฌ
์•ฝ์ : response ๊ฐ„ ์ž„๋ฒ ๋”ฉ ์ฐจ์ด๋ฅผ ๊ณ ๋ คํ•  ๋–„ prompt ๋‚ด ๋„๋ฉ”์ธ ์ง€์‹์ด ๋‹ค์†Œ ์•ฝํ•˜๊ฒŒ ๋ฐ˜์˜๋  ์ˆ˜ ์žˆ๋‹ค
์ œ์•ˆ: Prompt์— conditioned๋œ feature ์ƒ์„ฑ์„ ์œ„ํ•ด ๋ฐ์ดํ„ฐ์…‹๋งˆ๋‹ค ๋”ฐ๋กœ ํ•™์Šตํ•˜๋Š”๊ฒƒ์ด ์•„๋‹Œ foundation model์„ ์ œ์•ˆํ•  ์ˆ˜ ์žˆ์Œ
4.6

TL; DR

๐Ÿ’ก

SAE๋ฅผ ํ†ตํ•ด preference dataset์—์„œ ๋‘ ์‘๋‹ต ๊ฐ„ ์„ ํ˜ธ๋ฅผ ๊ฒฐ์ •์ง“๋Š” ์ž ์žฌ์  ํŠน์ง•(feature) ์ถ•์„ ์ž๋™์œผ๋กœ ์ถ”์ถœํ•˜๊ณ , ์–ด๋–ค ์‘๋‹ต ํŠน์„ฑ์ด ์ธ๊ฐ„์˜ ์„ ํ˜ธ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š”์ง€ ์ž์—ฐ์–ด๋กœ ํ•ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๋Š” WIMHF ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆ

  • Cited: 0
  • ICLRโ€™26 Oral

Preliminary

Autoencoder
  • ๊ตฌ์„ฑ์š”์†Œ
    • Encoder:
      • ๋ชฉํ‘œ: ์ž…๋ ฅ xx๏ปฟ ๋ฅผ ์ €์ฐจ์› ํ‘œํ˜„ zz๏ปฟ๋กœ ์••์ถ•
      • ๋ถˆํ•„์š”ํ•œ ์ •๋ณด(๋…ธ์ด์ฆˆ)๋ฅผ ๋ฒ„๋ฆฌ๊ณ , ์ค‘์š”ํ•œ ํŠน์ง•๋งŒ ๋‚จ๊ฒŒ๋”
    • Decoder:
      • ๋ชฉํ‘œ: ์••์ถ•๋œ ์ €์ฐจ์› ํ‘œํ˜„ z z๏ปฟ๋ฅผ ๋‹ค์‹œ ์›๋ณธ ์ž…๋ ฅ xx๏ปฟ๊ณผ ์ตœ๋Œ€ํ•œ ๋น„์Šทํ•œ ๋ฐ์ดํ„ฐ yy๏ปฟ๋กœ ๋ณต์›
    • ์ž ์žฌ ๊ณต๊ฐ„ (Latent Space / Bottleneck):
      • ์ธ์ฝ”๋”์— ์˜ํ•ด ์••์ถ•๋œ ์ €์ฐจ์› ํ‘œํ˜„ zz๏ปฟ๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ณต๊ฐ„
      • ๋ฐ์ดํ„ฐ์˜ ํ•ต์‹ฌ์ ์ธ ํŠน์ง•์ด ์‘์ถ•๋˜์–ด ์žˆ์Œ
      • ๊ฐ€์žฅ ๋‚ฎ์€ ์ฐจ์›์ด๊ธฐ ๋•Œ๋ฌธ์—ย ๋ณ‘๋ชฉ(bottleneck)์ด๋ผ๊ณ ๋„ ๋ถˆ๋ฆผ
  • ํ•™์Šต๋ชฉํ‘œ
    • ์ธ์ฝ”๋”์™€ ๋””์ฝ”๋”๋ฅผ ์ž˜ ํ•™์Šต์‹œ์ผœ์„œ
      1. ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ ์••์ถ• ํ›„ ์ตœ๋Œ€ํ•œ ์œ ์‚ฌํ•˜๊ฒŒ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณต์›
      1. ์ด ๊ณผ์ •์—์„œ ๋ชจ๋ธ์€ ๋ฐ์ดํ„ฐ์˜ ๋ถˆํ•„์š”ํ•œ ๋…ธ์ด์ฆˆ๋ฅผ ์ œ๊ฑฐํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๊ฐ€์žฅ ์ค‘์š”ํ•œ ํŠน์ง•์ด ๋ฌด์—‡์ธ์ง€ ์Šค์Šค๋กœ ํ•™์Šต
Sparse Autoencoder (SAE)

์ •์˜ ์ž…๋ ฅ ๋ฒกํ„ฐ๋ฅผ latent ๊ณต๊ฐ„์œผ๋กœ ๋ณ€ํ™˜ํ•œ ๋’ค ๋‹ค์‹œ ๋ณต์›ํ•˜๋„๋ก ํ•™์Šตํ•˜๋˜, latent ๋ฒกํ„ฐ์˜ ๋Œ€๋ถ€๋ถ„์„ 0์œผ๋กœ ๋งŒ๋“ค๊ณ  ์†Œ์ˆ˜์˜ ๋‰ด๋Ÿฐ๋งŒ ํ™œ์„ฑํ™”๋˜๋„๋ก ๊ฐ•์ œํ•˜๋Š” ์˜คํ† ์ธ์ฝ”๋”

  • ์†Œ์ˆ˜์˜ ๋‰ด๋Ÿฐ๋งŒ ์ผœ์ ธ์žˆ๊ธฐ ๋•Œ๋ฌธ์— sparse ํ•˜๋‹ค๊ณ  ํ•จ
  • Why Sparse Autoencoder?
    • ์ผ๋ฐ˜ ์˜คํ† ์ธ์ฝ”๋”๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ์••์ถ•(์ฐจ์› ์ถ•์†Œ)ํ•˜์ง€๋งŒ, SAE๋Š” ์ฐจ์› ์ถ•์†Œ/ํ™•์žฅ ๋ชจ๋‘ ๊ฐ€๋Šฅํ•จ!

      โ‡’ ๊ทธ ์†Œ์ˆ˜์˜ latent vector๋งŒ ์ผœ์ง€๋„๋ก ๋งŒ๋“ค์–ด ๊ฐ latent๊ฐ€ ๋šœ๋ ทํ•œ ์˜๋ฏธ ํ•˜๋‚˜์”ฉ ๋‹ด๋‹นํ•˜๋„๋ก ์œ ๋„ํ•จ์œผ๋กœ์จ
      ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ํ•ด์„ ๊ฐ€๋Šฅํ•œ ๊ฐœ๋… ๋‹จ์œ„๋กœ ๋ถ„ํ•ด

Introduction

Background

  • Preference Fine-Tuning, PFT
    • LLM alignment์˜ ํ•ต์‹ฌ ๋ฐฉ๋ฒ•
    • ์ธ๊ฐ„์˜ ์„ ํ˜ธ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ align
  • PFT ์ž‘๋™์›๋ฆฌ
    Prompt โ†’ (Response A, Response B) โ†’ Human์ด ๋” ๋‚˜์€ ์‘๋‹ต ์„ ํƒ โ†’ ๋ชจ๋ธ ํ•™์Šต
    • ํ•˜๋‚˜์˜ Prompt์— ๋Œ€ํ•ด ์ƒ์„ฑ๋œ ๋‘ ๊ฐœ์˜ ํ›„๋ณด ์‘๋‹ต ์ค‘์—์„œ ์ธ๊ฐ„์€ ํ•˜๋‚˜์˜ ์‘๋‹ต์„ ์„ ํƒ
    • RQ ์ธ๊ฐ„์€ ๋‘ ์‘๋‹ต ์ค‘ ์™œ ํŠน์ • ์‘๋‹ต์„ ์„ ํƒํ–ˆ์„๊นŒ?

Motivation & Contribution

  • RQ ์ธ๊ฐ„์ด ์–ด๋– ํ•œ ํŠน์„ฑ(fea์„ ๊ธฐ์ค€์œผ๋กœ ์„ ํ˜ธ ์‘๋‹ต ๋ฐ์ดํ„ฐ๋ฅผ ๊ณ ๋ฅผ๊นŒ?
  • ๊ธฐ์กด ๋ฐฉ๋ฒ•์˜ ํ•œ๊ณ„
    • Reward model์€ ์„ ํ˜ธ๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์–ด๋–ค ํŠน์„ฑ์ด ์„ ํƒ์„ ์œ ๋„ํ–ˆ๋Š”์ง€ ์„ค๋ช…ํ•˜์ง€ ๋ชปํ•จ
    • ๊ทธ๋ ‡๋‹ค๊ณ  ํŠน์„ฑ(e.g., ์ •์ค‘ํ•จ, ์œ ๋จธ ๋“ฑ)์„ ์‚ฌ์ „์— ์ •์˜ํ•˜๋Š” ๋ฐฉ์‹์€ ๋ฐœ๊ฒฌ ๊ฐ€๋Šฅํ•œ ํŠน์„ฑ์„ ์ œํ•œํ•  ์ˆ˜ ์žˆ์Œ

    โ‡’ WIMHF Method ์ œ์•ˆ

    • ๊ฐ€์„ค์„ ์‚ฌ์ „์— ์ •์˜ํ•˜์ง€ ์•Š๊ณ  ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ์ž๋™์œผ๋กœ ์„ ํ˜ธ ํŠน์„ฑ์„ ๋ฐœ๊ฒฌ
    • Sparse Autoencoder(SAE)๋ฅผ ํ†ตํ•ด ์‘๋‹ต ๊ฐ„ ์ฐจ์ด๋ฅผ ํ•ด์„ ๊ฐ€๋Šฅํ•œ feature๋กœ ๋ถ„ํ•ด
  • Preference Dataset DD๏ปฟ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ถ„ํฌ์—์„œ ์ƒ˜ํ”Œ๋ง๋œ ๋ฐ์ดํ„ฐ (p,rA,rB,y)(p, r_A, r_B, y)๏ปฟ ํ˜•ํƒœ๋กœ ๊ตฌ์„ฑ๋จ

Preference Dataset์˜ ์ƒ์„ฑ ๋ถ„ํฌ

(p,rA,rB,y)โˆผPr(p)โŸ(1)ย promptย dist.โ‹…Pr(rA,rBโˆฃp)โŸ(2)ย responseย dist.โ‹…Pr(yโˆฃrA,rB,p)โŸ(3)ย labelย dist.(p, r_A, r_B, y) \sim \underbrace{Pr(p)}_{\text{(1) prompt dist.}} \cdot \underbrace{Pr(r_A, r_B \mid p)}_{\text{(2) response dist.}} \cdot \underbrace{Pr(y \mid r_A, r_B, p)}_{\text{(3) label dist.}}
  • pp๏ปฟ: ํ”„๋กฌํ”„ํŠธ by ์ธ๊ฐ„
  • rAr_A๏ปฟ, rBr_B๏ปฟ: ํ”„๋กฌํ”„ํŠธ์— ๋Œ€ํ•œ 2๊ฐœ์˜ ์‘๋‹ต by LLM
  • yy๏ปฟ: ๋ผ๋ฒจ (rAr_A๏ปฟ๋ฅผ ๊ณ ๋ฅผ ๊ฒฝ์šฐ y=1, rBr_B๏ปฟ๋ฅผ ๊ณ ๋ฅด๋ฉด y=0) by ์ธ๊ฐ„

โญ Measurable Preferences

  • ์ •์˜: ๋‘ ์‘๋‹ต rAr_A๏ปฟ, rBr_B๏ปฟ ๊ฐ„ ์ฐจ์ด๋ฅผ ์„ค๋ช…ํ•˜๋Š” ์ถ•
    • e.g., rAr_A๏ปฟ๋Š” ์นœ์ ˆ, rBr_B๏ปฟ๋Š” ๋ฌด๋š๋š โ€ฆ
      rAr_A๏ปฟ, rBr_B๏ปฟ๋ฅผ ๊ตฌ๋ถ„ ์ง“๋Š” Measurable Preferences์˜ ์ž์—ฐ์–ด ์„ค๋ช… ์˜ˆ์‹œ
  • ๋ฌธ์ œ์ : Measurable Preferences๋ฅผ ์ธก์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๋„๊ตฌ๊ฐ€ ์—†์Œ
    • ๋‘ ์‘๋‹ต์„ ๊ฐ๊ฐ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ณ€ํ™˜ ํ›„ ์ฐจ์ด๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ์‹์€ ์˜๋ฏธ๋Š” ์žˆ์œผ๋‚˜ ์„ค๋ช… ๋ถˆ๊ฐ€

โญ Expressed Preferences

  • ์ •์˜: ๋ผ๋ฒจ yy๏ปฟ๋ฅผ ์‹ค์ œ๋กœ ์˜ˆ์ธกํ•˜๋Š” ํŠน์„ฑ
    • ๋งŽ์€ measurable preference ์ค‘์—์„œ ์‹ค์ œ๋กœ ์„ ํƒ์— ์˜ํ–ฅ์„ ๋ฏธ์นœ ํŠน์„ฑ
    • e.g., rAr_A๏ปฟ๊ฐ€ secular(์„ธ์†์ )์ด๊ณ  rBr_B๏ปฟ๋Š” ์•„๋‹ ๋•Œ, rAr_A๏ปฟ๊ฐ€ ๋” ์ž์ฃผ ์„ ํƒ(prefer) ๋œ๋‹ค๋ฉด

      โ†’ expressed preference: secular

  • โญ Expressed preference๋ฅผ ์•Œ์•„์•ผ ๋ชจ๋ธ์ด ์–ด๋– ํ•œ ๋ชฉํ‘œ๋กœ ์ •๋ ฌ๋˜๊ณ  ์žˆ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ์Œ!!

Method: WIMHF

  • 3 Step Method
    1. SAE๋ฅผ ํ•™์Šตํ•˜์—ฌ Measurable Preferences (vector ํ˜•ํƒœ) ์ถ”์ถœ
    1. ๊ฐ feature์— ๋Œ€ํ•œ ์ž์—ฐ์–ด ์„ค๋ช… ์ƒ์„ฑ
    1. ์–ด๋–ค feature๊ฐ€ ์‹ค์ œ๋กœ ์„ ํ˜ธ ๋ผ๋ฒจ์„ ๊ฒฐ์ •ํ•˜๋Š”์ง€ (Expressed Preferences) ๋ถ„์„
Step 1: Learning measurable preferences with SAEs
  • ๋ชฉํ‘œ preference pair (pp๏ปฟ, rAr_A๏ปฟ, rBr_B๏ปฟ)๋ฅผ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›์•„ measurable preferences์„ ์ฐพ์ž!
    • (๋‘ ์‘๋‹ต์ด ์–ด๋– ํ•œ ํŠน์„ฑ์„ ๊ธฐ์ค€์—์„œ ๋‹ค๋ฅธ๊ฐ€๋ฅผ ์•Œ์•„๋ณด์ž!)
    • ๋‘ ์‘๋‹ต์˜ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ ์ฐจ์ด์ธ eฮ”(erAโˆ’erB)e_\Delta (e_{r_A} - e_{r_B})๏ปฟ๋งŒ์œผ๋ก  ์„ ํ˜ธ์˜ ๊ธฐ์ค€์— ๋Œ€ํ•œ ์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ์ด ๋ถ€์กฑ

      โ‡’ eฮ”e_\Delta๏ปฟ์˜ SAE๋ฅผ ๊ตฌํ•œ ๋’ค BatchTopK(32,4) ๊ธฐ๋ฒ• ์ ์šฉ!

      • BatchTopK (M,K): SAE์„ ํ†ตํ•ด ๋ณ€ํ™˜๋œ M์ฐจ์› latent ๋ฒกํ„ฐ ์ค‘ K๊ฐœ๋งŒ ํ™œ์„ฑํ™”๋˜๋„๋ก ๋งŒ๋“œ๋Š” sparsity ๊ธฐ๋ฒ•
        batchTokK (32,4): 32์ฐจ์›์œผ๋กœ ์ถ•์†Œ๋œ latent vector์—์„œ 4๊ฐœ๋งŒ activate
  • ์ž‘๋™ ์›๋ฆฌ
    eฮ” (1536์ฐจ์› by text-embedding-3-small)
    โ†“
    SAE encoder
    โ†“
    32์ฐจ์› z
    โ†“
    BatchTopK sparsity (32,4)๋กœ
    โ†“
    ํ‰๊ท  4๊ฐœ๋งŒ ํ™œ์„ฑํ™”
    โ†“
    ์ตœ์ข… sparse representation Z (ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ ๋‹น 4๊ฐœ์˜ latent vector๋งŒ ํ™œ์„ฑํ™”)
  • ์ตœ์ข… Z์˜ ๊ตฌ์กฐ (Nร—MN \times M๏ปฟ ํ–‰๋ ฌ)

    Z=[z1(1)z2(1)โ€ฆzM(1)z1(2)z2(2)โ€ฆzM(2)โ‹ฎโ‹ฎz1(N)z2(N)โ€ฆzM(N)]Z = \begin{bmatrix} z_1^{(1)} & z_2^{(1)} & \dots & z_M^{(1)} \\ z_1^{(2)} & z_2^{(2)} & \dots & z_M^{(2)} \\ \vdots & & & \vdots \\ z_1^{(N)} & z_2^{(N)} & \dots & z_M^{(N)} \end{bmatrix}๏ปฟ

    • ๊ฐ ํ–‰ ๋‹น 4๊ฐœ์˜ latent vector ๋งŒ activation ๋จ
      • row: ๊ฐ ๋ฐ์ดํ„ฐ z(i)z^{(i)}๏ปฟ์˜ ํฌ์†Œ ํ‘œํ˜„
      • column: ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ํ•˜๋‚˜์˜ feature zjz_j๏ปฟ
Step 2: Describing measurable preferences in natural language
  • ๋ชฉํ‘œ: step1์—์„œ ์–ป์€ ์ตœ์ข… ํ‘œํ˜„ ZZ๏ปฟ๋ฅผ ํ†ตํ•ด ๊ฐ feature๊ฐ€ ๋Œ€์‘ํ•˜๋Š” ์ธ๊ฐ„์ด ํ•ด์„ ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•™์Šต
  • ์ž‘๋™์›๋ฆฌ
    1. ๊ฐ feature zjz_j๏ปฟ์— ๋Œ€ํ•ด ํ•ด๋‹น ๊ฐ’์ด ํฐ preference pair 5๊ฐœ๋ฅผ ์ƒ˜ํ”Œ๋ง
      • zjz_j๏ปฟ๊ฐ€ ํฌ๋‹ค โ‡’ ๋‘ ์‘๋‹ต ์Œ์˜ ์ฐจ์ด eฮ”e_{\Delta}๏ปฟ๊ฐ€ ํฌ๋‹ค โ‡’ ๋‘ ์‘๋‹ต์Œ์„ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์‰ฌ์›€!
    1. LLM (gpt-5-low)์—๊ฒŒ ๋‘ ์‘๋‹ต์„ ๊ฐ€์žฅ ์ž˜ ๊ตฌ๋ถ„ํ•˜๋Š” ๊ฐœ๋…(Measurable Preference)๋ฅผ ์ž์—ฐ์–ด ์„ค๋ช…์œผ๋กœ ์ƒ์„ฑํ•˜๋„๋ก ํ•จ
      Reddit ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ LLM์ด ์ƒ์„ฑํ•œ ์ž์—ฐ์–ด ์„ค๋ช… ์˜ˆ์‹œ

    โ‡’ ์ด ๊ณผ์ •์„ ํ†ตํ•ด ํ•ด๋‹น feature๊ฐ€ ํ™œ์„ฑํ™”๋˜๋Š” ์›์ธ์— ๋Œ€ํ•œ ์ž์—ฐ์–ด ์„ค๋ช…์ด ์ƒ์„ฑ

Step 3: Identifying expressed preferences
  • ๋ชฉํ‘œ: ํ•ด์„ ๊ฐ€๋Šฅํ•œ ๊ฐ feature zjz_j๏ปฟ๊ฐ€ ์„ ํ˜ธ ๋ผ๋ฒจ yy๏ปฟ์— ์–ผ๋งˆ๋‚˜ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋กœ ์ถ”์ •
    Pr(y=1)=ฯƒ(ฮฑ+ฮฒjzj+ฮณx)Pr(y = 1) = \sigma(\alpha + \beta_j z_j + \gamma x)
    • rAr_A๏ปฟ๊ฐ€ ์„ ํƒ๋  ํ™•๋ฅ  Pr(y=1)Pr(y=1)๏ปฟ โ‡’ ํŠน์ง• zjz_j๏ปฟ์˜ ์˜ํ–ฅ + ๊ธธ์ด xx๏ปฟ ์ฐจ์ด์˜ ์˜ํ–ฅ
      • x=length(rA)โˆ’length(rB)x=length(r_A)โˆ’length(r_B)๏ปฟ
      • ๋‹ต๋ณ€์ด ๊ธธ๋ฉด ๋” ์„ ํ˜ธ๋˜๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์œผ๋ฏ€๋กœ ํŽ˜๋„ํ‹ฐ ๋ถ€์—ฌ
  • ฮฒjฮฒ_j๏ปฟ: zjz_j๏ปฟ๊ฐ€ ์„ ํ˜ธ์— ์–ผ๋งˆ๋‚˜ ์˜ํ–ฅ์„ ์คฌ๋Š”์ง€
  • if) ฮฒjฮฒ_j๏ปฟ > 0
    • zjz_j๏ปฟ๊ฐ€ ํด์ˆ˜๋ก rAr_A๏ปฟ๊ฐ€ ์„ ํƒ๋  ํ™•๋ฅ  ์ฆ๊ฐ€
  • if) ฮฒjฮฒ_j ๏ปฟ < 0
    • zjz_j๏ปฟ๊ฐ€ ํด์ˆ˜๋ก rAr_A๏ปฟ๊ฐ€ ์„ ํƒ๋  ํ™•๋ฅ  ๊ฐ์†Œ
  • โˆฃฮฒjโˆฃ|ฮฒ_j|๏ปฟ๊ฐ€ ํด์ˆ˜๋ก ๊ทธ ํŠน์ง•์ด ์„ ํ˜ธ๋„์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์ด ๊ฐ•ํ•จ

Validating Learned Features

  • SAE๋ฅผ ํ†ตํ•ด ์ƒ์„ฑ๋œ Feature๋“ค์ด ์ •๋ง ์˜๋ฏธ๊ฐ€ ์žˆ๋Š”์ง€๋ฅผ 3๊ฐ€์ง€ ๋ฐฉ๋ฒ•์œผ๋กœ ๊ฒ€์ฆ
    • 1. ์„ ํ˜ธ๋„ ์˜ˆ์ธก ์„ฑ๋Šฅ
      • settings
        • baseline
          1. Finetuned Reward Model (Oracle)
            • Llama-3.2-3B reward model
            • preference dataset์œผ๋กœ ์ง์ ‘ finetuning
          1. Embedding (P+R)
            • ์ž…๋ ฅ: prompt + response embedding ep,re_{p,r}๏ปฟ์„ feature๋กœ ์‚ฌ์šฉ
          1. Embedding (R)
            • ์ž…๋ ฅ: response embedding ere_r๏ปฟ๋งŒ ์‚ฌ์šฉ
          1. SAE
        • metric
          • AUC (Area Under the Curve): ๋ถ„๋ฅ˜ ๋ชจ๋ธ์ด ์–ผ๋งˆ๋‚˜ ์ž˜ ๊ตฌ๋ณ„ํ•˜๋Š”์ง€ ์ธก์ •ํ•˜๋Š” ์ง€ํ‘œ
            AUC ๊ฐ’์˜๋ฏธ
            0.5๋žœ๋ค
            0.7๊ดœ์ฐฎ์€ ์ˆ˜์ค€
            1.0์™„๋ฒฝํ•œ ์˜ˆ์ธก

      • ์‹คํ—˜๊ฒฐ๊ณผ
        • Reward Model์ด ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ
          • reward model์€ ๋Œ€ํ˜• LLM + finetuning๋ผ ์–ด์ฉ” ์ˆ˜ ์—†์Œ(๋ชป์ด๊น€)
        • โญ SAE ์„ฑ๋Šฅ์€ baseline๋“ค๋ณด๋‹ค ์•ฝ๊ฐ„ ๋‚ฎ์€ ์„ฑ๋Šฅ
          • SAE feature๋Š” 32์ฐจ์› ํ‰๊ท  4๊ฐœ๋งŒ ํ™œ์„ฑํ™”๋œ ๋งค์šฐ ์ž‘์€ representation ์ž„์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์ข‹์€ ์„ฑ๋Šฅ!
    • 2. ์‚ฌ๋žŒ์ด ์“ด ์„ค๋ช…๊ณผ ์ผ์น˜ํ•˜๋Š”๊ฐ€?
      • Settings
        • CA dataset์—๋Š” annotator๊ฐ€ ์™œ ํ•ด๋‹น ์‘๋‹ต์„ ์„ ํ˜ธํ–ˆ๋Š”์ง€ ์ง์ ‘ ์“ด ์ž์—ฐ์–ด ์„ค๋ช…์ด ์žˆ์Œ
        • WIMHF๋Š” ์ด ์„ค๋ช…์„ ๋ณด์ง€ ์•Š๊ณ  ํŠน์ง•์„ ํ•™์Šต
        • ์ด 5000๊ฐœ์˜ preference pair๋ฅผ ์ƒ˜ํ”Œ๋ง ํ›„ ์‹คํ—˜
        • metric
          • Explanation match rate: LLM judge๊ฐ€ annotator explanation๊ณผ SAE feature๊ฐ€ ์ผ์น˜ํ•˜๋Š” ๋น„์œจ
          • LLM judge๊ฐ€ ์‚ฌ์šฉํ•œ Prompt
        • baseline
          • Top Features
            • ์‹ค์ œ ํ™œ์„ฑํ™”๋œ SAE feature 4๊ฐœ
          • Random Features
            • ๋žœ๋ค์œผ๋กœ ์„ ํƒ๋œ ๋น„ํ™œ์„ฑ feature 4๊ฐœ
      • ์‹คํ—˜ ๊ฒฐ๊ณผ
        • ์‚ฌ๋žŒ๋„ ์ž๊ธฐ ํŒ๋‹จ ์ด์œ ๋ฅผ ์ •ํ™•ํžˆ ์„ค๋ช…ํ•˜๊ธฐ ์–ด๋ ต๊ณ , ์„ค๋ช…์ด ์งง๊ฑฐ๋‚˜ ๋…ธ์ด์ฆˆ๊ฐ€ ๋งŽ์€๋ฐ 60.4%๋กœ ๋†’์€ ์ˆ˜์น˜
        ์‹ค์ œ annotator์™€ SAE Feature์˜ ๋‹ต๋ณ€ ์„ ํ˜ธ ์ด์œ  ์„ค๋ช… ์˜ˆ์‹œ
    • 3. ์ „๋ฌธ๊ฐ€ ์ •์„ฑ ํ‰๊ฐ€
      • settings
        • ์™ธ๋ถ€ ML ์—ฐ๊ตฌ์ž 3๋ช…์„ ๋ชจ์ง‘
        • 5๊ฐœ ๋ฐ์ดํ„ฐ์…‹์—์„œ ํ†ต๊ณ„์ ์œผ๋กœ ์œ ์˜๋ฏธํ•œ ํŠน์ง• 47๊ฐœ๋ฅผ ํ‰๊ฐ€
        • ํ‰๊ฐ€ ๊ธฐ์ค€: Predictive, Helpful, Interpretable
      • ์‹คํ—˜ ๊ฒฐ๊ณผ
        • 47๊ฐœ ์ค‘ 41๊ฐœ (87%) โ†’ "๋„์›€์ด ๋œ๋‹ค" ํ‰๊ฐ€
        • 47๊ฐœ ์ „๋ถ€ (100%) โ†’ "ํ•ด์„ ๊ฐ€๋Šฅํ•˜๋‹ค" ํ‰๊ฐ€

Experiment

  • Datasets
    • ๋ฐ์ดํ„ฐ์…‹ (๋ณผ๋“œ์ฒด ๋ฐ์ดํ„ฐ์…‹ ์œ„์ฃผ๋กœ ์‹คํ—˜)
      • LMArena
      • Community Alignment (CA)
      • HH-RLHF
      • PRISM
      • Reddit
      • PKU-SafeRLHF
      • Tulu 3 mixture

    (์ฝ”๋”ฉ, ์ˆ˜ํ•™๊ณผ ๊ฐ™์ด ๊ฐ๊ด€์ ์ธ ๋‹ต์„ ๊ฐ–๋Š” ์œ ํ˜•์˜ ์ฟผ๋ฆฌ๋Š” ์‚ญ์ œ)

dataset์— ๋”ฐ๋ฅธ Measured Preferences
dataset์— ๋”ฐ๋ฅธ Measurable Preferences ์ฐจ์ด
  • ๋‘ ๋ฐ์ดํ„ฐ์…‹ ๋ชจ๋‘ ๊ฐ€์น˜๊ด€ ๊ธฐ๋ฐ˜ ๋Œ€ํ™”๋ฅผ ์ง€์‹œํ–ˆ์ง€๋งŒ, ํ•™์Šต๋œ ํŠน์ง•์ด ์ „ํ˜€ ๋‹ค๋ฆ„
    • PRISM ํŠน์ง•
      • ๋‹ต๋ณ€๋“ค์ด ๋Œ€๋‹ต์„ ๊ฑฐ๋ถ€ํ•˜๋Š”๊ฐ€ vs ๋‹ต๋ณ€ํ•˜๋Š”๊ฐ€ ๋˜๋Š” ์Šคํƒ€์ผ๊ณผ ํ†ค์—์„œ ํฐ ์ฐจ์ด
        • e.g., ๋‚™ํƒœ๋‚˜ ์ข…๊ต ๊ฐ™์€ ๋ฏผ๊ฐํ•œ ์งˆ๋ฌธ์— ์–ด๋–ค ๋‹ต๋ณ€์€ ํšŒํ”ผํ•˜๊ณ , ์–ด๋–ค ๋‹ต๋ณ€์€ ๊ตฌ์ฒด์ ์œผ๋กœ ๋Œ€๋‹ต
      • why?
        • 21๊ฐœ์˜ ์„œ๋กœ ๋‹ค๋ฅธ LLM ๋ชจ๋ธ๋“ค์„ ์‚ฌ์šฉํ•ด ๋‹ต๋ณ€์„ ๋ฌด์ž‘์œ„๋กœ ๋ฝ‘์•„๋ƒˆ๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ๋งˆ๋‹ค ๋Œ€๋‹ตํ•˜๋Š” ์Šคํƒ€์ผ๊ณผ ๊ฑฐ๋ถ€ ๊ธฐ์ค€์ด ๋‹ค๋ฆ„
    • CA ํŠน์ง•
      • ๋‹ต๋ณ€๋“ค์ด ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋ณด๋‹ค๋Š” ์–ด๋–ค ์ฃผ์ œ์™€ ๊ฐ€์น˜๊ด€์„ ๋งํ•˜๋Š”๊ฐ€์— ์ง‘์ค‘
        • e.g., ํ™˜๊ฒฝ ๋ฌธ์ œ vs ์‚ฌํšŒ ์ •์˜, ๊ธ์ •์  ํƒœ๋„ vs ๋น„ํŒ์  ํƒœ๋„)์—์„œ ์ฐจ์ด๋ฅผ ๋ณด์ž„
      • why?
        • 1๊ฐœ์˜ ๋™์ผํ•œ LLM์„ ์‚ฌ์šฉํ•˜๋˜, ํ”„๋กฌํ”„ํŠธ๋กœ "๊ฐ๊ธฐ ๋‹ค๋ฅธ 4๊ฐ€์ง€ ๊ฐ€์น˜๊ด€์œผ๋กœ ๋Œ€๋‹ตํ•ด๋ด"๋ผ๊ณ  ์ง์ ‘ ์ง€์‹œํ–ˆ๊ธฐ ๋•Œ๋ฌธ โ‡’ ๋งํˆฌ(์Šคํƒ€์ผ)๋Š” ๋น„์Šทํ•˜์ง€๋งŒ ๋‚ด์šฉ์€ ๋‹ค์–‘ํ•จ
dataset์— ๋”ฐ๋ฅธ Expressed Preferences ์ฐจ์ด
  • x์ถ• (ฮ” win-rate): ์˜ค๋ฅธ์ชฝ(+)์ผ์ˆ˜๋ก ๊ทธ ํŠน์ง•์ด ์žˆ๋Š” ์‘๋‹ต์ด ๋” ์„ ํ˜ธ๋จ, ์™ผ์ชฝ(-)์ผ์ˆ˜๋ก ๋œ ์„ ํ˜ธ๋จ
  • ๊ฐ ์ : 5๊ฐœ ๋ฐ์ดํ„ฐ์…‹(ChatbotArena, CommunityAlign, HH-RLHF, PRISM, Reddit)
  • ์‹คํ—˜ ๊ฒฐ๊ณผ
    • ๊ตฌ์กฐ์  ํฌ๋งท์„ ๊ฐ–๋Š” ์‘๋‹ต์€ ๋ณดํŽธ์ ์œผ๋กœ ์„ ํ˜ธ๋จ
      • "๋Œ€๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ์…‹์—์„œ +๋ฐฉํ–ฅ (CommunityAlign์—์„œ +40%์ •๋„๋กœ ํฐ ์„ ํ˜ธ๋„)
    • ๋ถˆํ™•์‹ค์„ฑ ํ‘œํ˜„, ๋ชจ๋ฅด๊ฒ ๋‹ค๊ณ  ๋งํ•˜๊ธฐ๋Š” ๋ณดํŽธ์ ์œผ๋กœ ๋น„์„ ํ˜ธ๋จ
      • ์‚ฌ๋žŒ๋“ค์€ AI๊ฐ€ ๋ชจ๋ฅธ๋‹ค๊ณ  ํ•˜๋Š” ๊ฒƒ์„ ์‹ซ์–ดํ•˜๊ตฌ๋‚˜!
      • Reddit์—์„œ -25% ์ˆ˜์ค€์œผ๋กœ ๊ฐ€์žฅ ๊ฐ•ํ•˜๊ฒŒ ๋น„์„ ํ˜ธ๋จ
    • ๋Œ€๋น„๋˜๋Š” ์„ ํ˜ธ๋“ค
      • ๋น„๊ณต์‹์ ์ด๊ณ  ํ‘œํ˜„์ ์ธ ํ†ค(๋†๋‹ด, ์ด๋ชจ์ง€)
        • PRISM์€ -30% ์ˆ˜์ค€์œผ๋กœ ๋งค์šฐ ๋น„์„ ํ˜ธ/ChatbotArena, Reddit์€ ์•ฝ๊ฐ„ ์„ ํ˜ธ
      • ์‹œ์Šคํ…œ์  ๋ถˆํ‰๋“ฑ, ํ˜•ํ‰์„ฑ ๋…ผ์˜
        • CommunityAlign, HH-RLHF์€ ๋น„์„ ํ˜ธ/PRISM, Reddit์€ ์„ ํ˜ธ

      โ‡’ ๋ฒ”์šฉ ์„ ํ˜ธ ๋ชจ๋ธ์€ ์กด์žฌํ•˜์ง€ ์•Š์Œ!!

      • REDDIT ๋ฐ์ดํ„ฐ์…‹์—์„  ์„ ํ˜ธ๋˜๋Š” ์‘๋‹ต์ผ์ง€๋ผ๋„ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์—์„  ๋น„์„ ํ˜ธ๋  ์ˆ˜ ์žˆ์Œ

Effective Data Curation (WIMHF๋ฅผ ํ†ตํ•ด ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ๊ณจ๋ผ๋ณด์ž!)

๋ฌธ์ œ Arena ๋ฐ์ดํ„ฐ์…‹์—์„œ

  • rAr_A๏ปฟ: ์•ˆ์ „ํ•˜๊ฒŒ ๋‹ต๋ณ€ ๊ฑฐ๋ถ€
  • rBr_B๏ปฟ: unsafe ์ฝ˜ํ…์ธ  ์ƒ์„ฑ

  • But, ์‚ฌ๋žŒ๋“ค์ด rBr_B๏ปฟ๋ฅผ ์„ ํ˜ธํ•˜๋Š” ๊ฒฝํ–ฅ์„ ๋ณด์ž„

    โ†’ ์ด ๋ฐ์ดํ„ฐ๋กœ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋ฉด unsafeํ•œ ๋ชจ๋ธ์ด ๋งŒ๋“ค์–ด์ง

ํ•ด๊ฒฐ ๋ ˆ์ด๋ธ” ํ”Œ๋ฆฌํ•‘(Label Flipping)

  • WIMHF๋กœ unsafe๊ฐ€ ๊ฐ•ํ•˜๊ฒŒ ํ™œ์„ฑํ™”๋œ ์˜ˆ์‹œ๋“ค์„ ์ฐพ์•„์„œ ํ•ด๋‹น ์˜ˆ์‹œ๋“ค์˜ ์„ ํ˜ธ ๋ ˆ์ด๋ธ”์„ ๋ฐ˜๋Œ€๋กœ ๋’ค์ง‘์Œ
    • rBr_B๏ปฟ ์„ ํ˜ธ โ†’ rAr_A๏ปฟ ์„ ํ˜ธ๋กœ ์ˆ˜์ •
  • ์‹คํ—˜ ๊ฒฐ๊ณผ
    • ๋ ˆ์ด๋ธ”์„ ๋งŽ์ด ๋’ค์ง‘์„์ˆ˜๋ก Safety๊ฐ€ 8.9% โ†’ 46.2%๋กœ ๊ธ‰๊ฒฉํžˆ ์ƒ์Šน
    • + ๋ ˆ์ด๋ธ”์„ ๋’ค์ง‘์–ด๋„ ์ „๋ฐ˜์ ์ธ ์„ฑ๋Šฅ์€ ๊ฑฐ์˜ ๋ณ€ํ™” ์—†์Œ โ‡’ very nice!
      • x์ถ•:Safety ์ดˆ๋ก, Overall (Safety๋ฅผ ์ œ์™ธํ•œ ์ „๋ฐ˜์ ์ธ ์„ฑ๋Šฅ ํŒŒ๋ž‘
      • y์ถ•: RewardBench 2 Accuracy (%): Reward model ํ‰๊ฐ€ํ•˜๋Š” ๋ฒค์น˜๋งˆํฌ
        • ์™ผ์ชฝ (Safety): ์•ˆ์ „์„ฑ ๊ด€๋ จ ๋ฌธ์ œ์—์„œ์˜ ์ •ํ™•๋„
        • ์˜ค๋ฅธ์ชฝ (Overall excl. Safety): ์•ˆ์ „์„ฑ ์ œ์™ธ ์ „๋ฐ˜์  ์ •ํ™•๋„

Preference Dataset ํŠน์ง•์„ ์•„๋Š” ๊ฒƒ์ด ์™œ ์ค‘์š”ํ• ๊นŒ?

  • ๋ฐ์ดํ„ฐ์…‹์˜ ํŽธํ–ฅ(Bias) ๋ฐœ๊ฒฌ
    • Preference dataset์—๋Š” ์Šคํƒ€์ผ, ํ‘œํ˜„ ๋ฐฉ์‹ ๊ฐ™์€ ์ˆจ์€ ํŽธํ–ฅ์ด ํฌํ•จ๋  ์ˆ˜ ์žˆ์Œ

    โ‡’ ๋ฐ์ดํ„ฐ์…‹์ด ์˜๋„ํ•˜์ง€ ์•Š์€ ๋ฐฉํ–ฅ์œผ๋กœ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ค๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌ ๊ฐ€๋Šฅ

  • ๋ฐ์ดํ„ฐ์…‹ ๊ฐ„ ์ถฉ๋Œ ๋ฐœ๊ฒฌ
    • ์„œ๋กœ ๋‹ค๋ฅธ dataset์€ humor, tone, refusal ๊ฐ™์€ feature์— ๋Œ€ํ•ด ์„œ๋กœ ๋‹ค๋ฅธ ์„ ํ˜ธ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์Œ

    โ‡’ ์ด๋ฅผ ๋ถ„์„ํ•˜์—ฌ ์—ฌ๋Ÿฌ preference dataset์„ ์„ž์–ด ํ•™์Šตํ•  ๋•Œ ๋ฐœ์ƒํ•˜๋Š” ์ถฉ๋Œ ๋ฌธ์ œ๋ฅผ ๋ฐœ๊ฒฌ ๊ฐ€๋Šฅ

  • ๊ฐœ์ธํ™”(Personalization) ๊ฐ€๋Šฅ
    • ์‚ฌ๋žŒ๋งˆ๋‹ค ์„ ํ˜ธํ•˜๋Š” ์Šคํƒ€์ผ์ด ๋‹ค๋ฆ„
    • e.g.,
      • bullet list vs paragraph
      • formal tone vs informal tone

    โ‡’ ์„ ํ˜ธ feature๋ฅผ ๋ถ„์„ํ•ด์„œ ์‚ฌ์šฉ์ž๋ณ„ ๊ฐœ์ธํ™”๋œ ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Œ

Categories

RLHF SAE research