26 March 2026

Language Model Personalization via Reward Factorization

๐Ÿ’ก์—ฌ๋Ÿฌ ์‚ฌ์šฉ์ž์˜ ์„ ํ˜ธ๋ฅผ ๊ณตํ†ต๋œ ์„ ํ˜ธ ์ถ•(e.g., ์นœ์ ˆ, ๊ฐ„๊ฒฐ, ๊ฒฉ์‹)์œผ๋กœ ๋ถ„ํ•ดํ•ด ํ•™์Šตํ•œ ๋’ค, ์ƒˆ๋กœ์šด ์‚ฌ์šฉ์ž๊ฐ€ ๋“ค์–ด์˜ค๋ฉด ์ถ•๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ์–ด ์‚ฌ์šฉ์ž์˜ personalized๋œ ์„ ํ˜ธ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ถ”์ •ํ•˜์ž!

์ด์Šนํ™˜
์ด์Šนํ™˜
๐Ÿฅ‡

Language Model Personalization via Reward Factorization

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
๋Œ“์ธ ๋…ธ๋…ธ โ€ข ์žฅ์ : motivation์ด ๋ช…ํ™•ํ•˜๊ณ  ๋ฐฉ๋ฒ•๋ก ๊ณผ ์ž˜ ์ด์–ด์ง! ๋ชจํ˜ธํ•˜๊ฒŒ ๋А๊ปด์งˆ ์ˆ˜ ์žˆ๋Š” personalization์„ innovateํ•˜๊ฒŒ reformulateํ•จ!
โ€ข ๋‹จ์  & ๋ณด์™„์ : -
5
์•„์ด๋ฆฌ์Šค์žฅ์ : ๋‚ด ๋จธ๋ฆฟ์† ๋“ค์–ด์™”๋‹ค ๋‚˜๊ฐ”๋‚˜ ์‹ถ๋‹ค.. ์‚ฌ๋žŒ์€ ๊ฒฐ๊ตญ ์กฐ๊ธˆ ๋‹ค๋ฅธ๊ฒŒ ํฌ๊ฒŒ ๋“œ๋Ÿฌ๋‚œ๋‹ค๊ณ  ์ƒ๊ฐํ•จ. ์ƒ์‹์ด๋ผ๋Š” ์„ ์ด ์žˆ์œผ๋ฏ€๋กœ, ๊ทธ ์ง๊ด€์„ ์ž˜ ๊ตฌํ˜„ํ•œ๋“ฏ.
๋‹จ์ : ๋‹ค๋ฅธ ๋…ผ๋ฌธ์—์„œ ์ผ์ง€๋งŒ, ์‚ฌ์‹ค ๊ฐœ์ธ์ ์œผ๋กœ๋Š” ์‚ฌ์šฉ์ž ๋งž์ถค์€ ์ ˆ๋Œ€ ๋ถˆ๊ฐ€๋Šฅ์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ. ๋ชจ๋“  ์‚ฌ์šฉ์ž ๊ฐœ์ธํ™”๋Š” ์†Œ๊ฑฐ๋ฒ•์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ. ๊ทธ๋Ÿฐ ๋ฐฉ์‹์€ ์•„๋‹ˆ์–ด์„œ ์•„์‰ฝ์ง€๋งŒ, ๋‚ด๊ฐ€ํ•˜๋ฉด ๋˜์ง€ ์•Š์„๊นŒ?
๋ณด์™„์ : ์ถ•๋งˆ๋‹ค ์ฃผ๋Š” ๊ฐ€์ค‘์น˜๊ฐ€ ๋ถ€์ • ๊ธฐ๋ฐ˜์ด์—ˆ์œผ๋ฉด ์–ด๋–จ๊นŒ?
4.7
ํ•ธ๋“œํฌ๋ฆผโ€ข ์žฅ์ : ์ƒˆ๋กœ์šด ์‚ฌ์šฉ์ž์— ๋Œ€ํ•ด ๋น ๋ฅด๊ฒŒ ์ •๋ ฌํ•  ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•
โ€ข ๋‹จ์ : ์ดˆ๊ธฐ ํ–‰๋ ฌ ๊ตฌ์„ฑ์—์„œ ๋งŽ์ด ๋ฒ—์–ด๋‚˜๋Š” ์‚ฌ์šฉ์ž๋Š” reward function ์„ค์ •์ด ์ž˜ ์•ˆ๋  ์ˆ˜๋„ ์žˆ์–ด๋ณด์ž„
โ€ข ๋ณด์™„์ : base reward function์„ ์œ ์ง€ํ•˜๋˜ outlier ์‚ฌ์šฉ์ž๋ฅผ ์ปค๋ฒ„ํ•  ๋ฐฉ๋ฒ•
4.5
3์›” โ€ข ์žฅ์ : Inference ์‹œ ๋ชจ๋ธ ์žฌํ•™์Šต ์—†์ด ์‚ฌ์šฉ์ž๋ณ„ reward weight๋งŒ ์ถ”์ •ํ•ด์„œ ๋น„์šฉ ํšจ์œจ์„ฑ ์ฆ๋Œ€
โ€ข ๋‹จ์  ๋ฐ ๋ณด์™„์ : ์ด์ „ ๋…ผ๋ฌธ๋“ค๋„ ๊ทธ๋ ‡๊ณ  ์‚ฌ์šฉ์ž ์„ ํ˜ธ๊ฐ€ ์„ ํ˜• ๊ฒฐํ•ฉ์œผ๋กœ ํ‘œํ˜„๋œ๋‹ค๋Š”๊ฑธ ๊ฐ€์ •ํ•˜๋Š”๋ฐ... ์˜ˆ๋ฅผ ๋“ค์–ด ์‚ฌ๋žŒ์ด ์—…๋ฌด ๊ด€๋ จํ•ด์„œ๋Š” ์งง๊ณ  ์ •ํ™•ํ•˜๊ฒŒ, ์žก๋‹ด ๊ด€์ ์—์„œ๋Š” ๊ธธ๊ณ  ์นœ๊ทผ ํ•œ๊ฑธ ์„ ํ˜ธํ•˜๋Š”๋ฐ ์ด๊ฑธ linear๋กœ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋‚˜..? ๋” ๋‚˜์€ non-linear ๋ฐฉ๋ฒ•์ด ์—†๋‚˜?
4.5
์—๋„ˆ์ง€ โ€ข ์žฅ์  : RLHF PPO๋ฐฉ์‹์—์„œ reward๋ฅผ ๋‹จ์ˆœ neural network๋กœ ํ†ตํ•ฉํ•ด์„œ preference๋ฅผ ์ตœ์ ํ™”ํ–ˆ๋Š”๋ฐ, ์‚ฌ์‹ค preference๋ฅผ ๋” ์ž์„ธํ•˜๊ฒŒ ๋”ฐ์ ธ์•ผํ•˜๋Š” ๊ฑด ๋‹น์—ฐํ•œ ๊ฒƒ์ด์—ˆ๋˜ ๊ฒƒ ๊ฐ™๋‹ค. reward๋ฅผ ๋‹จ์ˆœ ์„ค๊ณ„ํ•˜๋‹ค๋ณด๋‹ˆ ๋น„์œ ์ผ์„ฑ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ ๊ฒƒ ๊ฐ™๊ณ , ํ•ด๋‹น ๋…ผ๋ฌธ๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ DPO ๋“ฑ์žฅํ–ˆ๋˜ ๊ฒƒ์— reward์˜ ํ•œ๊ณ„๊ฐ€ ํฐ ์˜ํ–ฅ์„ ๋ฏธ์นœ ๊ฒƒ ๊ฐ™๋‹ค. (reward ์••์ถ•,ํ†ตํ•ฉ ํ‘œํ˜„์ด ๋ฌธ์ œ!)
โ€ข ์•ฝ์  : ์‚ฌ์‹ค reward๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ์ถ•์— ๋Œ€ํ•ด ์„ ํ˜ธ๋„ ์ถ•์„ ์ถฉ๋ถ„ํžˆ ํ‘œํ˜„ ๋ชปํ•˜์ง€ ์•Š์„๊นŒ? ์‹ถ์—ˆ๋Š”๋ฐ ์‹คํ—˜ ๊ฒฐ๊ณผ์—์„œ ์ฆ๋ช…์ด ๋˜์—ˆ๋‹ค..
4.9
ํ”ผ์ฆˆ์น˜์ž โ€ข ์žฅ์ : ์‚ฌ์šฉ์ž๋งˆ๋‹ค personalized๋œ ์„ ํ˜ธ๋ฅผ ๋‹ค ๋ฐ˜์˜ํ•œ ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์—†์œผ๋‹ˆ, ์ด๋ฅผ ๊ณ ๋ คํ•œ ๋ฌธ์ œ ์ ‘๊ทผ์ด ํ˜„์‹ค์ ์ž„
โ€ข ๋‹จ์ : ์‚ฌ์šฉ์ž์˜ ์„ ํ˜ธ๋ฅผ ์„ ํ˜•์œผ๋กœ ๊ฐ€์ •ํ–ˆ๋Š”๋ฐ(๋ฌผ๋ก  ์‹คํ—˜์—๋” ์–ด๋А์ •๋„ ์ฆ๋ช…ํ•œ๊ฒƒ ๊ฐ™์ง€๋งŒ), ๋น„์„ ํ˜•์œผ๋กœ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ •๋ง ์—†์„๊นŒ?
โ€ข ์ œ์•ˆ: personalization ์†์„ฑ๋“ค์ด ๊ธธ์ด, ์œ ๋จธ, ์ •์ค‘ํ•จ, confidence ๊ฐ™์€ ๋น„๊ต์ ์œผ๋กœ ํ•ด์„ ๊ฐ€๋Šฅํ•œ ์Šคํƒ€์ผ์ธ๋ฐ, ๋” ๋ฏธ๋ฌ˜ํ•œ ์Šคํƒ€์ผ์„ ์–ด๋–ป๊ฒŒ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ์„๊นŒ(e.g., reasoning ์Šคํƒ€์ผ ๋“ฑ๋“ฑ..)
4.5
ํ™”์ดํŠธ๋…ธ์ด์ฆˆ โ€ข ์žฅ์ : ๊ทธ๋ž˜๋„ ์‚ฌ์šฉ์ž๋“ค๋ผ๋ฆฌ ๊ณตํ†ต๋œ ๋ช‡ ๊ฐœ์˜ ์„ ํ˜ธ ์ถ•์ด ์กด์žฌํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•œ ์ง€์ ์ด ์ง€๊ธˆ๊นŒ์ง€ ๋ณธ personalization ๋…ผ๋ฌธ ์ค‘์—์„œ ๊ฐ€์žฅ motivated ๋œ ๋…ผ๋ฌธ
(์ ๋‹นํžˆ โ†’ ๋” ์ข‹์€ ์ ๋‹นํžˆ)
โ€ข ๋‹จ์ : ์‚ฌ๋žŒ ์ทจํ–ฅ์ด๋ผ๋Š”๊ฒŒ ์ •๋ง ๋ณต์žกํ•œ๋ฐ ์„ ํ˜• ๊ด€๊ณ„๋กœ ๊ฐ„๋‹จํžˆ ํ‘œํ˜„ํ•˜๋Š”๊ฒŒ ๋งž์„๊นŒ?
โ€ข ๋ณด์™„์ : ๋ฌธํ™”์ ์œผ๋กœ ์ƒ๋ฐ˜๋œ ์œ ์ €๋“ค(e.g., ๊ณ ๋งฅ๋ฝ์‚ฌํšŒ vs ์ €๋งฅ๋ฝ ์‚ฌํšŒ)์ด ๊ฐ™์€ ์ถ• ๊ณต๊ฐ„ ์•ˆ์— ๊ณต์กดํ•  ์ˆ˜ ์žˆ๋Š”์ง€๊ฐ€ ์˜๋ฌธ
โ€ข ์ œ์•ˆ: ์‚ฌ์‹ค ํ•ด๋‹น ๋…ผ๋ฌธ์˜ ๊ณตํ†ต ์„ ํ˜ธ์ถ•์˜ ๊ฐ ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์–ด๋– ํ•œ peference๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š”์ง€ ํ•ด์„ ๋ถˆ๊ฐ€ํ•˜๋‹ค๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ๋Š”๋ฐ Whatโ€™s In My Human Feedback? Learning Interpretable Descriptions of Preference Data (ICLR'26 Oral) ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋ฐ์ดํ„ฐ์…‹์—์„œ ์„ ํ˜ธ ์ถ•์„ ๋ฏธ๋ฆฌ ์ถ”์ถœํ•˜๊ณ  ํ•ด๋‹น ์ถ•์„ ๋ฐ”ํƒ•์œผ๋กœ ๊ณตํ†ต ์„ ํ˜ธ์ถ• ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ค์ •ํ•˜๋ฉด ์–ด๋–จ๊นŒ?
4.5
์ œ๋กœ์ฝœ๋ผ โ€ข ์žฅ์ : ์‚ฌ๋žŒ๋งˆ๋‹ค ์„ ํ˜ธ๊ฐ€ ์™„์ „ํžˆ ์ œ๊ฐ๊ฐ์ด ์•„๋‹ˆ๋ผ ๊ณตํ†ต๋œ ๋ช‡ ๊ฐœ์˜ ์ถ•์œผ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฐ€์ •์ด ์ง๊ด€์ ์ด๊ณ  ์ข‹์Œ
โ€ข ์•ฝ์ : ์‚ฌ์šฉ์ž์˜ ์„ ํ˜ธ๊ฐ€ ์—ฌ๋Ÿฌ ์ถ•์˜ ์„ ํ˜• ๊ฒฐํ•ฉ์œผ๋กœ ํ‘œํ˜„๋œ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๋Š”๋ฐ, ๊ฐ™์€ ์‚ฌ๋žŒ์ด ์ƒํ™ฉ์— ๋”ฐ๋ผ ์„ ํ˜ธ๊ฐ€ ๋‹ฌ๋ผ์ง€๋Š” ๊ฒฝ์šฐ๋ฅผ ์„ ํ˜• ๋ชจ๋ธ๋กœ ์žก์•„๋‚ผ ์ˆ˜ ์žˆ์„์ง€ ์˜๋ฌธ์ด ์ƒ๊น€.
โ€ข ๋ณด์™„์ : ํ˜„์žฌ ์‹คํ—˜์—์„œ ์‚ฌ์šฉ๋œ ์„ ํ˜ธ ์ถ•์€ ๋น„๊ต์  ํ•ด์„ํ•˜๊ธฐ ์‰ฌ์šด ์Šคํƒ€์ผ ์†์„ฑ์ธ๋ฐ, ๋ฌธํ™”์  ๋ฐฐ๊ฒฝ์ด๋‚˜ ๊ฐ€์น˜๊ด€์ฒ˜๋Ÿผ ๋” ๋ณต์žกํ•œ ๊ฐœ์ธ ์„ ํ˜ธ๊นŒ์ง€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋‹ค์–‘ํ•œ ์‚ฌ์šฉ์ž ์ง‘๋‹จ์„ ๋Œ€์ƒ์œผ๋กœ ์ถ”๊ฐ€ ์‹คํ—˜์ด ์žˆ์œผ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ.
4.5
์ฐฝ๋ฐฑ์นด์ธ„์žฅ์ : ๊ฐœ์ธํ™”๋ฅผ ํ•˜๊ธฐ์— ์•„์ฃผ ์ข‹์€ ๋ฐฉ๋ฒ•๋ก ์ด๋ผ ์ƒ๊ฐํ•˜๊ณ , ๋ฌธํ™”์ , ์ง€๋ฆฌ์  bias๋ฅผ ์‰ฝ๊ฒŒ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ์–ด ๋ณด์ž„!
์•ฝ์ : ์ €์ž๋“ค์ด ํ•ฉ์„ฑํ•œ ๋ฐ์ดํ„ฐ์˜ reliablity๊ฐ€ ์กฐ๊ธˆ ์˜์‹ฌ๋จ. Prefer axis๊ฐ€ ๋…๋ฆฝ์ ์ธ์ง€๋„ ์ž˜ ๋ชจ๋ฅด๊ฒ ์Œ. ์ด๊ฒƒ์— ๋Œ€ํ•œ ์‹คํ—˜์ด ์žˆ๋‚˜?
์ œ์•ˆ์ : ์‹ค์ œ human study๋‚˜ case study๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉด ๋” ์ข‹์„๋“ฏ
3.5

TL; DR

๐Ÿ’ก

์—ฌ๋Ÿฌ ์‚ฌ์šฉ์ž์˜ ์„ ํ˜ธ๋ฅผ ๊ณตํ†ต๋œ ์„ ํ˜ธ ์ถ•(e.g., ์นœ์ ˆ, ๊ฐ„๊ฒฐ, ๊ฒฉ์‹)์œผ๋กœ ๋ถ„ํ•ดํ•ด ํ•™์Šตํ•œ ๋’ค, ์ƒˆ๋กœ์šด ์‚ฌ์šฉ์ž๊ฐ€ ๋“ค์–ด์˜ค๋ฉด ์ถ•๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ์–ด ์‚ฌ์šฉ์ž์˜ personalized๋œ ์„ ํ˜ธ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ถ”์ •ํ•˜์ž!

  • Cited: 19

Introduction

Motivation

  • ๊ธฐ์กด RLHF์˜ ํ•œ๊ณ„
    • Universal Preference Model: ๊ฐ ์‚ฌ์šฉ์ž ๋ณ„ ์„ ํ˜ธ๊ฐ€ ์•„๋‹Œ ๋ชจ๋“  ์‚ฌ์šฉ์ž์—๊ฒŒ ๋ณดํŽธ์ ์œผ๋กœ align๋œ ๋ชจ๋ธ

    โ‡’ โญ ํ‰๊ท ์ ์ธ ์ธ๊ฐ„ ์„ ํ˜ธ์—๋Š” align๋  ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ฐ ์‚ฌ์šฉ์ž ๋ณ„ ์„ ํ˜ธ๋ฅผ ๋ฐ˜์˜ํ•˜๋Š” personalization์€ ํ•œ๊ณ„

RQ ์‚ฌ์šฉ์ž ๋ณ„ ์„ ํ˜ธ๋ผ๋Š”๊ฒŒ ์™„์ „ ์ œ๊ฐ๊ฐ์ด ์•„๋‹ˆ๋ผ, ๊ณตํ†ต๋œ ๋ช‡ ๊ฐœ์˜ ์„ ํ˜ธ ์ถ•(low-dimensional preference space) ์œ„์—์„œ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ?

Contribution

  • Personalization via Reward Factorization (PReF) ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์•ˆ: Personalization์„ reward factorization ๋ฌธ์ œ๋กœ ์žฌ์ •์˜
    • ์‚ฌ์šฉ์ž๋งˆ๋‹ค reward model์„ ๋”ฐ๋กœ ํ•™์Šตํ•˜์ง€ ์•Š๊ณ , ๊ณตํ†ต base reward functions๋ฅผ ๋จผ์ € ํ•™์Šตํ•œ ๋’ค ์ƒˆ๋กœ์šด ์‚ฌ์šฉ์ž์— ๋Œ€ํ•ด์„œ๋Š” ์ถ•๋ณ„ ๊ฐ€์ค‘์น˜๋งŒ ์ถ”์ •
    • base reward๋ฅผ ํ•œ ๋ฒˆ ํ•™์Šตํ•ด๋‘๋ฉด ์ƒˆ ์‚ฌ์šฉ์ž๋Š” ์ „์ฒด ๋ชจ๋ธ ์žฌํ•™์Šต ์—†์ด ์‚ฌ์šฉ์ž๋ณ„ ๊ฐ€์ค‘์น˜ ๋ฒกํ„ฐ๋งŒ ์ถ”์ •ํ•˜๋ฉด ๋จ
  • Active learning ๊ธฐ๋ฐ˜ ์ ์‘ ๋„์ž…: ๊ฐ€์žฅ ๋ถˆํ™•์‹ค์„ฑ์„ ๋งŽ์ด ์ค„์—ฌ์ค„ ์งˆ๋ฌธ/์‘๋‹ต์Œ์„ ์„ ํƒํ•ด ๋ฐ์ดํ„ฐ ํšจ์œจ์„ ๋†’์ž„

Methods

Step 1. ๊ณตํ†ต ์„ ํ˜ธ ์ถ• ํ•™์Šต(Offline)
  1. ์—ฌ๋Ÿฌ ์‚ฌ์šฉ์ž๋“ค์˜ preference ๋ฐ์ดํ„ฐ๋ฅผ ๋ชจ์•„ ์‘๋‹ต์„ ๋ช‡ ๊ฐœ์˜ ๊ณตํ†ต base reward functions๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šต
    • ์ˆ˜๋ฐฑ ๋ช…์˜ ์œ ์ €์—๊ฒŒ ์‘๋‹ต ์Œ (X,Y)(X,Y)๏ปฟ ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์„ ํ˜ธ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆ˜์ง‘ โ‡’ ์œ ์ € x ์‘๋‹ต์Œ ํ–‰๋ ฌ AA๏ปฟ ๊ตฌ์ถ•
      Sparse Preference Matrix์‘๋‹ต์Œ1์‘๋‹ต์Œ2์‘๋‹ต์Œ3์‘๋‹ต์Œ4์‘๋‹ต์Œ5
      ์œ ์ €A์‘๋‹ต XX๏ปฟ์„ ํƒ์‘๋‹ต XX๏ปฟ์„ ํƒ
      ์œ ์ €B์‘๋‹ต XX๏ปฟ์„ ํƒ์‘๋‹ต XX๏ปฟ์„ ํƒ
      ์œ ์ €C์‘๋‹ต XX๏ปฟ์„ ํƒ
      ์œ ์ €D์‘๋‹ต XX๏ปฟ์„ ํƒ
  1. ์ด ๋ฐ์ดํ„ฐ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ, ์‘๋‹ต์„ ๋ช‡ ๊ฐœ์˜ ๊ณตํ†ต base reward functions๋กœ ํ‘œํ˜„ํ•˜๋„๋ก ๋ชจ๋ธ์„ ํ•™์Šต
    • ํ•˜๋‚˜์˜ ์‘๋‹ต์€ ์—ฌ๋Ÿฌ ๊ณตํ†ต ์„ ํ˜ธ์ถ• ์œ„์—์„œ ์ ์ˆ˜๋ฅผ ๋ฐ›๊ณ , ๊ฐ ์‚ฌ์šฉ์ž๋Š” ์ถ•๋“ค์— ์„œ๋กœ ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜๋ฅผ ์คŒ
    • ํ–‰๋ ฌ ๋ถ„ํ•ด(Matrix Factorization)
      Aโ‰ˆUร—VโŠคA \approx U \times V^\top
      • ํ–‰๋ ฌ UU๏ปฟ (User Factor): ์‚ฌ์šฉ์ž๋“ค์ด ๊ฐ ์„ ํ˜ธ ์ถ•์— ๋Œ€ํ•ด ๊ฐ€์ง€๋Š” ๊ฐ€์ค‘์น˜ (ฮปi\lambda_i๏ปฟ)
      • ํ–‰๋ ฌ VV๏ปฟ (Item Factor): ๊ฐ ์‘๋‹ต ์Œ์ด ์–ด๋–ค ์„ ํ˜ธ ์ถ•์˜ ํŠน์ง•์„ ๊ฐ€์กŒ๋Š”์ง€์— ๋Œ€ํ•œ ์ ์ˆ˜ (ฯ•j\phi_j๏ปฟ)
    • ์‚ฌ์šฉ์ž ii๏ปฟ์˜ ๋ณด์ƒ ํ•จ์ˆ˜ (UU๏ปฟ โ‹…\cdot๏ปฟ VV๏ปฟ dot product)
    ri(x,y)=ฮปiโŠคฯ•(x,y)=โˆ‘j=1Jฮปijฯ•jr_i(x,y)=\lambda_i^\top \phi(x,y)=\sum_{j=1}^{J}\lambda_i^j \phi_j
    • ฯ•jฯ•สฒ๏ปฟ: ๊ณตํ†ต๋œ ์„ ํ˜ธ์ถ•(๋ชจ๋“  ์œ ์ € ๊ณต์œ ) e.g., ฯ•1\phi_1๏ปฟ: ๊ฐ„๊ฒฐํ•จ, ฯ•2\phi_2๏ปฟ: ๊ฒฉ์‹, ฯ•3\phi_3๏ปฟ: ์นœ์ ˆํ•จ, ฯ•4\phi_4๏ปฟ: ์ฐฝ์˜์„ฑ
    • ฮปijฮปแตขสฒ๏ปฟ: ์œ ์ € ii๏ปฟ์˜ ๊ณ ์œ  ๊ฐ€์ค‘์น˜
    • ํ”„๋กฌํ”„ํŠธ์™€ ์‘๋‹ต์„ ์ž…๋ ฅ๋ฐ›์•„ JJ๏ปฟ ์ฐจ์› ๋ฒกํ„ฐ ฯ•(x,y)\phi(x,y)๏ปฟ๋ฅผ ์ถœ๋ ฅํ•˜๋Š” ์‹ ๊ฒฝ๋ง์„ ํ•™์Šต
      • SVD ์ดˆ๊ธฐํ™”, L2 ์ •๊ทœํ™” ์‚ฌ์šฉ

โ‡’ ์‚ฌ์šฉ์ž๋งˆ๋‹ค ์™„์ „ํžˆ ๋ณ„๋„ ๋ชจ๋ธ์„ ๋งŒ๋“ค์ง€ ์•Š๊ณ , ๋ชจ๋“  ์‚ฌ์šฉ์ž์—๊ฒŒ ๊ณตํ†ต์œผ๋กœ ์“ธ ์ˆ˜ ์žˆ๋Š” reward ๋ชจ๋ธ

Step 2. ์ƒˆ ์‚ฌ์šฉ์ž์˜ ์„ ํ˜ธ ๋ฒกํ„ฐ ์ถ”์ •(Online)
  • ์ƒˆ๋กœ์šด ์‚ฌ์šฉ์ž์—๊ฒŒ ๋ช‡ ๊ฐœ์˜ ๋น„๊ต ์งˆ๋ฌธ์„ ํ†ตํ•ด ์‚ฌ์šฉ์ž๊ฐ€ ๊ณตํ†ต ์ถ•๋“ค์„ ์–ด๋–ค ๋น„์œจ๋กœ ์ข‹์•„ํ•˜๋Š”์ง€ ฮปiฮปแตข๏ปฟ ์ถ”์ •
    • ํ˜„์žฌ ฮป ์ถ”์ •์ด ๊ฐ€์žฅ ๋ถˆํ™•์‹คํ•œ ์ถ•์— ํ•ด๋‹นํ•˜๋Š” ์‘๋‹ต์Œ์„ ๋Šฅ๋™์ ์œผ๋กœ ์„ ํƒ(Active Learning) ํ•ด ๋ฐ์ดํ„ฐ ํšจ์œจ์„ ๊ทน๋Œ€ํ™”
      • Active Learning ์‚ฌ์šฉ์ž์—๊ฒŒ ์•„๋ฌด ์งˆ๋ฌธ์ด๋‚˜ ๋ฌป์ง€ ์•Š๊ณ  ์‚ฌ์šฉ์ž์˜ ์ทจํ–ฅ์„ ๊ฐ€์žฅ ๋นจ๋ฆฌ ์•Œ์•„๋‚ผ ์ˆ˜ ์žˆ๋Š” ์งˆ๋ฌธ์„ ๊ณจ๋ผ์„œ ๋ฌป๋Š” ๊ฒƒ
      • e.g., ๋‹ต๋ณ€ A์™€ B ์ค‘ ์–ด๋А ์ชฝ์ด ๋” ์ข‹์€๊ฐ€?

โ‡’ ์ƒˆ ์‚ฌ์šฉ์ž์— ๋Œ€ํ•ด ฯ•ฯ•๏ปฟ๋Š” ๊ณ ์ •ํ•œ ์ฑ„, logistic regression์œผ๋กœ ์‚ฌ์šฉ์ž์˜ weight vector๋งŒ ๋งž์ถ”๋ฉด ๋จ

Step 3. Personalized ์‘๋‹ต ์ƒ์„ฑ
  • LLM์„ ์ƒˆ๋กœ ํ•™์Šตํ•˜์ง€ ์•Š๊ณ , Personalized Reward๋กœ ์‘๋‹ต ์„ ํƒ
    1. ํ•™์Šต๋œ ๊ณตํ†ต ์ถ• ฯ•\phi๏ปฟ ์™€ ์ƒˆ ์‚ฌ์šฉ์ž์˜ ๊ฐ€์ค‘์น˜ ฮป\lambda๏ปฟ๋ฅผ ๊ฒฐํ•ฉํ•ด personalized reward๋ฅผ ๊ณ„์‚ฐ
    1. ์ถ”๋ก  ์‹œ, ์ด ๋ณด์ƒ ๊ฐ’์„ ๊ฐ€์ง€๊ณ  ์‘๋‹ต ์„ ํƒ

Experiments

setup
  • model: qwen 2.5 ๊ณ„์—ด
  • dataset
    1. Attributes
      • ์ €์ž๋“ค์ด ๋งŒ๋“  synthetic personalization ๋ฐ์ดํ„ฐ์…‹
      • 7๊ฐœ์˜ preference attribute ์ •์˜ํ•˜๊ณ , ๊ฐ attribute๋งˆ๋‹ค positive/negative trait๋ฅผ ๋‘ 
        • ์‚ฌ์šฉ์ž๋งˆ๋‹ค ๋‘ ๊ฐœ์˜ trait๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ๋ถ€์—ฌํ•ด 84๋ช…์˜ synthetic user ๋งŒ๋“ฆ
        • AlpacaEval ํ”„๋กฌํ”„ํŠธ ๊ธฐ๋ฐ˜์œผ๋กœ ์‚ฌ์šฉ์ž๋‹น 100๊ฐœ์˜ preference๋ฅผ ์ˆ˜์ง‘
    1. PRISM
      • ์ „ ์„ธ๊ณ„ ๋‹ค์–‘ํ•œ ์‘๋‹ต์ž๋“ค์˜ LLM ์„ ํ˜ธ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ด์€ ๋ฐ์ดํ„ฐ์…‹
      • 1.5K users, 3K prompts and answers
  • metrics
    • User Preference AUC-ROC: ์‘๋‹ต์Œ ์ค‘ ์–ด๋А ๊ฒƒ์„ ์‚ฌ์šฉ์ž๊ฐ€ ์„ ํ˜ธํ• ์ง€ ๋งžํžˆ๋Š”๊ฐ€
    • Win rate: ๊ฐœ์ธํ™” reward๋ฅผ ์ด์šฉํ•ด ์ƒ์„ฑํ•œ ์‘๋‹ต์ด ๋น„๊ฐœ์ธํ™” baseline๋ณด๋‹ค ์–ผ๋งˆ๋‚˜ ๋” ์„ ํ˜ธ๋˜๋Š”๊ฐ€
  • Baseline
    • Standard RLHF: ๋ชจ๋“  ์‚ฌ์šฉ์ž๋ฅผ ํ•˜๋‚˜์˜ ์ „์—ญ reward๋กœ ํ•™์Šต
    • Model per User: ์‚ฌ์šฉ์ž๋งˆ๋‹ค ๊ฐœ๋ณ„ reward model ํ•™์Šต
Baseline ๋Œ€๋น„ ์„ฑ๋Šฅ ํ‰๊ฐ€
  • ๋ชฉํ‘œ ์ƒˆ๋กœ์šด ์‚ฌ์šฉ์ž์—๊ฒŒ์„œ ํ”ผ๋“œ๋ฐฑ์„ ๋ฐ›์•˜์„ ๋•Œ, ๋ˆ„๊ฐ€ ๊ฐ€์žฅ ์‚ฌ์šฉ์ž์˜ ์„ ํ˜ธ๋ฅผ ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•˜๋Š”์ง€
  • ์‹คํ—˜๊ฒฐ๊ณผ
    • Standard RLHF
      • ๊ฐœ์ธํ™” ๋ชจ๋ธ์ด ์•„๋‹ˆ๋ฏ€๋กœ ์„ฑ๋Šฅ ๋ณ€ํ™” X
    • Model per user
      • ์‚ฌ์šฉ์ž ๋ณ„ ๋ฐ์ดํ„ฐ๊ฐ€ ์ ๊ธฐ ๋•Œ๋ฌธ์— ๋งŽ์€ ์‘๋‹ต์ด ์Œ“์—ฌ์•ผ ์„ฑ๋Šฅ์ด ์˜ค๋ฅด๊ธฐ ์‹œ์ž‘ํ•จ
    • PReF(Ours)
      • ์ ์€ ์ˆ˜์˜ ์‚ฌ์šฉ์ž ์‘๋‹ต๋งŒ์œผ๋กœ๋„ personalization ์„ฑ๋Šฅ์„ ๋น ๋ฅด๊ฒŒ ์˜ฌ๋ฆผ
      • x์ถ•: ์ƒˆ ์‚ฌ์šฉ์ž์—๊ฒŒ์„œ ๋ฐ›์€ ์„ ํ˜ธ ์‘๋‹ต ์ˆ˜
      • y์ถ•
        • User Preference AUC-ROC: ์‘๋‹ต์Œ ์ค‘ ์–ด๋А ๊ฒƒ์„ ์‚ฌ์šฉ์ž๊ฐ€ ์„ ํ˜ธํ• ์ง€ ๋งžํžˆ๋Š”๊ฐ€
        • Win rate: ๊ฐœ์ธํ™” reward๋ฅผ ์ด์šฉํ•ด ์ƒ์„ฑํ•œ ์‘๋‹ต์ด ๋น„๊ฐœ์ธํ™” baseline๋ณด๋‹ค ์–ผ๋งˆ๋‚˜ ๋” ์„ ํ˜ธ๋˜๋Š”๊ฐ€
๊ธฐ์กด Personalization Methods ์™€ ๋น„๊ต
  • VPL: few-shot์—์„œ๋Š” ๊ฐ•ํ•˜์ง€๋งŒ, in-context ๋ฐฉ์‹์ด๋ผ ์‚ฌ์šฉ์ž ์˜ˆ์‹œ๊ฐ€ ๋งŽ์•„์ง€๋ฉด ์„ฑ๋Šฅ
  • PReF: ์‚ฌ์šฉ์ž ํ”ผ๋“œ๋ฐฑ์ด 10๊ฐœ ์ด์ƒ ๋„˜์–ด๊ฐ€๋ฉด SOTA
Ablation Study

(A) SVD ์ดˆ๊ธฐํ™”์™€ regularization์ด ์ง„์งœ ํ•„์š”ํ•œ๊ฐ€?

  • Full: ์ „์ฒด ๋ฐฉ๋ฒ• ์‚ฌ์šฉ
  • No Reg.: w/o regularization
  • No SVD: w/o SVD initialization

โ‡’ ์ •๊ทœํ™” SVD ์ดˆ๊ธฐํ™” ์—†์ด๋Š” ์„ฑ๋Šฅ ์ €ํ•˜์™€ ๋ถˆ์•ˆ์ •ํ•ด์ง

(B) base reward function ๊ฐœ์ˆ˜ JJ๏ปฟ๋Š” ๋ช‡ ๊ฐœ๊ฐ€ ์ ์ ˆํ•œ๊ฐ€?

  • x์ถ•: ๊ณตํ†ต ์„ ํ˜ธ ์ถ• ๊ฐœ์ˆ˜
  • ์‹คํ—˜ ๊ฒฐ๊ณผ
    • JJ๏ปฟ๊ฐ€ 1~3์ผ ๋•Œ๋Š” ์„ฑ๋Šฅ์ด ๋น ๋ฅด๊ฒŒ ์˜ค๋ฅด์ง€๋งŒ 4~6๊นŒ์ง€๋Š” ์„ฑ๋Šฅ์ด ๊ทธ๋‹ค์ง€ ์˜ค๋ฅด์ง€ ์•Š์Œ

โ†’ ์„ ํ˜ธ ์ถ•์„ ๋งŽ์ด ๋Š˜๋ฆฐ๋‹ค๊ณ  ์„ฑ๋Šฅ์ด ๊ณ„์† ์˜ค๋ฅด์ง€๋Š” ์•Š๋Š”๋‹ค

โ‡’ โญ ์‚ฌ๋žŒ์˜ ๊ณตํ†ต๋œ ์„ ํ˜ธ๊ฐ€ ์‹ค์ œ๋กœ ์ €์ฐจ์› ๊ตฌ์กฐ๋ฅผ ๊ฐ–๋Š”๊ตฌ๋‚˜!

๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ ๋น„๊ต

๋ชฉํ‘œ ๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ๊ณผ base reward model โ†” ์‚ฌ์šฉ์ž ์„ ํ˜ธ ์˜ˆ์ธก ์„ฑ๋Šฅ ์‚ฌ์ด์˜ ์—ฐ๊ด€์„ฑ

  • ๋ฐ์ดํ„ฐ์…‹ = PRISM ๊ธฐ๋ฐ˜ ํ•™์Šต ๋ฐ์ดํ„ฐ

  • ์‹คํ—˜ ๊ฒฐ๊ณผ
    • ๋ฐ์ดํ„ฐ์…‹์ด ์ปค์งˆ์ˆ˜๋ก ๋ชจ๋“  ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ
    • ๊ฐ™์€ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ์—์„œ์˜ ์„ฑ๋Šฅ: 3B > 1B > 0.5B
    • ๋ฐ์ดํ„ฐ๊ฐ€ ์ถฉ๋ถ„ํžˆ ๋งŽ์•„์งˆ์ˆ˜๋ก ๋ชจ๋ธ ๊ฐ„ ์„ฑ๋Šฅ ์ฐจ์ด ์ค„์–ด๋“ฆ

โ‡’ ๋” ํฐ reward model๊ณผ ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ๋Š” personalization ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ด

qwen 2.5 ๊ณ„์—ด
human eval
  • MIT/Harvard ๊ณ„์—ด์˜ 28๋ช… ์ž์›์ž๋ฅผ ๋Œ€์ƒ์œผ๋กœ, ์•ž 15๊ฐœ ๋น„๊ต์—์„œ ์ทจํ–ฅ์„ ํ•™์Šตํ•˜๊ณ  ๋’ค 15๊ฐœ์—์„œ ํ‰๊ฐ€
    • personalized response๊ฐ€ ๊ธฐ๋ณธ GPT-4o ์‘๋‹ต๋ณด๋‹ค 67% win rate๋ฅผ ๋ณด์ž„

Categories

RLHF SVD research