RLHF

์ด์Šนํ™˜
26 March 2026

Language Model Personalization via Reward Factorization

COLM'25

๐Ÿ’ก์—ฌ๋Ÿฌ ์‚ฌ์šฉ์ž์˜ ์„ ํ˜ธ๋ฅผ ๊ณตํ†ต๋œ ์„ ํ˜ธ ์ถ•(e.g., ์นœ์ ˆ, ๊ฐ„๊ฒฐ, ๊ฒฉ์‹)์œผ๋กœ ๋ถ„ํ•ดํ•ด ํ•™์Šตํ•œ ๋’ค, ์ƒˆ๋กœ์šด ์‚ฌ์šฉ์ž๊ฐ€ ๋“ค์–ด์˜ค๋ฉด ์ถ•๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ์–ด ์‚ฌ์šฉ์ž์˜ personalized๋œ ์„ ํ˜ธ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ถ”์ •ํ•˜์ž!

์ด์Šนํ™˜
19 March 2026

Whatโ€™s In My Human Feedback? Learning Interpretable Descriptions of Preference Data

ICLR'26 Oral

๐Ÿ’กSAE๋ฅผ ํ†ตํ•ด preference dataset์—์„œ ๋‘ ์‘๋‹ต ๊ฐ„ ์„ ํ˜ธ๋ฅผ ๊ฒฐ์ •์ง“๋Š” ์ž ์žฌ์  ํŠน์ง•(feature) ์ถ•์„ ์ž๋™์œผ๋กœ ์ถ”์ถœํ•˜๊ณ , ์–ด๋–ค ์‘๋‹ต ํŠน์„ฑ์ด ์ธ๊ฐ„์˜ ์„ ํ˜ธ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š”์ง€ ์ž์—ฐ์–ด๋กœ ํ•ด์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๋Š” WIMHF ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆ