ALIGNMENT

26 March 2026

Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games

COLM'25

๐Ÿ’กํ˜„์žฌ์˜ ์ถ”๋ก  ์ตœ์ ํ™”๊ฐ€ ํ˜‘๋ ฅ์„ ๋ณ„๋„๋กœ ์ •๋ ฌ์‹œํ‚ค์ง€ ์•Š๋Š”๋‹ค๋ฉด, ํ˜‘๋ ฅ์ด ์•„๋‹Œ ํ•ฉ๋ฆฌ์  ์ด๊ธฐ์ฃผ์˜๋ฅผ ํ‘œ๋ฐฉํ•˜๋Š” ๊ฐœ์ธ์ฃผ์˜ ๋ชจ๋ธ์ด ํƒ„์ƒํ•  ์ˆ˜ ์žˆ๋‹ค!์ฆ‰, ์ถ”๋ก  ๋Šฅ๋ ฅ๊ณผ, ํ˜‘์—… ๋Šฅ๋ ฅ(๋น„์šฉ ๊ฐ์ˆ˜ ์ธก๋ฉด)์€ ๋ณ„๊ฐœ๋‹ค!

์—ผ๊ทœํ™˜
19 March 2026

OrthAlign: Orthogonal Subspace Decomposition for Non-Interfering Multi-Objective Alignment

ICLR'26 Poster

๐Ÿ’ก๋‹ค์ค‘ preference ์ตœ์ ํ™” ์‹œ ํŒŒ๋ผ๋ฏธํ„ฐ ์—…๋ฐ์ดํŠธ ๊ณต๊ฐ„์„ orthogonal subspace๋กœ ๋ถ„ํ•ดํ•˜์—ฌ, objective ๊ฐ„ ๊ฐ„์„ญ์„ ์›์ฒœ์ ์œผ๋กœ ์ œ๊ฑฐํ•˜์ž

19 March 2026

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

COLM'25

๐Ÿ’กPost-training ํ›„ ๋ชจ๋ธ ๋‚ด๋ถ€ ์ง€์‹, ์ง„์‹ค์„ฑ, ์•ˆ์ „์„ฑ, ํ™•์‹ ์„ฑ์˜ ๋ณ€ํ™”๋ฅผ ๊ธฐ๊ณ„์ ์œผ๋กœ ๋ถ„์„!