RL

์—ผ๊ทœํ™˜
26 March 2026

TROLL: Trust Regions Improve Reinforcement Learning for Large Language Models

ICLR'26 Oral

๐Ÿ’กLLM์„ RL๋กœ ํ•™์Šตํ•  ๋•Œ ๋ชจ๋ธ์ด ํ•œ ๋ฒˆ์— ๋„ˆ๋ฌด ํฌ๊ฒŒ ๋ฐ”๋€Œ๋ฉด ๋ง๊ฐ€์ง€๋ฏ€๋กœ, ํ—ˆ์šฉ๋œ ๋ฒ”์œ„ ์•ˆ์—์„œ๋งŒ ์—…๋ฐ์ดํŠธํ•ด์„œ ์•ˆ์ „ํ•˜๊ฒŒ ํ•™์Šต์‹œํ‚ค์ž

์ด๋‘ํ˜ธ
26 March 2026

LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts

ICLR'26 Oral

๐Ÿ’กshort-context(16K) RL ํ•™์Šต๋งŒ์œผ๋กœ long-context(128K) ์ถ”๋ก ์„ ์ž˜ํ•˜๊ฒŒ ํ•˜์ž.์–ด๋–ป๊ฒŒ?โ‡’ UUID ์ฒด์ธ์œผ๋กœ ์งˆ๋ฌธ์„ ์ˆจ๊ธด ๊ณ ๋‚œ์ด๋„ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ(KeyChain)๋กœ RL ํ•™์Šตํ•˜๋ฉด, planโ€“retrieveโ€“reasonโ€“recheck ์‚ฌ๊ณ  ํŒจํ„ด์ด ๋ฐœ์ƒํ•˜์—ฌ ๋†’์€ ์žฅ๋ฌธ ์ถ”๋ก  ์„ฑ๋Šฅ์„ 7B/14B์˜ ์†Œํ˜• ๋ชจ๋ธ๋กœ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

26 March 2026

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

COLM'25

๐Ÿ’กMathematical Reasoning Task ๋ฅผ ํ•  ๋•Œ, RL์„ ๊ฐ„์ ‘์ ์œผ๋กœ ๊ตฌํ˜„ํ•˜์—ฌ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ’€์–ด๋ณด์ž.(= ๊ฐ•ํ™”ํ•™์Šต ํ˜•ํƒœ๋กœ ์ˆ˜ํ•™๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ’€์–ด๋ณด์ž !)

19 March 2026

Why DPO is a Misspecified Estimator and How to Fix It

ICLR'26 Oral

๐Ÿ’กDPO์˜ ์ „์ œ๊ฐ€ realisticํ•˜์ง€ ์•Š์Œ์„ ์œ„์ƒํ•™์ ์œผ๋กœ ํŒŒํ—ค์นจ AuxDPO๋ฅผ ํ†ตํ•ด DPO์˜ Misspecifection๋ฅผ ์™„ํ™”ํ•˜์ž!

19 March 2026

Multiplayer Nash Preference Optimization

ICLR'26 Poster

๐Ÿ’กalignment๊ฐ€ ๊ฐ€์ ธ์•ผ ํ•  ๋ชฉํ‘œ๋Š” ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋‹ค์ˆ˜ ๊ฐ€์น˜ ๋ฐ ์ •์ฑ… ์ง‘๋‹จ ์†์—์„œ ๊ทธ ๋ˆ„๊ตฌ์—๊ฒŒ๋„ ์ง€์ง€ ์•Š๋Š” ์•ˆ์ •์  ๊ท ํ˜• ์ƒํƒœ๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ์ด๋‹ค!

์ด๋‘ํ˜ธ
19 March 2026

Diffusion Alignment as Variational Expectation-Maximization

ICLR'26 Poster

๐Ÿ’กDiffusion ๋ชจ๋ธ์„ ๋ชฉ์  ํ•จ์ˆ˜์— ๋งž๊ฒŒ diffusion alignmentํ•  ๋•Œ ๋ฐœ์ƒํ•˜๋Š” reward over-optimization ๊ณผ mode collapse ๋ฌธ์ œ๋ฅผ EM์•Œ๊ณ ๋ฆฌ์ฆ˜ (E๋‹จ๊ณ„(test time search) โ†’ M๋‹จ๊ณ„(forward-KL)์˜ ๋ฐ˜๋ณต)์œผ๋กœ ํ•ด๊ฒฐํ•˜์ž!