Blog

์ด๋‘ํ˜ธ
26 March 2026

LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts

ICLR'26 Oral

๐Ÿ’กshort-context(16K) RL ํ•™์Šต๋งŒ์œผ๋กœ long-context(128K) ์ถ”๋ก ์„ ์ž˜ํ•˜๊ฒŒ ํ•˜์ž.์–ด๋–ป๊ฒŒ?โ‡’ UUID ์ฒด์ธ์œผ๋กœ ์งˆ๋ฌธ์„ ์ˆจ๊ธด ๊ณ ๋‚œ์ด๋„ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ(KeyChain)๋กœ RL ํ•™์Šตํ•˜๋ฉด, planโ€“retrieveโ€“reasonโ€“recheck ์‚ฌ๊ณ  ํŒจํ„ด์ด ๋ฐœ์ƒํ•˜์—ฌ ๋†’์€ ์žฅ๋ฌธ ์ถ”๋ก  ์„ฑ๋Šฅ์„ 7B/14B์˜ ์†Œํ˜• ๋ชจ๋ธ๋กœ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด์Šนํ™˜
26 March 2026

Language Model Personalization via Reward Factorization

COLM'25

๐Ÿ’ก์—ฌ๋Ÿฌ ์‚ฌ์šฉ์ž์˜ ์„ ํ˜ธ๋ฅผ ๊ณตํ†ต๋œ ์„ ํ˜ธ ์ถ•(e.g., ์นœ์ ˆ, ๊ฐ„๊ฒฐ, ๊ฒฉ์‹)์œผ๋กœ ๋ถ„ํ•ดํ•ด ํ•™์Šตํ•œ ๋’ค, ์ƒˆ๋กœ์šด ์‚ฌ์šฉ์ž๊ฐ€ ๋“ค์–ด์˜ค๋ฉด ์ถ•๋งˆ๋‹ค ๋‹ค๋ฅธ ๊ฐ€์ค‘์น˜๋ฅผ ์ฃผ์–ด ์‚ฌ์šฉ์ž์˜ personalized๋œ ์„ ํ˜ธ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ถ”์ •ํ•˜์ž!

26 March 2026

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

COLM'25

๐Ÿ’กMathematical Reasoning Task ๋ฅผ ํ•  ๋•Œ, RL์„ ๊ฐ„์ ‘์ ์œผ๋กœ ๊ตฌํ˜„ํ•˜์—ฌ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ’€์–ด๋ณด์ž.(= ๊ฐ•ํ™”ํ•™์Šต ํ˜•ํƒœ๋กœ ์ˆ˜ํ•™๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ’€์–ด๋ณด์ž !)