26 March 2026

LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts

๐Ÿ’กshort-context(16K) RL ํ•™์Šต๋งŒ์œผ๋กœ long-context(128K) ์ถ”๋ก ์„ ์ž˜ํ•˜๊ฒŒ ํ•˜์ž.์–ด๋–ป๊ฒŒ?โ‡’ UUID ์ฒด์ธ์œผ๋กœ ์งˆ๋ฌธ์„ ์ˆจ๊ธด ๊ณ ๋‚œ์ด๋„ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ(KeyChain)๋กœ RL ํ•™์Šตํ•˜๋ฉด, planโ€“retrieveโ€“reasonโ€“recheck ์‚ฌ๊ณ  ํŒจํ„ด์ด ๋ฐœ์ƒํ•˜์—ฌ ๋†’์€ ์žฅ๋ฌธ ์ถ”๋ก  ์„ฑ๋Šฅ์„ 7B/14B์˜ ์†Œํ˜• ๋ชจ๋ธ๋กœ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

์ด๋‘ํ˜ธ
์ด๋‘ํ˜ธ

LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts

Review

๋‹‰๋„ค์ž„ Strength & Weakness & Sugguestions ๋ณ„์  (0/5)
๋Œ“์ธ ๋…ธ๋…ธ โ€ข ์žฅ์ : UUID๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์ด ๊ผผ์ˆ˜์—†์ด reasoningํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ฐ•์ œํ•จ / context ๊ธธ์ด์— ๋ฌด๊ด€ํ•˜๊ฒŒ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„ / ์–‘๋ฐฉํ–ฅ ๋ถ€๋ถ„๋ฌธ์ž์—ด ๊ธฐ๋ฐ˜ ๋งค์นญ์œผ๋กœ reasonableํ•œ ์ •๋‹ต ๋น„๊ต ๊ฐ€๋Šฅ / ์•ฝ๊ฐ„ ์ข…ํ•ฉ์„ ๋ฌผ์„ธํŠธ๊ฐ™์Œ
โ€ข ๋‹จ์ : "RL์ด ํšจ๊ณผ์ ์œผ๋กœ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•œ ์ ์ ˆ ๋‚œ์ด๋„"๋ฅผ ์„ ํƒํ•˜๊ธฐ ์œ„ํ•ด ๋„ˆ๋ฌด ์–ด๋ ค์šด ๋ฌธ์ œ๋ฅผ ์ œ๊ฑฐํ•˜๋Š”๊ฒŒ ๋‚ฉ๋“๋˜์ง€ ์•Š์Œ.
โ€ข ๋ณด์™„์ : ์žฅ๋ฌธ ์ปจํ…์ŠคํŠธ ํ™•์žฅ์—์„œ, ๊ด€๋ จ ์—†๋Š” ์‹ค์ œ ๋ฌธ์„œ๋ฅผ ์‚ฝ์ž…ํ•˜๋Š”๊ฒƒ ๋ณด๋‹ค ์œ ์‚ฌํ•œ topic์˜ ๋ฌธ์„œ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š”๊ฒŒ ๋” hard negative ์Šค๋Ÿฝ์ง€ ์•Š์„๊นŒ?
3.5
ํ™”์ดํŠธ๋…ธ์ด์ฆˆ โ€ข ์žฅ์ : long-context reasoning์— ๋Œ€ํ•œ 3๊ฐ€์ง€ ๋ฌธ์ œ์  ์ฆ‰, motivation์ด ๋ช…ํ™•ํ•˜๊ณ  ๋ฐฉ๋ฒ•๋ก ์ด ๊น”๋”ํ•จ
โ€ข ๋‹จ์ : UUID๋ฅผ ์‚ฌ์šฉํ•˜๋Š”๊ฒŒ ์ง€๊ธˆ ์‹œ๋Œ€์—๋Š” RL ์„ฑ๋Šฅ์„ ๋†’์ด๋Š”๋ฐ์—๋Š” ์ข‹์„ ์ˆ˜ ์žˆ์ง€๋งŒ ๋ฉ€๋ฆฌ๋ณด๋ฉด ๋ฏธ๋ด‰์ฑ…์ด ๋ถˆ๊ณผํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•จ. ๊ฒฐ๊ตญ์—” ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๋” ๋†’์ด๋Š”๊ฒŒ ์ค‘์š”ํ•  ๊ฒƒ ๊ฐ™์Œ
โ€ข ๋ณด์™„์ : ๊ธด ์ปจํ…์ŠคํŠธ๋ฅผ ๊ฐ€์ง„ Narrative ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ ์ž˜ ์ž‘๋™ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ์‹คํ—˜์ด ์žˆ์œผ๋ฉด ์ข‹์„๋“ฏ
3.0
์•„์ด๋ฆฌ์Šค์žฅ์ : ํšจ์œจ์  ํ•™์Šต, ์ข‹์€ ์„ฑ๋Šฅ, ๋ช…ํ™•ํ•œ ์‹คํ—˜, ๋ฆฌ๋ทฐ์–ด๋“ค์ด ์ง€์ ํ• ๋งŒํ•œ ์‚ฌํ•ญ์— ๋Œ€ํ•ด ๋ฏธ๋ฆฌ ๋Œ€๋น„ํ•˜๋Š” ์„œ์ˆ  ๋ฐ ์‹คํ—˜๊นŒ์ง€ ๋…ผ๋ฌธ์˜ ๊ตฌ์„ฑ์ด ์ข‹๊ณ , ์‹ค์งˆ์  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•จ.
๋‹จ์ : ์ผ๋ฐ˜ํ™”๊ฐ€ ๋ ์ง€ ์กฐ๊ธˆ ์˜๋ฌธ์Šค๋Ÿฌ์›€. ๋‹ค์–‘ํ•œ ๋ชจ๋ธ, qa ์™ธ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์„ฑ๋Šฅ ์œ ์ง€ ๋“ฑ.
๋ณด์™„์ : ๋” ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์‹คํ—˜?
3.5
ํ•ธ๋“œํฌ๋ฆผโ€ข ์žฅ์ : ๋ชจ๋ธ์ด ํ•™์Šตํ•˜๊ธฐ ์›ํ•˜๋Š” ์š”์†Œ๋ฅผ ๋ชจ๋‘ ํ•˜๋‚˜์˜ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ํฌํ•จ์‹œํ‚ค๊ณ , ์ด๋ฅผ ํ•™์Šต์— ์‚ฌ์šฉํ•˜๋‹ˆ ํšจ๊ณผ๊ฐ€ ์žˆ์Œ. ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ์ด ์ค‘์š”ํ•จ์„ ๋ณด์ธ๋‹ค?
โ€ข ๋‹จ์ : ํ•™์Šต ๋‹จ๊ณ„๋Š” ๊ฐ„๋‹จํ•ด์„œ, ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ์—์„œ ์ค‘์š”ํ•œ ์š”์†Œ๋ฅผ ์•Œ์•„์„œ ํ•™์Šตํ•  ๊ฒƒ์ด๋ผ ๊ฐ€์ •
โ€ข ๋ณด์™„์ : ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๊ฐ ์š”์†Œ์— ๋Œ€ํ•œ ablation study
3.5
3์›” โ€ข ์žฅ์ : โ€œ์ง„์งœ ์งˆ๋ฌธโ€์„ ์ˆจ๊ฒจ ๋†“์•„ ์˜๋ฏธ์  shortcut์„ ๋ง‰๊ณ , ๋ชจ๋ธ์ด ์ˆœ์ฐจ์ ์œผ๋กœ ์ถ”์ ํ•˜๊ฒŒ ๋งŒ๋“  ์ ์ด ์˜๋ฆฌํ•จ.
โ€ข ์•ฝ์ : ํ…์ŠคํŠธ๊ฐ€ ์ ์  ๊ธธ์–ด์งˆ์ˆ˜๋ก chain์œผ๋กœ ์ถ”์ ํ•˜๋Š”๊ฒŒ ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ? ์ƒ์ถฉํ•˜๋Š” ๋ฌธ์žฅ์ด ๋งŽ์•„์ง€๊ณ  ๋ชจํ˜ธํ•œ ์ง€์‹œ์–ด๊ฐ€ ๋งŽ์•„์งˆ ๊ฒฝ์šฐ ์ผ๋ฐ˜ํ™”๊ฐ€ ์–ด๋ ค์›Œ๋ณด์ž„
โ€ข ๋ณด์™„์ : ๋ฌธ์„œ ๊ฐ„ ๋ชจ์ˆœ์ด ์žˆ๋Š” ๊ฒฝ์šฐ or ํ…Œ์ด๋ธ” ๊ฐ™์€ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ์—์„œ์˜ ์ถ”๋ก ์œผ๋กœ ํ™•์žฅ
3.6
์—๋„ˆ์ง€ โ€ข ์žฅ์  : ๋ฌธ์ œ ํ’€์ด(QA)๊ณผ์ •์—์„œ ํ™•์‹คํ•œ reasoning์„ ๋ฐ˜์˜ํ•ด ์˜๋ฏธ์žˆ๋Š” ๋‹ต๋ณ€์„ ์ฃผ๊ฒŒ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ ๋…ผ๋ฌธ.
โ€ข ์•ฝ์  : ๊ฒฐ๊ตญ ์ง€ํ–ฅํ•˜๋Š” ๋ฐฉํ–ฅ์€ ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ์กฐ์ž‘ํ•ด์„œ, reasoning์„ ๋…ผ๋ฆฌ์ ์œผ๋กœ ์ž˜ํ•˜์ž์ธ ๊ฒƒ ๊ฐ™์€๋ฐ, ๋ฐ์ดํ„ฐ๋ง๊ณ  ๋ชจ๋ธ ๊ด€์ ์— ๋Œ€ํ•œ ๋‚ด์šฉ์ด ์—†์–ด๋ณด์ž„. ์ผ๋ฐ˜ํ™”๊ฐ€ ๋ ๊นŒ?
โ€ข ๋ณด์™„์  : ๋ชจ๋ธ ๊ด€์ ์˜ ์—ฐ๊ตฌ๊ฐ€ ์ถ”๊ฐ€๋˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ
3.5
์ œ๋กœ์ฝœ๋ผ โ€ข ์žฅ์ : uuid๋ฅผ chain์œผ๋กœ ์ˆจ๊ฒจ๋†“์•„์„œ shortcut์„ ๋ง‰๊ณ , ๋ฉ”๋ชจ๋ฆฌ ์—ฐ์‚ฐ ๋น„์šฉ์˜ ๊ฑฑ์ • ์—†์ด ์ž‘์€ ๋ชจ๋ธ์—์„œ ์žฅ๋ฌธ๋งฅ ์ถ”๋ก ์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•จ
โ€ข ์•ฝ์ : planโ€“retrieveโ€“reasonโ€“recheck ์‚ฌ๊ณ  ํŒจํ„ด์ด๋ผ๋Š” ๋ฐฉ์‹์ด ๋ช…ํ™•ํ•˜์ง€ ์•Š์•„ ๋ณด์ž„ ๊ฒฐ๊ณผ๊ฐ€ ์ž˜ ๋‚˜์˜ค๊ธด ํ•˜๋Š”๋ฐ ์‚ฌ๊ณ  ํŒจํ„ด๊ณผ์˜ ๊ด€๋ จ์€ ์ข€ ๋” ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์„๊นŒ
โ€ข ๋ณด์™„์ : ์‚ฌ๊ณ  ํŒจํ„ด๊ณผ์˜ ์—ฐ๊ด€์„ฑ์— ๋Œ€ํ•œ ์—ฐ๊ตฌ
3.3
ํ”ผ์ฆˆ์น˜์ž โ€ข ๊ฐ•์ : ์งง๊ฒŒ ํ•™์Šตํ•˜๊ณ  ๊ทธ๊ฒƒ์„ ๊ธธ๊ฒŒ ์ผ๋ฐ˜ํ™”ํ•˜๋Š” ๋ถ€๋ถ„์ด ์‹ค์šฉ์ ์ธ๋“ฏ
โ€ข ํ•œ๊ณ„: KeyChain์ด ๋‹จ๊ณ„์  retrieval๋“ฑ์˜ ์‚ฌ๊ณ  ํŒจํ„ด์„ ๊ฐ•์ œํ•˜๋„๋ก ํ•˜๋Š”๋ฐ, ์ฒด์ธ์„ ์ถ”์ ํ•œ๋‹ค๋Š”๊ฒŒ ์ž์ฒด๊ฐ€ ์ž์—ฐ์Šค๋Ÿฌ์šด long-context reasoning์ด๋ผ๊ณ  ๋งํ•  ์ˆ˜ ์žˆ๋‚˜?
โ€ข ์ œ์•ˆ์ : task์„ ํ™•์žฅํ•ด์„œ ๋ฌธ์„œ๋น„๊ต/์ข…ํ•ฉ ๋“ฑ ํ›ˆ๋ จํ•˜๊ณ ํ”ˆ ์‹œํ€€์Šค๋ฅผ ์ •์˜ํ•˜๊ณ  ๋‹ค๋ฅธ long-context task๋กœ ์œ ์‚ฌํ•˜๊ฒŒ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์„๊ฒƒ ๊ฐ™์Œ
3.5
์ฐฝ๋ฐฑ์นด์ธ„์žฅ์ : ๋น…ํ…Œํฌ๋“ค์€ ์šฐ๋ฆฌ๋“ค ๋ชจ๋ธ context ๊ธธ์ด ์™•๊ธธ์–ด์š” ํ•˜๊ณ  ํ™๋ณดํ•˜๋Š”๋ฐ, ์‹ค์ œ๋กœ ๊ทธ context ๊ฝ‰ ์ฑ„์šฐ๋ฉด ์ž˜ ๋ชปํ•จ. ๊ทธ ํฌ์ธํŠธ๋ฅผ ์ž˜ ์ง‘๊ณ  ๊ฐœ์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ค์šด ์ผ์ด์ง€๋งŒ ์ž˜ ํ•ด๋ƒ„
์•ฝ์ : ์‹คํ—˜์—์„œ ๋ฌด๊ด€ํ•œ ์ •๋ณด๊นŒ์ง€ ๋„ฃ๋Š” ๊ฒƒ์€ realisticํ•˜์ง€๋งŒ, ๊ทธ๋ƒฅ ๊ด€๋ จ์ •๋ณด ํ•„ํ„ฐ๋งํ•ด์„œ LLM์— ๋„ฃ๋Š”๊ฒŒ ํ›จ์”ฌ ํšจ์œจ์ ์ผ ๊ฒƒ ๊ฐ™์Œ. ์œ ์‚ฌํ•œ ์ •๋ณด๋กœ ๋‘˜๋Ÿฌ์Œ“์—ฌ์ง„ ์ƒํ™ฉ์—์„œ ์‹คํ—˜ ํ•ด๋ด์•ผํ•˜์ง€ ์•Š์„๊นŒ|
์ œ์•ˆ์ : ํ•™์Šต, ์‹คํ—˜ ์„ธํŒ…์„ ๋Š˜๋ฆฌ์ž!
3.6
์˜ค์ฐจ โ€ข ๊ฐ•์ : ํ•™์Šต์„ ์งง๊ฒŒ ํ•˜๊ณ  ์›ํ•˜๋Š” task์˜ ์‹ค์งˆ์ ์ธ ๋ฌธ์ œ๋ฅผ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•ด๊ฒฐํ•œ๋‹ค๋Š” ์ ์—์„œ ๊ฐ•์ ์œผ๋กœ ๋ณด์ž„.
โ€ข ์•ฝ์ : ๊ท ํ˜• ์žกํžŒ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ผ๋Š”๊ฑธ ์ข€ ๋” ๋ช…ํ™•ํ•˜๊ฒŒ ์—ฐ๊ตฌํ•ด์•ผ ํ•  ํ•„์š”๊ฐ€ ์žˆ์Œ
โ€ข ๋ณด์™„์ : ์‹คํ—˜ ๋ฐ์ดํ„ฐ์™€ ๋ชจ๋ธ์„ ๋‹ค์–‘ํ™”ํ•ด์•ผ ํ•  ๊ฒƒ์ž„.
3.5

TL; DR

๐Ÿ’ก

short-context(16K) RL ํ•™์Šต๋งŒ์œผ๋กœ long-context(128K) ์ถ”๋ก ์„ ์ž˜ํ•˜๊ฒŒ ํ•˜์ž.

์–ด๋–ป๊ฒŒ?

โ‡’ UUID ์ฒด์ธ์œผ๋กœ ์งˆ๋ฌธ์„ ์ˆจ๊ธด ๊ณ ๋‚œ์ด๋„ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ(KeyChain)๋กœ RL ํ•™์Šตํ•˜๋ฉด, planโ€“retrieveโ€“reasonโ€“recheck ์‚ฌ๊ณ  ํŒจํ„ด์ด ๋ฐœ์ƒํ•˜์—ฌ ๋†’์€ ์žฅ๋ฌธ ์ถ”๋ก  ์„ฑ๋Šฅ์„ 7B/14B์˜ ์†Œํ˜• ๋ชจ๋ธ๋กœ ๋‹ฌ์„ฑํ•  ์ˆ˜ ์žˆ๋‹ค.

Summary

  • ์—ฐ๊ตฌ์ง„:
  • ์ธ์šฉ์ˆ˜: 3

Background & Motivation

Background

  • Long-context Reasoning ์ด๋ž€?

    ์ˆ˜๋งŒ~ ์ˆ˜์‹ญ๋งŒ ํ† ํฐ์˜ ์™ธ๋ถ€ ๋ฌธ์„œ์—์„œ ๊ด€๋ จ ์ •๋ณด๋ฅผ retrieve ํ•ด์„œ โ†’ reasoning ํ•˜๋Š” ๋Šฅ๋ ฅ

    ํ˜„๋Œ€ ๋ชจ๋ธ๋“ค์€ ๊ธด ์ปจํ…์ŠคํŠธ ์œˆ๋„์šฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋งŽ์€ ์ž…๋ ฅ๊ณผ ๋‹จ๋ฌธ ๋ฌธ์„œ์—์„œ retreive๋Š” ๋›ฐ์–ด๋‚˜์ง€๋งŒ ์žฅ๋ฌธ ๋ฌธ์„œ์—์„œ retrieve ํ•˜์—ฌ reasoningํ•˜๋Š” ๋Šฅ๋ ฅ์€ ๋ถ€์กฑํ•˜๋‹ค

  • ๋ฒ•๋ฅ  ๋ฌธ์„œ ๋ถ„์„, ์ฝ”๋“œ ๋ฒ ์ด์Šค ๋ถ„์„ ๋“ฑ ๋งŽ์€ ์‹ค์ œ ์ž‘์—…์—์„œ ์ˆ˜๋งŒ~ ์ˆ˜์‹ญ๋งŒ ํ† ํฐ์˜ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜๋Š” ์ถ”๋ก  (long-context reasoning) ๋Šฅ๋ ฅ์ด ์š”๊ตฌ๋จ
  • DeepSeek-R1, OpenAI o-series ๋“ฑ ์ตœ๊ทผ RL ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์€ ๋‹จ๋ฌธ ์ถ”๋ก , ๋‚ด๋ถ€ ์ง€์‹ ์˜์กด ์˜์—ญ์—์„œ longer CoT, self-reflection ๋“ฑ์„ ์œ ๋„ํ•˜๋ฉฐ ๊ฐ•ํ™”ํ•™์Šต
  • ํ•˜์ง€๋งŒ ์™ธ๋ถ€ ๋ฌธ์„œ์—์„œ ์ •๋ณด๋ฅผ retrieve ํ•˜์—ฌ reasoning ํ•˜๋Š” ๋Šฅ๋ ฅ์€ ์—ฌ์ „ํžˆ ์ž˜ ์•ˆ๋จ (long context resoning)

๊ธฐ์กด ์„ธ๊ฐ€์ง€ ์ฃผ์š” ๋ฌธ์ œ

๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์„ ์‚ฌ์šฉํ•œ ์žฅ๋ฌธ ์ปจํ…์ŠคํŠธ RL ๋ฐฉ๋ฒ•์—๋Š” ํ˜„์žฌ ์„ธ ๊ฐ€์ง€ ๋ฌธ์ œ๊ฐ€ ์žˆ์Œ

  • ๋ฌธ์ œ1 - ๊ณ ๋‚œ์ด๋„ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๋ถ€์กฑ
    • ํ•™์Šต์„ ์œ„ํ•ด์„œ ๋‹จ์ˆœ retrieve ๋งŒ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์—†๋Š” ๊ณ ๋‚œ์ด๋„์˜ ์žฅ๋ฌธ ๋ฌธ์ œ๊ฐ€ ํ•„์š”ํ•จ
    • ํ•˜์ง€๋งŒ ๋ฐ์ดํ„ฐ๋Š” ๋“œ๋ฌผ๋ฉฐ ์ •๋‹ต๋„ ๋‹ค์–‘ํ•œ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๊ธฐ ๋•Œ๋ฌธ์— ์ž๋™ ๊ฒ€์ฆ์ด ์–ด๋ ค์›€
  • ๋ฌธ์ œ2 - ์—ฐ์‚ฐ ๋น„์šฉ
    • RL ํ•™์Šต์—๋Š” ๋ฌธ์ œ๋‹น ์—ฌ๋Ÿฌ๋ฒˆ์˜ ๋‹ต๋ณ€ ์ƒ์„ฑ(rollout)์ด ํ•„์š”ํ•จ
    • ์ด๋Š” 128K ํ† ํฐ ์ •๋„์˜ ์žฅ๋ฌธ ์ž…๋ ฅ์—์„œ๋Š” ๋ฉ”๋ชจ๋ฆฌ, ์—ฐ์‚ฐ ๋น„์šฉ์ด ๊ฐ๋‹น ๋ถˆ๊ฐ€๋Šฅํ•œ ์ˆ˜์ค€์ž„
  • ๋ฌธ์ œ3 - long-context ํ•™์Šต ์‹œ short-context ๋Šฅ๋ ฅ์˜ ์ €ํ•˜
    • ์žฅ๋ฌธ ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ํ•™์Šตํ•˜๋ฉด math์™€ ๊ฐ™์€ ๋‹จ๋ฌธ ์ถ”๋ก  ๋Šฅ๋ ฅ์ด ์ €ํ•˜๋˜์–ด ์˜คํžˆ๋ ค ์ผ๋ฐ˜์  ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ

โ‡’ ์œ„ ์„ธ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ค‘์‹ฌ ์žฅ๋ฌธ ์ปจํ…์ŠคํŠธ RL ๋ฐฉ๋ฒ•๋ก  LoongRL์„ ์ œ์•ˆ

Contributions (What theyโ€™ve revealed)

  • KeyChain ๋ฐ์ดํ„ฐ ํ•ฉ์„ฑ ๊ธฐ๋ฒ• ์ œ์•ˆ
    • ๊ธฐ์กด short-context ๋ฉ€ํ‹ฐํ™‰ QA๋ฅผ dstracting documents์™€ UUID ์ฒด์ธ ์‚ฝ์ž…์œผ๋กœ ๊ณ ๋‚œ์ด๋„ ์žฅ๋ฌธ ๋ฌธ์ œ๋กœ ๋ณ€ํ™˜
    • UUID ์ฒด์ธ์„ ์‚ฝ์ž…ํ•จ์œผ๋กœ์„œ ๋ชจ๋ธ์ด ์–ดํœ˜์ ,์˜๋ฏธ์  ๋‹จ์ถ•๋กœ(ํŽธ๋ฒ•)์„ ์‚ฌ์šฉํ•˜์ง€ ๋ชปํ•˜๋„๋ก ๊ฐ•์ œํ•˜๊ณ , ์‹ค์ œ ์งˆ๋ฌธ์„ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•จ.
    • ๊ทœ์น™ ๊ธฐ๋ฐ˜ ๋ณด์ƒ ์„ค๊ณ„ (์–‘๋ฐฉํ–ฅ ๋ถ€๋ถ„๋ฌธ์ž์—ด ๋งค์นญ)
      • LLM-as-a-judge ์—†์ด ๋‹จ์ผ ๊ทœ์น™์œผ๋กœ ์ž์œ  ํ˜•์‹ QA ์ •๋‹ต์„ ๊ฒ€์ฆ ํ•จ
      • ํ‘œํ˜„ ๋‹ค์–‘์„ฑ์„ ํ—ˆ์šฉํ•˜๋ฉด์„œ ํŽธ๋ฒ•์œผ๋กœ ํ•™์Šตํ•˜๋Š” reward hacking ๋ฐฉ์ง€
  • ์ƒˆ๋กœ์šด ์ถ”๋ก  ํŒจํ„ด ์œ ๋„
    • KeyChain ๋ฐ์ดํ„ฐ๋กœ RL ํ•™์Šต ์‹œ planโ€“retrieveโ€“reasonโ€“recheck ํŒจํ„ด์ด ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋งŒ๋“ค์–ด์ง
    • 16K ํ† ํฐ์œผ๋กœ ํ•™์Šตํ–ˆ์Œ์—๋„ 128K ์ถ”๋ก ์œผ๋กœ ์ผ๋ฐ˜ํ™”๋จ โ†’ ๊ธด ์ปจํ…์ŠคํŠธ์˜ RL ๋น„์šฉ ์—†์ด ์žฅ๋ฌธ ์„ฑ๋Šฅ ํ™•๋ณด ๊ฐ€๋Šฅ
  • ๊ท ํ˜• ์žกํžŒ ๋ฐ์ดํ„ฐ ํ˜ผํ•ฉ ์‚ฌ์šฉ + 3๋‹จ๊ณ„ RL ์ปค๋ฆฌํ˜๋Ÿผ
    • ์žฅ๋ฌธ ์ถ”๋ก ,๊ฒ€์ƒ‰ ๋ฐ์ดํ„ฐ๋ฟ ์•„๋‹ˆ๋ผ ๋‹จ๋ฌธ ์ˆ˜ํ•™ ๋ฐ์ดํ„ฐ๋ฅผ ํ˜ผํ•ฉํ•ด ๋‹จ๋ฌธ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๋ณด์กด
    • Warm-up โ†’ Stage I (KeyChain ๋„์ž…) โ†’ Stage II (๋‚œ์ด๋„ ์ง‘์ค‘) 3๋‹จ๊ณ„ ์ปค๋ฆฌํ˜๋Ÿผ

Methods

KeyChain ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ

  • 3๋‹จ๊ณ„์˜ ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ ๊ณผ์ •
    • Step 1 - ์‹œ๋“œ ๋ฐ์ดํ„ฐ ํ•„ํ„ฐ๋ง
      • RL์ด ํšจ๊ณผ์ ์œผ๋กœ ์ž‘๋™ํ•˜๊ธฐ ์œ„ํ•œ ์ ์ ˆํ•œ ๋‚œ์ด๋„ ๊ตฌ๊ฐ„์„ ํ™•๋ณดํ•จ
      • HotpotQA, MuSiQue, 2WikiMultiHopQA์—์„œ 277K์˜ ์ธ์Šคํ„ด์Šค ์ˆ˜์ง‘ โ†’ Qwen2.5-32B๋กœ ๊ฐ ์งˆ๋ฌธ์— 8ํšŒ ์‘๋‹ต ํ›„ ์ •๋‹ต๋ฅ  0(๋„ˆ๋ฌด ์–ด๋ ค์›€) ๋˜๋Š” 1(๋„ˆ๋ฌด ์‰ฌ์›€)์ธ ๋ฌธ์ œ ์ œ๊ฑฐ โ†’ 72K์˜ ์ค‘๋‚œ์ด๋„ ์˜ˆ์ œ ํ™•๋ณด.
    • Step 2 - ์žฅ๋ฌธ ์ปจํ…์ŠคํŠธ ํ™•์žฅ
      • ์ด ๊ณผ์ •์€ ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ๊ด€๋ จ ์ •๋ณด๊ฐ€ ๋ฐฉ๋Œ€ํ•œ ๋ฌด๊ด€ ํ…์ŠคํŠธ ์†์— ๋ฌปํ˜€ ์žˆ๋Š” ์ƒํ™ฉ์„ ์‹œ๋ฎฌ๋ ˆ์ด์…˜
      • ํ•„ํ„ฐ๋ง๋œ 72K ์˜ˆ์ œ์˜ ์›๋ณธ ๋‹จ๋ฌธ ์ปจํ…์ŠคํŠธ๋ฅผ, ํ•„ํ„ฐ๋ง์—์„œ ์ œ๊ฑฐ๋œ 200K ์˜ˆ์ œ์˜ ๋ฌธ์„œ๋“ค์—์„œ ์ƒ˜ํ”Œ๋งํ•œ ๊ด€๋ จ ์—†๋Š” ์‹ค์ œ ๋ฌธ์„œ๋ฅผ ์‚ฝ์ž…ํ•ด ๊ฐ ์˜ˆ์ œ๋ฅผ ์•ฝ 16K ํ† ํฐ์˜ ์žฅ๋ฌธ ์ปจํ…์ŠคํŠธ๋กœ ํ™•์žฅํ•จ. ์›๋ณธ ์งˆ๋ฌธ์€ ๊ทธ๋Œ€๋กœ ์œ ์ง€
    • Step 3 - KeyChain ์‚ฝ์ž…
      • ์žฅ๋ฌธ ์ปจํ…์ŠคํŠธ ๋‚ด ๋žœ๋ค ์œ„์น˜์— ๋‘ ์ข…๋ฅ˜์˜ UUID ์ฒด์ธ์„ ์‚ฝ์ž…
      • ์ง„์งœ ์ฒด์ธ (1๊ฐœ): ์›๋ณธ ์งˆ๋ฌธ oqi๋กœ ์ด์–ด์ง€๋Š” ์ฒด์ธ. ๋ชจ๋ธ์€ ์‹œ์ž‘ UUID๋ถ€ํ„ฐ ์ฒด์ธ์„ ๋‹จ๊ณ„๋ณ„๋กœ ์ถ”์ ํ•ด ์ง„์งœ ์งˆ๋ฌธ์„ ์ฐพ์•„๋‚ธ ๋’ค, ์žฅ๋ฌธ ์ปจํ…์ŠคํŠธ์—์„œ ์ฆ๊ฑฐ๋ฅผ ๊ฒ€์ƒ‰ยท์ถ”๋ก ํ•ด ์ •๋‹ต์„ ์ƒ์„ฑํ•ด์•ผ ํ•จ
      • ๊ฐ€์งœ ์ฒด์ธ (์—ฌ๋Ÿฌ ๊ฐœ): ๋‹ค๋ฅธ QA ์ธ์Šคํ„ด์Šค์—์„œ ์ƒ˜ํ”Œ๋งํ•œ ์˜ค๋‹ต ์งˆ๋ฌธ์œผ๋กœ ์ด์–ด์ง€๋Š” ์ฒด์ธ. ๊ทธ๋Ÿด๋“ฏํ•˜์ง€๋งŒ ๋ฌด๊ด€ํ•œ ์งˆ๋ฌธ์œผ๋กœ ์—ฐ๊ฒฐ๋˜์–ด ๋ชจ๋ธ์„ ํ˜ผ๋ž€์‹œํ‚ด

  • UUID๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ์ด์œ 

    UUID๋Š” ๊ณ ์—”ํŠธ๋กœํ”ผ์˜ ๋น„์˜๋ฏธ์  ์‹๋ณ„์ž๋กœ, ๋ชจ๋ธ์ด ํ† ํฐ ์ƒ์„ฑ ๊ณผ์ •์—์„œ ์–ดํœ˜์ , ์˜๋ฏธ์  ๋‹จ์ถ•๋กœ(ํŽธ๋ฒ•)๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ๋ชปํ•˜๋„๋ก ๊ฐ•์ œํ•œ๋‹ค.

    UUID๋ฅผ ๋™์ผ ๊ธธ์ด์˜ ๋žœ๋ค ๋ฌธ์ž์—ด๋กœ ๊ต์ฒดํ•œ ablation ์‹คํ—˜์—์„œ ์„ฑ๋Šฅ์ด ๋™์ผ(72.4 vs 72.2)ํ•˜๊ฒŒ ๋‚˜์™€ ์‹๋ณ„์ž์˜ ๋น„์˜๋ฏธ์„ฑ์ด ํ•ต์‹ฌ ์†์„ฑ์ž„์ด ํ™•์ธ๋จ.

    โ†’ UUID ํ˜•์‹ ์ž์ฒด๊ฐ€ ํ•ต์‹ฌ์ด ์•„๋‹ˆ๋ผ ์˜๋ฏธ ์—†๋Š” ์‹๋ณ„์ž๋ฉด ๋‹ค ์ƒ๊ด€์—†์Œ

  • KeyChain-augmented long-context question ์˜ˆ์‹œ

    ๋ชจ๋ธ์—๊ฒŒ ์ฃผ์–ด์ง€๋Š” ์ง€์‹œ๋ฌธ: "์‹œ์ž‘ UUID๋ถ€ํ„ฐ ์—ฐ์†๋œ key:value ์ฒด์ธ์„ ๋”ฐ๋ผ๊ฐ€ ์ง„์งœ ์งˆ๋ฌธ์„ ์ฐพ์€ ๋’ค, ๋‹ตํ•˜๋ผ"

    {"UUIDA-1": "UUIDA-2"} โ† ์ง„์งœ ์ฒด์ธ 1๋ฒˆ์งธ hop
    {"UUIDA-2": "UUIDA-3"} โ† ์ง„์งœ ์ฒด์ธ 2๋ฒˆ์งธ hop
    {"UUIDA-3": "original question oq"} โ† ์ง„์งœ ์งˆ๋ฌธ ๋„๋‹ฌ

    {"UUIDB-1": "UUIDB-2"} โ† ๊ฐ€์งœ ์ฒด์ธ
    {"UUIDB-2": "distractor question q'"} โ† ์˜ค๋‹ต ์งˆ๋ฌธ

  • KeyChain์˜ ์ƒˆ๋กœ์šด ์ถ”๋ก  ํŒจํ„ด planโ€“retrieveโ€“reasonโ€“recheck ์œ ๋„
    • KeyChain ๋ฐ์ดํ„ฐ๋กœ RL ํ•™์Šต ์‹œ ๋ชจ๋ธ์€ planโ€“retrieveโ€“reasonโ€“recheck ํŒจํ„ด์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ํš๋“ํ•œ๋‹ค.
      • Plan: ๋ฌธ์ œ๋ฅผ ํ•˜์œ„ ๋‹จ๊ณ„๋กœ ๋ถ„ํ•ดํ•˜์—ฌ ํ’€์ด ๊ฒฝ๋กœ๋ฅผ ๋จผ์ € ์„ค๊ณ„
      • Retrieve: ๊ฐ ๋‹จ๊ณ„์—์„œ ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์žฅ๋ฌธ ์ปจํ…์ŠคํŠธ์—์„œ ๋ช…์‹œ์ ์œผ๋กœ ์ถ”์ถœ
      • Reason: ์ถ”์ถœ๋œ ์ •๋ณด๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹จ๊ณ„๋ณ„ ์ถ”๋ก  ์ˆ˜ํ–‰
      • Recheck: ๋ถˆํ™•์‹คํ•  ๋•Œ ๋‹ค์‹œ ๊ด€๋ จ ๋ฌธ์„œ๋กœ ๋Œ์•„๊ฐ€ ๊ฒ€์ฆ
    • ์ผ๋ฐ˜ ์žฅ๋ฌธ QA ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์€ ๋ช…์‹œ์  ๊ณ„ํš ๋‹จ๊ณ„ ์—†์ด ๊ฒ€์ƒ‰๊ณผ ์ถ”๋ก ์ด ํ˜ผ์žฌ๋œ ํŒจํ„ด์„ ๋ณด์ด๋ฉฐ, ์ด๊ฒƒ์ด ์˜ค๋‹ต์œผ๋กœ ์ด์–ด์ง€๋Š” ์ฃผ์š” ์›์ธ์ด ๋จ
    • ์ด ํŒจํ„ด์€ ์ปจํ…์ŠคํŠธ ๊ธธ์ด์— ๋…๋ฆฝ์ ์œผ๋กœ ์ ์šฉ๋จ
      • 16K ํ† ํฐ์œผ๋กœ ํ•™์Šตํ–ˆ์Œ์—๋„ 128K ์ถ”๋ก ์œผ๋กœ ์ผ๋ฐ˜ํ™”๋จ์œผ๋กœ์จ, ํ’€-๊ธธ์ด RL์˜ ๋ง‰๋Œ€ํ•œ ์—ฐ์‚ฐ ๋น„์šฉ ์—†์ด ์žฅ๋ฌธ ์„ฑ๋Šฅ์„ ํ™•๋ณดํ•œ๋‹ค

๋ณด์ƒ ์„ค๊ณ„

  • ์–‘๋ฐฉํ–ฅ ๋ถ€๋ถ„๋ฌธ์ž์—ด ๋งค์นญ์„ ํ†ตํ•œ ๋ณด์ƒ ์„ค๊ณ„
    • ์ผ๋ฐ˜ QA์˜ ์ •๋‹ต์€ ๋‹ค์–‘ํ•œ ํ‘œํ˜„ ํ˜•ํƒœ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ์–ด ์ž๋™ ๊ฒ€์ฆ์ด ์–ด๋ ต๋‹ค.
      • ๊ทธ๋ ‡๋‹ค๊ณ  "1 December 2010" vs "2010๋…„ 12์›” 1์ผ"์ฒ˜๋Ÿผ ์™„์ „ ์ •ํ™• ๋งค์นญ์„ ํ•˜๋ฉด ํ‘œํ˜„๋งŒ ๋‹ค๋ฅธ ์ •๋‹ต์„ ํ‹€๋ฆฌ๋‹ค๊ณ  ํŒ๋‹จํ•œ๋‹ค
    • ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด LLM-as-a-judge ์—†์ด ์•„๋ž˜์˜ ๊ทœ์น™ ๊ธฐ๋ฐ˜ ๋ณด์ƒ์„ ์„ค๊ณ„ํ•œ๋‹ค.
    • ๋ชจ๋ธ์ด ์ตœ์ข… ๋‹ต์„ ๋ฐ˜๋“œ์‹œ โ€œ\boxed{}โ€ ์•ˆ์— ์ถœ๋ ฅํ•˜๋„๋ก ํ”„๋กฌํ”„ํŠธ์— ๋ช…์‹œํ•˜์—ฌ ๋‹ต ์ถ”์ถœ์„ ๋ช…ํ™•ํžˆ ํ•˜๊ณ , ์ถ”์ถœ๋œ ๋‹ต๊ณผ ์ •๋‹ต ๊ฐ„ ์–‘๋ฐฉํ–ฅ ํฌํ•จ ๊ด€๊ณ„๋ฅผ ํ™•์ธํ•จ.
      • (์ถ”์ถœ ์ •๋‹ต์ด ์‹ค์ œ ์ •๋‹ต๊ณผ ์–‘๋ฐฉํ–ฅ์œผ๋กœ ๋ถ€๋ถ„ ๋งค์นญ๋˜๋ฉด 1, ์•„๋‹ˆ๋ฉด 0์„ ๋„์ถœํ•จ
    • ์–‘๋ฐฉํ–ฅ ๋ถ€๋ถ„๋ฌธ์ž์—ด์„ ์‚ฌ์šฉํ•˜์—ฌ ํ‘œํ˜„ ๋‹ค์–‘์„ฑ์„ ํ—ˆ์šฉํ•˜๋ฉด์„œ๋„ ์ •๋ฐ€๋„ ์œ ์ง€ โ†’ reward hacking ๋ฐฉ์ง€

    ๋ณด์ƒ ๋ฐฉ์‹ ๋น„๊ต (ablation): F1(65.1) < LLM-as-a-judge(65.2) < ์™„์ „ ์ •ํ™• ๋งค์นญ(69.2) < ์–‘๋ฐฉํ–ฅ ๋ถ€๋ถ„๋ฌธ์ž์—ด(72.4)

ํ•™์Šต ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ

  • ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ (์ด 22,024๊ฐœ)
    • ๊ณ ๋‚œ์ด๋„(KeyChain ์ ์šฉ) + ์ค‘๋‚œ์ด๋„(KeyChain ๋ฏธ์ ์šฉ, ์ผ๋ฐ˜ ๋ฉ€ํ‹ฐํ™‰ QA) + ์žฅ๋ฌธ ๊ฒ€์ƒ‰ + ๋‹จ๋ฌธ ์ˆ˜ํ•™ ์œผ๋กœ ๊ตฌ์„ฑ.
      • ์ค‘๋‚œ์ด๋„ ์ผ๋ฐ˜ QA ๋ฐ์ดํ„ฐ๋Š” ํŠนํžˆ ์†Œํ˜• ๋ชจ๋ธ(7B)์—์„œ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค.
        • KeyChain ๋ฌธ์ œ๊ฐ€ ์ดˆ๊ธฐ์— ๋„ˆ๋ฌด ์–ด๋ ค์›Œ RL ์‹ ํ˜ธ๊ฐ€ ๋ถˆ์•ˆ์ •ํ•  ๋•Œ, ์ค‘๊ฐ„ ๋‚œ์ด๋„ ๋ฌธ์ œ๋กœ ๋จผ์ € ๊ธฐ์ดˆ ๋Šฅ๋ ฅ์„ ์Œ“์„ ์ˆ˜ ์žˆ๊ฒŒ ํ•จ
      • ๋‹จ๋ฌธ ์ˆ˜ํ•™ ๋ฐ์ดํ„ฐ๋Š” ์žฅ๋ฌธ ํ•™์Šต์œผ๋กœ ์ธํ•œ ๋‹จ๋ฌธ ๋Šฅ๋ ฅ ์ €ํ•˜๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” ์™„์ถฉ์žฌ ์—ญํ• ์„ ํ•œ๋‹ค.

3๋‹จ๊ณ„ RL ์ปค๋ฆฌํ˜๋Ÿผ

  • LoongRL์˜ 3๋‹จ๊ณ„ RL ์Šคํ…
    • Warm-up (42 steps): KeyChain ์ œ์™ธ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•˜์—ฌ ๊ธฐ์ดˆ ๊ฒ€์ƒ‰,์ถ”๋ก  ๋Šฅ๋ ฅ ํ™•๋ณด.
      • 14B ๋ชจ๋ธ์€ ์ด๋ฏธ ๊ฐ•๋ ฅํ•œ ๊ธฐ์ดˆ ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ”๊ณ  ์žˆ์–ด ์ด ๋‹จ๊ณ„๋ฅผ ๊ฑด๋„ˆ๋œ€
    • Stage I - KeyChain ๋„์ž… (7B: 168 steps, 14B: 168 steps)
      • KeyChain ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์— ์ถ”๊ฐ€. ๋ชจ๋ธ์ด ๊ณ„ํšยท์ •๋ฐ€ ๊ฒ€์ƒ‰ยท๋‹ค๋‹จ๊ณ„ ์ถ”๋ก ์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ์œ ๋„
      • ์ด ๋‹จ๊ณ„์—์„œ recheck ํ–‰๋™์ด ์ƒ์„ฑ๋˜๊ณ  ์‘๋‹ต ๊ธธ์ด๊ฐ€ ์ ์ง„์ ์œผ๋กœ ์ฆ๊ฐ€
    • Stage II - ๋‚œ์ด๋„ ์ง‘์ค‘ (7B: 118 steps, 14B: 150 steps)
      • Stage I ์—์„œ์˜ ์ตœ์  ์ฒดํฌํฌ์ธํŠธ๋กœ ์˜ˆ์ œ๋‹น 8ํšŒ rollout ํ›„, ๋ชจ๋‘ ์ •๋‹ต์ธ ์‰ฌ์šด ์˜ˆ์ œ๋ฅผ ์ œ๊ฑฐ โ†’ ์ „์ฒด์˜ 30~40%์— ํ•ด๋‹นํ•˜๋Š” ์–ด๋ ค์šด ์˜ˆ์ œ๋งŒ ๋‚จ๊ฒจ ์ง‘์ค‘ ํ•™์Šต.
      • ์ด ๋‹จ๊ณ„์—์„œ ๋ช…์‹œ์  plan ํ–‰๋™์ด ์ถ”๊ฐ€๋กœ ๋‚˜ํƒ€๋‚˜๋ฉฐ, ๋” ์งง๊ณ  ์ •ํ™•ํ•œ ์‘๋‹ต์ด ์ƒ์„ฑ๋จ

Experiments

Result

  • ์ฃผ์š” ๊ฒฐ๊ณผ
    • ์žฅ๋ฌธ ์ปจํ…์ŠคํŠธ ์ถ”๋ก  (LongBench v1)
      • LoongRL-14B: 74.2์  โ†’ o3-mini(74.5), DeepSeek-R1(74.9)์— ๊ทผ์ ‘
      • ๊ธฐ์ค€ ๋ชจ๋ธ ๋Œ€๋น„ ์ ˆ๋Œ€ ์ •ํ™•๋„ ํ–ฅ์ƒ: 7B +23.5%, 14B +21.1%
      • LoongRL-7B(72.4)๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ 4.6๋ฐฐ ํฐ QwenLong-L1-32B(70.1)๋ฅผ ๋Šฅ๊ฐ€
    • ๋‹จ๋ฌธ ์ถ”๋ก ์€ ๋ณด์กด๋จ
      • MMLU: ์˜คํžˆ๋ ค ํ–ฅ์ƒ (+2.8%, +1.1%)
      • IFEval: ์†Œํญ ๊ฐ์†Œ (-0.3%, -2.6%) โ€” R1-Distill ๋Œ€๋น„ ํ˜„์ €ํžˆ ์ ์€ ์ €ํ•˜

  • Long-context์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™”
    • 16K ํ•™์Šต โ†’ 128K์—์„œ๋„ ๊ฐ•๋ ฅํ•œ ์„ฑ๋Šฅ ์œ ์ง€
    • R1-Distill ๊ณ„์—ด(๊ธฐ์กด RL๊ธฐ๋ฐ˜ ํ•™์Šต ๋ชจ๋ธ)์€ 128K์—์„œ ์„ฑ๋Šฅ์ด ๊ธ‰๊ฒฉํžˆ ๋–จ์–ด์ง€๋Š” ๋ฐ˜๋ฉด, LoongRL์€ ์•ˆ์ •์ 
  • Improved long-context retrieval (Needle-in-a-Haystack)
    • ๋‹ค์–‘ํ•œ ๊นŠ์ด์—์„œ ๊ธด ๋ฌธ์„œ์˜ retrieve ๋Šฅ๋ ฅ ์ธก์ •
    • LoongRL-7B๋Š” 128K ์ „ ๊ตฌ๊ฐ„์—์„œ 100% ์ •ํ™•๋„ ๋‹ฌ์„ฑ
    • ๊ธฐ์ค€ Qwen2.5-7B-Instruct, QwenLong-L1-32B๋Š” ์ผ๋ถ€ ๊ตฌ๊ฐ„์—์„œ ์‹คํŒจ
    • LoongRL์€ ๊ฒ€์ƒ‰ ์„ฑ๋Šฅ์„ ์‹ค์งˆ์ ์œผ๋กœ ํ–ฅ์ƒ์‹œํ‚ค๋ฉฐ, LoongRL-7B๋Š” ๋ชจ๋“  ๊นŠ์ด์—์„œ ์™„๋ฒฝํ•œ ์ •ํ™•๋„๋ฅผ ๋‹ฌ์„ฑ

Categories

Long Context Reasoning RL research