26 March 2026

TROLL: Trust Regions Improve Reinforcement Learning for Large Language Models

๐Ÿ’กLLM์„ RL๋กœ ํ•™์Šตํ•  ๋•Œ ๋ชจ๋ธ์ด ํ•œ ๋ฒˆ์— ๋„ˆ๋ฌด ํฌ๊ฒŒ ๋ฐ”๋€Œ๋ฉด ๋ง๊ฐ€์ง€๋ฏ€๋กœ, ํ—ˆ์šฉ๋œ ๋ฒ”์œ„ ์•ˆ์—์„œ๋งŒ ์—…๋ฐ์ดํŠธํ•ด์„œ ์•ˆ์ „ํ•˜๊ฒŒ ํ•™์Šต์‹œํ‚ค์ž

์—ผ๊ทœํ™˜
์—ผ๊ทœํ™˜

TROLL: Trust Regions Improve Reinforcement Learning for Large Language Models

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
๋Œ“์ธ ๋…ธ๋…ธ โ€ข ์žฅ์ : trust region์„ ์ •์˜ํ•˜๊ณ , ๊ทธ ์•ˆ์—์„œ ํšจ์œจ์ ์œผ๋กœ optimization์ˆ˜ํ–‰. ๊ธฐ์กด clipping๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž„
โ€ข ๋‹จ์ : ๋‘ motivation ๊ฐ„ ์—ฐ๊ณ„์„ฑ์ด ๋ถ€์กฑํ•จ. Sparse projection์ด๋ž‘ PPO clipping์ด๋ž‘ ๋ญ” ์ƒ๊ด€์ธ์ง€???
โ€ข ๋ณด์™„์ : token distribution์˜ efficiency ๊ฐ•์กฐ ์‹คํ—˜
2.8
์•„์ด๋ฆฌ์Šค โ€ข ์žฅ์ : ์ ์ง„์ , ์œ ๋™์  ํ•™์Šต์ด๋ผ๋Š” ์•„์ด๋””์–ด๋ฅผ ์ง๊ด€์ ์œผ๋กœ ์ž˜ ํ’€์–ด๋‚ด๊ณ , ๊ธฐ์กด ๋ฐฉ๋ฒ•์˜ ํ•œ๊ณ„๋ฅผ ์ง€์ ํ•˜๊ณ  ์ž˜ ํ’€์–ด๋ƒ„. ์„ฑ๋Šฅ์ ์œผ๋กœ ์šฐ์ˆ˜ํ•จ.
โ€ข ๋‹จ์ :method๊ฐ€ ๋ฌด์—‡์„ ํ•ด๊ฒฐํ•˜๋Š”๊ฒƒ์ธ์ง€ ๋ชจํ˜ธํ•˜๊ฒŒ ๋А๊ปด์ง. Projection์„ ์ผ๋ถ€๋งŒ ํ•œ๋‹ค๋Š” ๊ฒƒ๋„ ์ดํ•ด๊ฐ€ ์กฐ๊ธˆ ์–ด๋ ค์›€. ๋…ผ๋ฌธ์˜ ๊ธฐ์—ฌ์ ์ด ๋ฌด์—‡์ธ์ง€ ์ž˜ ๋ชจ๋ฅด๊ฒ ์Œ.
โ€ข ๋ณด์™„์ : ํ•™์Šต์ด ํฌ๊ฒŒ ๋˜์–ด๋„ ์˜คํžˆ๋ ค ์ตœ์ ์ ์— ๋” ๊ฐ€๊นŒ์›Œ์งˆ ์ˆ˜ ์žˆ์ง€ ์•Š๋‚˜? ๋™์ ์œผ๋กœ ๊ณ ๋ คํ•ด์•ผ ํ•˜์ง€ ์•Š๋‚˜? ๋ผ๋Š” ์ƒ๊ฐ
3.5
ํ•ธ๋“œํฌ๋ฆผโ€ข ์žฅ์ : ์ •์ฑ… ์—…๋ฐ์ดํŠธ๊ฐ€ trust region ๋‚ด์—์„œ ์ผ์–ด๋‚˜๋Š” ๊ฒƒ์€ ๋ณด์žฅํ•˜๋˜ ์ž„์˜์˜ clipping ๊ธฐ์ค€๊ฐ’์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ. ํ•™์Šต์ด ์•ˆ์ •์ ์ธ ๋ฒ”์œ„์—์„œ ์ตœ๋Œ€ํ•œ์˜ ํšจ๊ณผ๋กœ ์ผ์–ด๋‚˜๊ฒŒ๋” ํ•จ
โ€ข ๋‹จ์ : policy clipping์ด ๋” ํšจ๊ณผ ์ข‹์€ ๊ฒฝ์šฐ๋Š” ์—†์„๊นŒ? reasoning ๋„๋ฉ”์ธ์ด ์•„๋‹ˆ๋ผ๋ฉด?
โ€ข ๋ณด์™„์ : ํƒ€ ๋„๋ฉ”์ธ ๋ฒค์น˜๋งˆํฌ ์‹คํ—˜
4.3
์—๋„ˆ์ง€ โ€ข ์žฅ์  : PPO์˜ reward๋ฌธ์ œ ๋ฐ policy update์™€ ๊ด€๋ จํ•ด์„œ, KL Constraint ๋ถ€๋ถ„๋„ ๊ฐœ์„ ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ๊ฒƒ์„ ๋ณด๊ณ , ํ•ญ์ƒ ์™„๋ฒฝํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ ์—†๋‹ค๋Š” ๊ฒƒ์„ ๋‹ค์‹œ ํ•œ๋ฒˆ ๋А๋‚„ ์ˆ˜ ์žˆ์—ˆ์Œ. ํŠนํžˆ policy์˜ update ํฌ๊ธฐ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ฐฉํ–ฅ๊นŒ์ง€ ๊ณ ๋ คํ•˜๋Š” ์—ฐ๊ตฌ์ž„.
โ€ข ์•ฝ์  : Top-k๋ฅผ ๊ณ ๋ คํ•˜๋Š”๊ฒŒ ์ตœ์„ ์˜ ์„ ํƒ์œผ๋กœ ๋ณด์ด๊ธด ํ•˜์ง€๋งŒ, long tail ํ† ํฐ์„ ๋†“์น˜๋Š”๊ฒƒ์€ ์–ด์ฉ” ์ˆ˜ ์—†์–ด ๋ณด์ž„.
โ€ข ๋ณด์™„์  : trade-off๋Š” ํ”ผํ•  ์ˆ˜ ์—†๊ฒ ์ง€๋งŒ, ํ† ํฐ์˜ ๋‹ค์–‘์„ฑ์„ ์ฑ™๊ธธ ์ˆ˜ ์žˆ๋Š” ์—ฌ๋Ÿฌ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์‹คํ—˜์ด ์žˆ์œผ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ.
3.8
3์›” โ€ข ์žฅ์ : Heuristic์— ๊ธฐ๋ฐ˜ํ•˜์—ฌ ๊ตญ์†Œ์ ์ธ ๊ด€์ ์„ ํŒŒ์•…ํ•˜๊ธฐ ์–ด๋ ค์šด ๊ธฐ์กด clipping๊ณผ ๋‹ฌ๋ฆฌ, ๊ฐ ํ† ํฐ๋งˆ๋‹ค constraint๋ฅผ ์ ์šฉํ•ด์„œ ํŠน์ • ํ† ํฐ๋งŒ ๊ณผํ•˜๊ฒŒ ๋ฐ”๋€Œ๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•จ
โ€ข ์•ฝ์ : ํ•™์Šต ์•ˆ์ •์„ฑ์— ๊ธฐ์—ฌํ•˜๋Š” ์ง์ ‘์ ์ธ ์›์ธ์ธ projection, sparsification์— ๋Œ€ํ•œ ablation ์‹คํ—˜์ด ๋ถ€์žฌํ•จ.
โ€ข ๋ณด์™„์ : ์•ˆ์ •์„ฑ ๋•Œ๋ฌธ์— ์‹ค์งˆ์ ์œผ๋กœ policy๊ฐ€ ์–ผ๋งˆ๋‚˜ ์™œ๊ณก๋๋Š”์ง€ tradeoff๋ฅผ ๋ณด์—ฌ์ฃผ๋Š” ์‹คํ—˜์„ ์ถ”๊ฐ€ํ•˜๋ฉด ์ข‹์„๋“ฏ
3.6
ํ™”์ดํŠธ๋…ธ์ด์ฆˆ โ€ข ์žฅ์ : clipping ๋งŒ ๋Œ€์ฒดํ•˜๋ฉด ๋œ๋‹ค๋Š” ์ ์—์„œ ํ”Œ๋Ÿฌ๊ทธ์ธ ํ˜ธํ™˜์„ฑ์ด ์ข‹์Œ
โ€ข ๋‹จ์ : PPO๋…ผ๋ฌธ์€ ์‹ค์ œ๋กœ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์— ๋Œ€ํ•œ ์‹คํ—˜์ด ์žˆ๋Š” ๋ฐ”์— ๋น„ํ•ด ํ•ด๋‹น ๋…ผ๋ฌธ์€ ์ˆ˜ํ•™์ชฝ ๋ฐ–์— ์—†์Œ
โ€ข ๋ณด์™„์ : ์ˆ˜ํ•™ ์ด์™ธ์˜ ๋„๋ฉ”์ธ์—์„œ๋„ ์„ฑ๋Šฅ์ด ์–ด๋–จ์ง€ ๊ถ๊ธˆํ•จ
3.0
ํ”ผ์ฆˆ์น˜์ž โ€ข ์žฅ์ : ๊ธฐ์กด CLIP์ด ๊ทผ์‚ฌ์ ์ธ ๋ฐฉ๋ฒ•์ด๋ผ๋Š” ๊ฒƒ์„ ์ž˜ ์งš๊ณ  ์ด๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์ง์ ‘ ๊ฐœ์„ ํ•˜๊ณ ์ž ํ•˜๋Š” ์‹œ๋„๋Š” ์ข‹์€๋“ฏ
โ€ข ๋‹จ์ : token-level ๋‹จ์œ„๋กœ ์•ˆ์ •์ ์œผ๋กœ ๋งŒ๋“ค๊ณ ์ž ํ•˜๋Š”๋ฐ, ์ด token ๋ณ„ ๋‚ฎ์€ KL์ด sequence-level์— ๋Œ€ํ•ด์„œ ์•ˆ์ •์„ฑ์„ ๋ณด์žฅํ• ์ง€๋Š” ์˜๋ฌธ์ž„
โ€ข ๋ณด์™„์ : ์žฅ๋ฌธ generation์—์„œ์˜ ์‹คํ—˜์ด๋‚˜ global์ ์ธ ์ธก๋ฉด์„ ์ถ”๊ฐ€ํ•˜๋ฉด ์ข‹๊ฒ ์Œ
3.9
์ œ๋กœ์ฝœ๋ผ โ€ข ์žฅ์ : PPO clipping์€ ๊ฐ’ ํ•˜๋‚˜๋กœ ๋ชจ๋“  ํ† ํฐ์„ ๋˜‘๊ฐ™์ด ์ œํ•œํ•˜๋Š”๋ฐ, TROLL์€ ํ† ํฐ๋งˆ๋‹ค ๊ฐœ๋ณ„์ ์œผ๋กœ ์–ผ๋งˆ๋‚˜ ๋ฐ”๋€Œ์—ˆ๋Š”์ง€ ๋ณด๊ณ  ์ œ์–ดํ•œ๋‹ค๋Š” ์ ์ด ํ•ฉ๋ฆฌ์ ์œผ๋กœ ๋А๊ปด์ง.
โ€ข ๋‹จ์ : Trust region ์•ˆ์œผ๋กœ projectionํ•  ๋•Œ ๊ธฐํ•˜ ํ‰๊ท ์„ ์“ฐ๋Š” ๊ฒŒ ์ตœ์ ํ•ด๋ผ๊ณ  ํ•˜๋Š”๋ฐ, ์ด๊ฒŒ ์™œ ์ตœ์ ์ธ์ง€ ์ถฉ๋ถ„ํžˆ ์„ค๋ช…๋˜์ง€ ์•Š๋Š”๊ฒƒ ๊ฐ™์Œ.
โ€ข ๋ณด์™„์ : ์ˆ˜ํ•™์ฒ˜๋Ÿผ ์ •๋‹ต์ด ๋ช…ํ™•ํ•œ ํƒœ์Šคํฌ๊ฐ€ ์•„๋‹ˆ๋ผ, reward ์ž์ฒด๊ฐ€ ๋ชจํ˜ธํ•œ ๋„๋ฉ”์ธ์—์„œ๋„ TROLL์ด ์ž˜ ์ž‘๋™ํ•˜๋Š”์ง€ ์‹คํ—˜์ด ์žˆ์œผ๋ฉด ๋” ์„ค๋“๋ ฅ์ด ์žˆ์„ ๊ฒƒ ๊ฐ™์Œ.
3.5
์˜ค์ฐจ โ€ข ์žฅ์ : ๋‹จ์ˆœ Clipping์€ ๊ฐ’์„ ์ œํ•œํ•˜์—ฌ ์ •๋ณด๋ฅผ ์ถฉ๋ถ„ํžˆ ๋ฐ˜์˜ํ•˜์ง€ ๋ชปํ•œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋Š”๋ฐ ์ด ๋ฐฉ๋ฒ•์€ Trust Region์œผ๋กœ Project์„ ํ•จ์œผ๋กœ์จ Gradient๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ ์ ์ด ์ด ์—ฐ๊ตฌ์˜ ๊ฐ•์ ์œผ๋กœ ๋ณด์ž„.
โ€ข ๋‹จ์ : Sparse Projection ๋ฐฉ์‹์ธ๋ฐ ์ด๋Ÿฌ๋ฉด ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋‚ฎ์•„์ง€์ง€๋งŒ ์ตœ์ ์ธ ์ƒํƒœ๊ฐ€ ์œ ์ง€๋˜๋Š”์ง€ ์˜๋ฌธ์ž„. ์—ฌ๊ธฐ์— ๋Œ€ํ•œ ์‹คํ—˜์ด๋‚˜ ์ฆ๋ช…์ด ๋ถ€์กฑํ•จ.
โ€ข ๋ณด์™„์ : TROLL์ด PPO Cliping๋ณด๋‹ค LLM์—์„œ์˜ ์„ฑ๋Šฅ์ด ๋†’์•„์ง„๋‹ค๋Š” ์‹คํ—˜ ๊ฒฐ๊ณผ๋งŒ ์ œ์‹œํ•˜์ง€ ๋ง๊ณ , ๋‹ค์–‘ํ•œ Task์— ๋Œ€ํ•ด์„œ ์‹คํ—˜์„ ์ง„ํ–‰ํ•˜๋Š” ์ผ๋ฐ˜์„ฑ์ด ์ถ”๊ฐ€๋˜์—ˆ์œผ๋ฉด ํ•จ.
3.6
์ฐฝ๋ฐฑ์นด์ธ„์žฅ์ : ๊ธฐ์กด ๋…ผ๋ฌธ๋“ค์ด ๊ทธ๋ƒฅ PPO ๊ฐ–๋‹ค์“ฐ๋Š”๋ฐ ๋ฐ˜ํ•ด, ์ด๊ฒŒ optimalํ•œ์ง€ ๊ฒ€ํ† ํ•˜๊ณ  optimalํ•œ point๋ฅผ ์ œ์‹œํ•˜๋Š” ๊ฒƒ์€ ์•„์ฃผ ๊ธฐ์—ฌ๊ฐ€ ํผ! ์ด์ œ ๊ฐ•ํ™”ํ•™์Šต ์“ฐ๋Š” ๋…ผ๋ฌธ๋“ค์€ PPO ๋Œ€์‹  TROLL์„ ์จ์•ผ ํ• ์ง€๋„..
์•ฝ์ : policy gradient๊ฐ€ ํด ๋•Œ ํ•ญ์ƒ ์•ˆ์ข‹์€ ๊ฒƒ์ธ์ง€์— ๋Œ€ํ•œ ๊ฒ€์ฆ์ด ๋ฏธํกํ•œ ๊ฒƒ ๊ฐ™์Œ, ์ด๋ฏธ ์„ ํ–‰์—ฐ๊ตฌ๊ฐ€ ์žˆ๋‚˜?
์ œ์•ˆ์ : Case study๋ฅผ ๋ณด์—ฌ์ฃผ๋ฉด ๋” ์™€๋‹ฟ์„ ๊ฒƒ ๊ฐ™์Œ!
4

TL; DR

๐Ÿ’ก

LLM์„ RL๋กœ ํ•™์Šตํ•  ๋•Œ ๋ชจ๋ธ์ด ํ•œ ๋ฒˆ์— ๋„ˆ๋ฌด ํฌ๊ฒŒ ๋ฐ”๋€Œ๋ฉด ๋ง๊ฐ€์ง€๋ฏ€๋กœ, ํ—ˆ์šฉ๋œ ๋ฒ”์œ„ ์•ˆ์—์„œ๋งŒ ์—…๋ฐ์ดํŠธํ•ด์„œ ์•ˆ์ „ํ•˜๊ฒŒ ํ•™์Šต์‹œํ‚ค์ž

Summary

  • ์—ฐ๊ตฌ์ง„: ์นด๋ฅผ์Šค๋ฃจ์— ๊ณต๊ณผ๋Œ€ํ•™, ๋งˆ์ดํฌ๋กœ์†Œํ”„ํŠธ
  • ์ธ์šฉ์ˆ˜ : 2

Preliminary

  • Trust Region method๋ž€?
    โ€œํ•œ ๋ฒˆ์— ๋„ˆ๋ฌด ๋ฉ€๋ฆฌ ๊ฐ€์ง€ ๋ง๊ณ , ์•ˆ์ „ํ•œ ๋ฒ”์œ„ ์•ˆ์—์„œ๋งŒ ์—…๋ฐ์ดํŠธํ•˜์žโ€
    • ์ผ๋ฐ˜์ ์ธ gradient update ์ˆ˜์‹
      • ฮธnew=ฮธold+ฮฑโˆ‡J(ฮธ)\theta_{new}=\theta_{old}+\alpha \nabla J(\theta)๏ปฟ

        โ‡’ Gradient๊ฐ€ ํฌ๋ฉด ๋ณ€ํ™”๋Ÿ‰์ด ๋„ˆ๋ฌด ์ปค์ ธ ์„ฑ๋Šฅ ์•ˆ์ •์„ฑ์ด ์•…ํ™”๋จ

        โ‡’ RL์—์„œ๋Š” reward variance๊ฐ€ ํฌ๊ณ  policy๊ฐ€ ์กฐ๊ธˆ๋งŒ ๋ฐ”๋€Œ์–ด๋„ ๊ฒฐ๊ณผ๊ฐ€ ํฌ๊ฒŒ ๋ณ€ํ•˜๊ธฐ์— ์น˜๋ช…์ 

    • ๊ทธ๋Ÿฌ๋ฉด ์–ด๋–ป๊ฒŒ ์ •์˜ํ•˜๋‚˜? โ‡’ KL Divergence๋กœ ์ •์˜
      • KL(ฯ€newโˆฃโˆฃฯ€old)โ‰คฯตKL(\pi_{new}||\pi_{old})\leq \epsilon๏ปฟ
        • ์ƒˆ๋กœ์šด policy์™€ ๊ธฐ์กด policy๊ฐ€ ๋„ˆ๋ฌด ๋‹ฌ๋ผ์ง€์ง€ ์•Š๋„๋ก ์ œํ•œ์„ ๋‘ 

      โ‡’ ์ตœ์ ํ™” ๋ชฉํ‘œ: reward ์ตœ๋Œ€ํ™”ํ•˜๋ฉด์„œ policy ๋ณ€ํ™”๋Š” ์ž‘๊ฒŒ!


  • PPO (Proximal Policy Optimization)์ด๋ž€?
    โ€œKL constraint ๊ณ„์‚ฐ์ด ๋„ˆ๋ฌด ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋‹ˆ clipping์„ ํ™œ์šฉํ•˜์—ฌ ๊ทผ์‚ฌํ•˜์ž!โ€
    • ์ƒˆ๋กœ์šด policy๊ฐ€ ํŠน์ • action์„ ์–ผ๋งˆ๋‚˜ ๋”/๋œ ์„ ํ˜ธํ•˜๋Š” ์ง€ ๋น„์œจ ์ •์˜
      • ฯ€ฮธ(atโˆฃst)ฯ€old(atโˆฃst)\frac{\pi_{\theta}(a_t|s_t)}{\pi_{old}(a_t|s_t)}๏ปฟ
    • ๊ทธ๋Ÿฌ๋‚˜ ์—ฌ์ „ํžˆ ๋น„์œจ์ด ๋„ˆ๋ฌด ์ปค์ง€๊ฑฐ๋‚˜ ์ž‘์•„์งˆ ์ˆ˜ ์žˆ์Œ โ†’ Clipping์„ ํ•˜์ž
      • ๋น„์œจ์ด ์ •์ƒ ๋ฒ”์œ„ ์ด๋‚ด์ธ ๊ฒฝ์šฐ โ†’ ฯ€ฮธ(atโˆฃst)ฯ€old(atโˆฃst)โˆˆ(1โˆ’ฯต,1+ฯต)\frac{\pi_{\theta}(a_t|s_t)}{\pi_{old}(a_t|s_t)} \in (1-\epsilon, 1+\epsilon)๏ปฟ โ†’ ์ข‹์€ action์ด๋ฏ€๋กœ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ
      • ๋น„์œจ์ด ๋„ˆ๋ฌด ์ปค์ง€๋Š” ๊ฒฝ์šฐ โ†’ ฯ€ฮธ(atโˆฃst)ฯ€old(atโˆฃst)>(1+ฯต)\frac{\pi_{\theta}(a_t|s_t)}{\pi_{old}(a_t|s_t)}> (1+\epsilon)๏ปฟ โ†’ Clipํ•ด์„œ ๊ณผ๋„ํ•œ ์—…๋ฐ์ดํŠธ ๋ฐฉ์ง€
      • ๋น„์œจ์ด ๋„ˆ๋ฌด ์ž‘์•„์ง€๋Š” ๊ฒฝ์šฐ โ†’ ฯ€ฮธ(atโˆฃst)ฯ€old(atโˆฃst)<(1+ฯต)\frac{\pi_{\theta}(a_t|s_t)}{\pi_{old}(a_t|s_t)}< (1+\epsilon)๏ปฟ โ†’ Clipํ•ด์„œ ๊ณผ๋„ํ•œ ์—…๋ฐ์ดํŠธ ๋ฐฉ์ง€

์—ฐ๊ตฌ ๋™๊ธฐ

LLM์˜ย post-training ๋‹จ๊ณ„์—์„œ

  • RLHF / RLVR ๋“ฑย ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ fine-tuning์ด ํ‘œ์ค€ ๋ฐฉ๋ฒ•์ด ๋จ
  • ๋Œ€๋ถ€๋ถ„์˜ ๋ฐฉ๋ฒ•์€ย PPOย ๊ธฐ๋ฐ˜ policy gradient ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉ
    • ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ํ† ํฐ์— ๋Œ€ํ•ด advantage ๊ณ„์‚ฐ
    • ๊ธฐ์กด policy์™€ ์ƒˆ๋กœ์šด policy์˜ ๋น„์œจ ๊ณ„์‚ฐ์„ ํ†ตํ•ด policy gradient ์—…๋ฐ์ดํŠธ
    • ์—…๋ฐ์ดํŠธ ํญ์ด ๋„ˆ๋ฌด ์ปค์ง€์ง€ ์•Š๋„๋ก clipping ์ ์šฉ

    โ‡’ policy ๋ณ€ํ™”๊ฐ€ ๋„ˆ๋ฌด ์ปค์ง€๋Š” ๊ฒƒ์„ ๋ง‰์•„ ํ›ˆ๋ จ ์•ˆ์ •์„ฑ ํ™•๋ณด

์ด ๋…ผ๋ฌธ์€ ๋‹ค์Œ ์งˆ๋ฌธ์—์„œ ์ถœ๋ฐœํ•จ

โ€œLLM ๊ฐ•ํ™”ํ•™์Šต์—์„œ PPO clipping์ด ์•„๋‹Œ ๋” principledํ•œ trust region ๋ฐฉ์‹์ด ํ•„์š”ํ•˜์ง€ ์•Š์„๊นŒ?โ€

๊ธฐ์กด PPO clipping ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ํ•œ๊ณ„

  • Clipping์€ ์ด๋ก ์ ์œผ๋กœ ์ •ํ™•ํ•œ trust region์ด ์•„๋‹Œ, ๋‹จ์ˆœํžˆ ๋น„์œจ์„ ์ž๋ฅด๋Š” heuristic ๋ฐฉ์‹์ž„
  • Clipping ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜๋ฉด gradient๊ฐ€ ์‚ฌ๋ผ์ง€๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ โ†’ ๋А๋ฆฐ ์ˆ˜๋ ด
  • ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ์— ๋ฏผ๊ฐํ•˜์—ฌ ๋‚ฎ์€ reproducibility๋ฅผ ๋ณด์ž„
  • Continuous action ์ค‘์‹ฌ์ด๋ผ discrete ํ† ํฐ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง€๋Š” LLM์— ๋ฐ”๋กœ ์ ์šฉ์ด ์–ด๋ ค์›€

์ œ์•ˆ ์•„์ด๋””์–ด

  • ๊ทธ๋ฆผ ์„ค๋ช…
    • 3 ํ† ํฐ ๋ถ„ํฌ (๊ณ ์–‘์ด / ํŠธ๋กค / ํ–„์Šคํ„ฐ) ๋ฅผ ๋‚˜ํƒ€๋ƒ„
    • ๊ธฐ์กด policy๋Š” ํŠธ๋กค ํ† ํฐ์„ ์„ ํ˜ธํ•˜๊ณ , ์ƒˆ๋กœ์šด policy๋Š” ํ–„์Šคํ„ฐ์ชฝ์œผ๋กœ ์ด๋™
      • ๊ทผ๋ฐ ๋„ˆ๋ฌด ๋ฉ€๋ฆฌ ์ด๋™์‹œํ‚ค๋ฉด ์•ˆ๋˜๋‹ˆ๊นŒ trust region ์•ˆ์œผ๋กœ projectionํ•ด์„œ ๋Œ์–ด์˜ค์ž!
  • ๊ทผ์‚ฌ ๋ฐฉ์‹์ธ clipping์ด ์•„๋‹Œ, ์ •ํ™•ํ•œ trust region์„ ํ™œ์šฉํ•˜์—ฌ projectionํ•˜์ž!
    • ์ƒˆ๋กœ์šด policy์™€ ์ตœ๋Œ€ํ•œ ๊ฐ€๊นœ๊ฒŒ ์œ ์ง€ํ•˜๋ฉด์„œ old policy์™€ KL ๊ฑฐ๋ฆฌ ์ œํ•œ
    • Token-level KL constraint
      • LLM์€ ์‹œํ€€์Šค์ด๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ ํ† ํฐ ๋ถ„ํฌ์— ๋Œ€ํ•ด trust region ์ ์šฉ
    • Sparse projection (LLM scaling ๋ฌธ์ œ ํ•ด๊ฒฐ)
      • ํ™•๋ฅ  ๋†’์€ ํ† ํฐ๋งŒ ์œ ์ง€ํ•˜์—ฌ projection ๊ณ„์‚ฐ ๋น„์šฉ์„ ๋‚ฎ์ถค

Methods

  • Trust Region Projection
    • ์ƒˆ๋กœ์šด ๋ชจ๋ธ์ด ๋งŒ๋“  policy ฯ€~ฮธ\tilde{\pi}_{\theta}๏ปฟ์™€ ์ตœ๋Œ€ํ•œ ๊ฐ€๊น๋˜, ๊ธฐ์กด policy ฯ€old\pi_{old}๏ปฟ์™€ ๋„ˆ๋ฌด ๋ฉ€์–ด์ง€์ง€ ์•Š๋Š” ๋ถ„ํฌ๋ฅผ ์ฐพ์•„๋ผ!

      โ‡’ ๊ทธ๋Ÿผ ์ตœ์ ํ•ด๋Š” ๋ฌด์—‡์ด์•ผ?

      • ๊ธฐ์กด policy์™€ ์ƒˆ๋กœ์šด policy์˜ ๊ธฐํ•˜ ํ‰๊ท 
        • KL constraint๊ฐ€ ๋ถ™์€ ์ตœ์ ํ™” ๋ฌธ์ œ๋ฅผ ํ’€๋ฉด log-space์—์„œ linearization โ†’ ์ดํ›„ exponential ์ทจํ•˜๋ฉด ๊ธฐํ•˜ ํ‰๊ท ์ด ๋จ
    • ํ•œํŽธ, projection์€ ์ผ๋ถ€ ํ† ํฐ์—๋งŒ ํ•„์š”ํ•˜๋‹ค!
      • ๋Œ€๋ถ€๋ถ„ ํ† ํฐ์€ ์ด๋ฏธ KL constraint๋ฅผ ๋งŒ์กฑํ•ด์„œ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜๊ณ , ์ผ๋ถ€ ํ† ํฐ๋งŒ projection ํ•„์š”
  • Sparse & Efficient Representations of Token Distributions
    • Qwen3์˜ vocab size๋Š” 151936
      • ํ† ํฐ ํ•˜๋‚˜๋‹น 15๋งŒ๊ฐœ์˜ ํ™•๋ฅ ์„ ๊ณ„์‚ฐํ•ด์•ผ ํ•จ โ†’ ๋„ˆ๋ฌด ๋น„์‹ธ์„œ ํ˜„์‹ค์ ์œผ๋กœ ๋ถˆ๊ฐ€๋Šฅ

    โ‡’ ๋ถ„ํฌ๋ฅผ sparseํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์„œ ์ค‘์š”ํ•œ ํ† ํฐ๋งŒ ๋‚จ๊ธฐ์ž!
    โ‡’ ๋ฉ”๋ชจ๋ฆฌ OOM ๋ฐฉ์ง€ ๋ฐ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ ์ฆ๊ฐ€

    • How?
      • ํ™•๋ฅ  ๊ธฐ์ค€ top-K ํ† ํฐ ์„ ํƒ
        • ํ™•๋ฅ ์„ ๋ˆ„์ ํ•ฉํ•ด์„œ ํŠน์ • threshold (e.g., 99.9%)๊นŒ์ง€ ์ฑ„์šฐ๋Š” ํ† ํฐ๋งŒ ์œ ์ง€
      • ๋ชจ๋ธ์ด ์‹ค์ œ๋กœ ์„ ํƒํ•œ ํ† ํฐ์€ ํฌํ•จ์‹œ์ผœ gradient ๊ณ„์‚ฐ์— ํ™œ์šฉ

Experiments

  • ์‹คํ—˜ ๋ชฉ์ 
    • PPO clipping์„ TROLL๋กœ ๊ต์ฒดํ•˜๋ฉด ์„ฑ๋Šฅ์ด ์ข‹์•„์ง€๋Š”๊ฐ€?
    • ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๊ณผ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜์—์„œ๋„ ํšจ๊ณผ๊ฐ€ ์œ ์ง€๋˜๋Š”๊ฐ€?
    • ์ˆ˜ํ•™ reasoning / ์ฝ”๋“œ ์ƒ์„ฑ ๊ฐ™์€ ์‹ค์ œ RLVR task์—์„œ๋„ ํšจ๊ณผ๊ฐ€ ์žˆ๋Š”๊ฐ€?
  • ๋ฐ์ดํ„ฐ์…‹
    • DAPO-Math : ์ˆ˜ํ•™ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ RL๋กœ ํ•™์Šตํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋จ
    • Math-Eval: ์—ฌ๋Ÿฌ ์ˆ˜ํ•™ ๋ฒค์น˜๋งˆํฌ ํ†ตํ•ฉ๋ณธ, ์˜ฌ๋ฆผํ”ผ์•„๋“œ ์ˆ˜์ค€ ๋ฌธ์ œ
    • GSM8K: ์ดˆ๋“ฑํ•™๊ต ์ˆ˜์ค€ ๋ฌธ์ œ
    • Eurus-2-RL: ์ˆ˜ํ•™ ์ถ”๋ก  & ์ฝ”๋“œ ์ƒ์„ฑ ๋ฌธ์ œ ํฌํ•จ
  • ํ™œ์šฉ LLM
    • Qwen3-{0.6B~14B}
    • Qwen2.5-{0.5B~7B}
    • Llama3.1-8B, Llama3.2-3B, Apertus-8B, Smol-LM3-3B
  • ์‹คํ—˜ ๊ฒฐ๊ณผ 1: Qwen ํ™œ์šฉํ•œ ์‹คํ—˜ ์„ฑ๋Šฅ
    โ€œQwen3 ๋ชจ๋ธ์„ GRPO๋กœ ํ•™์Šต ์‹œ, PPO clipping ๋Œ€์‹  TROLL์„ ์‚ฌ์šฉํ•˜๋ฉด ์–ด๋–ค ๋ณ€ํ™”๊ฐ€ ์ƒ๊ธฐ๋Š”์ง€?โ€
    • ๋ชจ๋“  ๋ชจ๋ธ์—์„œ TROLL์ด ๋” ๋น ๋ฅด๊ฒŒ ํ•™์Šต โ‡’ ํ•™์Šต ํšจ์œจ์„ฑ์ด ๋” ์ข‹๋‹ค!
    • ์ตœ์ข… ์„ฑ๋Šฅ์ด ๋” ๋†’์œผ๋ฉฐ, ์ž‘์€ ๋ชจ๋ธ์—์„œ๋„ ํฐ ๊ฐœ์„ ์ด ์ด๋ค„์ง
    • ์ฝ”๋“œ ์ƒ์„ฑ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ๋„ ๋™์ผํ•œ ํŒจํ„ด์„ ๋ณด์ž„
    โ€œTROLL์ด ๋‹ค์–‘ํ•œ RL ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํ™œ์šฉ ์‹œ PPO clipping๋ณด๋‹ค ์‹ค์ œ๋กœ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋Š”์ง€?โ€
    • TROLL์ด ํ•™์Šต ์•ˆ์ •์„ฑ์„ ํฌ๊ฒŒ ๊ฐœ์„ 
      • PPO clipping ์ž์ฒด๊ฐ€ RL ์ตœ์ ํ™”๋ฅผ ์ œํ•œ์‹œํ‚ค๊ณ  ์žˆ์„ ๊ฐ€๋Šฅ์„ฑ ์‹œ์‚ฌ
  • ์‹คํ—˜ ๊ฒฐ๊ณผ 2: ๋‹ค๋ฅธ LLM์—๋„ ๋™์ผํ•œ ํšจ๊ณผ๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” ์ง€ ๊ฒ€์ฆ
    • ๋Œ€๋ถ€๋ถ„ LLM์— ๋Œ€ํ•ด TROLL์ด ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„
    • Clip์—์„œ ํ•™์Šต์ด ์‹คํŒจํ•˜๋Š” ๊ฒฝ์šฐ ์กด์žฌ (e.g., Llama3.1-8B)
    • TROLL์€ ํ•™์Šต์„ ํ›จ์”ฌ ๋นจ๋ฆฌ ์‹œ์ž‘ํ•˜๋ฉฐ, ํ•™์Šต ์•ˆ์ •์„ฑ๋„ ๊ฐœ์„ ํ•จ
  • ์‹คํ—˜ ๊ฒฐ๊ณผ 3: ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ„์„
    • KL bound (Trust region ํฌ๊ธฐ) ฯต\epsilon ๏ปฟ & Sparsification ํ† ํฐ ์ˆ˜ KK๏ปฟ ์กฐ์ • ์‹คํ—˜
      • KL bound๊ฐ€ ์ž‘์€ ๊ฒฝ์šฐ policy ๋ณ€ํ™”๊ฐ€ ๋งค์šฐ ์ œํ•œ๋˜์–ด ํ•™์Šต ์†๋„ ๊ฐ์†Œ but ์ตœ์ข… ์„ฑ๋Šฅ์€ ๋™์ผ
      • KL bound๊ฐ€ ํฐ ๊ฒฝ์šฐ ์„ฑ๋Šฅ ๊ฐ์†Œ ๋ฐ ํ•™์Šต ํ’ˆ์งˆ ์•…ํ™”
      • Sparsification ํ† ํฐ ์ˆ˜๊ฐ€ ์ž‘์€ ๊ฒฝ์šฐ ์‹ค์ œ ๋ถ„ํฌ approximation ์•…ํ™” โ‡’ policy ์—…๋ฐ์ดํŠธ ํ’ˆ์งˆ ์•…ํ™”
      • Sparsification ํ† ํฐ ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ํฐ ๊ฒฝ์šฐ ์—ฐ์‚ฐ ๋น„์šฉ์€ ์ฆ๊ฐ€ํ•˜์ง€๋งŒ ์„ฑ๋Šฅ ๊ฐœ์„ ์€ ๊ทธ๋‹ฅ

  • ์‹คํ—˜ ๊ฒฐ๊ณผ 4: Entropy ๊ด€์  ํ•ด์„
    • ๊ธฐ์กด ๋ฌธ์ œ: PPO-like clipping์€ ์—”ํŠธ๋กœํ”ผ๋ฅผ ์ค„์ด๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต๋จ (๊ณผ๋„ํ•œ ์—…๋ฐ์ดํŠธ ๋ฐฉ์ง€)
      โ†’ ๋ถ„ํฌ๊ฐ€ ํŠน์ • ๋ฐฉํ–ฅ์œผ๋กœ ์ ๋ฆฌ๋Š” ํ˜„์ƒ ๋ฐœ์ƒ
      โ†’ ์ƒˆ๋กœ์šด reasoning ์ „๋žต์— ๋Œ€ํ•œ ํ•™์Šต์ด ์–ด๋ ค์›Œ์ง
    • TROLL์€ ๊ณ„์† ๋†’์€ ์—”ํŠธ๋กœํ”ผ๋ฅผ ์œ ์ง€ํ•œ๋‹ค!
      • KL constraint ์•ˆ์—์„œ projection์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์ด์ „ policy์™€์˜ ๊ฑฐ๋ฆฌ๋ฅผ ์œ ์ง€โ†’ gradient๊ฐ€ ๊ณ„์† ์œ ์ง€๋จ

Categories

RL research