26 November 2025

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

๐Ÿ’กRLVRํ•˜๋ฉด sampling path์—์„œ ์ •๋‹ต path๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ž˜ ์ฐพ๊ธด ํ•˜๋Š”๋ฐ, ์›๋ž˜ ๋ชจ๋ธ์ด ๊ณ ๋ ค์•ˆํ•˜๋Š”๊ฑธ ๊ณ ๋ คํ•˜๋Š”๊ฑด ์•„๋‹˜! ๊ฒŒ๋‹ค๊ฐ€ ์ƒ˜ํ”Œ๋ง์„ ๋Š˜๋ฆฌ๋ฉด ์˜คํžˆ๋ ค reasoning scope๊ฐ€ base model๋ณด๋‹ค ์ข์Œ!my insight: ์ด๊ฒƒ๋„ ์ง€์‹์˜ ์ €์ฃผ?!

๐Ÿฅ‰

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
MNG์ง๊ด€์ ์œผ๋กœ ์ƒ๊ฐํ–ˆ์„ ๋•Œ, ๊ฐ•ํ™”ํ•™์Šต์€ ํ•˜๋˜ ๊ฒƒ ์ค‘์—์„œ ๊ฐœ์„ ์„ ๋ชฉํ‘œ๋กœ ํ•˜๋Š” ์ผ๋ถ€๋ฅผ ๋” ์ž˜ํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฐœ๋…์ธ ๊ฒƒ ๊ฐ™์Œ.(reward ์„ค์ •์— ๋”ฐ๋ผ์„œ) ๊ทธ๋Ÿฐ ์ธก๋ฉด์—์„œ, reasoning scope์ด ์ข์•„์ง€๋Š” ๊ฑด ์–ด์ฉ” ์ˆ˜ ์—†์ด ๋”ฐ๋ผ์˜ค๋Š” ๊ฒŒ ์•„๋‹๊นŒ? ํ•˜๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ์Œ. ์˜คํžˆ๋ ค, ์ด๊ฑธ ๋” ์ž˜ ํ™œ์šฉํ•˜๋Š” ๋ฐฉํ–ฅ๋„ ๊ฐ€๋Šฅํ•˜์ง€ ์•Š์„๊นŒ?4/5
๋ฐฉ์–ด๋ƒ ๋ƒ LLM์ด 'ํ˜„์žฌ ์กด์žฌํ•˜๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๋ฐ์ดํ„ฐ'๋ฅผ ๋ชจ๋‘ ํ•™์Šตํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ฏธ world knowledge๋ฅผ ๋‹ค ์•Œ๊ณ  ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•จ.
๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ๋‹น์—ฐํžˆ 'LLM์ด ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š์€ ์ถ”๋ก  ๋Šฅ๋ ฅ' ์ค‘์—์„œ ์šฐ๋ฆฌ๊ฐ€ ์•Œ๊ณ  ์žˆ๋Š” ๊ฒƒ์€ ์—†์ง€ ์•Š์„๊นŒ? ์ฆ‰, ๊ธฐ์กด ์ถ”๋ก ์„ ๋” ์ž˜ํ•˜๊ฒŒ ํ•˜๋ ค๋Š” ๋ฐฉํ–ฅ์œผ๋กœ๋งŒ RL์ด ๋˜์ง€ ์•Š์„๊นŒ? ํ•œํŽธ์œผ๋กœ๋Š” "LLM์ด ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š์€ ์ถ”๋ก  ๋Šฅ๋ ฅ์ด ๊ณผ์—ฐ ์ง„์งœ ํ•„์š”ํ• ๊นŒ?" ํ•˜๋Š” ์ƒ๊ฐ๋„ ๋“ค์—ˆ๋‹ค. ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•œ ์—ฐ๊ตฌ ๊ฐ™์Œ!
3.8
์˜ค์ฐจ์ฆˆ์ผ€RL์€ ๊ฒฐ๊ตญ '๋ณด์ƒ์„ ๋” ์ž˜ ๋ฐ›๋Š” ํ–‰๋™ ๋ฐฉ์‹'์„ ํ•™์Šตํ•˜๋Š” ๋‹จ๊ณ„์ด๊ณ , ๋ชจ๋ธ์˜ ๊ทผ๋ณธ์ ์ธ reasoning ๋Šฅ๋ ฅ์€ ๋Œ€๋ถ€๋ถ„ pretraining์—์„œ ์ด๋ฏธ ๊ฒฐ์ •๋˜๋Š” ๊ฒƒ ๊ฐ™์Œ. ๋”ฐ๋ผ์„œ RL์ด๋‚˜ SFT๋Š” ์ƒˆ๋กœ์šด ๋Šฅ๋ ฅ์„ ๋งŒ๋“œ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๊ธฐ์กด ๋Šฅ๋ ฅ์„ ์–ด๋–ป๊ฒŒ ๋” ํšจ์œจ์ ์ด๊ณ  ์„ ํ˜ธ๋˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ‘œํ˜„ํ• ์ง€๋ฅผ ์กฐ์ •ํ•˜๋Š” ์—ญํ• ์— ๊ฐ€๊น๋‹ค๊ณ  ๋А๊ปด์ง„๋‹ค.4
์•ผํ‚คํ† ๋ฆฌRL reasoning์€ ๋‡Œ๋กœ ์น˜๋ฉด ์ด๋ฏธ ์ •ํ•ด์ ธ ์žˆ๋Š” ๋‡Œ์˜ ์šฉ๋Ÿ‰? ๋Šฅ๋ ฅ์น˜?์„ ํ‚ค์šฐ๊ธฐ ๋ณด๋‹จ ๋‡Œ์˜ ๋Šฅ๋ ฅ์„ ์ตœ๋Œ€ํ•œ ์ž˜ ํ™œ์šฉํ•˜๋„๋ก ๋•๋Š” ์—ญํ• (์˜ํ™” ๋ฃจ์‹œ ๋А๋‚Œ)์ธ ๊ฒƒ ๊ฐ™๋‹ค. ๊ฒฐ๋ก ์€ RL๋„ ์ค‘์š”ํ•˜๊ณ  ๋ชจ๋ธ ์ž์ฒด๋„ ์ค‘์š”ํ•˜๋‹ค๊ณ  ๋А๊ผˆ์Œ4
42RENํŠน์ • Task์— ๋Œ€ํ•œ ์ •๋‹ต๊ณผ ๋ณด์ƒ์„ ์ œ์‹œํ•˜๋Š” ๊ฒƒ์ด ๊ฒฐ๊ตญ RLVR์ด ๋˜๋Š”๋ฐ, Reasoning Scope๋Š” ์ข์•„์งˆ ์ˆ˜๋ฐ–์— ์—†๋‹ค๋Š” ์ƒ๊ฐ์ž„. ๊ทธ๋Ÿฌ๋‚˜, ์ด๊ฑธ base model์—์„œ ๊บผ๋‚ด์“ด๋‹ค๋Š” ๋ฌธ์ œ ์ œ๊ธฐ๋ฅผ ํ†ตํ•ด ๊ธฐ์กด ๋ชจ๋ธ์„ ์ œ๋Œ€๋กœ Trainingํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ฐพ์•„๋ณผ ์ˆ˜ ์žˆ๋Š” ๊ณ„๊ธฐ๊ฐ€ ๋  ์ˆ˜๋Š” ์žˆ๋‹ค๊ณ  ๋ด„.4.2
ํ…€๋ธ”๋Ÿฌ์–ด๋А์ •๋„ ์˜ˆ์ƒ๊ฐ€๋Šฅํ•œ ์‹œ๋‚˜๋ฆฌ์˜ค๊ธด ํ•œ๋ฐ, ๊ฒฐ๊ตญ ์ ์€ k ์ƒ˜ํ”Œ๋ง์—์„œ ์ž˜์ฐพ๋Š”๊ฒŒ RL์˜ ๋ชฉ์  ์•„๋‹Œ๊ฐ€? ๋ฌผ๋ก  ๊ณ ์ „์ ์ธ RL์„ ๊ธฐ๋Œ€ํ•˜๋Š” ๊ฒƒ์ด๋ผ๋ฉด ์ƒˆ๋กœ์šด ์•„์ด๋””์–ด๊ฐ€ ๋‚˜์˜ค๋Š”๊ฒŒ ์ข‹์ง€๋งŒ, ์ง€๊ธˆ LLM์—๊ฒŒ ๋จน์ด๋Š” RL์€ ๋” ์ž˜ํ•˜๋„๋ก ์ง€๋„ํ•˜๋Š” ๊ฑฐ๋ผ์„œ,,, ์ง€๊ธˆ์€ RLVR์ด ์ œ ์—ญํ• ์„ ๊ทธ๋Œ€๋กœ ์ˆœ์ˆ˜ํ•˜๊ฒŒ ์ž˜ ์ดํ–‰ํ•˜๊ณ  ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•จ! ์ธ์‚ฌ์ดํŠธ๋Š” ์ข‹์Œ!!3.5
๊ฐ์žRL์ด ๋ชจ๋ธ์˜ ์ถ”๋ก ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”ํ•œ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, ๋…ผ๋ฌธ ์‹คํ—˜๊ฒฐ๊ณผ๋ฅผ ๋ด์„œ๋Š” ์ƒˆ๋กœ ๋ฐฐ์šฐ๊ฒŒ ํ•œ๋‹ค๊ธฐ๋ณด๋‹ค ์ด๋ฏธ ์•„๋Š” ๊ฑธ ๋” ์ž˜ํ•˜๊ฒŒ ํ•˜๋Š” ๋А๋‚Œ. LLM์—๊ฒŒ ์•„์˜ˆ ์ƒˆ๋กœ์šด ๋ถ„์•ผ๋ผ๋ฉด, ์ฒ˜์Œ๋ถ€ํ„ฐ RLํ•˜๋Š” ๊ฒŒ ๊ผญ ์ข‹์€ ์„ ํƒ์€ ์•„๋‹ˆ๊ฒ ๋‹ค ์‹ถ๋‹ค4
์ƒˆ์šฐRL์ด LLM์˜ โ€˜๋Šฅ๋ ฅ์น˜โ€™๋ฅผ ๋Š˜๋ฆฌ๊ธฐ๋ณด๋‹ค, ์ด๋ฏธ ํ•™์Šต๋œ world knowledge ์•ˆ์—์„œ reward๋ฅผ ์ž˜ ๋ฐ›์„ ์ˆ˜ ์žˆ๋Š” ์ถ”๋ก  ํŒจํ„ด๋งŒ ๊ฐ•ํ™”ํ•œ๋‹ค๋Š” ์ ์„ ๋…ผ๋ฆฌ์ ์œผ๋กœ ์ž˜ ํ‘ผ ๋…ผ๋ฌธ์ธ๋“ฏ4.1

TL; DR

๐Ÿ’ก

RLVRํ•˜๋ฉด sampling path์—์„œ ์ •๋‹ต path๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ž˜ ์ฐพ๊ธด ํ•˜๋Š”๋ฐ, ์›๋ž˜ ๋ชจ๋ธ์ด ๊ณ ๋ ค์•ˆํ•˜๋Š”๊ฑธ ๊ณ ๋ คํ•˜๋Š”๊ฑด ์•„๋‹˜! ๊ฒŒ๋‹ค๊ฐ€ ์ƒ˜ํ”Œ๋ง์„ ๋Š˜๋ฆฌ๋ฉด ์˜คํžˆ๋ ค reasoning scope๊ฐ€ base model๋ณด๋‹ค ์ข์Œ!
my insight: ์ด๊ฒƒ๋„ ์ง€์‹์˜ ์ €์ฃผ?!


Summary

Background & Motivation

  • RLVR(Reinforce Learning with Verifiable Rewards)
    • LLM์˜ next token prediction์„ ๊ฐ•ํ™”ํ•™์Šต์—์„œ์˜ policy๋กœ ์ƒ๊ฐํ•ด๋ณด์ž!
      • ์ •๋‹ต์„ ์ƒ์„ฑํ•˜๋ฉด reward๋ฅผ ์ฃผ๋Š” ๋ฐฉ์‹
      • RLVR ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์€ PPO์˜ objective๋ฅผ ์‚ฌ์šฉํ•จ
      LCLIP=E[min(rt(ฮธ)At,clip(rt(ฮธ),1โˆ’ฯต,1+ฯต)At)]rt(ฮธ)=ฯ€ฮธ(ytโˆฃx,y<t)ฯ€old(ytโˆฃx,y<t)L_{CLIP}=E[min(r_t(ฮธ)A_t, clip(r_t(ฮธ),1โˆ’ฯต,1+ฯต)A_t)] \\ r_t(ฮธ) = \frac{ฯ€_ฮธ(y_t|x,y_{<t})}{ฯ€_{\text{old}}(y_t|x,y_{<t})}
      • ์—ฌ๊ธฐ์„œ clip์€ ๋„ˆ๋ฌด ๊ณผํ•œ ์—…๋ฐ์ดํŠธ๋ฅผ ๋ง‰๋Š” threshold๋ผ๊ณ  ๋ณด๋ฉด ๋จ
      • AtA_t๏ปฟ ๋Š” advantage๋กœ ํ‰๊ท ์ ์ธ action๋ณด๋‹ค ์–ผ๋งˆ๋‚˜ ๋” ์ข‹์€ action(์ƒ์„ฑ)์ด์—ˆ๋Š”์ง€ ๊ณฑํ•ด์ฃผ๋Š” ๊ฒƒ
        • ๋ฌธ์ œ์˜ ๋‚œ์ด๋„๊ฐ€ ๋†’์•˜์„๋•Œ ๋งž์ถ”๋ฉด ์•„์ฃผ ๊ตฟ๊ตฟ!
  • ์ „ํ†ต์ ์ธ ๊ฐ•ํ™”ํ•™์Šต์€ ์ƒˆ๋กœ์šด ์ „๋žต, ์•„์ด๋””์–ด๋“ค์„ ๋งŒ๋“ค์–ด๋‚ด๋Š”๋ฐ (e.g. AlphaGoโ€™s move 37),
    LLM์„ ์œ„ํ•œ RLVR๋„ LLM์ด ๊ฐ€์ง€๊ณ  ์žˆ์ง€ ์•Š์€ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ์ƒˆ๋กœ ๋งŒ๋“ค์–ด ๋‚ด๋Š”๊ฑธ๊นŒ?, ์•„๋‹ˆ๋ฉด ๊ธฐ์กด ์ถ”๋ก ์„ ๋” ์ž˜ ํ•˜๋Š” ๊ฒƒ ๋ฟ์ผ๊นŒ?
  • โ†’ base model๊ณผ RLVR model์ด ์–ด๋–ค ๋ฌธ์ œ๋ฅผ ์ž ์žฌ์ ์œผ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜ ์žˆ๋Š”์ง€์— ๋Œ€ํ•ด reasoning capacity boundary๋ฅผ ์ธก์ •ํ•ด๋ณด์ž!
    • ๋งค์šฐ ์ถฉ๋ถ„ํ•œ sampling์†์—์„œ ์ •๋‹ต์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ์ธก์ •, ํ‰๊ฐ€!

Key Findings

  1. ํ˜„์žฌ์˜ RLVR ๋ชจ๋ธ๋“ค์˜ ์ถ”๋ก  ๋ฒ”์œ„๋Š” base model๋ณด๋‹ค ์ž‘๋‹ค
  1. RLVR ๋ชจ๋ธ์ด ์ƒ์„ฑํ•˜๋Š” reasoning path๋“ค์€ base model๋“ค์— ์ด๋ฏธ ์กด์žฌํ•œ๋‹ค
  1. RLVR ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์˜ ์„ฑ๋Šฅ๋“ค์€ ๋‹ค ๋น„์Šทํ•œ๋ฐ, optimal์ด๋ž‘์€ ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€๋‹ค
  1. RLVR๊ณผ distillation์€ ๊ทผ๋ณธ์ ์œผ๋กœ ๋‹ค๋ฅด๋‹ค

Experiments

Experimental setup
  • ์ˆ˜ํ•™ task๋Š” SFT ์—†๋Š” ๋ชจ๋ธ๋“ค ์“ฐ๊ณ , ๋‚˜๋จธ์ง€ task๋Š” SFT๋œ ๋ชจ๋ธ ์”€
  • Start model์— RL์„ ์ ์šฉํ•œ ํ›„ ์ „ํ›„ ๋น„๊ต!
  • Deep analysis๋Š” 4์žฅ์˜ ์‹ฌ์ธต ๋ถ„์„์—์„œ์˜ ์„ธํŒ…์ž„! main result๋Š” ์•„๋‹˜
TaskStart ModelRL FrameworkRL Algorithm(s)Benchmarks
MathematicsLLaMA-3.1โ€“8B / Qwen2.5โ€“7B/14B/32B Base / Qwen2.5-Math-7BSimpleRLZoo, Oat-Zero, DAPOGRPOGSM8K, MATH500, Minerva, Olympiad, AIME24, AMC23
Code GenerationQwen2.5โ€“7B-Instruct / DeepSeek-R1-Distill-Qwen-14BCode-R1 / DeepCoderGRPOLiveCodeBench, HumanEval+, MBPP+
Visual ReasoningQwen2.5-VL-7BEasyR1GRPOMathVista, MathVision
Deep AnalysisQwen2.5โ€“7B Base & Instruct / R1-Distill-Qwenโ€“7BVeRLPPO, GRPO, Reinforce++, RLOO, ReMax, DAPOOmni-Math-Rule, MATH500
Evaluation protocol
  • Metric: pass@k๋ฅผ ์‚ฌ์šฉ
    • ๊ธฐ์กด ์ƒ˜ํ”Œ๋ง ๋ฐฉ์‹๋“ค์€ ํ‰๊ท ์ ์ธ ํ–‰๋™๋งŒ ํ‰๊ฐ€ํ•˜๊ณ , ์ถฉ๋ถ„ํžˆ ์‹œ๋„ํ–ˆ์„ ๋•Œ ํ’€ ์ˆ˜ ์žˆ๋Š”์ง€๋Š” ๊ณ ๋ ค ์•ˆํ•จ
    • ๋ชจ๋ธ๋กœ๋ถ€ํ„ฐ k๊ฐœ ์ถœ๋ ฅ ์ƒ˜ํ”Œ๋งํ•˜๊ณ , ํ•˜๋‚˜๋ผ๋„ ๋งž์œผ๋ฉด pass@k = 1, ๋‹ค ํ‹€๋ฆฌ๋ฉด pass@k = 0
    • โ†’ ๋ชจ๋ธ์ด k๋ฒˆ ์‹œ๋„ ์•ˆ์œผ๋กœ ํ’€ ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ์ธ๊ฐ€? ๋ฅผ ์•Œ ์ˆ˜ ์žˆ์Œ
    • ๋ฒค์น˜๋งˆํฌ ์ „์ฒด๋กœ ๋ณด๋ฉด ํ‰๊ท  pass@k๋Š” ๋ชจ๋ธ์ด k๋ฒˆ ์‹œ๋„ํ–ˆ์„ ๋•Œ ํ’€ ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ์˜ ๋น„์œจ
      โ†’ Reasoning coverage
    • ์ˆ˜ํ•™์—์„œ๋Š” k๋ฒˆ ์‹œ๋„ํ•˜๋ฉด์„œ ์ˆซ์ž ์ฐ์–ด์„œ ๋งž์ถœ ์ˆ˜ ์žˆ์–ด์„œ CoT๋ฅผ ์ˆ˜๋™์œผ๋กœ ๊ฒ€์‚ฌํ–ˆ๋‹ค๊ณ  ํ•จ
  • ์ƒ˜ํ”Œ๋ง ์„ค์ •
    • Temperature = 0.6
    • Top-p = 0.95
    • max token generation length = 16,384 tokens

Evaluation Results

์ˆ˜ํ•™ task

  • k๊ฐ€ ์ž‘์œผ๋ฉด RLVRํ•œ ๋ชจ๋ธ์ด ๋” ์ž˜ํ•˜๋Š”๋ฐ, k ๋†’์•„์ง€๋ฉด coverage๊ฐ€ ์—ญ์ „๋จ!
  • RLVR์ด ์ƒˆ๋กœ์šด ์ถ”๋ก  ํŒจํ„ด์„ ํ•™์Šต์‹œํ‚จ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๋ชจ๋ธ์ด ์ด๋ฏธ ๊ฐ€์ง€๊ณ  ์žˆ๋˜ ํŒจํ„ด์„ ๋” ์ž์ฃผ ๊บผ๋‚ด ์“ฐ๋„๋ก ๋ถ„ํฌ๋ฅผ ์กฐ์ •ํ•œ ๊ฒƒ!
  • ์ด๋ฏธ ์ž˜ํ•˜๋Š” base model ์˜ˆ์‹œ (figure20, 21)
    • base ์ฃผ์ œ์— ์ด์ •๋„ ๊ธธ์ด์™€ ํ€„๋ฆฌํ‹ฐ์˜ ์ถ”๋ก ์„..?

์ฝ”๋“œ ์ƒ์„ฑ, visual reasoning task

  • ์—ญ์‹œ ๋‚ฎ์€ k์—์„œ๋Š” RLVR ๋ชจ๋ธ์ด ์šฐ์ˆ˜ํ•˜๋‚˜, k๊ฐ€ ๋งŽ์•„์ง€๋ฉด base๋ชจ๋ธ์˜ coverage๊ฐ€ ๋„“์–ด์ง

Deep analysis

  • ์™œ base model์˜ coverage๊ฐ€ ์—ญ์ „ํ• ๊นŒ? ์™œ RLVR์€ ์ƒˆ๋กœ์šด path๋ฅผ ํ™•์žฅํ•˜์ง€ ๋ชปํ• ๊นŒ?
  • Accuracy distribution analysis
    • RLVR ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ๊ฒƒ์ด ์ „์ฒด์ ์œผ๋กœ accuracy๊ฐ€ ๋†’์Œ
    • ๊ทผ๋ฐ ์ด์ƒํ•˜๊ฒŒ๋„ accuracy 0์ธ๊ฒƒ๋„ RL ๋ชจ๋ธ์ด ๋” ๋†’์Œ
  • Perplexity analysis
    • RL์ด ์ƒ์„ฑํ•œ ์ถ”๋ก ์˜ perplexity๊ฐ€ basemodel์—์„œ ๋‚ฎ์Œ
      • Base model๋„ ํ•  ์ˆ˜ ์žˆ๋Š”, ์•Œ๊ณ  ์žˆ๋Š” ์ถ”๋ก  ๊ฒฝ๋กœ์˜€๋˜ ๊ฒƒ!
  • Distillation์€ ๋‹ค๋ฅด๋‹ค!
    • DeepSeek์—์„œ distill๋ฐ›์€ DeepSeek-R1-Distill-Qwen-7B๊ณผ Qwen2.5-Math-7B(base), Qwen2.5-Math-7B-Oat-Zero(RL)๋ฅผ ๋น„๊ต
    • K๋ฅผ ๋Š˜๋ ค๋„ coverage๊ฐ€ ์ œ์ผ ๋†’์Œ โ†’ ์„ ์ƒ๋‹˜ํ•œํ…Œ ๋ฐฐ์šฐ๋ฉด ๋ชจ๋ฅด๋Š” ๊ฒƒ๋„ ์•Œ๊ฒŒ ๋œ๋‹ค!
  • ๋‹ค์–‘ํ•œ RL algorithm ์‹คํ—˜
    • ๋‹ค๋ฅธ algorithm์„ ์จ๋„, step์„ ๋” ๊นŠ๊ฒŒ ํ•™์Šต์‹œ์ผœ๋„, ๊ฒฐ๊ตญ ์–‘์ƒ์ด ๋น„์Šทํ•ด์ง
  • RL์ด ๋ฌธ์ œ์ธ๊ฑฐ์•ผ? RL ์•ˆ์—์„œ ์–ด๋–ป๊ฒŒ ํ•ด๊ฒฐ ๋ชปํ•ด?
    • Training step ๋Š˜๋ ค๋„ ์•ˆ๋จ
    • ํ›ˆ๋ จํ•  ๋•Œ ์ƒ˜ํ”Œ๋ง ๋Š˜๋ ค๋„ ์•ˆ๋จ
    • base model์—์„œ ๋ฉ€์–ด์ง€์ง€ ์•Š๊ฒŒ KL loss ๋„ฃ์–ด๋„ ์•ˆ๋จ
    • ๋ณด๋‹ˆ๊นŒ RL ๋ชจ๋ธ์€ trainingํ• ์ˆ˜๋ก ์—”ํŠธ๋กœํ”ผ ๋‚ฎ์•„์ง€๋„ค, temperature๋ฅผ ๋†’์—ฌ๋ณด์ž! โ†’ ์•ˆ๋จ

Categories

research