26 March 2026

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

๐Ÿ’กMathematical Reasoning Task ๋ฅผ ํ•  ๋•Œ, RL์„ ๊ฐ„์ ‘์ ์œผ๋กœ ๊ตฌํ˜„ํ•˜์—ฌ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ’€์–ด๋ณด์ž.(= ๊ฐ•ํ™”ํ•™์Šต ํ˜•ํƒœ๋กœ ์ˆ˜ํ•™๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ’€์–ด๋ณด์ž !)

Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning

Review

๋‹‰๋„ค์ž„ Strength & Weakness & Sugguestions๋ณ„์  (0/5)
๋Œ“์ธ ๋…ธ๋…ธ โ€ข ์žฅ์ : positive trajectory, negative trajectory๋ฅผ ๋ชจ๋‘ ๊ณ ๋ คํ•˜์—ฌ sparse reward๋ฅผ ๋ณด์™„ํ•จ / reward์— token ๋ณ„ ์ค‘์š”๋„๋ฅผ ๊ณ ๋ คํ•จ์œผ๋กœ์„œ robustness ๊พ€ํ•จ
โ€ข ๋‹จ์ &๋ณด์™„์ : QWEN ๋ง๊ณ  ๋‹ค๋ฅธ LLM family์— ์ ์šฉํ•˜๋Š” ์‹คํ—˜ ๋ถ€์กฑ! (e.g. Deepseek prober ์ฒ˜๋Ÿผ NR์— ํŠนํ™”๋œ ๋ชจ๋ธ๋“ค)
3.8
์•„์ด๋ฆฌ์Šค์žฅ์ ; ์ข‹์€ ๊ธฐ์ˆ ์  ๋…ผ๋ฌธ์ž„. ๊ฐœ์ธ์ ์œผ๋กœ trajectory๋ผ๋Š” ํ‚ค์›Œ๋“œ์— ๊ด€์‹ฌ์ด ๋งŽ์€๋ฐ, ๊ฐ•ํ™”ํ•™์Šต์— ์ž˜ ์ ์šฉํ•œ ๋…ผ๋ฌธ์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ.
๋‹จ์ : ํšจ์œจ์„ฑ์€ ์ž˜ ๋ชจ๋ฅด๊ฒ ๊ณ , ์„ฑ๋Šฅ๋„ ์ž˜ ๋ชจ๋ฅด๊ฒ ์Œ.
๋ณด์™„์ : ๋ฌผ๋ก  ๊ฐ•ํ™”ํ•™์Šต์ด ๊ทธ๋ ‡์ง€๋งŒ, ๋ชป ํ‘ธ๋Š” ๊ฒƒ์„ ํ’€ ์ˆ˜ ์žˆ๊ฒŒ ๋งŒ๋“ค์—ˆ๋Š”์ง€ ์‹คํ—˜ํ•˜๋Š” ๊ฒƒ๋„ ์žฌ๋ฐŒ์„๋“ฏ. ๊ฐœ์ธ์ ์œผ๋กœ๋„ ์ƒ๊ฐ ์ค‘์ธ ๋ฐฉํ–ฅ์ด๋ผ ๊ถ๊ธˆํ•จ. ์ด์ „์— ๊ด€๋ จ๋œ ์—ฐ๊ตฌ๊ฐ€ ์žˆ์—ˆ๋Š”๋ฐ, ์ƒ๊ฐ๋‚˜๋ฉด ์ ๊ฒ ์Šต๋‹ˆ๋‹ค
3.8
ํ•ธ๋“œํฌ๋ฆผโ€ข ์žฅ์ : ์ƒ˜ํ”Œ๋ง ํ†ตํ•ด ๊ฐ ์งˆ๋ฌธ์— ๋Œ€ํ•œ positive/negative ๋‹ต๋ณ€์„ ๋ชจ๋‘ ์–ป๊ณ  ํ•™์Šต์— ์‚ฌ์šฉ. ์ค‘๊ฐ„ ๋ณด์ƒ ์‚ฌ์šฉ
โ€ข ๋‹จ์ : ์ƒ˜ํ”Œ๋ง ๋‹ต๋ณ€์˜ ํ’ˆ์งˆ์ด ๋ณด์žฅ๋  ๊ฒƒ์ธ์ง€, ํ† ํฐ๋ณ„ ๊ธฐ์—ฌ๋„๊ฐ€ ์ข‹์€ ์ค‘๊ฐ„ ๋ณด์ƒ์ผ์ง€ ์˜๋ฌธ
โ€ข ๋ณด์™„์ : ํ•™์Šต ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๋ณด์žฅํ•˜๋Š” ๋‹จ๊ณ„ ์ถ”๊ฐ€
3.5
3์›” โ€ข ์žฅ์ : ์ˆ˜ํ•™๋ฌธ์ œ ์ถ”๋ก ์— ์ ํ•ฉํ•œ ํ™˜๊ฒฝ์„ ์„ค์ •ํ•จ. ์ค‘๊ฐ„ ์ถ”๋ก  ๊ณผ์ •์ธ trajectory๊ฐ€ ๊ฐ€์žฅ ์–ด์šธ๋ฆฌ๋Š” ๋ถ„์•ผ๋ผ๊ณ  ์ƒ๊ฐํ•จ!
โ€ข ๋‹จ์ : reward ๋ชจ๋ธ์ด ํ‹€๋ฆฌ๋ฉด...? ์ค‘์š”ํ•˜์ง€ ์•Š์€ ํ† ํฐ์ด ์ค‘์š”ํ•˜๋‹ค๊ณ  ํ•™์Šตํ•  ์ˆ˜๋„ ์žˆ์Œ
โ€ข ๋ณด์™„์ : ๋ชจ๋“  ์ •๋‹ต์— ๋™์ผํ•œ reward๋ฅผ ๋ถ€์—ฌํ•˜์ง€ ์•Š๊ณ  ๋ถˆํ™•์‹ค์„ฑ์„ ๊ณ ๋ คํ•ด์„œ ๋ฌธ์ œ ๋‚œ์ด๋„์— ๋”ฐ๋ผ ์ฐจ๋“ฑ ๋ถ€์—ฌ
3.5
ํ™”์ดํŠธ๋…ธ์ด์ฆˆ โ€ข ์žฅ์ : reward๊ฐ€ 0/1๋กœ sparse ํ•˜๋‹ค๋Š” motivation์ด ๋…ํŠนํ•˜๊ณ  ์ด๋ฅผ trajectory๋ฅผ ํ†ตํ•ด ์‹ค์ œ๋กœ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ž„
โ€ข ๋‹จ์ : BON ์ƒ˜ํ”Œ๋ง์—์„œ N์ด ์ปค์งˆ์ˆ˜๋ก ๊ณ„์‚ฐ๋น„์šฉ์ด ๋งŽ์ด ๋“ค ๊ฒƒ ๊ฐ™์Œ
โ€ข ๋ณด์™„์ : ์ˆ˜ํ•™ ๋ฐ์ดํ„ฐ์…‹ ์ด์™ธ์— ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹์—๋„ ๋ฐฉ๋ฒ•๋ก ์ด ์ ์šฉ ๊ฐ€๋Šฅํ• ์ง€ ๊ถ๊ธˆํ•จ
3.1
ํ”ผ์ฆˆ์น˜์ž โ€ข ๊ฐ•์ : process reward๋ฅผ ์‰ฝ๊ฒŒ ๊ตฌ์ถ•ํ•  ์ˆ˜ ์—†๋‹ค๋Š” ํ˜„์‹ค์ ์ธ ๋ฌธ์ œ์ ์„ ์ž˜ ์ง์‹œํ•˜๊ณ , outcome reward๋กœ ๋„๋‹ฌํ•  ์ˆ˜ ์žˆ๋Š” ์ƒํ•œ์„ ์„ ๋ณด๋ ค๋Š” ์‹œ๋„๊ฐ€ ์ข‹์Œ
โ€ข ์•ฝ์ : ๊ทผ๋ฐ ์™„์ „ ์ˆœ์ˆ˜ํ•œ 'final-answer supervision'์€ ์•„๋‹ˆ๋„ค. trajectory ๋‹จ์œ„์˜ ํ™œ์šฉ๊ณผ token-level reward๋“ฑ์ด ๋“ค์–ด๊ฐ€๊ธฐ ๋•Œ๋ฌธ์— trajectory๊ฐ€ ์ถฉ๋ถ„ํžˆ ํ™•๋ณด๋˜์ง€ ์•Š์„ ๋•Œ๋Š” ๊ทธ๋Œ€๋กœ ์ ์šฉํ•˜๊ธฐ ์–ด๋ ค์šธ๋“ฏ
โ€ข ์ œ์•ˆ: trajectory selection์—์„œ pos/neg ์—ฌ๋ถ€ ๋ฟ ๋งŒ์ด ์•„๋‹ˆ๋ผ ํ’ˆ์งˆ์— ๋”ฐ๋ผ์„œ ๊ฒฐ๊ณผ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋‹ฌ๋ผ์ง€๋Š”์ง€ ๊ถ๊ธˆํ•จ
3.9
์—๋„ˆ์ง€ โ€ข ์žฅ์  : reward sparse๋ฌธ์ œ๋ฅผ reasoning์— reward๋ฅผ ๋ถ„๋ฐฐํ•จ์œผ๋กœ์จ, LLM์ด ๊ฒฐ๊ณผ ๊ธฐ๋ฐ˜์˜ reasoning์ด ์•„๋‹Œ (๊ณผ์ •+๊ฒฐ๊ณผ) ๊ธฐ๋ฐ˜์˜ reasoning์„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•จ.
โ€ข ์•ฝ์  : trajectory์˜ ํ† ํฐ์ด ๋งŽ์„ ๋•Œ(ํ’€์ด๊ฐ€ ๊ธด ๋ฌธ์ œ, ์–ด๋ ค์šด ๋ฌธ์ œ)๋Š” reward๊ฐ€ ๋น„์Šทํ•˜๊ฒŒ ๋ถ„๋ฐฐ๋  ๊ฒƒ ๊ฐ™์€๋ฐ ์ด๋Ÿฐ ๊ฒฝ์šฐ๋Š” reasoning์„ ์ž˜ ํ•  ์ˆ˜ ์žˆ์„๊นŒ?
โ€ข ๋ณด์™„์  : reward ๋ถ„๋ฐฐ ๊ณผ์ •์„ ๋” ํšจ์œจ์ ์œผ๋กœ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ์ดˆ๊ธฐ์— ๋ฐ์ดํ„ฐ ์„ค์ •(trajectory ๋ถ„ํฌ)์„ ๊ฑด๋“œ๋ฆฐ๋‹ค๋“ ์ง€.. ์ถ”๊ฐ€ ๋ฐฉ๋ฒ•์ด ์ œ์‹œ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์Œ.
3.4
์ œ๋กœ์ฝœ๋ผ โ€ข ์žฅ์ : ์ตœ์ข… ์ •๋‹ต์˜ ๋งž๊ณ ํ‹€๋ฆผ ์‹ ํ˜ธ๋งŒ์œผ๋กœ๋Š” ํ’€์ด ๊ณผ์ • ์ „์ฒด๋ฅผ ํ‰๊ฐ€ํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๋ฌธ์ œ๋ฅผ ์ง€์ ํ•˜๊ณ , reward๋ฅผ ํ† ํฐ ๋‹จ์œ„๋กœ ์ชผ๊ฐœ ๋ถ„๋ฐฐํ•˜๋Š” ์•„์ด๋””์–ด๊ฐ€ ์ง๊ด€์ ์œผ๋กœ ๋‚ฉ๋“์ด ๋จ.
โ€ข ์•ฝ์ : ํ’€์ด ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ์ˆ˜๋ก ํ† ํฐ์ด ๋งŽ์•„์ง€๊ณ , reward๊ฐ€ ๊ทธ ํ† ํฐ๋“ค์— ๋ถ„๋ฐฐ๋˜๋‹ค ๋ณด๋ฉด ๊ฐ ํ† ํฐ์ด ๋ฐ›๋Š” ์‹ ํ˜ธ๊ฐ€ ๋„ˆ๋ฌด ์ž‘์•„์ ธ์„œ ์–ด๋–ค ๋ถ€๋ถ„์ด ํ•ต์‹ฌ์ธ์ง€ ๊ตฌ๋ถ„์ด ์–ด๋ ค์›Œ์งˆ ๊ฒƒ ๊ฐ™์Œ.
โ€ข ๋ณด์™„์ : ์ˆ˜ํ•™ ์ถ”๋ก  ๋ฌธ์ œ์—๋งŒ ์ง‘์ค‘ํ•œ ์‹คํ—˜์ด๋ผ ๋‹ค๋ฅธ ๋ถ„์•ผ์—๋„ ๊ฐ™์€ ๋ฐฉ์‹์ด ์ž˜ ์ž‘๋™ํ•˜๋Š”์ง€ ๊ถ๊ธˆํ•˜๊ณ , ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์— ์ ์šฉํ•ด๋ณด๋Š” ์‹คํ—˜์ด ์ถ”๊ฐ€๋˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ.
3.4
์ฐฝ๋ฐฑ์นด์ธ„์žฅ์ : RL ๋งค์ปค๋‹ˆ์ฆ˜์„ ํšจ์œจ์ ์œผ๋กœ ๊ตฌํ˜„ํ•จ
์•ฝ์ : ๋ญ”๊ฐ€ ์—ฐ๊ตฌ๋ฅผ ๊ฑฐ๊พธ๋กœ ํ•œ ๊ฒƒ ๊ฐ™์Œ. ๋ฐฉ๋ฒ•๋ก ์˜ rationale์„ ์ž˜ ๋ชจ๋ฅด๊ฒ ์Œ. ๊ทธ๋ž˜์„œ ๋ณ„๋กœ ์ž„ํŒฉํŠธ์žˆ๊ฒŒ ๋‹ค๊ฐ€์˜ค์ง€๋Š” ์•Ÿ์Œ
์ œ์•ˆ์ : ํ‹€๋ฆฌ๊ฒŒ ๋œ ์‹œ์ ์—๋งŒ ํ”ผ๋“œ๋ฐฑ์˜ ๊ฐ•๋„๋ฅผ ์˜ฌ๋ฆฌ๋Š”๊ฒŒ ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ
2.5
์˜ค์ฐจ์žฅ์ : RL์„ reasoning์— ๊ฐ„์ ‘์ ์œผ๋กœ ๊ตฌํ˜„ํ•จ์œผ๋กœ์จ ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ํšจ์œจํ™”ํ•œ ์ ์ด ๊ฐ•์ ์ž„.
์•ฝ์ : ์ด๊ฒŒ ๋ฌด์Šจ ์˜๋ฏธ๊ฐ€ ์žˆ๋Š”์ง€ ์™€๋‹ฟ์ง€ ์•Š์Œ.
์ œ์•ˆ์ : Reeward๋‚˜ ํ”ผ๋“œ๋ฐฑ ๋ฐฉ์‹์„ ์ข€ ๋” ํšจ์œจ์ ์œผ๋กœ ํ•  ์ˆ˜ ์žˆ๋„๋ก Reasoning ๋ฐฉ๋ฒ•์„ ๋ฐ”๊พธ์–ด ๋ด๋„ ๋  ๊ฒƒ ๊ฐ™์Œ.
3.4

Author

Citation : 42

TL; DR

๐Ÿ’ก

Mathematical Reasoning Task ๋ฅผ ํ•  ๋•Œ, RL์„ ๊ฐ„์ ‘์ ์œผ๋กœ ๊ตฌํ˜„ํ•˜์—ฌ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ’€์–ด๋ณด์ž.

(= ๊ฐ•ํ™”ํ•™์Šต ํ˜•ํƒœ๋กœ ์ˆ˜ํ•™๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ’€์–ด๋ณด์ž !)

Summary

Introduction & Background & Motivation

Introduction & Background

์ตœ๊ทผ LLM ๋ชจ๋ธ๋“ค์€ Reasoning์„ ์ž˜ํ•˜๋Š”๋ฐ,

๊ทธ ์›์ธ์œผ๋กœ๋Š” RL(๊ฐ•ํ™”ํ•™์Šต) + COT(๊ธด ์‚ฌ๊ณ  ๊ณผ์ •)๊ธฐ๋ฒ•์ด ์ฑ„ํƒ๋จ.

ํ•˜์ง€๋งŒ, Mathematical Reasoning์„ ๊ธฐ์กด RL ๋ฐฉ์‹์œผ๋กœ ์ ‘๊ทผํ•  ๋•Œ, Sparse Reward ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•จ.

๋˜ํ•œ, ๊ฐ•ํ™” ํ•™์Šต์˜ ๋งค step๋งˆ๋‹ค reasoning์„ ํ•˜๋Š” ๊ฒƒ์€ ๋…ธ๋™์ ์ธ ์ธก๋ฉด์—์„œ ๋งค์šฐ ๋น„ํšจ์œจ์ ์ž„.

  • example
    โ“

    Q ) 1 + 3 x 2 + 5= ?

    = 1 + 6 + 5 (reasoning)

    = 7 + 5 (reasoning)

    = 12 (reasoning)

    step๋งˆ๋‹ค ๊ณ„์† ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์€ ๋น„ํšจ์œจ์ .

    โ‡’ ๋”ฐ๋ผ์„œ ๊ฒฐ๊ณผ๊ฐ’์— ๋Œ€ํ•ด์„œ๋งŒ reward๋ฅผ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ํšจ์œจ์ ์ผ ๊ฒƒ ๊ฐ™์ง€๋งŒ,

    ๋งŽ์€ ์ถ”๋ก ๊ณผ์ •์„ ์Šคํ‚ตํ•˜๊ณ  ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด์„œ๋งŒ ํ‰๊ฐ€ํ•˜๋Š” ๊ฒƒ์€ Sparse ํ•˜๋‹ค.


์ˆ˜ํ•™์  ์ถ”๋ก ์„ ์œ„ํ•ด LLM์„ ์‚ฌ์šฉํ•  ๋•Œ,

LLM policy์˜ ์ž…๋ ฅ์€ ์—ฌ๋Ÿฌ ํ† ํฐ์œผ๋กœ ๊ตฌ์„ฑ๋œ ๋‹ค๋‹จ๊ณ„ ์ถ”๋ก  ๊ณผ์ •์„ ์ถœ๋ ฅํ•˜๋„๋ก ์œ ๋„ํ•œ๋‹ค.

์ผ๋ฐ˜์ ์ธ RL ๋ฐฉ์‹์€, LLM policy๊ฐ€ ์—ฌ๋Ÿฌ reasoning trajectory(ํ’€์ด ๊ณผ์ •)์„ ์ƒ˜ํ”Œ๋ง(์ƒ์„ฑ)ํ•˜๊ณ ,

์ตœ์ข… ๋‹ต๋ณ€์˜ ์ •ํ™•์„ฑ๋งŒ ์ฐธ๊ณ ํ•˜์—ฌ binary feedback(์ •๋‹ต1/์˜ค๋‹ต0 reward)์„ ํ†ตํ•ด policy๋ฅผ ์ตœ์ ํ™”ํ•œ๋‹ค.


Symbol

๋งˆ๋ฅด์ฝ”ํ”„ ๊ฒฐ์ •๊ณผ์ • ํ˜•ํƒœ(MDP) = (S,A,P,r,ฮณ)(S, A, P, r, ฮณ)๏ปฟ

SS๏ปฟ : ์ง€๊ธˆ๊นŒ์ง€ ์“ด ํ’€์ด / AA๏ปฟ : ๋‹ค์Œ์— ์˜ฌ ํ† ํฐ / PP๏ปฟ : ๋‹ค์Œ ์ƒํƒœ๋กœ ๊ฐ€๋Š” ๊ทœ์น™ /

rr๏ปฟ : reward / ฮณฮณ๏ปฟ : discount factor

policy = ํ˜„์žฌ LLM ๋ชจ๋ธ์ด ์ƒ์„ฑ ์ •์ฑ…(์–ด๋–ค ๊ฐ’์„ ์–ด๋–ค ํ™•๋ฅ ๋กœ ์ƒ์„ฑํ•˜๋Š”๊ฐ€ ~ ?)

trajectory = ์ถ”๋ก  ๊ณผ์ •(=ํ’€์ด ๊ณผ์ •)

โ‡’ Positive trajectory(์ •๋‹ต ํ’€์ด) / negative trajectory(์˜ค๋‹ต ํ’€์ด)


Policy ์ตœ์ ํ™” objective

J(ฮธ)โ‰œEsโˆผฯ0,aโˆผฯ€ฮธ(โ‹…โˆฃs)[Qฯ€ฮธ(s,a)]โˆ’ฮฑโ‹…Esโˆผฯ0[DKL(ฯ€ฮธ(โ‹…โˆฃs)โˆฅฯ€0(โ‹…โˆฃs))]J(\theta) \triangleq \mathbb{E}_{s \sim \rho_0, a \sim \pi_\theta (\cdot | s)} \left[ \mathcal{Q}^{\pi_\theta} (s, a) \right] - \alpha \cdot \mathbb{E}_{s \sim \rho_0} \left[ D_{\text{KL}} (\pi_\theta (\cdot | s) \| \pi_0 (\cdot | s)) \right]๏ปฟ

= ๊ธฐ์กด ๋ชจ๋ธ์—์„œ ๋„ˆ๋ฌด ๋ฉ€์–ด์ง€์ง€ ์•Š๊ฒŒ, Reward๋ฅผ ์ตœ๋Œ€ํ™”

์œ„์˜ ์ˆ˜์‹์„ ํ’€๋ฉด,

ฯ€โˆ—(aโˆฃs)=ฯ€0(aโˆฃs)expโก(Qฯ€(s,a)/ฮฑ)Z(s)\pi^*(a|s) = \frac{\pi_0(a|s) \exp(Q^\pi(s, a)/\alpha)}{Z(s)}๏ปฟ

์ตœ์  policy์˜ ํ˜•ํƒœ๊ฐ€ ๋‚˜์˜จ๋‹ค. (= ๊ธฐ์กด ํ™•๋ฅ  x exp(๋ณด์ƒ ๊ธฐ๋ฐ˜์˜ weight))


Best of N sampling

ํ•˜๋‚˜์˜ ๋ฌธ์ œ๋ฅผ N๋ฒˆ ํ’€๊ฒŒํ•ด์„œ ์ œ์ผ reasoning(ํ’€์ด)์„ ์ž˜ํ•œ ๋‹ต ํ•˜๋‚˜๋ฅผ ๊ณ ๋ฅด๋Š” ๊ฒƒ

aโˆ—=argmaxa(i)ย Q(s,a(i))a^โˆ—=argmax_{a^(i)}\ Q(s,a(i))๏ปฟ


Reward ๊ตฌ์กฐ

๋ฌธ์ œ์— ๋Œ€ํ•œ ์ตœ์ข… ๋‹ต๋ณ€์— ๋Œ€ํ•ด์„œ ์ •๋‹ต(1), ์˜ค๋‹ต(0)์œผ๋กœ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด์„œ๋งŒ reward๋ฅผ ๋ถ€์—ฌํ•จ

Problem)

  • ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด์„œ๋งŒ reward๋ฅผ ๊ณ ๋ คํ•˜๋ฏ€๋กœ, sparseํ•˜๋‹ค.

    (์‹ค์ œ ์ถ”๋ก ๊ณผ์ •๊นŒ์ง€ ํฌํ•จํ•˜๋ฉด token์ˆ˜๊ฐ€ ๋งŽ์€๋ฐ, reward๋ฅผ ํ•œ๋ฒˆ๋งŒ ๊ณ ๋ คํ•˜๋Š”๊ฑด ์ด์ƒํ•จ)

  • ํ‹€๋ฆฐ ์ถ”๋ก  ๊ณผ์ •์ด ์žˆ์–ด๋„ ์ •๋‹ต๋งŒ ๋งž์œผ๋ฉด ๋œ๋‹ค.

    (์ด๋ ‡๊ฒŒ ๋˜๋ฉด, ์ž˜๋ชป๋œ ํ’€์ด๋ฒ•์ด ํ•™์Šต๋จ)


Motivation

๐Ÿ’ก

๊ณ„์‚ฐ ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ reward๋Š” 0,1 ๊ฐ’์ด๊ณ , ์ด ๊ฐ’์—๋งŒ ์˜์กดํ•˜๋Š”๊ฑด ๋„ˆ๋ฌด sparse ํ•˜๋‹ค.

(positive trajectory๊ฐ€ sparseํ•œ ๊ฒฝ์šฐ, gradient๊ฐ€ ์—†์Œ)

โ‡’ Outcome reward๋งŒ ๋ณด๊ณ  ๋ชจ๋ธ์„ updateํ•˜๋ฉด ์•ˆ๋œ๋‹ค


Contribution
  • ๊ธฐ์กด ๋ฐฉ์‹์—์„œ๋Š” negative trajectory๊ฐ€ ๋” ํ•™์Šต์ด ๋˜์—ˆ์œผ๋ฏ€๋กœ, positive trajectory๊ฐ€ ๋” ์ž˜๋‚˜์˜ค๊ฒŒ ๋ณด์ •
    • ๋ฌธ์ œ๋ฅผ ํ’€๋ ค๋ฉด ์ •๋‹ต ํ’€์ด๊ณผ์ •์ด ์ถฉ๋ถ„ํžˆ ํ•™์Šต๋ผ์•ผํ•จ.
    • positive trajectory๊ฐ€ ์ ์–ด๋„ ํ•˜๋‚˜๋Š” ๊ผญ ๋ฝ‘ํž ์ˆ˜ ์žˆ๊ฒŒ, BON์„ ์‚ฌ์šฉ
    • negative trajectory๋„ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ์ถ”๊ฐ€ ๋ณด์ •ํ•จ.
  • outcome reward์˜ ๊ฒฐ๊ณผ ์˜์กด ๋ฌธ์ œ๋ฅผ trajectory์˜ token ์˜์กด๋„๋กœ ๋ถ„ํ•ดํ•ด์„œ ํ•™์Šต
    • trajectory์— ๋“ฑ์žฅํ•˜๋Š” ํ’€์ด๊ณผ์ •์˜ ํ† ํฐ์— ๊ฐ€์ค‘์น˜๋ฅผ ๋งค๊ธด๋‹ค๋ฉด, ์–ด๋А ํ’€์ด๊ณผ์ •์ด ์ค‘์š”ํ•œ์ง€ ํ•™์Šต ๊ฐ€๋Šฅํ•จ.
    • Outcome reward 1(์ •๋‹ต), 0(์˜ค๋‹ต)์— ๋งž์ถฐ ํ† ํฐ๋ณ„ ๊ฐ€์ค‘์น˜๋ฅผ ํ•™์Šต.
  • ์—…๋ฐ์ดํŠธ ๋œ ๊ฐ€์ค‘์น˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ Policy Update
  • ์ „์ฒด ํŒŒ์ดํ”„๋ผ์ธ
Method
  • Learning from positive sample


    ฯ€BoN(s)=nโ‹…[P(s)]nโˆ’1โ‹…ฯ€(s)\pi_{\text{BoN}}(s) = n \cdot [P(s)]^{n-1} \cdot \pi(s)๏ปฟ : positive trajectory s๊ฐ€ ๋ฝ‘ํž ํ™•๋ฅ 

    ์ƒ˜ํ”Œ๋Ÿ‰ N์„ ๋Š˜๋ฆด์ˆ˜๋ก, ์ ์–ด๋„ 1๊ฐœ ์ด์ƒ์€ positive๊ฐ€ ๋ฝ‘ํž˜.

    why?) ๋…ผ๋ฌธ์—์„œ๋Š” sampling์‹œ ์˜ค๋‹ต์ด ๋งŽ์ด ๋ฝ‘ํžˆ๋Š” ๋ฌธ์ œ๋ฅผ ์ „์ œ๋กœ reward sparse๋ฅผ ์ง€์ ํ•จ. ๋”ฐ๋ผ์„œ, ์ ์–ด๋„ ํ•˜๋‚˜์˜ positive trajectory๊ฐ€ ์„ ํƒ๋  ์ˆ˜ ์žˆ๊ฒŒ BoN์„ ์‚ฌ์šฉํ•จ.

    e.g.,)

    • ๊ธฐ์กด RL : ๋ฌธ์ œ๋ฅผ ๋ณด๊ณ  ๊ณ„์‚ฐ โ†’ ํ‹€๋ฆผ โ†’ ์—…๋ฐ์ดํŠธ (์ด ๊ฒฝ์šฐ์— ์ •๋‹ต์— ๋Œ€ํ•œ ์ •๋ณด๊ฐ€ ๋ถ€์กฑํ•˜๋ฏ€, reward sparse)
    • BoN : ๋ฌธ์ œ๋ฅผ ๋ณด๊ณ  ๊ณ„์‚ฐ์„ 10๋ฒˆ ์ƒ์„ฑ โ†’ ์ƒ์„ฑ๋œ ๊ฒฐ๊ณผ ์ค‘์— ์ •๋‹ต์„ sampling


    KL(ฯ€BoNโˆฅฯ€)=logโกnโˆ’nโˆ’1n\mathbf{KL}(\pi_{\text{BoN}} \| \pi) = \log n - \frac{n-1}{n}๏ปฟ : Constraint



    n(ฯต)=argminโกnEsโˆผฯ€BoN[โˆ’R(s)]n(\epsilon) = \operatorname{arg min}n \mathbb{E}{s \sim \pi_{\text{BoN}}} [-R(s)]๏ปฟ : BoN์—์„œ ๋ช‡ ๋ฒˆ ์ƒ˜ํ”Œ๋งํ•˜๋Š”๊ฒŒ ์ข‹์„์ง€ ์„ ํƒ

    โ‡’ ์ตœ์ ์˜ n ๊ณ ๋ฅด๋Š” ๊ณผ์ •



    L1(ฮธ)=EsโˆผD+[โˆ’logโกฯ€ฮธ(s)]โŸPositiveย exampleย alignment+ฮฒKL(ฯ€ฮธโˆฅฯ€old)โŸPolicyย constraint\mathcal{L}1(\theta) = \underbrace{\mathbb{E}{s \sim \mathcal{D}^+} [- \log \pi_\theta(s)]}_{\text{Positive example alignment}} + \underbrace{\beta \mathbf{KL}(\pi\theta \| \pi_{\text{old}})}_{\text{Policy constraint}}๏ปฟ

    ์ •๋‹ต trajectory๋ฅผ ์ž˜ ์ƒ์„ฑํ•˜๋„๋ก ํ•˜๋ฉด์„œ, ๊ธฐ์กด policy์—์„œ ๋„ˆ๋ฌด ๋ฉ€์–ด์ง€์ง€ ์•Š๊ฒŒ ํ•จ.

    ๐Ÿ’ก

    ฯ€BoN\pi_{\text{BoN}}๏ปฟ ์œผ๋กœ ๋งŒ๋“  positive distribution์„ ๋”ฐ๋ฅด๋„๋ก policy์„ ์—…๋ฐ์ดํŠธํ•˜๋Š” loss

  • Learning from negative sample

    ฯ€bon(s)=ฯ€(s)[R(s)โ‹…1โˆ’(1โˆ’p)np+(1โˆ’R(s))โ‹…(1โˆ’p)nโˆ’1] \pi_{\text{bon}}(s) = \pi(s) \left[ R(s) \cdot \frac{1 - (1-p)^n}{p} + (1 - R(s)) \cdot (1-p)^{n-1} \right] ๏ปฟ

    = positive ํ™•๋ฅ ์€ ๋†’์ง€๋งŒ, negative ํ™•๋ฅ ์€ ๋‚ฎ์Œ


    โ‡’ negative gradient์˜ ์ง€์ˆ˜๊ฐ€ ํ•œ ๋ฒˆ ๋” ๊ณฑํ•ด์ง€๋ฏ€๋กœ, gradient ๋ถˆ๊ท ํ˜•์ด ์กด์žฌ.



    L2(ฮธ)=EsโˆผSโˆ’[F(1โˆ’p)โ‹…logโกฯ€ฮธ(s)ฯ€old(s)]+ฮฒKL(ฯ€ฮธโˆฅฯ€old)L_2(\theta) = \mathbb{E}_{s \sim \mathcal{S}{-}} \left[ F(1 - p) \cdot \log \frac{\pi_\theta(s)}{\pi_{\text{old}}(s)} \right] + \beta \text{KL}(\pi_\theta \parallel \pi_{\text{old}})๏ปฟ

    ๐Ÿ’ก

    ฯ€BoN\pi_{\text{BoN}}๏ปฟ ์œผ๋กœ ๋งŒ๋“  positive distribution์—์„œ negative๋ฅผ ๋ณด์ • ํ›„ policy์„ ์—…๋ฐ์ดํŠธํ•˜๋Š” loss

  • Dealing with Long Reasoning Chains

    ์ •๋‹ต์€ ๋งˆ์ง€๋ง‰์— ์•Œ ์ˆ˜ ์žˆ์ง€๋งŒ, ์ค‘๊ฐ„ ์ถ”๋ก  ๊ณผ์ •๋„ ๊ณ ๋ คํ•ด์„œ ํ•™์Šตํ•ด์•ผ ํ•œ๋‹ค.

    โ‡’ token ๋ณ„ ์ค‘์š”๋„๋ฅผ ์ถ”์ • (Reward๋ฅผ ์ถ”๋ก  ๊ณผ์ •์— ๋ถ„๋ฐฐ)


    Qฯ€(s<t,ฯ€(st))=Vฯ€(sโ‰คt)=โˆ‘k=0Tโˆ’tฮณkr(st+kโˆฃs<t)Q^\pi(s_{<t}, \pi(s_t)) = V^\pi(s_{\leq t}) = \sum_{k=0}^{T-t} \gamma^k r(s_{t+k} | s_{<t})๏ปฟ

    : Q(ํ–‰๋™)์„ V(์ƒํƒœ)๋กœ ๋ณด๊ณ , t ์ƒํƒœ์—์„œ ์•ž์œผ๋กœ ๋ฐ›์„ reward ์ ์ˆ˜



    A(sโ‰คt)=Vฯ€(sโ‰คt+1)โˆ’Vฯ€(sโ‰คt)A(s \le t) = V^{\pi}(s \le t+1) - V^{\pi}(s \le t)๏ปฟ

    : ํ† ํฐ์ด ํ•˜๋‚˜ ์ถ”๊ฐ€๋์„ ๋•Œ, ๊ฒฐ๊ณผ๋ฅผ ์–ผ๋งˆ๋‚˜ ๋ฐ”๊ฟจ๋Š”์ง€ ์ธก์ •ํ•˜๋Š” ์‹

    : ๋™์ผํ•œ ์งˆ๋ฌธ์— ๋Œ€ํ•ด ์ •๋‹ต๊ณผ ์˜ค๋‹ต์ด ๋‚˜์˜ค๋ฉด, ๊ฐ ์ถ”๋ก ๊ณผ์ •์˜ token๋ณ„ ๊ธฐ์—ฌ๋„ ์ฐจ์ด๋ฅผ ๊ณ„์‚ฐํ•จ.

    ์ฆ‰, ์ •๋‹ต๊ณผ ์˜ค๋‹ต์˜ reward ์ฐจ์ด๋Š” ๊ฐ trajectory๋ณ„ token๋ณ„ ๊ธฐ์—ฌ๋„ ์ฐจ์ด์˜ ์ดํ•ฉ.


    ๐Ÿ’ก reward ์ฐจ์ด๋ฅผ token๋ณ„ ๊ธฐ์—ฌ๋„ ์ฐจ์ด์˜ ํ•ฉ์œผ๋กœ ๋‚˜ํƒ€๋ƒˆ์œผ๋ฏ€๋กœ, ์ฒ˜์Œ๋ถ€ํ„ฐ trajectory ์ž์ฒด๋ฅผ token์˜ ํ•ฉ์œผ๋กœ ํ‘œํ˜„ํ•ด๋ณด์ž

    (reward ์ดํ•ฉ์˜ ์ฐจ์ด = reward๋ณ„ ์ฐจ์ด ํ•ฉ) , (reward ์ดํ•ฉ = reward๋ณ„ ์ดํ•ฉ)

    • ๊ทธ๋ž˜์„œ, Reward๋ฅผ token ๊ธฐ์—ฌ๋„ ์ดํ•ฉ์œผ๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค


      rโˆ—(s)โ‰œโˆ‘t=0TฮณtA(sโ‰คt)r^*(s) \triangleq \sum_{t=0}^{T} \gamma^t A(s \le t)๏ปฟ

      : trajectory reward๋ฅผ token๋ณ„ ๊ธฐ์—ฌ๋„ ํ•ฉ์œผ๋กœ ํ‘œํ˜„

      1Tโˆ‘t=0Tw(sโ‰คt)=r(s)\frac{1}{T} \sum_{t=0}^{T} w(s \leq t) = r(s)๏ปฟ : ๊ธฐ์—ฌ๋„ ํ‰๊ท 

      w(sโ‰คt)w(s \leq t)๏ปฟ : ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•˜๋Š” ํ† ํฐ ๊ธฐ์—ฌ๋„


    • ์ตœ์ข… Loss

      = (L1 Loss + L2 Loss) ์™€ (๊ฐ trajectory์˜ ํ† ํฐ๋ณ„ ๊ฐ€์ค‘์น˜)๋ฅผ ๊ฒฐํ•ฉํ•œ ํ˜•ํƒœ์ž„.

      • ์ •๋‹ต์„ ๋” ๋งŽ์ด ์ƒ์„ฑํ•˜๋˜, ์ค‘์š”ํ•œ token์— ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌ
      • ์˜ค๋‹ต์ด ๋‚˜์˜ฌ ํ™•๋ฅ ์„ ๋ณด์ •ํ•˜๋ฉด์„œ, ํ‹€๋ฆฐ token์— ๊ฐ•ํ•œ ํŽ˜๋„ํ‹ฐ
      • KL constraint๋กœ policy ์ดํƒˆ ๋ฐฉ์ง€
      ๐Ÿ’ก

      ํ† ํฐ์˜ ์ค‘์š”๋„๋ฅผ ๋ฐ˜์˜ํ•ด์„œ, positive trajectory๋Š” ์ž˜ ์ƒ์„ฑํ•˜๊ณ , negative trajectory๋Š” ๋œ ์ƒ์„ฑํ•˜๋„๋ก plicy๋ฅผ ํ•™์Šตํ•จ. (policy update)


Implementation

์ดˆ๊ธฐ ๋ชจ๋ธ(policy)๋กœ Qwen2.5-7B, Qwen2.5-32B๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ, RFT๋ฅผ ์ดˆ๊ธฐํ™”

๊ธฐ์กด ์ดˆ๊ธฐ ๋ชจ๋ธ์— OpenDataLab Dataset, Numina, MATH Training set์„ ์งˆ๋ฌธ์œผ๋กœ ๋„ฃ๊ณ , ๋‚˜์˜จ ๋‹ต๋ณ€์„ ์‹ค์ œ ์ •๋‹ต๊ณผ Exact Match๋ฅผ ํ†ตํ•ด reward๋ฅผ ๋ถ€์—ฌํ•จ! (์ •๋‹ต 1, ์˜ค๋‹ต 0)

๊ทธ๋ ‡๊ฒŒ ๋‚˜์˜จ (์งˆ๋ฌธ, reward) ์Œ์„ ๊ฐ€์ง€๊ณ , RFT๋ฅผ ์ดˆ๊ธฐํ™”ํ•จ.


Dataset : Numia, MATH, AMC/AIME

  • ์œ„ ๋ฐ์ดํ„ฐ๋“ค์˜ ๊ฐ ๋ฌธ์ œ์— ๋Œ€ํ•ด RHF ๋ชจ๋ธ๋กœ 64๊ฐœ์˜ ๋ฐฐ์น˜(์งˆ๋ฌธ)์— ๋Œ€ํ•ด 16๊ฐœ์˜ trajectory(ํ’€์ด)๋ฅผ sampling. = (1024๊ฐœ์˜ trajectory)
  • ๊ฐ trajectory๋ฅผ Qwen2.5-72B-instruct์™€ rule-based-verifier๋ฅผ ํ†ตํ•ด

    ์ •๋‹ต(reward)๋ฅผ ๋งค๊น€ (์ •๋‹ต์ธ ์ถ”๋ก ์€ 1, ์˜ค๋‹ต์€ 0).

  • ๊ทธ๋ฆฌ๊ณ , ์ด ์ •๋‹ต๋ฅ ์ด 0~0.8 ์‚ฌ์ด์ธ ๋ฌธ์ œ๋งŒ ์‚ฌ์šฉํ•จ. (ํ•„ํ„ฐ๋ง)
  • ํ•„ํ„ฐ๋ง๋œ ๋ฌธ์ œ์˜ trajectory์— ๋Œ€ํ•ด์„œ, positive, negative pair๋ฅผ ์„ ํƒ
  • ์„ ํƒ๋œ pair๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ token๋ณ„ ๊ฐ€์ค‘์น˜๋ฅผ ํ•™์Šตํ•จ.


    LCE=โˆ’E(s,r)โˆผD[rlogโกp(s)+(1โˆ’r)logโก(1โˆ’p(s))]\mathcal{L}_{\text{CE}} = - \mathbb{E}_{(s,r) \sim \mathcal{D}} \left[ r \log p(s) + (1 - r) \log(1 - p(s)) \right]๏ปฟ

    ์—ฌ๊ธฐ์„œ, p(s)=ฯƒ(1Tโˆ‘t=0Tw(sโ‰คt))p(s) = ฯƒ(\frac{1}{T} \sum_{t=0}^{T} w(s \leq t) )๏ปฟ ์ด๋‹ค.

    16๊ฐœ์˜ trajectory ์ค‘ (positive, negative) pair ๊ฐ€์ค‘์น˜๋ฅผ ๋ณด๊ณ , ๊ณตํ†ต ๋ถ€๋ถ„์€ ์ƒ์‡„๋˜๊ณ , ์ฐจ์ด ๋ถ€๋ถ„์„ ํ•™์Šต.

    (์—ฌ๋Ÿฌ trajectory๊ฐ€ ํ•ฉ์ณ์ง€๋ฉด์„œ ํ† ํฐ๋ณ„ ๊ฐ€์ค‘์น˜๊ฐ€ ํ•™์Šต๋จ)

    โ‡’ ์–ด๋–ค reasoning ํŒจํ„ด์ด ์ •๋‹ต์œผ๋กœ ์ด์–ด์ง€๋Š”์ง€?

  • Hyperparameter
    • Learning Rate = Policy(5e-7), reward(2e-6)
    • Warmup(10 step warmup)
    • Cosine Annealing
    • Optimizer : AdamW
    • KL coefficient : ฮฒ=0.01
    • ์ด 80์Šคํ… training์„ ํ•˜๊ณ , 10 step๋งˆ๋‹ค ํ‰๊ฐ€ ์ง„ํ–‰

      (1์Šคํ…๋งˆ๋‹ค policy์™€ weight๋ฅผ update)

    • ๋” ๋ณต์žกํ•œ ์ˆ˜ํ•™๋ฌธ์ œ(์‚ผ๊ฐํ•จ์ˆ˜, ํ™•๋ฅ  ํ†ต๊ณ„, ๊ธ‰์ˆ˜) ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” ๊ฐ™์€ ์Šคํ‚ฌ์˜ ๋ฌธ์ œ๋ฅผ ๋” ์ˆ˜์ง‘ํ•˜์—ฌ RFT ๋‹จ๊ณ„์—์„œ ์žฌํ•™์Šต.

Experments & Result
  • ํ‰๊ฐ€ ๋ฐ์ดํ„ฐ์…‹ : MATH-500, AIME2024, AIME2025 (Part1), LiveMathBench, OlympiadBench (์ˆ˜ํ•™ ๋ฌธ์ œ ๋ฐ์ดํ„ฐ์…‹)
  • OREAL-7B ๋ชจ๋ธ์ด RL ๋งŒ์œผ๋กœ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ƒ„. (์ž‘์€ ๋ชจ๋ธ์ž„์—๋„ ์ข‹์€ ์„ฑ๋Šฅ. Distillation์„ ์‚ฌ์šฉ์•ˆํ•จ)
  • ๊ธฐ์กด ์ตœ๊ณ  ๋ชจ๋ธ์ด์—ˆ๋˜ DeepSeek-R1-Distill-Qwen์— ์ ์šฉ์‹œ ์„ฑ๋Šฅ ํ–ฅ์ƒ
  • AIME ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ ๋ฐ ์งˆ๋ฌธ์˜ ๋‚œ์ด๋„๊ฐ€ ์›์ธ์ด๋ผ๊ณ  ํŒ๋‹จ

7B ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋ชจ๋“ˆ์„ ์ ์ง„์ ์œผ๋กœ ์ถ”๊ฐ€์‹œํ‚ค๋ฉด์„œ MATH-500์— ๋Œ€ํ•ด ์„ฑ๋Šฅ ํ‰๊ฐ€

  • Reward Shaping = L2
  • Behavior Cloning = L1
  • Importance Shaping = L_total

7B ๋ชจ๋ธ์—์„œ ๊ฐ ๋ชจ๋“ˆ์„ ์ถ”๊ฐ€ํ•จ์œผ๋กœ์จ ๊ธฐ์กด RL Baseline ์„ฑ๋Šฅ์„ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ์Œ.

์ตœ์ข…์ ์œผ๋กœ๋Š” Importance Sampling์ด ๊ฐ€์žฅ ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ.


  • ์ข‹์€ ์ดˆ๊ธฐ ๋ชจ๋ธ(Policy)์„ ์‚ฌ์šฉํ• ์ˆ˜๋ก ์ตœ์ข… ์„ฑ๋Šฅ์ด ๋†’์Œ์„ ํ™•์ธ

    โ‡’ OREAL ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆฌ๋Š” ์—ญํ• ์„ ํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๊ณ , ์ข‹์€ ์ดˆ๊ธฐ ๋ชจ๋ธ์ผ์ˆ˜๋ก ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋†’์Œ.


  • conclusion
    • OREAL ํ”„๋ ˆ์ž„์›Œํฌ๋Š” BoN ์ƒ˜ํ”Œ๋ง, / ํ† ํฐ๋ณ„ ๊ธฐ์—ฌ๋„ ํ•™์Šต ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ด mathematical reasoning ์— ๋Œ€ํ•ด ์ž˜ ํ•  ์ˆ˜ ์žˆ์Œ.
    • ํ•˜์ง€๋งŒ, ์ด ์ ‘๊ทผ๋ฒ•๋“ค์€ ์ดˆ๊ธฐ policy model(base model)์ด ์ถฉ๋ถ„ํ•œ knowledge๋ฅผ ๊ฐ–๊ณ  ์žˆ๋‹ค๋Š” ์ „์ œ์— ์˜์กดํ•จ.

    โ‡’ Future work๋กœ data construction process๋ฅผ ์–ธ๊ธ‰ํ•˜๋ฉฐ, ๋ถ€์กฑํ•œ ๋ถ€๋ถ„์„ ๊ฐœ์„ .

    RL์„ ๊ฐ„์ ‘์ ์œผ๋กœ ๊ตฌํ˜„ํ–ˆ๋‹ค !

    = ์ง์ ‘์ ์œผ๋กœ RL์„ ์“ด ๊ฑด ์•„๋‹ˆ์ง€๋งŒ, Reward๋ฅผ ์ถ”๋ก  ๊ณผ์ •์— ๊ฐ€์ค‘์น˜ ํ˜•ํƒœ๋กœ ๋ถ„๋ฐฐํ•ด์„œ ๊ฐ„์ ‘์ ์ด๋ผ๋Š” ํ‘œํ˜„์„ ์‚ฌ์šฉํ•œ ๊ฒƒ.

Categories

CoT Mathematical Reasoning RL research