14 January 2026

S1: Simple Test-time Scaling

๐Ÿ’กtraining ๋‹จ๊ณ„์—์„œ ๋ง๊ณ , inference ๋‹จ๊ณ„์—์„œ ์„ฑ๋Šฅ์„ ๋†’ํžˆ๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ?โ‡’ ์ผ๋‹จ ์ˆ˜ํ•™/์ถ”๋ก  ๋ฌธ์ œ๋Š” token ๊ฐœ์ˆ˜ ์กฐ์ •ํ•ด

S1: Simple Test-time Scaling

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
์ฐฐ๋‚˜LLM์ด length๋ฅผ ์ž˜ ์ง€ํ‚ฌ ์ˆ˜ ์žˆ์„๊นŒ? ๋ผ๋Š” ์˜๋ฌธ์ด ๋“ค๊ธด ํ•จ. ๋‹ค๋งŒ, 100๋ฌธ์žฅ๋งŒํผ ์ƒ๊ฐํ•ด โ‡’ ๊นœ์ง€ ์“ฐ๋“ฏ์ดํ•˜๋ฉด reasoning ๋‹จ๊ณ„๋ฅผ ๊ฐ„๋‹จํ•˜๊ฒŒ ํ™•์ธํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๊ธด ํ•จ. ๋‹ค๋งŒ, ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์˜คํžˆ๋ ค ๋ถˆํ•„์š”ํ•œ reasoning์ด ์ƒ๊ธฐ๊ฑฐ๋‚˜, ๊ทธ๋ ‡๊ฒŒ ์œ ๋„๋  ๊ฒƒ ๊ฐ™๊ธฐ๋„ ํ•จ. ์„ฑ๋Šฅ ๊ฐœ์„  ์ธก๋ฉด์—์„  ๋„ˆ๋ฌด ์ข‹์€ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ์ƒ๊ฐ!4.3
์™€์‚ฌ๋น„๊ฝƒ๊ฒŒ๋ž‘๋ฐฉ๋ฒ•์€ ๊ต‰์žฅํžˆ ๋‹จ์ˆœํ•˜๊ณ  CoT์— ์ต์ˆ™ํ•˜๋ฉด ์ƒˆ๋กญ๊ฒŒ ๋А๊ปด์ง€์ง€ ์•Š์„ ์ˆ˜ ์žˆ๊ธดํ•จ. ํ•˜์ง€๋งŒ test-time์—์„œ ์–ผ๋งˆ๋‚˜ ์ƒ˜ํ”Œ๋งํ•˜๊ณ  ์–ธ์ œ ๋Š๋Š”์ง€๊ฐ€ ์„ฑ๋Šฅ์— ํฐ ์˜ํ–ฅ์„ ์ค€๋‹ค๋Š” ๋ถ€๋ถ„์€ ์‹ค์ „์—์„œ๋„ ๋งŽ์ด ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์„๊ฒƒ ๊ฐ™์Œ3.7
๋ฉ”๊ฐ€์ปคํ”ผ์ง๊ด€์ ์œผ๋กœ ์ƒ๊ฐํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ์ด ๋†’์•„์ง€์ง€ ์•Š์„ ๊ฒƒ ๊ฐ™์€๋ฐ(์‹œํ—˜๊ธฐ๊ฐ„์— ๊ณต๋ถ€์•ˆํ–ˆ๋‹ค๊ฐ€ ์‹œํ—˜์น  ๋•Œ ๋จธ๋ฆฌ ๋” ์“ด๋‹ค๊ณ  ์„ฑ๋Šฅ์ด ์˜ค๋ฅผ๊นŒ?) ์„ฑ๋Šฅ์ด ์˜ค๋ฅด๋Š”๊ฒŒ ์‹ ๊ธฐํ•˜๋‹ค.3.7
์š”๋ฆฌ๊ดด๋ฌผํ•™๋ฌธ์ ์ธ contribution์ด ํฌ์ง€๋Š” ์•Š์ง€๋งŒ ์‹ค์ œ ์ ์šฉ ๋‹จ๊ณ„์—์„œ ๊ต‰์žฅํžˆ ์˜ํ–ฅ๋ ฅ์ด ํด๊ฑฐ๊ฐ™๋‹ค. Wait์„ ์—„์ฒญ ๋ถ™์ด๋Š”๋ฐ๋„ ์ •ํ™•๋„๊ฐ€ ๋–จ์–ด์ง€์ง€๋Š” ์•Š๋Š”๊ฒŒ ์‹ ๊ธฐํ•˜๋‹ค. ์• ์ดˆ์— ๋ฌธ์ œ๋“ค์ด ๋„ˆ๋ฌด ์–ด๋ ค์›Œ์„œ ๊ทธ๋Ÿฐ๊ฐ€? ๋‹ค๋“ค ์‹คํ—˜ ์ ์šฉํ•˜๋А๋ผ ์ธ์šฉ์ˆ˜๊ฐ€ ๊ต‰์žฅํžˆ ๋†’์€๋“ฏ.โ€ฆ4.0
์ƒˆ์šฐ๊นก๊ธธ๊ฒŒ ์ƒ๊ฐํ•˜๋Š” ๊ฒƒ์ด ํ˜ผ๋ž€์„ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ ์‹ถ์—ˆ๋Š”๋ฐ, ์ถ”๋ก ๋Šฅ๋ ฅ ์–ด๋А์ •๋„ ์ด์ƒ์ธ ์–ธ์–ด๋ชจ๋ธ๊ณผ ์–ด๋ ค์šด ์ถ”๋ก  ๋ฐ์ดํ„ฐ์…‹ ๋Œ€์ƒ์œผ๋กœ ํ•ด์„œ ๊ทธ๋Ÿฐ์ง€ ๊ธ์ •์  ์˜ํ–ฅ์ด ์ปธ๋‚˜๋ณด๋‹ค. ์• ๋งคํ•˜๊ฒŒ ์ž˜ํ•˜๋Š” ์–ธ์–ด๋ชจ๋ธ์— ๋Œ€ํ•ด์„œ๋Š” ํšจ๊ณผ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋‚˜์˜ฌ์ง€ ๊ถ๊ธˆํ•˜๋‹ค4
๊ณ ๊ตฌ๋งˆ๋ง›๋„๋ฆฌ - ํ–ฅํ›„ 0.5~1๋…„ ๋™์•ˆ์€ test time scaling์ด ๋งŽ์ด ๋‚˜์˜ค์ง€ ์•Š์„๊นŒ์š”! ์–ด์จŒ๋“  ์šฐ๋ฆฌ๋Š” ์ œํ•œ์ ์ธ ์ž์›์•ˆ์—์„œ ์ตœ์„ ์˜ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ ค์•ผ ํ•˜๋‹ˆ๊นŒ์š”~
- ๊ทธ์น˜๋งŒ ๋ฐฉ๋ฒ•์ด ๋„ˆ๋ฌด ๋‹จ์ˆœํ•ด์„œ ์™€๋‹ฟ์ง€ ์•Š์Œ! ์ด ๋…ผ๋ฌธ์—์„œ์˜ findings๊ฐ€ ๋‹ค๋ฅธ task์—์„œ๋Š” ์œ ์šฉํ•˜์ง€ ์•Š์„ ๊ฑฐ ๊ฐ™์Œ
3.5
์•ˆ์„ฑ์žฌscaling์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ์ ‘๊ทผ์€ good, but ์ง๊ด€์ ์œผ๋กœ ์ƒ๊ฐํ–ˆ์„ ๋•Œ, LLM training๋ณด๋‹ค ๊ธฐ์—…์—์„œ inferenceํ•˜๋Š” ๋น„์šฉ์ด ํ›จ์”ฌ ๋” ํด ๊ฒƒ ๊ฐ™์Œ. ๊ทธ๋Ÿฐ ๋ฉด์—์„œ ์ด๊ฒŒ impact๊ฐ€ ํฐ๊ฐ€?๋Š” ์˜๋ฌธ. ๋ณด๋ฅ˜์ž…๋‹ˆ๋‹ค.3.3
์Šคํƒ€๋ฒ…์ŠคInference ๋‹จ๊ณ„์—์„œ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ์•„์ด๋””์–ด ์ž์ฒด๋Š” ๊ดœ์ฐฎ์ง€๋งŒ, ์ด๊ฒŒ ๋…ผ๋ฌธ๋งŒํผ ์–ผ๋งˆ๋‚˜ ํšจ๊ณผ๊ฐ€ ์žˆ์„์ง€๋Š” ์˜๋ฌธ์ž„. ์ด๋Ÿฐ ๊ด€์ ์—์„œ ๋ดค์„ ๋•Œ, ๋ฐฉ๋ฒ•๋ก ์ด๋‚˜ ๋ฌธ์ œ ์ •์˜๊ฐ€ ๊ทธ๋ ‡๊ฒŒ ํฌ๊ฒŒ ์™€๋‹ฟ์ง€๋Š” ์•Š๋Š” ๊ฒƒ ๊ฐ™์Œ.3.8

TL; DR

๐Ÿ’ก

training ๋‹จ๊ณ„์—์„œ ๋ง๊ณ , inference ๋‹จ๊ณ„์—์„œ ์„ฑ๋Šฅ์„ ๋†’ํžˆ๋ ค๋ฉด ์–ด๋–ป๊ฒŒ ํ•ด์•ผ ํ• ๊นŒ?

โ‡’ ์ผ๋‹จ ์ˆ˜ํ•™/์ถ”๋ก  ๋ฌธ์ œ๋Š” token ๊ฐœ์ˆ˜ ์กฐ์ •ํ•ด

Summary

  • ์—ฐ๊ตฌ์ง„
  • ์ธ์šฉ์ˆ˜: 819

Background & Motivation

  • Test-time scaling์ด๋ž€?

    : ๋ชจ๋ธ์˜ parameter ์ˆ˜๋‚˜ training data๋ฅผ ๋Š˜๋ฆฌ์ง€ ์•Š๊ณ , ์ถ”๋ก  ์‹œ์ (test time)์— ์‚ฌ์šฉํ•˜๋Š” compute(ํŠนํžˆ reasoning token ์ˆ˜)๋ฅผ ์กฐ์ ˆํ•จ์œผ๋กœ์จ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ

  • ๊ธฐ์กด LLM ๋””๋ฒจ๋กญ ๋ฐฉ์‹์€ Train-time scaling์ž„
    • ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ, ๋” ํฐ ๋ชจ๋ธ, ๋” ๋งŽ์€ ํ•™์Šต step, โ€ฆ
    • ๊ทธ๋Ÿฌ๋‚˜, ์ด๋ฅผ ์œ„ํ•ด์„œ๋Š” ๋„ˆ๋ฌด ๋งŽ์€ GPU/time cost ๋ฐœ์ƒํ•จ

โ‡’ Test-time scaling์„ ํ•ด๋ณด์ž !

์ฆ‰, ๋ชจ๋ธ์€ ๊ณ ์ •ํ•œ ์ฑ„, inference ๊ณผ์ •์—์„œ ์„ฑ๋Šฅ์„ ์˜ฌ๋ ค๋ณด์ž

  • openAI๊ฐ€ o1 ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•  ๋•Œ test-time scaling์„ ํ†ตํ•ด ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ด๋Œ์–ด๋ƒˆ๋Š”๋ฐ, ๋ฐฉ๋ฒ•์ด ๊ณต๊ฐœ๋˜์–ด ์žˆ์ง€ ์•Š์Œ
    • ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์ด MCTS ๋“ฑ์œผ๋กœ ์ด๋ฅผ ์žฌํ˜„ํ•ด๋ณด๋ ค ํ–ˆ์ง€๋งŒ, ์‹คํŒจํ–ˆ์Œ (high cost & data)

โ‡’ ๊ฐ€์žฅ ๋‹จ์ˆœํ•˜๊ณ  ํšจ์œจ์ ์ธ Test-time scaling์„ ๊ฐœ๋ฐœํ•ด๋ณด์ž !!!!!!!!!!!!

Contributions (What theyโ€™ve revealed)

  • sample-efficient reasoning data (s1K dataset )์ƒ์„ฑ์„ ์œ„ํ•œ ๋ฐฉ๋ฒ• ๊ฐœ๋ฐœ (Section 2)
    1. 16๊ฐ€์ง€ ์‹œ๋“œ ๋ฐ์ดํ„ฐ์…‹ ์ค‘์—์„œ, ์•„๋ž˜ ์„ธ๊ฐ€์ง€ ๊ธฐ์ค€์œผ๋กœ 59,029 questions ์„ ๋ณ„
      • ์‹œ๋“œ ๋ฐ์ดํ„ฐ
        • NuminaMATH , AIME problems, OmniMath, SAT, LSAT ๋“ฑ ๊ธฐ์กด ์ถ”๋ก  ๊ด€๋ จ ๋ฐ์ดํ„ฐ
        • ์ž์ฒด ์ƒ์„ฑ ๋ฐ์ดํ„ฐ
          • s1-prob: ์Šคํƒ ํฌ๋“œ ๋Œ€ํ•™๊ต ํ†ต๊ณ„ํ•™๊ณผ ๋ฐ•์‚ฌ ์ž๊ฒฉ์‹œํ—˜ ์ค‘ probability section
          • s1-teasers: quantitative trading positions์—์„œ ํ”ํžˆ ์‚ฌ์šฉ๋˜๋Š” ๋‘๋‡Œ ํ…Œ์ŠคํŠธ ๋ฌธ์ œ ์ค‘ ๋‚œ์ด๋„ Hard

          โ‡’ Google Gemini Flash Thinking API๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ถ”๋ก  ๊ณผ์ •๊ณผ ํ’€์ด๋ฅผ ์ถ”์ถœ

      • ๊ธฐ์ค€
        1. Quality: Datasets should be high-quality
        1. Difficulty: Datasets should be challenging and require significant reasoning effort
        1. Diversity: Datasets should stem from various fields to cover different reasoning tasks
    1. ๋™์ผํ•œ ์กฐ๊ฑด (Quality, Difficulty, Diversity)๋ฅผ ๊ธฐ์ค€์œผ๋กœ 1000๊ฐœ ์ƒ˜ํ”Œ๋งŒ ๋‚จ๊น€
      • why? ๊ฐ€์žฅ ์‹ฌํ”Œํ•œ! ๋ฐ์ดํ„ฐ์…‹์„ ๋งŒ๋“ค๊ธฐ ์œ„ํ•จ
      • how to sampling? ์ˆœ์„œ๋Œ€๋กœ ์ง„ํ–‰

        1)Quality

        1. API ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•œ ์งˆ๋ฌธ์„ ์ œ๊ฑฐ
        1. low-quality example ์ œ๊ฑฐ

          e.g. inconsistent question numbering, non-existent image reference

        โ‡’ 51,381๊ฐœ ๋‚จ์Œ

        2)Difficulty

        ๊ฐ ๋ฌธ์ œ์— ๋Œ€ํ•ด Qwen2.5-7B-Instruct์™€ Qwen2.5-32BInstruct ์ค‘ ํ•˜๋‚˜๋ผ๋„ ๋‹ต์„ ๋งž์ถœ ์ˆ˜ ์žˆ๋Š” ๋ฌธ์ œ๋Š” ์ œ์™ธ (๋„ˆ๋ฌด ์‰ฌ์šด ๋ฌธ์ œ ์ œ๊ฑฐ)

        โ‡’ 24496๊ฐœ ๋‚จ์Œ

        3)Diversity

        1. Claude 3.5 Sonnet์„ ์‚ฌ์šฉํ•˜์—ฌ American Mathematical Society์˜ ์ˆ˜ํ•™ ์ฃผ์ œ๋กœ ๋ถ„๋ฅ˜
          (e.g. ๊ธฐํ•˜ํ•™, ์ƒ๋ฌผํ•™, ๋ฌผ๋ฆฌํ•™ ๋“ฑ ์ด 50๊ฐ€์ง€ ๋ถ„๋ฅ˜)
        1. ๊ฐ ๋ถ„๋ฅ˜๋ณ„๋กœ ๊ธธ์ด๊ฐ€ ๊ธด (=์–ด๋ ค์šด) ๋ฐ์ดํ„ฐ ์ƒ˜ํ”Œ๋ง

        โ‡’ 1000๊ฐœ ๋‚จ์Œ

  • test-time scaling ๋ฐฉ๋ฒ• ๊ฐœ๋ฐœ
    • test-time scaling์˜ ๋‘๊ฐ€์ง€ ์œ ํ˜• (Sequential & Pararell) ์ค‘์—์„œ, Sequential scaling์„ ์ˆ˜ํ–‰ํ•จ
      • why? ์ง๊ด€์ ์œผ๋กœ ์ƒ๊ฐํ–ˆ์„ ๋•Œ, ์ค‘๊ฐ„ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ตœ์ข… ๊ณ„์‚ฐ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ์œผ๋‹ˆ ๋” ํšจ์œจ์ ์ผ ๊ฑฐ๋ผ์„œ!
      • pararell์˜ ์˜ˆ์‹œ? majority voting!
    • maximum/mininum token์˜ ๊ฐœ์ˆ˜์— constraint๋ฅผ ๊ฑบ์œผ๋กœ์„œ, ๊ฐ„๋‹จํ•˜๊ฒŒ decoding time์„ ๊ฐ•์ œํ•จ
      • ์˜ˆ์‹œ
      • budget forcing ์ ์šฉ
        • maximum token ์ œ์•ฝ

          ์ถ”๋ก ์ด ๋„ˆ๋ฌด ๊ธธ์–ด์งˆ ๋•Œ, end-of-thinking token delimiter ์ถ”๊ฐ€ํ•˜์—ฌ reasoning์„ ์กฐ๊ธฐ์— ์ข…๋ฃŒ

          โ‡’ ๋งˆ์ง€๋ง‰์— Final Answer: ๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ ํ˜„์žฌ๊นŒ์ง€ ์ถ”๋ก  ๊ฒฐ๊ณผ๋กœ ๋‹ต๋ณ€์„ ๋„์ถœํ•˜๋„๋ก

        • minimum token ์ œ์•ฝ

          ๋ชจ๋ธ์ด ๋„ˆ๋ฌด ์งง์€ ์ถ”๋ก  ํ›„ ๋‹ต์„ ์ƒ์„ฑํ•˜๋ ค๊ณ  ํ•  ๋•Œ Wait์ด๋ผ๋Š” ์ถ”๊ฐ€์ ์ธ ์‹ ํ˜ธ๋ฅผ ์ž…๋ ฅํ•˜์—ฌ, ๋ชจ๋ธ์ด ์ง€๊ธˆ๊นŒ์ง€์˜ ์ถœ๋ ฅ์„ ํ•œ ๋ฒˆ ๋” ๊ฒ€ํ† ํ•  ๊ธฐํšŒ๋ฅผ ์ œ๊ณต

      • how to apply?
        • Token-conditional control : ๋ชจ๋ธ์—๊ฒŒ ์ƒ์„ฑํ•  ๊ธธ์ด๋ฅผ ์•Œ๋ ค์คŒ
          • Conditional length-control methods: prompt ์—์„œ maximum length๋ฅผ ์ง€์ •
          • Step-conditional control: ๊ฐ reasoning staep์˜ maximum length๋ฅผ ์ง€์ •
          • Class-conditional control: ์งง์€/์ค‘๊ฐ„/๊ธด ์‹œ๊ฐ„๋™์•ˆ ์ƒ๊ฐํ•˜๋„๋ก prompt ์ƒ์„ฑ
        • Rejection sampling: ์ƒ์„ฑ ๊ฒฐ๊ณผ๊ฐ€ ์ •ํ•ด์ง„ budget์— ๋งž์„ ๋•Œ๊นŒ์ง€ sampling
  • s1-32B ๊ฐœ๋ฐœ
    • ์„ธํŒ…
      • Qwen2.5-32B-Instruct๋ฅผ, s1K ๋ฐ์ดํ„ฐ๋กœ FT โ‡’ s1-32B๋ฅผ ์–ป์Œ
      • ์‹คํ—˜ ๋ฐ์ดํ„ฐ
        • AIME24: 2024๋…„ 1์›” 31์ผ๋ถ€ํ„ฐ 2์›” 1์ผ๊นŒ์ง€ ๊ฐœ์ตœ๋œ ๋ฏธ๊ตญ ์ˆ˜ํ•™ ๊ฒฝ์‹œ๋Œ€ํšŒ ๋ฌธ์ œ๋“ค
        • MATH500: ๋‹ค์–‘ํ•œ ๋‚œ์ด๋„์˜ ์ˆ˜ํ•™ ๊ฒฝ์‹œ๋Œ€ํšŒ ๋ฌธ์ œ๋“ค์„ ๋ชจ์•„๋†“์€ ๋ฒค์น˜๋งˆํฌ ๋ฐ์ดํ„ฐ
        • GPQA Diamond: ์ƒ๋ฌผํ•™, ํ™”ํ•™ ๋ฐ ๋ฌผ๋ฆฌํ•™ ๋ถ„์•ผ์˜ ๋ฐ•์‚ฌ ์ˆ˜์ค€ ๊ณผํ•™ ๋ฌธ์ œ
      • ์‚ฌ์šฉํ•œ metric

        ๋‹ค์–‘ํ•œ compute budge์—์„œ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ๋ฒˆ ํ‰๊ฐ€ํ•จ

        • Control: ์ „์ฒด ์‹คํ–‰ ์ค‘ ๋ชฉํ‘œํ•œ ์ตœ์†Œ/์ตœ๋Œ€ compute ๋ฒ”์œ„ ์•ˆ์— ๋“ค์–ด์˜ค๋Š” ๋น„์œจ
        • Scaling: compute๊ฐ€ ์ฆ๊ฐ€ํ•  ๋•Œ accuracy๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ฆ๊ฐ€ํ•˜๋Š”์ง€(ํ‰๊ท  ๊ธฐ์šธ๊ธฐ)
        • Performance: ํ•ด๋‹น method๊ฐ€ ๋‹ฌ์„ฑํ•œ ์ตœ๋Œ€ ์„ฑ๋Šฅ
    • ์„ฑ๋Šฅ
      • test-time compute(token๊ฐœ์ˆ˜) ์ฆ๊ฐ€์— ๋”ฐ๋ฅธ ์„ฑ๋Šฅ

        โ‡’ ๋” ์ž์„ธํ•œ ๊ฒฐ๊ณผ!

        • token ๊ฐœ์ˆ˜์™€ ์„ฑ๋Šฅ์€ ๋น„๋ก€ํ•˜์ง€๋งŒ, 6๋ฐฐ ์ •๋„์—์„œ saturate๋จ
          • ๋„ˆ๋ฌด ์ž์ฃผ end-of-thinking token delimiter๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด, ๋ชจ๋ธ์ด ๋ฃจํ”„์— ๋น ์ ธ๋ฒ„๋ฆผ
      • pararell scaling(majority voting)๊ณผ์˜ ๋น„๊ต
        • test-time compute๋ฅผ ์•„๋ฌด๋ฆฌ ํ™•์žฅํ•ด๋„, ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ์„ฑ๋Šฅ์„ ๋”ฐ๋ผ์˜ฌ ์ˆ˜ ์—†์Œ
      • ๋‹ค๋ฅธ ๋ชจ๋ธ๋“ค(e.g. QWEN r1)๊ณผ์˜ ๋น„๊ต ๊ฒฐ๊ณผ
        • ์ œ์•ˆํ•˜๋Š” scaling์ด ๊ฐ€์žฅ ํšจ์œจ์ ์ด๋‹ค !!
    • ablation study ์ˆ˜ํ–‰
      • data ablation : quality, difficulty, diversity๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š์•˜์„ ๋•Œ์—๋Š” ์–ด๋–ค๊ฐ€
        • 1K-random: Gemini๋กœ ์ถ”๋ก  ๊ฒฐ๊ณผ๋งŒ ๋„์ถœํ•˜๊ณ , ๋ฐ์ดํ„ฐ ์ž์ฒด๋Š” ๋žœ๋ค ์ƒ˜ํ”Œ๋ง
        • 1K-diverse: difficulty ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ , ๊ฐ ์นดํ…Œ๊ณ ๋ฆฌ ๋ณ„๋กœ ๋žœ๋ค ์ƒ˜ํ”Œ๋ง
        • 1K-longest: difficulty๋งŒ ๊ณ ๋ ค
        • 59k-full: ์ „์ฒด ๋ฐ์ดํ„ฐ ๋‹ค ํ™œ์šฉํ–ˆ์„ ๋–„
      • ์–ด๋–ค test-time compute control ๋ฐฉ์‹์ด ์ œ์ผ ์ข‹์„๊นŒ? โ‡’ budget forcing ํ•˜๋Š”๊ฒŒ ์งฑ์ด๋‹ค!
        • Rejection sampling: ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์ง€๋‹ˆ๊นŒ ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ๋‚ฎ์•„์ง
          • ์ฆ‰, ์ฒ˜์Œ๋ถ€ํ„ฐ ๊ธธ์ด๊ฐ€ ์งง์€ ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒฝ์šฐ์— ๋” ์ •ํ™•ํ•œ ๋‹ต๋ณ€์„ ๋งŒ๋“ ๋‹ค!
  • Limitations
    • budget forcing์œผ๋กœ ์ธํ•œ test-time scaling์˜ ํ•œ๊ณ„ (๊ฒฐ๊ตญ ์„ฑ๋Šฅ์ด saturate๋œ๋‹ค!)
    • ๋‹ค์–‘ํ•œ Task์—์˜ ํ•œ๊ณ„ : ์ˆ˜ํ•™, ๋ฌผ๋ฆฌํ•™ ๋“ฑ์˜ ๋ฌธ์ œ์— ์ดˆ์ ์„ ๋งž์ถ”๊ณ  ์žˆ์–ด, ์ฐฝ์ž‘ ๋“ฑ์˜ ๋‹ค๋ฅธ task์— ๋Œ€ํ•œ ์—ฐ๊ตฌ ํ•„์š”

Categories

research