07 January 2026

EvoLM: In Search of Lost Language Model Training Dynamics

๐Ÿ’กLanguage Model์˜ ์„ฑ๋Šฅ์ด ์–ผ๋งˆ๋‚˜ ํฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์˜ค๋ž˜ ํ•™์Šตํ–ˆ๋Š”๊ฐ€๋ณด๋‹ค ์–ด๋–ค ๋‹จ๊ณ„์—์„œ ์–ด๋–ป๊ฒŒ, ์–ธ์ œ ํ•™์Šตํ–ˆ๋Š”๊ฐ€๊ฐ€ ๋” ์ค‘์š”ํ•˜๋ฉฐ CPT(Continued Pre-Training)๊ฐ€ ์ง€๋„ ํ•™์Šต ๋ฐ ๊ฐ•ํ™” ํ•™์Šต์˜ ์„ฑ๋Šฅ์„ ๊ฒฐ์ •ํ•œ๋‹ค.

EvoLM: In Search of Lost Language Model Training Dynamics

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
๋งˆ์Šคํ‚นํ…Œ์ดํ”„์ด๋ฒˆ์ฃผ์— ์†Œ๊ฐœ๋˜๋Š” ๋‹ค๋ฅธ ๋…ผ๋ฌธ๊ณผ ๋น„์Šทํ•œ ๊ฒƒ ๊ฐ™์Œ. ์–ธ์ œ, ์–ด๋–ค ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค๋Š” ์ปค๋ฆฌํ˜๋Ÿผ ๋Ÿฌ๋‹ ๋“ฑ์—์„œ ์ด๋ฏธ ๋‚˜์™”๋˜ ์•„์ด๋””์–ด์ง€๋งŒ, ๊ทธ๊ฒƒ์„ ์ด๋ก ์ ์œผ๋กœ, ๋ถ„์„์ ์œผ๋กœ ํ™•์ธํ•˜๋Š” ์—ฐ๊ตฌ๋Š” ๋„์›€์ด ๋œ๋‹ค๊ณ  ์ƒ๊ฐํ•จ. ๋ฐ์ดํ„ฐ๋Š” ๋ฌดํ•œํ•˜์ง€ ์•Š๊ณ , ์ข‹์€ ๋ฐ์ดํ„ฐ๋Š” ๋งŽ์€ ์–‘์ด ์กด์žฌํ•˜์ง€ ์•Š์Œ. ๊ทธ๋Ÿฌ๋ฏ€๋กœ, ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์–ด๋–ป๊ฒŒ ํ•™์Šตํ• ๊นŒ ๊ณ ๋ฏผํ•˜๋Š” ๊ฒƒ์ด ๋‹ค์Œ ์ˆ™์ œ๋ผ๊ณ  ์ƒ๊ฐํ•˜๋Š”๋ฐ, ๊ทธ ๋ฐฉํ–ฅ์„ฑ์— ์žˆ์–ด ๋„์›€์ด ๋  ์ˆ˜ ์žˆ๋Š” ๋…ผ๋ฌธ ๊ฐˆ๋ž˜๋ผ๊ณ  ์ƒ๊ฐํ•จ.4.2
๊ทค์š”์ฆ˜ ์—ฐ๊ตฌ ํ๋ฆ„์ด ๋‹จ์ˆœํžˆ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ ๋ฟ์ด ์•„๋‹ˆ๋ผ ๊ฐ™์€ ์„ฑ๋Šฅ์„ ๋” ์ ์€ ์ž์›์œผ๋กœ ํšจ์œจ์ ์œผ๋กœ ๋‹ฌ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์œผ๋กœ ์ดˆ์ ์ด ์ ์  ์˜ฎ๊ฒจ๊ฐ€๋Š” ๋“ฏ ์‹ถ์Œ. ๊ทธ๋ฆฌ๊ณ  ํ›ˆ๋ จ ์ž์ฒด๋ฅผ ์–ด๋–ป๊ฒŒ ์„ค๊ณ„ํ•ด์•ผ ํšจ์œจ์ ์ธ๊ฐ€๋ฅผ ์ค‘์ ์œผ๋กœ ์ƒ๊ฐํ•ด๋ณด๋ฉด ์ข‹์„๋“ฏ4
๋™๊นŒ์Šค๋‹ค์–‘ํ•œ ๋ฐฉ์‹(pre-training, CPT, SFT, RL) ๋ณ„๋กœ ์Šค์ผ€์ผ๋ง์ด ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋ถ„์„ํ•ด์„œ ์ฐธ๊ณ ํ•ด์„œ ๋ณด๊ธฐ ์ข‹์„ ๋…ผ๋ฌธ์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ. ์ตœ์ ์˜ ์Šค์ผ€์ผ๋ง์„ ์ฐพ๋Š”๊ฒŒ ์˜ฌํ•ด ๋Œ€์„ธ์ธ๊ฐ€?3.9
์ˆ˜๋ฉด์žฅ์•  motivation๋งŒ ๋ดค์„ ๋•Œ์—๋Š” ๋‹น์—ฐํ•œ ์†Œ๋ฆฌ๋“ค์˜ ๋‚˜์—ด์ด์—ˆ์ง€๋งŒ, ์‹คํ—˜ ๋‚ด์šฉ ๋ฐ contribution์„ ๋ณด๋‹ˆ ์ •๋ง ๋ˆ„๊ตฐ๊ฐ€๋Š” ์ง„์ž‘์— ํ–ˆ์–ด์•ผ (LLM ๋ถ์ด ์ผ๊ธฐ ์‹œ์ž‘ํ•  ๋•Œ ์ฏค) ํ•˜๋Š” ์—ฐ๊ตฌ์ด์ž, ์ •๋ง NIPS๋‹ค์šด ์—ฐ๊ตฌ์ธ ๊ฒƒ ๊ฐ™๋‹ค! 4
์ด์–ดํฐ์ƒˆ์‚ผ ๋ถ„์„ ์‹คํ—˜ ์ด๋Ÿฐ ๊ฑธ ์ค‘์‹ฌ์œผ๋กœ ๋…ผ๋ฌธ์“ฐ๋ ค๋ฉด ์ •๋ง ๋งŽ์ด ์‹คํ—˜ํ•ด์•ผ๋˜๋Š”๊ตฌ๋‚˜ ์‹ถ๋‹ค. ํ›ˆ๋ จ ๋‹จ๊ณ„๋ณ„ ํŠน์„ฑ๊ณผ ์„ฑ๋Šฅ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ์„ค์ •์„ ๋งŽ์ด ์•Œ๋ ค์ค˜์„œ, ๋ชจ๋ธ ํŒŒ์ธํŠœ๋‹ ์‹คํ—˜ํ•  ๋•Œ ์‹ค์šฉ์ ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์–ด๋ณด์ธ๋‹ค3.9
์‚ฌ๊ณผMotivation ์ž์ฒด๋Š” ๋‹น์—ฐํ•œ ๊ฒฐ๋ก ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ์ด๊ฑธ ํ•˜๋‚˜ํ•˜๋‚˜ ์กฐ๊ฑด์„ ๋ณ€ํ™”์‹œ์ผœ ๊ฐ€๋ฉด์„œ ์Šค์ผ€์ผ๋ง์„ ํ•œ ์ ์ด ์˜๋ฏธ๊ฐ€ ์žˆ๋Š” ๋…ผ๋ฌธ์ž„. ์‹คํ—˜์„ ํšจ์œจ์ ์œผ๋กœ ์ง„ํ–‰ํ•˜๊ธฐ ์œ„ํ•ด Saturation๋˜๋Š” peak์ง€์ ๊นŒ์ง€ ํ•™์Šต์„ ํ•ด์„œ ์ตœ๊ณ ์˜ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•˜๋Š” ๋ถ€๋ถ„์— ๋งŽ์€ ์ฐธ๊ณ ๊ฐ€ ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์Œ.3.7
7์ผ์œ„ temporal dependence ๋…ผ๋ฌธ์€ ํ•˜๋‚˜์˜ ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ๋ฐ์ดํ„ฐ๊ฐ€ ์–ธ์ œ ํ•™์Šต๋๋Š”์ง€๋ฅผ ๋ดค๋‹ค๋ฉด, ์ด ๋…ผ๋ฌธ์€ ์—ฌ๋Ÿฌ ํŒŒ์ดํ”„๋ผ์ธ ์ž์ฒด๋ฅผ ๋น„๊ตํ•œ ๋А๋‚Œ. SFT๋Š” ๋„๋ฉ”์ธ ์ ์‘์ด ์ค‘์‹ฌ์ด๋˜๊ณ , RL์€ ์ถœ๋ ฅ์˜ ์•ˆ์ •์„ฑ์„ ๊ณ ๋ คํ•  ๋•Œ ํ™œ์šฉํ•˜๋‹ค๋Š” ์‹คํ—˜์ ์ธ ์ธ์‚ฌ์ดํŠธ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ๋‹ค4.3

TL; DR

๐Ÿ’ก

Language Model์˜ ์„ฑ๋Šฅ์ด ์–ผ๋งˆ๋‚˜ ํฐ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ์˜ค๋ž˜ ํ•™์Šตํ–ˆ๋Š”๊ฐ€๋ณด๋‹ค ์–ด๋–ค ๋‹จ๊ณ„์—์„œ ์–ด๋–ป๊ฒŒ, ์–ธ์ œ ํ•™์Šตํ–ˆ๋Š”๊ฐ€๊ฐ€ ๋” ์ค‘์š”ํ•˜๋ฉฐ CPT(Continued Pre-Training)๊ฐ€ ์ง€๋„ ํ•™์Šต ๋ฐ ๊ฐ•ํ™” ํ•™์Šต์˜ ์„ฑ๋Šฅ์„ ๊ฒฐ์ •ํ•œ๋‹ค.

Summary

์—ฐ๊ตฌํŒ€: Harvard, Stanford, CMU, EPFL ์—ฐ๊ตฌ์ง„

Motivation

  • ํ˜„์žฌ์˜ ์–ธ์–ด ๋ชจ๋ธ(Lauguage Model)์˜ ํ•™์Šต(Training) ๊ณผ์ •์€ ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆ„์–ด์ ธ ์žˆ์–ด ๊ฐ๊ฐ์˜ ๋‹จ๊ณ„์—์„œ์˜ ์˜ํ–ฅ์„ ์•Œ๊ธฐ๊ฐ€ ์–ด๋ ค์›€.
    • Supervised Fine-tuning(SFT)์™€ Reinforcement Learning์ด ์–ฝํžˆ๋ฉด ๋”์šฑ ๊ฒฐ๊ณผ๊ฐ€ ๋ณต์žกํ•ด์ง.
  • ๋ชจ๋ธ์˜ ์–ธ์–ด ์ƒ์„ฑ ์ž์ฒด์˜ ๋Šฅ๋ ฅ๊ณผ Problem-Solving ๋Šฅ๋ ฅ์€ ๋ณ„๊ฐœ์˜ ๋ฌธ์ œ๋กœ์„œ, downstream performance improvement๊ฐ€ ๋ถ€๋“œ๋Ÿฝ์ง€ ์•Š์Œ.
    • ๊ณผ๋„ํ•œ Pre-training๊ณผ Post-training์„ ์กฐ์ •ํ•˜๊ณ , Continued Pre-training์„ ํ†ตํ•ด forgetting์„ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•œ๋‹ค๋Š” ๊ฒƒ์ด ์š”์ .
  • ํˆฌ๋ช…ํ•˜์ง€ ์•Š์€ ์ฒดํฌํฌ์ธํŠธ, ๋ชจ๋ธ ์กฐ๊ฑด์œผ๋กœ ๊ณต์ •ํ•œ ๋น„๊ต ์•ˆ๋จ.
    • ๊ธฐ์กด์˜ ๋ชจ๋ธ Training์—์„œ Post-training ์—ฐ๊ตฌ๋ฅผ ์ง„ํ–‰ํ•  ๋•Œ, ๋ชจ๋ธ ํฌ๊ธฐ, Pre-training ๋ฐ์ดํ„ฐ ํฌ๊ธฐ, ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ์š”์†Œ๋ฅผ ์—„๊ฒฉํ•˜๊ฒŒ ํ†ต์ œํ•˜์ง€ ์•Š๋Š” ๋ฌธ์ œ
    • Incomplete learning rate decay๋กœ ์ธํ•ด ์ตœ์ ์ด ์•„๋‹ ์ˆ˜๋„ ์žˆ๋Š” ์ค‘๊ฐ„ ์ฒดํฌํฌ์ธํŠธ(checkpoint)๊ฐ€ ํ‰๊ฐ€์— ์ด์šฉ๋˜์–ด ๊ณต์ •ํ•œ ๋น„๊ต๋ฅผ ๋ฐฉํ•ดํ•˜๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ.

Contribution

  • ์–ธ์–ด ๋ชจ๋ธ์˜ ๋Šฅ๋ ฅ์„ ์ฒด๊ณ„์ ์œผ๋กœ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ๋ถ„์„
    • Pre-training๋ถ€ํ„ฐ Reinforcement Learning๊นŒ์ง€
    • ์–ธ์–ด ๋ชจ๋ธ ์ž์ฒด์˜ ๋Šฅ๋ ฅ(upstream task)์™€ ๋ฌธ์ œ ํ•ด๊ฒฐ ๋Šฅ๋ ฅ(downstream task)๋ฅผ ๋ชจ๋‘ ๋น„๊ตํ•˜๊ณ , in-domain๊ณผ out-of-domain์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ ๋น„๊ต
  • ์ฒ˜์Œ๋ถ€ํ„ฐ 1B, 4B ํŒŒ๋ผ๋ฏธํ„ฐ ๊ทœ๋ชจ๋กœ ํ•™์Šตํ•œ 100+ ์–ธ์–ด ๋ชจ๋ธ๊ณผ ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต๊ฐœ
  • Training Pipeline๊ณผ Evaluation Framework๋ฅผ ๊ณต๊ฐœํ•˜์—ฌ ๋ชจ๋ธ์˜ ํ•™์Šต ์กฐ๊ฑด๊ณผ ์–ธ์–ด, ๋ฌธ์ œํ•ด๊ฒฐ ๋Šฅ๋ ฅ์˜ ํ›„์† ์—ฐ๊ตฌ ๊ฐ€๋Šฅ
Experimental Settings
  1. Training Setup
    1. Pre-training ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ: FineWeb-Edu(๊ต์œก ์ค‘์‹ฌ ์›น ๋ฐ์ดํ„ฐ์…‹)
    1. Continued Pre-training ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ: Fine-Math(tngkr-cnfhs wndtla epdlxj)
    1. Supervised Fine-tuning ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ: GSM8K, MATH ๊ธฐ๋ฐ˜ QA
    1. RL ์‚ฌ์šฉ ๋ฐ์ดํ„ฐ: SFT์™€ ๋™์ผํ•˜๋‚˜ disjointํ•˜๊ฒŒ ๊ตฌ์„ฑ
  1. Evaluation Protocol
    • Upstream Cloze Task (์–ธ์–ด ์ž์ฒด ๋ชจ๋ธ๋ง ๋Šฅ๋ ฅ)
      • ๋Œ€ํ™” ๋Šฅ๋ ฅ์„ ์ œ์™ธํ•˜๊ณ  ์ˆœ์ˆ˜ํ•œ ๋‹ค์Œ ํ† ํฐ ์˜ˆ์ธก, ์ƒ์‹-์ถ”๋ก  ๊ธฐ๋ฐ˜ ๋Šฅ๋ ฅ ํ‰๊ฐ€
      • 0-shot accuracy ๋ฐฉ๋ฒ• ์‚ฌ์šฉ, ํ‰๊ฐ€ ์„ฑ๋Šฅ์„ ์—ฌ๋Ÿฌ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ๊ณ„์‚ฐ
      • ๋ฐ์ดํ„ฐ์…‹: HellaSwag, Winogrande, PIQA, OBQA, ARC-Easy, Challenge
    • Downstream Cloze Task (์ƒ์„ฑ ๊ธฐ๋ฐ˜ ๋ฌธ์ œ ํ•ด๊ฒฐ)
      • ์งˆ๋ฌธ์„ ์ดํ•ดํ•˜๊ณ  ํ•ด๊ฒฐ ๊ณผ์ •์„ ์ƒ์„ฑํ•˜์—ฌ ์ •๋‹ต ๋„์ถœ(๋ฌธ์ œ ํ•ด๊ฒฐ ๋Šฅ๋ ฅ)
      • ID(In-Domain): ์ˆ˜ํ•™ ๋ฐ ์ถ”๋ก  ์ค‘์‹ฌ(GSM8K-Platinum, MATH)
      • OOD(Out-of-Domain):
        • CRUXEval: ์ฝ”๋“œ ์ถ”๋ก 
        • BGQA: ๋…ผ๋ฆฌ ์ถ”๋ก 
        • TabMWP: ํ…Œ์ด๋ธ” ๊ธฐ๋ฐ˜ ์ถ”๋ก 
        • StrategyQA: ์ƒ์‹ยท์ „๋žต์  ์ถ”๋ก 
    1. ์ •ํ™•๋„ ํ‰๊ฐ€ ์ง€ํ‘œ
      1. Pass@1 (Greedy)
        1. Temperature = 0
        1. ๋‹จ์ผ ์ •๋‹ต์ด ๋งž์œผ๋ฉด ์ •๋‹ต์œผ๋กœ ๊ฐ„์ฃผ
      1. Maj@16
        1. Temperature = 1
        1. 16๊ฐœ ์ƒ˜ํ”Œ ์ƒ์„ฑ
        1. ๋‹ค์ˆ˜๊ฒฐ ๊ฒฐ๊ณผ๋กœ ์ •๋‹ต ํŒ๋‹จ
      1. RM@16

        16๊ฐœ ์ค‘ ORM ์ ์ˆ˜๊ฐ€ ๊ฐ€์žฅ ๋†’์€ ์‘๋‹ต ์„ ํƒ

        ORM์ ์ˆ˜: Skywork-Reward-Llama-3.1-88-v0.2์—์„œ ์ œ์‹œ๋˜์—ˆ์œผ๋ฉฐ ์ƒ์„ฑ๋œ ํ•ด๋‹ต์— ๋Œ€ํ•ด ์Šค์นผ๋ผ ์ ์ˆ˜ ๋ถ€์—ฌํ•˜์—ฌ ์ •๋‹ต ์—ฌ๋ถ€๋ฟ ์•„๋‹ˆ๋ผ ํ’€์ด์˜ ์ผ๊ด€์„ฑ ๋ฐ˜์˜

      1. Pass@16

        16๊ฐœ ์ค‘ ํ•˜๋‚˜๋ผ๋„ ๋งž์œผ๋ฉด ์„ฑ๊ณต

Scaling Studies Across Three Training Stages (Methods)

Scaling Up Pre-Training Compute
  • Pre-training์˜ ์–‘์ด ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•˜์—ฌ 0.5B, 1B, 4B๋ชจ๋ธ๋กœ token์˜ ์–‘์„ 10B๋ถ€ํ„ฐ 320B token๊นŒ์ง€ pre-trainํ•จ.
  • ์ฒ˜์Œ์—๋Š” ์ ์ฐจ ๋น„๋ก€ํ•˜์—ฌ ์ฆ๊ฐ€ํ•˜๋‹ค๊ฐ€, ๋ชจ๋ธ ํฌ๊ธฐ์˜ 80๋ฐฐ์—์„œ 160๋ฐฐ๊ฐ€ ๋˜๋Š” ์‹œ์ ์—์„œ Accuracy์˜ ์ฆ๊ฐ€ํญ์ด ์ ์ฐจ ๊ฐ์†Œ
  • SFT ๋ชจ๋ธ๊ณผ SFT-RL ๋ชจ๋ธ์„ ๋ชจ๋‘ ๋น„๊ตํ•˜์˜€์„ ๋•Œ, 80B Token๊นŒ์ง€๋Š” ๋šœ๋ ทํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์ด๋‹ค๊ฐ€ ๊ทธ ์ดํ›„์—๋Š” ๋šœ๋ ทํ•œ ๋ณ€ํ™” ์—†์Œ
    • ID Maj @ 16์˜ ๊ฒฝ์šฐ 20BT๊นŒ์ง€ 8%์—์„œ 15%๋กœ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ์ƒ์Šนํ•˜๋‹ค๊ฐ€ ์ดํ›„ 320BT๊นŒ์ง€ 17%๋กœ ํฐ ๋ณ€ํ™” ์—†์Œ
    • ์ „์ฒด์ ์œผ๋กœ Reinforcement Learning (RL)์„ ์ถ”๊ฐ€ํ•˜์˜€์„ ๋•Œ, ์ถ”๊ฐ€ํ•˜์ง€ ์•Š์€ ๊ฒฝ์šฐ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ๋†’์ง€๋งŒ ์ด ๊ฒฝ์šฐ ์—ญ์‹œ๋„ 80BT ์ดํ›„์— ๋šœ๋ ทํ•œ ์ƒ์Šน์„ ๋ณด์ด์ง€ ๋ชปํ•จ
  • Out-of-Domain (OOD)์˜ ๊ฒฝ์šฐ์—๋Š” 160B Token ์ดํ›„์— ์˜คํžˆ๋ ค Accuracy๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ž„
    • Degradation์„ ์ผ์œผ์ผœ ์˜คํžˆ๋ ค ์ƒ์„ฑ ํ’ˆ์งˆ์ด ๋–จ์–ด์ง

โ‡’ ๊ฒฐ๋ก : General Model Pre-training์ด ๊ณผ๋„ํ•˜๋ฉด ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๊ฒฐ๊ณผ๋ฅผ ์ดˆ๋ž˜ํ•˜๋ฉฐ, ํ•ญ์ƒ ๋งŽ์€ Pre-training์ด ์ข‹์€ ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋Š” ๊ฒƒ์€ ์•„๋‹˜

Scaling Up Continued Pre-training (CPT)

160B Token์œผ๋กœ Pre-trained๋œ 1B Model์— Continued Pre-training์„ ํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๋ถ€ํ„ฐ 50B Token์œผ๋กœ Pre-trainํ•˜๋Š” ๊ฒฝ์šฐ๊นŒ์ง€ ๋น„๊ต

  • Continued Pre-training (CPT)๋ฅผ ๊ฑฐ๋“ญํ• ์ˆ˜๋ก Upstream Task (์ผ๋ฐ˜ ์–ธ์–ด ์„ฑ๋Šฅ)์€ ๊ฐ์†Œํ•จ(Catastrophic Forgetting)
  • ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•ด Replay ์ „๋žต์„ ์‚ฌ์šฉ
    • ์†Œ๋Ÿ‰์˜ Pre-training data๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ์„ž์–ด์„œ ์‚ฌ์šฉ
  • 8B Token๋งŒํผ Replayํ–ˆ์„ ๋•Œ๊ฐ€ ํ•˜์ง€ ์•Š์•˜์„ ๋•Œ๋ณด๋‹ค ์ „์ฒด์ ์œผ๋กœ ์„ฑ๋Šฅ์ด ๋†’์Œ์„ ๋ณด์—ฌ์คŒ
  • ๊ทธ๋Ÿฌ๋‚˜ Too much replay(16B Token)์—์„œ๋Š” ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง
  • Downstream Task์—์„œ๋Š” CPT Budget์ด ์ฆ๊ฐ€ํ• ์ˆ˜๋ก In distribution (ID)์™€ Out-of-distribution (OOD)์—์„œ ๋ชจ๋‘ 2B์—์„œ 32B Token๊นŒ์ง€๋Š” ์„ฑ๋Šฅ์ด ๊ธ‰๊ฒฉํžˆ ์ƒ์Šน
  • 32B์ดํ›„์—๋Š” CPT์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ ํšจ๊ณผ๊ฐ€ ์ œํ•œ์ ์ž„

โ‡’๊ฒฐ๋ก : Domain ํŠนํ™” Post-training์€ ์ถฉ๋ถ„ํ•œ CPT์— ์˜ํ•ด ๋’ท๋ฐ›์นจ ๋˜์–ด์•ผ ์›ํ•˜๋Š” ์„ฑ๋Šฅ์„ ์–ป์„ ์ˆ˜ ์žˆ์œผ๋ฉฐ, CPT ๋ฐ์ดํ„ฐ๊ฐ€ ์ฆ๊ฐ€ํ•จ์— ๋”ฐ๋ผ ID(In-distribution)๊ณผ OOD(Out-of-distribution) ๋ชจ๋‘์—์„œ ์ด์ต์„ ์–ป์„ ์ˆ˜ ์žˆ์Œ

Scaling Up Supervised-Fine-Tuning (SFT)

SFT๊ฐ€ Training ๋ฐ ๋ชจ๋ธ ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•ด Epoch๊ณผ Dataset size๋ฅผ ๋ณ€ํ™”

  1. Epoch
    • 1, 2, 4, 8, 16, 32 epoch์œผ๋กœ ๊ฐ๊ฐ trainingํ•˜์˜€์„ ๋•Œ, ID(In-distribution) metric์ด ๊พธ์ค€ํ•˜๊ฒŒ ์ฆ๊ฐ€ํ•˜๋Š” ์ถ”์„ธ๋ฅผ ๋ณด์ด๋‹ค๊ฐ€ 8 epochs ๊ทผ์ฒ˜์—์„œ ์ •์ฒด๋จ
    • OOD์˜ ๊ฒฝ์šฐ 2-4 epochs์—์„œ peak์˜€๋‹ค๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์•˜์„ ๋•Œ, over-specialization์ด ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ•ด์นจ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
    • 3 Epochs์—์„œ์˜ SFT๊ฐ€ ๊ฐ€์žฅ ์ ํ•ฉํ•จ์„ ๋ณด์—ฌ์ฃผ๊ณ  ์žˆ์œผ๋ฉฐ, ๊ณผ๋„ํ•œ SFT Epoch๋กœ ์ธํ•ด ๋’ค์˜ Reinforcement Learning (RL)์˜ ์ด์ ์„ ์‚ฌ๋ผ์ง€๊ฒŒ ๋งŒ๋“ฆ
    1. SFT Dataset Size์˜ ๋ณ€ํ™”

    • SFT ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๋ฅผ 50K, 100K, 150K, โ€ฆ, 400K๊นŒ์ง€ ๋Š˜๋ ค๋ณด๋ฉด์„œ ์‹คํ—˜
    • ID์˜ ์„ฑ๋Šฅ์€ Dataset Example์ด ์ฆ๊ฐ€ํ•˜๋ฉด ๊ณ„์† ์ฆ๊ฐ€ํ•˜๋Š” ์–‘์ƒ
    • OOD์˜ ์„ฑ๋Šฅ์€ ๋“ค์ญ‰๋‚ ์ญ‰ํ•˜๊ณ  ์‹ฌ์ง€์–ด ํ•˜๊ฐ•ํ•˜๋Š” ๊ฒฝ์šฐ๋„ ์žˆ์—ˆ์Œ
      • ํ›„์˜ Reinforcement Learning (RL) ๋‹จ๊ณ„์—์„œ์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ์ œ์•ฝ์ด ๋  ์ˆ˜๋„ ์žˆ๋Š” ์‚ฌํ•ญ์ž„

Scaling Up Reinforcement Learning (RL)

RL Epochs์™€ RL Dataset Size๋ฅผ ๋ณ€ํ™”์‹œ์ผฐ์„ ๋•Œ์˜ Accuracy์˜ ๋ณ€ํ™”๋ฅผ ์•Œ์•„๋ด„

  1. RL Epoch์˜ ๋ณ€ํ™”

Greedy, Maj@16 , RM@16 ์„ฑ๋Šฅ์€ 8โ€“16 epoch์—์„œ peak ํ›„ ์ •์ฒด

  • Correct Ratio@16โ†’epoch๊ฐ€ ๋Š˜์–ด๋‚ ์ˆ˜๋ก ๊ณ„์† ์ฆ๊ฐ€
  • Pass@16โ†’4 epoch ์ดํ›„ ๊ธ‰๊ฒฉํžˆ ๊ฐ์†Œ
    • RL์€ epoch์ด ๊ณผ๋„ํ•˜๊ฒŒ ๋Š˜์–ด๋‚ ์ˆ˜๋ก ์ถœ๋ ฅ ๋‹ค์–‘์„ฑ์„ ๊ฐ์†Œ์‹œํ‚ค๊ณ , 1-2๊ฐœ์˜ ์ •๋‹ต๋งŒ ๊ณ„์† ์ƒ์„ฑํ•˜๊ฒŒ๋จ
  • Maj @16 vs Greedy
    • SFT-only ๋ชจ๋ธ์€ Maj@16์ด Greedy๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ €์กฐํ•œ ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Œ
    • RL์„ ์ ์šฉํ•œ ๋ชจ๋“  ๊ฒฝ์šฐ์—์„œ Maj@16์ด Greedy๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ
  1. RL Dataset Size์˜ ๋ณ€ํ™”
  • RL Epoch์„ 8๋กœ ๊ณ ์ •ํ•˜๊ณ , ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๋ฅผ 0๋ถ€ํ„ฐ 400K๊นŒ์ง€ ์กฐ์ •ํ•ด ๊ฐ€๋ฉฐ ๊ด€์ฐฐ
  • ID์™€ OOD ๋ชจ๋‘์—์„œ Accuracy๊ฐ€ 150-200K๊นŒ์ง€ ์ฆ๊ฐ€ํ•˜๋‹ค๊ฐ€ ์ •์ฒด
  • Pass@K๋Š” ์˜คํžˆ๋ ค ์ผ์ฐ ์ •์ฒด๋˜๊ณ  ํ•˜๋ฝํ•˜๊ธฐ๊นŒ์ง€ ํ•จ
    • Pass@K: ์—ฌ๋Ÿฌ ์ƒ˜ํ”Œ ์ค‘ ํ•˜๋‚˜๋ผ๋„ ์ •๋‹ต์ด ์žˆ์œผ๋ฉด ๋งž๋Š” ๊ฒƒ์œผ๋กœ ์ธ์ •, ์ถœ๋ ฅ์˜ ๋‹ค์–‘์„ฑ์ด ์ค‘์š”
    • RL ๋ฐ์ดํ„ฐ๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ๋‹ค์–‘์„ฑ์ด ๊ฐ์†Œํ•˜์—ฌ ๊ฐ™์€ ๋‹ต๋งŒ ๊ณ„์† ๋‚ด๋†“๊ฒŒ ๋จ
  • ์ƒˆ๋กœ์šด ๋ฌธ์ œ๋ฅผ ๋งžํžˆ๋Š” ๋Šฅ๋ ฅ์ด ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ์ด๋ฏธ ๋งžํž ์ˆ˜ ์žˆ์„ ๋งŒํ•œ ๋ฌธ์ œ๋ฅผ ๋” ์ •ํ™•ํ•˜๊ฒŒ ๋งžํžˆ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์ด ๋˜๋Š” ๊ฒƒ
  • SFT์™€ RL์˜ ๋ฐ์ดํ„ฐ ๋ถ„ํ• 

    500K ๋ฐ์ดํ„ฐ ์ค‘ ๋žœ๋ค์œผ๋กœ ์ถ”์ถœํ•œ 100K์˜ ๋ฐ์ดํ„ฐ๋ฅผ (10 / 90, 30 / 70, 50 / 50, 70 / 30, 90 / 10)๋น„์œจ๋กœ ๋‚˜๋ˆ”

    100K๊ฐ€ peak๋กœ ์„ฑ๋Šฅ ์ •์ฒด๊ฐ€ ์‹œ์ž‘๋˜๋Š” ์‹œ์ ์ด๋ฏ€๋กœ ๋ฐ์ดํ„ฐ ํฌ๊ธฐ๋ฅผ 100K๋กœ ์„ ํƒ

    OOD ์„ฑ๋Šฅ์˜ ๊ฒฝ์šฐ SFT๊ฐ€ 70K์ผ ๋•Œ, ๊ฐ‘์ž๊ธฐ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Œ. ์ด์— SFT๋ณด๋‹ค๋Š” RL์˜ ๋น„์ค‘์ด ์ค‘์š”ํ•˜๊ณ , RL์ด 90K์ธ ์‹œ์ ์—์„œ OOD์˜ ์„ฑ๋Šฅ์€ ๊ฐ€์žฅ ์šฐ์ˆ˜โ‡’OOD์˜ ์„ฑ๋Šฅ์€ RL์ด ๊ฒฐ์ •

    ์—ฌ๊ธฐ์„œ๋Š” ID์™€ OOD์˜ ์„ฑ๋Šฅ์ด trade-off๊ด€๊ณ„์ž„์„ ์•Œ ์ˆ˜ ์žˆ์Œ

Conclusion

  1. ๋ฌด์กฐ๊ฑด ๋ชจ๋ธ ํ•™์Šต์—์„œ Scale์„ ํ‚ค์šฐ๋Š” ๊ฒƒ๋งŒ์ด ์ •๋‹ต์€ ์•„๋‹ˆ๋‹ค
    1. Token ์ˆ˜, SFT ๋ฐ์ดํ„ฐ์…‹ ํฌ๊ธฐ, RL์„ ๋Š˜๋ฆฌ๋ฉด ์„ฑ๋Šฅ์€ ์ฒ˜์Œ์—๋Š” ์ฆ๊ฐ€ํ•˜๋‚˜ ์ผ์ • ์ง€์  ์ดํ›„ ์ •์ฒด
    1. ๊ณผ๋„ํ•œ ํ•™์Šต์€ ์„ฑ๋Šฅ ์ •์ฒด ๋ฐ ํ•™์Šต ๋น„์šฉ๋งŒ ์ฆ๊ฐ€
    1. ์‹ฌ์ง€์–ด ์„ฑ๋Šฅ์ด ์ €ํ•˜๋˜๋Š” ๊ฒฝ์šฐ๋„ ์กด์žฌ
  1. Domain-specific Continued Pre-training (CPT)์€ ํ•„์š”ํ•˜์ง€๋งŒ ์ ์ ˆํ•˜๊ฒŒ ์‚ฌ์šฉํ•ด์•ผ ํ•จ
    1. CPT๋Š” Downstream(ํŠน์ • task ๋ฌธ์ œ ํ•ด๊ฒฐ) ์„ฑ๋Šฅ์˜ ํ•ต์‹ฌ ๊ธฐ๋ฐ˜
    1. CPT๋ฅผ ๋ฌด๋ถ„๋ณ„ํ•˜๊ฒŒ ๋Š˜๋ฆฌ๋ฉด Catastrophic Forgetting ๋ฐœ์ƒ
    1. Replay ๋ฐ์ดํ„ฐ์™€ ๊ท ํ˜•
  1. SFT์™€ RL์€ ์„œ๋กœ ๋‹ค๋ฅธ ์—ญํ• 
    1. SFT: In-domain ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆฌ๋‚˜, ๊ณผ๋„ํ•  ๊ฒฝ์šฐ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ ์ €ํ•˜
    1. RL: ๊ธฐ์กด ์ •๋‹ต์— ๋Œ€ํ•œ confidence๋ฅผ ๊ฐ•ํ™”ํ•˜๋ฉฐ, OOD์— ์œ ๋ฆฌ
      1. RL์ด ๊ณผ๋„ํ•  ๊ฒฝ์šฐ ์ถœ๋ ฅ ๋‹ค์–‘์„ฑ ๊ฐ์†Œ, Pass@K ํ•˜๋ฝ ๋“ฑ ๋‹จ์ 

Categories

research