26 March 2026

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

๐Ÿ’ก์ •๋‹ต์„ ๊ทธ๋Œ€๋กœ ๋ชจ๋ฐฉํ•˜๋Š” SFT๋ณด๋‹ค, noisyํ•œ ๋‹ต์•ˆ์„ โ€˜๋น„ํŒ(critique)โ€™ํ•˜๋„๋ก ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์ด reasoning ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋” ํšจ๊ณผ์ ์ด๋‹ค!Human learning process์˜ ๋ฐฉ์‹(critical thinking, analyze, understandingโ€ฆ)์„ ๋ชจ๋ธ ํ•™์Šต์— ์ ์šฉํ•ด๋ณด์ž

์ตœ๋ฏผ์˜
์ตœ๋ฏผ์˜
๐Ÿฅ‰

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

Review

๋‹‰๋„ค์ž„ Strength & Weakness & Sugguestions๋ณ„์  (0/5)
๋Œ“์ธ ๋…ธ๋…ธ โ€ข ์žฅ์ : ์ธ๊ฐ„์˜ ์‚ฌ๊ณ ๋ฐฉ์‹์„ ๋ชจ๋ธ๋งํ•ด SFT๋ณด๋‹ค ๋” ๋‚˜์€ FT ๋ฐฉ์‹์„ ์ œ์•ˆํ•จ. ์ œ์•ˆํ•˜๋Š” ๋ฐฉ๋ฒ•์˜ ๋›ฐ์–ด๋‚œ ํšจ์œจ/ํšจ์šฉ์„ฑ ๋ณด์ž„. COLM๋‹ค์šด ๋…ผ๋ฌธ!
โ€ข ๋‹จ์ : ์–ด๋–ค ์›๋ฆฌ๋กœ CFT๊ฐ€ SFT๋ณด๋‹ค ๋” ์ž˜ optimize๋˜๋Š”๊ฑธ๊นŒ? ์‹คํ—˜์  ๋ง๊ณ  ์ด๋ก ์  ๊ทผ๊ฑฐ๊ฐ€ ์žˆ์—ˆ์œผ๋ฉด ์กฐ๊ธˆ ๋” ์ข‹์•˜์„ ๋“ฏ
โ€ข ๋ณด์™„์ : limitation์œผ๋กœ ์–ธ๊ธ‰ํ•œ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ๋ฌธ์ œ๋ฅผ ์–ด๋–ป๊ฒŒ ๋ณด์™„ํ•  ์ˆ˜ ์žˆ์„์ง€
4
์•„์ด๋ฆฌ์Šค์žฅ์ : ๋ชจํ‹ฐ๋ฒ ์ด์…˜, ์•„์ด๋””์–ด๊ฐ€ ๊ฐœ์ธ์  ์ƒ๊ฐ๊ณผ ๋„ˆ๋ฌด ์ผ์น˜ํ•จ!! ์‚ฌ๋žŒ์˜ ์‚ฌ๊ณ  ๋ฐฉ์‹์„ ์ž˜ ๋ชจ๋ธ๋งํ•˜๋Š” ์—ฐ๊ตฌ๋ผ๊ณ  ์ƒ๊ฐํ•จ.
๋‹จ์ : ์ข‹์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค๊ณ , ๊ฒ€์ฆํ•˜๊ณ , ๋‹ค์–‘ํ•˜๊ฒŒ ๋น„ํŒํ•˜๊ณ , ํ† ๋ก ํ•˜๋Š”, ์ •๋ง ์‚ฌ๋žŒ๊ฐ™์€ ํ”„๋กœ์„ธ์Šค๋Š” ์•„๋‹˜. ์ผ๋ถ€๋งŒ ๊ตฌํ˜„ํ•œ ๋А๋‚Œ?
๋ณด์™„์ : ๋‚ด๊ฐ€ ํ•˜๊ณ  ์‹ถ์€ ๋ฐฉํ–ฅ์ž„. ํ˜ผ์ž ํ•™์Šตํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค, ๋‹ค ๊ฐ™์ด, ๋” ์ข‹์€ ๋ฌธ์ œ๋ฅผ ๋น„ํŒํ•˜๋ฉฐ ํ† ๋ก ํ•˜๊ณ  ํ•™์Šตํ•˜๊ธฐ.
4.5
ํ•ธ๋“œํฌ๋ฆผโ€ข ์žฅ์ : gpt์˜ ๋น„ํŒ ๋Šฅ๋ ฅ์ด ๋ฐ˜์˜๋œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต. gpt ์ƒ์„ฑ ํ…์ŠคํŠธ๋ฅผ ๋ฐฐ์šฐ๋ฉด์„œ ๋™์‹œ์— ๋น„ํŒ ๋Šฅ๋ ฅ์„ ๋ฐฐ์šธ ์ˆ˜ ์žˆ์Œ
โ€ข ๋‹จ์ : ํ•™์Šต ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๋ณด์žฅ์ด ํ•„์š”
โ€ข ๋ณด์™„์ : distilled SFT ๋ชจ๋ธ๊ณผ ์„ฑ๋Šฅ ๋น„๊ต
4.5
3์›” โ€ข ์žฅ์ : ๊ธฐ์กด์— ์ •๋‹ต์„ ๋”ฐ๋ผํ•˜๊ฒŒ ํ•™์Šต์‹œํ‚ค๋Š”๊ฒƒ๊ณผ ๋‹ฌ๋ฆฌ, ์ธ๊ฐ„ ํ•™์Šต ๋ฐฉ์‹๊ณผ ์œ ์‚ฌํ•˜๊ฒŒ ํ‹€๋ฆฐ ๋‹ต์„ ๋น„ํŒํ•˜๋„๋ก ํ•™์Šตํ•œ ์‚ฌ๊ณ ์˜ ์ „ํ™˜์ด ๋›ฐ์–ด๋‚จ + ๋ฐ์ดํ„ฐ ํšจ์œจ์„ฑ๋„ ์—„์ฒญ ์ข‹์Œ
โ€ข ๋‹จ์ : ํ•™์Šต ๋ชฉํ‘œ๋ž‘ inference ๋ชฉํ‘œ๋ž‘ ๋‹ค๋ฅธ๋ฐ๋„ ์„ฑ๋Šฅ์ด ์ข‹์€ ์ด์œ ๊ฐ€ ๋ญ˜๊นŒ...? ๊ถ๊ธˆ์ฆ
โ€ข ๋ณด์™„์ : ํ‹€๋ฆฐ ๊ธฐ์ค€์ด ๋ชจํ˜ธํ•œ ๋ฌธ์ œ์— ๋Œ€ํ•ด critique ์ดํ›„์— ์ •๋‹ต ์ƒ์„ฑ๊นŒ์ง€ end-to-end๋กœ ํ•™์Šตํ•ด๋ณด๊ธฐ
4.4
ํ™”์ดํŠธ๋…ธ์ด์ฆˆ โ€ข ์žฅ์ : base ๋ชจ๋ธ ์„ฑ๋Šฅ์ด ๋งŽ์ด ์ข‹์•„์ ธ์„œ SFT๋กœ๋Š” ์Šฌ์Šฌ ๋ถ€์กฑํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— background๋ฅผ ์ฝ์œผ๋ฉฐ ๊ณ ๊ฐœ๋ฅผ ๋„๋•์˜€์Œ + SFT์˜ ๋‹จ์ˆœ ๋‹ต imitating์— ๋ถˆ๋งŒ์„ ๊ฐ–๋Š” ๋…ผ๋ฌธ์ด ๋งŽ์ด ๋ณด์ž„
โ€ข ๋‹จ์  & ๋ณด์™„์ : ํ‹€๋ฆฐ ์ด์œ ๊ฐ€ ๋ช…ํ™•ํ•œ ์ˆ˜ํ•™ ๋„๋ฉ”์ธ์— ๋Œ€ํ•ด์„œ๋Š” ์ž˜ํ•˜์ง€๋งŒ ์ •๋‹ต์ด ๋ชจํ˜ธํ•˜๊ฑฐ๋‚˜ ์—ด๋ฆฐ ํ˜•ํƒœ์ธ ๊ธ€์“ฐ๊ธฐ, ์ƒ์‹ ์ถ”๋ก  ์˜์—ญ์—์„œ๋„ ์ž˜ํ• ์ง€ ์˜๋ฌธ์ž„ + ์‹คํ—˜ํ•ด๋ดค์œผ๋ฉด ์ข‹๊ฒ ์Œ
4.1
์—๋„ˆ์ง€ โ€ข ์žฅ์  : SFT์˜ (์งˆ๋ฌธ,์ •๋‹ต)์ด ์•„๋‹Œ (์งˆ๋ฌธ, ์ •๋‹ต, ์„ค๋ช…)์˜ CFT์„ ์ œ์‹œ. ๋Œ€๋ถ€๋ถ„ post-training์€ SFT๋ฅผ ์‚ฌ์šฉํ–ˆ๊ธฐ์— ๋‹น์—ฐ์‹œ ์—ฌ๊ฒจ์™”๋˜ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ๊ฒŒํ•ด ์ค€ ๊ฒƒ ๊ฐ™์Œ. ์ฒ˜์Œ ์ œ๋ชฉ์„ ๋ณด๊ณ  ๋ญ”๊ฐ€ ํ–ˆ์ง€๋งŒ ์ฐฝ์˜์ ์ธ ๋…ผ๋ฌธ์ด๋ผ๊ณ  ๋А๊ผˆ์Œ.
โ€ข ์•ฝ์  : ๋ฐฉ์‹์€ ์ฐฝ์˜์ ์ด๋‚˜, ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ์— ๋„ˆ๋ฌด ์˜์กด
โ€ข ๋ณด์™„์  : critique ๋ฐ์ดํ„ฐ๋ฅผ ๊ตฌ์ถ•์‹œ ์—ฌ๋Ÿฌ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด์„œ critique ํ’ˆ์งˆ์„ ์˜ฌ๋ฆฌ๊ฑฐ๋‚˜, top-k๋กœ ํ•˜๊ฑฐ๋‚˜,,, ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ํ™•๋ณด์— ๋Œ€ํ•œ ์ถ”๊ฐ€ ๋ฐฉ๋ฒ•์ด ๋ฐ˜์˜๋˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ.
4.2
ํ”ผ์ฆˆ์น˜์ž โ€ข ์žฅ์ : ๋ฐฉ๋ฒ•์ด ๊ต‰์žฅํžˆ ๊ฐ„๋‹จํ•จ์—๋„ ๋†’์€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์ด๋ฃจ์–ด์ง ์™œ ์ง€๊ธˆ๊ป human reasoning ๊ณผ์ •์„ ์ด๊ณณ์ €๊ณณ์— ์ ์šฉํ•˜๊ณ ์ž ํ•˜๋Š” ์ƒ๊ฐ์€ ๋งŽ์•˜๋Š”๋ฐ SFT์—๋Š” ์ ์šฉํ•  ์ƒ๊ฐ์„ ๋ชปํ–ˆ์„๊นŒ. ์ ์€ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•„์š”ํ•˜๋‹ค๋Š”๊ฒƒ๋„ ๊ต‰์žฅํžˆ ํฐ ๋ฉ”๋ฆฌํŠธ์ž„
โ€ข ๋‹จ์ : ์ƒ์„ฑ๋œ critique์˜ ํ’ˆ์งˆ์— ์ขŒ์ง€์šฐ์ง€ ๋  ์—ฌ์ง€๊ฐ€ ์žˆ์Œ
โ€ข ์ œ์•ˆ: ์ตœ๊ทผ์— ๋‚˜์˜จ LLM์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๋” ์ •๊ตํ•˜๊ฒŒ right, wrong set๋ฅผ ๋งŒ๋“ค๋ฉด ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋งŽ์ด ์ผ์–ด๋‚ ๊ฒƒ ๊ฐ™๊ธดํ•จ. ์™„์ „ ์ •๊ตํ•œ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šตํ–ˆ์„ ๋•Œ์˜ ์„ฑ๋Šฅ์ด ๊ถ๊ธˆ
4.2
์ œ๋กœ์ฝœ๋ผ โ€ข ์žฅ์ : ์ •๋‹ต์„ ์™ธ์šฐ๊ฒŒ ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ํ‹€๋ฆฐ ์ด์œ ๋ฅผ ๋ถ„์„ํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹์ด ๋” ํšจ๊ณผ์ ์ด๋ผ๋Š” ์•„์ด๋””์–ด๊ฐ€ ์ธ๊ฐ„์ด ๊ณต๋ถ€ํ•˜๋Š” ๋ฐฉ์‹๊ณผ ๋น„์Šทํ•ด์„œ ๊ณต๊ฐ์ด ๋จ.
โ€ข ์•ฝ์ : ํ•™์Šตํ•  ๋•Œ๋Š” ๋‹ต์•ˆ์„ ๋น„ํŒํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จํ•˜๋ฉด์„œ ์ •์ž‘ ์ถ”๋ก ํ•  ๋•Œ๋Š” ๋ฐ”๋กœ ๋‹ต์„ ์ƒ์„ฑํ•˜๋Š”๋ฐ, ์™œ ์ด ํ•™์Šต ๋ฐฉ์‹์ด ์ง์ ‘ ๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ๋Šฅ๋ ฅ์„ ํ‚ค์›Œ์ฃผ๋Š”์ง€ ์„ค๋ช…์ด ๋ถ€์กฑํ•œ๊ฒƒ ๊ฐ™์Œ.
โ€ข ๋ณด์™„์ : critique๋ฅผ ์ƒ์„ฑํ•˜๋Š” teacher ๋ชจ๋ธ์˜ ํ’ˆ์งˆ์— ์„ฑ๋Šฅ์ด ์˜์กดํ•˜๋Š” ๊ตฌ์กฐ์ธ๋ฐ, ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๋กœ critique ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค์–ด ๋ณด๊ธฐ.
4.3
์ฐฝ๋ฐฑ์นด์ธ„์žฅ์ : ์„ฑ๋Šฅ์ด ์˜ค๋ฆ„
์•ฝ์ : Contrastive learning์˜ ์ฒ ํ•™์„ ๊ทธ๋Œ€๋กœ ๊ฐ€์ ธ์™”๊ณ , CoT์˜ ์ฒ ํ•™๊ณผ๋„ ๋น„์Šทํ•จ. ์ฆ‰, ์•„์ด๋””์–ด๊ฐ€ originalํ•˜์ง€ ์•Š์•„๋ณด์—ฌ ํ•™๊ณ„์— ๊ธฐ์—ฌํ–ˆ๋‹ค๋Š” ๋А๋‚Œ์„ ๋ฐ›์ง€ ๋ชปํ•จ.
๋‚ด์ƒ๊ฐ์— ๋ฐฉ๋ฒ•๋ก  Contrastive learning+CoT+Distillation ์ด๊ฒŒ ๋์ธ๊ฑฐ๊ฐ™์Œ
์ œ์•ˆ์ : ๊ฐ•ํ™”ํ•™์Šต์ ์ธ ๊ด€์ ์„ ๋„ฃ์–ด์„œ, ์ถ”๋ก ์—์„œ policy์— ๋Œ€ํ•œ critique๋ฅผ ์ƒ์„ฑํ•ด์„œ ํ•˜๋Š”๊ฑด ์–ด๋–จ๊นŒ?
1.75

TL; DR

๐Ÿ’ก

์ •๋‹ต์„ ๊ทธ๋Œ€๋กœ ๋ชจ๋ฐฉํ•˜๋Š” SFT๋ณด๋‹ค, noisyํ•œ ๋‹ต์•ˆ์„ โ€˜๋น„ํŒ(critique)โ€™ํ•˜๋„๋ก ํ•™์Šตํ•˜๋Š” ๋ฐฉ๋ฒ•์ด reasoning ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ๋” ํšจ๊ณผ์ ์ด๋‹ค!

  • Human learning process์˜ ๋ฐฉ์‹(critical thinking, analyze, understandingโ€ฆ)์„ ๋ชจ๋ธ ํ•™์Šต์— ์ ์šฉํ•ด๋ณด์ž

Summary

  • Author
  • Citation: 40

Introduction

Background

  • ์ผ๋ฐ˜์ ์œผ๋กœ LLM post-training์˜ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” SFT (Supervised Fine-Tuning)์„ ์‚ฌ์šฉํ•จ
    • Supervised Fine-Tuning (SFT): ์ฃผ์–ด์ง„ ์งˆ๋ฌธ-์ •๋‹ต ์Œ์„ ๋ฐ”ํƒ•์œผ๋กœ, ๋ชจ๋ธ์ด ์ •๋‹ต ์‘๋‹ต์„ ๋ชจ๋ฐฉ(imitate responses)ํ•˜๋„๋ก ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹
    • ํŠนํžˆ ์ˆ˜ํ•™์  ์ถ”๋ก ์ด๋‚˜ ์ฝ”๋“œ ์ƒ์„ฑ์ฒ˜๋Ÿผ ํŠน์ • ๋Šฅ๋ ฅ์„ ๊ฐ•ํ™”ํ•˜๋Š” ๋ฐ ์ž์ฃผ ํ™œ์šฉ๋จ
  • ๊ทธ๋ž˜์„œ ๊ธฐ์กด์—ฐ๊ตฌ๋“ค์€ high quality SFT dataset์„ ๊ตฌ์ถ•ํ•˜๋Š”๋ฐ ์ฃผ๋ ฅํ•จ
    • e.g., MetaMath, MAmmoTH, WisardCoder
  • ํ•˜์ง€๋งŒ ์ด๋ฏธ ์„ฑ๋Šฅ์ด ๊ฐ•ํ•œ base ๋ชจ๋ธ์—์„œ SFT๋ฅผ ์ง„ํ–‰ํ•œ๋‹ค๋ฉด, SFT ๋ฐ์ดํ„ฐ์˜ ์–‘๊ณผ ํ’ˆ์งˆ์„ ๊ณ„์† ๋†’์—ฌ๋„ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ๋‘”ํ™”๋จ
  • ๋˜ํ•œ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ์ด ์ถฉ๋ถ„ํžˆ ์ข‹์ง€ ์•Š์œผ๋ฉด SFT๋ฅผ ํ–ˆ์„ ๋•Œ ์„ฑ๋Šฅ์ด ์˜คํžˆ๋ ค ๋–จ์–ด์งˆ ์ˆ˜ ์žˆ์Œ Fig1

Motivation

  • ์ธ๊ฐ„์˜ ์‚ฌ๊ณ ๋ฐฉ์‹(human learning process)์„ ํ•œ๋ฒˆ ์ƒ๊ฐํ•ด๋ณด์ž!
    • ์ธ๊ฐ„์€ ๋‹จ์ˆœํžˆ ์ •๋‹ต์„ ์™ธ์šฐ์ง€ ์•Š์Œ. ๋‹ต์„ ๋ถ„์„ํ•˜๊ณ , ๋น„ํŒํ•˜๊ณ , ์ •์ œํ•จ์œผ๋กœ์จ ์ดํ•ด๋ฅผ ๊นŠ๊ฒŒ ๋งŒ๋“ฆ

      โ†’ critical thinking, deeper analysis, and nuanced understanding โ€ฆ

    • ์ด๋Ÿฌํ•œ ์š”์†Œ๋“ค์ด ๊ทธ๋™์•ˆ SFT์—์„œ๋Š” ๊ณ ๋ ค๋˜์ง€ ์•Š์•˜๊ณ , ์ •๋‹ต์„ ๊ทธ๋Œ€๋กœ ๋ชจ๋ฐฉํ•˜๋Š” ํ•™์Šต์— ์ดˆ์ ์„ ๋‘์—ˆ์Œ

So in this Paperโ€ฆ

Fig 1-b: Comparison between SFT and CFT dataset samples
  • ๋ชจ๋ธ์ด ๋‹จ์ˆœํžˆ ์ •๋‹ต์„ ๋ชจ๋ฐฉํ•˜๋ฉด์„œ(imitation) ๋ฐฐ์šฐ๊ธฐ๋ณด๋‹ค, ์–ด๋–ค ํ’€์ด๊ฐ€ ์™œ ํ‹€๋ ธ๋Š”์ง€, ์–ด๋А ๋ถ€๋ถ„์ด ๋ถˆ์™„์ „ํ•œ์ง€, ์–ด๋–ป๊ฒŒ ์ˆ˜์ •ํ•ด์•ผ์•ผ ํ•˜๋Š”์ง€(critique)๋ฅผ ๋น„ํŒํ•˜๊ณ , ๊ฒ€ํ† ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•˜๊ฒŒ ํ•˜์ž!
  • ์ด๋ฅผ ์œ„ํ•ด, question-response pair (x,y)(x, y)๏ปฟ์— ๋Œ€ํ•ด annotated critique cc๏ปฟ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ํ•™์Šต Fig1-b
    • P(cโˆฃ[x;y])P(c \mid [x; y])๏ปฟ๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” objective์„ ์‚ฌ์šฉํ•˜์ž

โ‡’ โ€˜์ •๋‹ต ๋ชจ๋ฐฉโ€™๋ณด๋‹ค, ๋ถˆ์™„์ „ํ•œ ๋‹ต์•ˆ์„ ๋น„ํŒํ•˜๊ณ  ๊ฒ€์ฆํ•˜๋Š” ํ•™์Šต์ด reasoning ๋Šฅ๋ ฅ ํ–ฅ์ƒ์— ๋” ์ ํ•ฉํ•˜๋‹ค!!

Contribution

  • Critique Fine-Tuning (CFT) ์ œ์•ˆ: ๋‹จ์ˆœํžˆ query์— ๋Œ€ํ•ด์„œ response์„ ๋ชจ๋ฐฉ(imitation)ํ•˜๋Š” ๋ฐฉ์‹์ด ์•„๋‹Œ, queryโ€“response pair๋ฅผ ์ž…๋ ฅ์œผ๋กœ ์ฃผ๊ณ  critique๋ฅผ ํ•™์Šตํ•˜๋Š” ์ƒˆ๋กœ์šด fine-tuning ๋ฐฉ์‹์„ ์ œ์•ˆ
  • Critique dataset ๊ตฌ์ถ•: GPT-4o๋ฅผ ํ™œ์šฉํ•˜์—ฌ WebInstruct, MetaMathQA, NuminaMath์— ๋Œ€ํ•œ critique ๋ฐ์ดํ„ฐ์…‹์„ ๊ตฌ์ถ•
  • ์‹คํ—˜
    • 3๊ฐœ์˜ 7B base model์—์„œ CFT๊ฐ€ ๊ฐ€์žฅ ๊ฐ•ํ•œ SFT baseline ๋Œ€๋น„ ํ‰๊ท  ์•ฝ 4~10์  ๋” ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ
    • ๋‹จ 50K ์ƒ˜ํ”Œ๊ณผ ์•ฝ 1์‹œ๊ฐ„์˜ ํ•™์Šต๋งŒ์œผ๋กœ, 2M+ ์ƒ˜ํ”Œ๋กœ ํ•™์Šตํ•œ ๊ฐ•ํ•œ ๋ชจ๋ธ ๋ฐ RL ๊ธฐ๋ฐ˜ SimpleRL์— ๊ทผ์ ‘ํ•œ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ(data/ compute efficiency)

Method & Dataset

Datasets

WebInstruct

  • ์ˆ˜ํ•™ 65%, ๋ฌผ๋ฆฌ 8%, ํ™”ํ•™ 4%, ๋น„์ฆˆ๋‹ˆ์Šค 10%, ์ธ๋ฌธ 4% ๋“ฑ์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š” dataset
    • ์ˆ˜ํ•™ ์ค‘์‹ฌ ๋ฐ์ดํ„ฐ๋ณด๋‹ค ๋ฒ”์œ„๊ฐ€ ๋„“์Œ
  • 50K ๊ทœ๋ชจ๋กœ 4๊ฐ€์ง€ subset์„ ๋งŒ๋“ฆ:
    • WebInstruct-SFT: ์›๋ณธ ๋‹ต์•ˆ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•œ SFT ๋ฐ์ดํ„ฐ(์˜ค๋ฅ˜์œจ์ด 50% ์ด์ƒ) (์›๋ณธ WebInstruct๋ฐ์ดํ„ฐ์—์„œ ๋‹จ์ˆœ 50K ์ƒ˜ํ”Œ๋ง)
    • WebInstruct-verified: ์›๋ณธ ๋‹ต์•ˆ์— ๋Œ€ํ•ด์„œ, GPT-4o-1120๊ฐ€ ๋งž๋‹ค๊ณ  ํŒ์ •ํ•œ ๋‹ต์•ˆ๋งŒ ๊ณจ๋ผ ๋งŒ๋“  SFT ๋ฐ์ดํ„ฐ
    • WebInstruct-GPT-4o: WebInstruct-SFT์™€ ๋™์ผํ•œ ์งˆ๋ฌธ์— GPT-4o-1120๊ฐ€ ์ƒˆ๋กœ ๋‹ตํ•œ ๋ฐ์ดํ„ฐ
    • WebInstruct-CFT (Ours): ์›๋ณธ noisy ๋‹ต์•ˆ(WebInstruct-SFT)์— ๋Œ€ํ•ด GPT-4o-1120๊ฐ€ critique๋ฅผ ์ƒ์„ฑํ•œ ๋ฐ์ดํ„ฐ. ์ด ์ค‘ ์•ฝ 56%๋Š” โ€˜correctโ€™, ๋‚˜๋จธ์ง€๋Š” โ€˜wrongโ€™์œผ๋กœ ํŒ์ •๋จ
      • Prompts & Generated Critique

      โ†’ ์ฆ‰, ์›๋ณธ ๋ฐ์ดํ„ฐ์—์„œ ๋งž๋Š” ์Œ(correct)์€ ๋งž์€ ์ด์œ ๋ฅผ ์ƒ์„ฑ, ํ‹€๋ฆฐ ์Œ(wrong)์€ ํ‹€๋ฆฐ ์ด์œ ๋ฅผ ์ƒ์„ฑํ•˜๊ฒŒ ๋จ. ์›๋ณธ noisy ๋‹ต์•ˆ์„ (critique๋งŒ ๋ถ™์—ฌ์„œ) ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•œ๋‹ค๋Š”๊ฒŒ ํฌ์ธํŠธ์ž„

  • Comparison between other SFT datasets
    • ํ›จ์”ฌ ๋” ์ ์€ ์–‘(50K)์œผ๋กœ, ๋” ๋งŽ์€ range of topics์„ ์ปค๋ฒ„ํ•œ๋‹ค

MetaMath & NuminaMath

  • ๊ฐ๊ฐ 50K๋ฅผ ์ƒ˜ํ”Œ๋งํ•˜๊ณ  GPT-4o๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ critique ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค์Œ

Training Objective

  • Input: ์งˆ๋ฌธ xx๏ปฟ์™€ noisy response yy๏ปฟ๋ฅผ ์ด์–ด๋ถ™์ธ [x;y][x; y]๏ปฟ
  • Output: [x;y][x; y]๏ปฟ ์Œ์— ๋Œ€ํ•œ critique cc๏ปฟ
  • Training Objective: ๋ชจ๋ธ์ด critique cc๏ปฟ๋ฅผ ์ƒ์„ฑํ•˜๋„๋ก ๋‹ค์Œ์„ ์ตœ๋Œ€ํ™”:

    argโ€‰maxโกฮธlogโกP(cโˆฃ[x;y];ฮธ)\argmax_{\theta} \log P(c \mid [x; y]; \theta)๏ปฟ

    • ฮธ\theta๏ปฟ ๋Š” ๋ชจ๋ธ์˜ parameter

โ‡’ ๋ชจ๋ธ์ด training ์‹œ์—๋Š” โ€˜์ •๋‹ต ์ƒ์„ฑ๊ธฐโ€™๊ฐ€ ์•„๋‹Œ โ€˜๋‹ต์•ˆ ๋น„ํ‰๊ฐ€โ€™๋กœ ํ›ˆ๋ จ๋จ

  • Inference ์‹œ์—๋Š” ๋ณ„๋„์˜ critique ๋‹จ๊ณ„ ์—†์ด ๋‹ต์„ ๋ฐ”๋กœ ์ƒ์„ฑ

Experiments

Setting

  • Evaluation Datasets
    • Mathematical reasoning benchmarks
      • MATH, Minerva-Math, GSM8K, AIME24, AMC23, OlympiadBench
    • STEM reasoning (Science, Technology, Engineering, Mathmatics)
      • TheoremQA: mathematical theorem understanding
      • MMLU-Pro: physics, chemistry, mathematics
      • GPQA: ๊ณผํ•™์  reasoning์„ ์š”๊ตฌํ•˜๋Š” ๋ณต์žกํ•œ ์งˆ๋ฌธ
  • Base Models
    • DeepSeek-Math-7B, Qwen2.5-7B, Qwen2.5-Math-7B
  • Training Details
    • SFT settings
      1. SFT: ์›๋ณธ ๋ฐ์ดํ„ฐ์…‹์˜ ์‘๋‹ต์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต
      1. SFT-verified: GPT-4o๊ฐ€ ๊ฒ€์ฆํ•œ ์‘๋‹ต๋งŒ ์‚ฌ์šฉํ•˜์—ฌ ํ•™์Šต
      1. SFT-GPT-4o: GPT-4o๊ฐ€ ์ƒ์„ฑํ•œ ์‘๋‹ต์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต
    • CFT settings
      • ์•ž์„œ ๊ตฌ์ถ•ํ–ˆ๋˜ CFT dataset์„ ์‚ฌ์šฉํ•ด ํ•™์Šต

Results

  • Main Results (CFT vs. SFT)
    • ์„ธ base model์— ๋Œ€ํ•ด, SFT์™€ CFT๊ฐ„์˜ ์„ฑ๋Šฅ์„ ๋น„๊ต
    • ๊ฐ€์žฅ base ์„ฑ๋Šฅ์ด ์ข‹์€ ๋ชจ๋ธ์€ Qwen2.5-Math-7B์ž„
    • WebInstruct-SFT (์›๋ณธ ๋ฐ์ดํ„ฐ์…‹)์œผ๋กœ๋งŒ ํ›ˆ๋ จํ–ˆ์„ ๋•Œ base ๋ณด๋‹ค ์˜คํžˆ๋ ค ์„ฑ๋Šฅ์ด ๋‚ฎ์•„์ง€๋Š” ์ƒํ™ฉ๋„ ์กด์žฌํ•จ
    • WebInstruct-CFT๋ฅผ ์‚ฌ์šฉํ–ˆ์„๋•Œ ๋ชจ๋“  ๋ชจ๋ธ์—์„œ ์ „๋ฐ˜์ ์œผ๋กœ ๊ฐ€๋Šฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ๊ณ , SFT์—์„œ ๋‹ฌ์„ฑํ•œ ์ตœ๊ณ  ์„ฑ๋Šฅ ๋Œ€๋น„ 6.7%์˜ improve๋ฅผ ๋‹ฌ์„ฑํ•จ
  • Performance comparison of Ours vs. other Reasoning-specialized models
    • CFT ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ(Qwen2.5-Math-7B-CFT)๊ณผ ๊ธฐ์กด์˜ ๋‹ค๋ฅธ reasoning-specialized model๊ณผ์˜ ์„ฑ๋Šฅ ๋น„๊ต
    • Qwen2.5-Math-7B-CFT๊ฐ€ ๋ชจ๋“  7B scale ๋ชจ๋ธ์— ๋Œ€ํ•ด์„œ ๊ฐ€์žฅ ๋†’์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ•จ
      • ์ถ”๊ฐ€์ ์œผ๋กœ, ์ด๋Š” ๋‹จ์ง€ 50K์˜ training data ์—์„œ ๋‹ฌ์„ฑํ•œ ์„ฑ๋Šฅ์ž„
    • ๋” ํฐ ๋ชจ๋ธ(72B)๊ณผ ๋น„๊ตํ•ด์„œ๋„, 1/10 ์ •๋„๋งŒ์˜ parameter ๋ฟ ๋งŒ์œผ๋กœ ๋Œ€๋ถ€๋ถ„์˜ dataset์— ๋Œ€ํ•ด์„œ ๋Šฅ๊ฐ€ํ•˜๊ฑฐ๋‚˜ ๊ฒฌ์ค„๋งŒํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž„
  • Comparison with RL-based Method
    • Qwen2.5-Math-7B-base๋ฅผ ๊ธฐ์ค€์œผ๋กœ, CFT๋ฅผ RL ๊ณ„์—ด ๋ฐฉ๋ฒ•์ธ SimpleRL๊ณผ ๋น„๊ต
      • SimpleRL-Zero: pure RL-based training
      • SimpleRL: Distill+RL-based training
    • CFT๋Š” RL ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋“ค๊ณผ ์œ ์‚ฌํ•œ ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ๋ณด์ž„
    • SimpleRL ๊ณ„์—ด์€ 1152 H100 GPU hours๋ฅผ ์‚ฌ์šฉํ•˜์˜€์ง€๋งŒ CFT๋Š” 8 H100 GPU hours๋งŒ์œผ๋กœ ํ•™์Šตํ•จ

      โ†’ RL๊ธ‰ ์„ฑ๋Šฅ์„ ํ›จ์”ฌ ์ ์€ ์—ฐ์‚ฐ ๋น„์šฉ์œผ๋กœ ๊ทผ์ ‘ํ•  ์ˆ˜ ์žˆ์Œ

  • Ablation Studies
    • (1) Data Source
      • ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์„ WebInstruct / MetaMathQA / NuminaMath๋กœ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ์„ฑ๋Šฅ ์ฐจ์ด๋ฅผ ๋น„๊ต
        • ๊ฐ ๋ฐ์ดํ„ฐ์…‹์˜ ํŠน์„ฑ
          • WebInstruct: ๋ฒ”์œ„๋Š” ๋„“์ง€๋งŒ noisyํ•œ ์›น ๊ธฐ๋ฐ˜ instruction dataset
          • MetaMathQA: ์ˆ˜ํ•™ ๋ฌธ์ œ๋ฅผ ๋‹ค์–‘ํ•˜๊ฒŒ ์žฌ์ž‘์„ฑํ•ด์„œ ๋งŒ๋“  math-specialized dataset
          • NuminaMath: ๋Œ€๊ทœ๋ชจ competition-style math CoT dataset
      • SFT์—์„œ๋Š” ์ˆ˜ํ•™์— ํŠนํ™”๋˜์–ด์žˆ๊ฑฐ๋‚˜ ๊ตฌ์กฐํ™”๋œ MetaMathQA/NuminaMath๊ฐ€ ์œ ๋ฆฌํ–ˆ๊ณ , broadํ•˜์ง€๋งŒ noisyํ•œ WebInstruct๋Š” ๋ถˆ๋ฆฌํ•จ(์„ฑ๋Šฅ์ด ๋‚ฎ๊ฒŒ ๋‚˜์˜ด)
      • ํ•˜์ง€๋งŒ CFT์—์„œ๋Š” WebInstruct์˜ ์„ฑ๋Šฅ์ด ์šฐ์„ธํ•จ

        โ†’ ์ด๋Š” ๊ณง, CFT๋Š” โ€˜์ข‹์€ ๋ฐ์ดํ„ฐโ€™์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ์ขŒ์ง€์šฐ์ง€ ๋˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ, critique์„ ํ•™์Šตํ•จ์— ๋”ฐ๋ผ reasoning ๋Šฅ๋ ฅ์„ ๊ธฐ๋ฅธ๋‹ค๋Š” ๊ฒƒ์„ ๋‚˜ํƒ€๋ƒ„. Dataset quality๊ฐ€ ๋‹ค์–‘ํ•œ ์ ์„ ์˜คํžˆ๋ ค ์ด์ ์œผ๋กœ ๊ฐ€์ ธ๊ฐ

    • (2) Response Source
      • CFT ํ•™์Šต์— ๋„ฃ๋Š” solution yy๏ปฟ์˜ ์ถœ์ฒ˜๋ฅผ ๋‘ ๊ฐ€์ง€๋กœ ๋น„๊ตํ•จ:
        1. Qwen2.5-Math-7B๊ฐ€ ์ง์ ‘ ์ƒ์„ฑํ•œ ํ’€์ด
        1. WebInstruct ๋ฐ์ดํ„ฐ์…‹์— ์›๋ž˜ ๋“ค์–ด ์žˆ๋˜ ํ’€์ด
      • ํ•ด๋‹น ๋‹ต์— ๋Œ€ํ•ด์„œ critique๋ฅผ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๋ชจ๋ธ์€ ๊ฐ™์Œ
      • ๋‘ ๊ฒฝ์šฐ ๋ชจ๋‘ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ฐจ์ด๋‚˜์ง€ ์•Š์Œ
      • CFT๋Š” ํŠน์ • ์ข…๋ฅ˜์˜ ๋‹ต์—๋งŒ ์˜์กดํ•˜์ง€ ์•Š๊ณ , ๋ฐ์ดํ„ฐ์…‹์— ์›๋ž˜ ์žˆ๋˜ ๋‹ต ๋ฐ ๋ชจ๋ธ์ด ์ƒˆ๋กœ ์ƒ์„ฑํ•œ ๋‹ต์ด๋“  ๋‘˜ ๋‹ค๋ฅผ ๊ฐ€์ง€๊ณ  ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ์Œ

        โ†’ CFT๋Š” โ€˜๋ˆ„๊ฐ€ ์“ด ๋‹ต์ด๋ƒโ€™๋ณด๋‹ค, ํ’€์ด๋ฅผ ๋ณด๊ณ  ๋น„ํŒํ•˜๊ณ  ์˜ค๋ฅ˜๋ฅผ ์‹๋ณ„ํ•˜๋Š” ํ•™์Šตํ•จ

    • (3) Teacher Critique Model
      • CFT์—์„œ critique cc๏ปฟ๋ฅผ ๋งŒ๋“ค์–ด์ฃผ๋Š” teacher ๋ชจ๋ธ์˜ ํ’ˆ์งˆ์ด ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€ ํ™•์ธํ•˜๊ณ ํ•จ ํ•จ
        • [x;y][x;y]๏ปฟ pair์— ๋Œ€ํ•ด์„œ critique์„ ๋งŒ๋“ค์–ด์ฃผ๋Š” ๋ชจ๋ธ
      • GPT-4o-mini์ฒ˜๋Ÿผ ๋น„๊ต์  ์•ฝํ•œ critique ๋ชจ๋ธ์„ ์จ๋„ CFT๊ฐ€ verified-SFT๋ณด๋‹ค ํ›จ์”ฌ ํšจ๊ณผ์ ์ž„
      • ํ•˜์ง€๋งŒ ๋” ๊ฐ•ํ•œ critique teacher(GPT-4o-1120) ๋ฅผ ์“ฐ๋ฉด ์„ฑ๋Šฅ์ด ๋” ์ข‹์•„์ง

        โ†’ CFT๋Š” ์•ฝํ•œ critique ๋ชจ๋ธ๋กœ๋„ ์ž˜ ์ž‘๋™ํ•˜์ง€๋งŒ, teacher critique ๋ชจ๋ธ์ด ๊ฐ•ํ• ์ˆ˜๋ก ์ถ”๊ฐ€ ์„ฑ๋Šฅ ํ–ฅ์ƒ์ด ์ผ์–ด๋‚จ

Limitation & Conclusion

Limitation

  • Critique ๋ฐ์ดํ„ฐ๊ฐ€ ์™„๋ฒฝํ•˜์ง€ ์•Š์Œ. GPT-4o-1120์ด ๋งŒ๋“  critique 50๊ฐœ์— ๋Œ€ํ•ด์„œ ์‚ฌ๋žŒ์ด ์ ๊ฒ€ํ–ˆ๋”๋‹ˆ ์•ฝ 20%์˜ ๋ฐ์ดํ„ฐ์— ๋ถ€์ •ํ™•์„ฑ์ด ์žˆ์—ˆ๋‹ค๊ณ  ํ•จ
  • Self-critique๋ฅผ inference์— ๋ถ™์—ฌ๋ณด์•˜์ง€๋งŒ direct inference๋ณด๋‹ค ๊ณ„์† ๋ชปํ–ˆ์Œ
    • Self-critique inference: ์ถ”๋ก  ๋•Œ ๋ชจ๋ธ์ด ๋ฐ”๋กœ ๋‹ตํ•˜๋Š” ๋Œ€์‹ , ์ง์ ‘ ๋‹ต์„ ํ•œ ๋ฒˆ ์ƒ์„ฑ โ†’ ๊ทธ ๋‹ต์„ ์ž๊ธฐ๊ฐ€ ๋‹ค์‹œ ๋น„ํŒ(critique) โ†’ ํ‹€๋ ธ๋‹ค๊ณ  ํŒ๋‹จํ•˜๋ฉด ๋‹ค์‹œ ์ƒ์„ฑ โ†’ ์ด๋ฅผ ๋ฐ˜๋ณต
    • Self-critique ๋ฐฉ์‹๋“ค์ด ํ•ญ์ƒ direct inference๋ณด๋‹ค ๋ชปํ–ˆ์Œ
      • ๋น„ํŒ ๊ธฐ์ค€์ด ์ผ๊ด€๋˜์ง€ ์•Š๊ฑฐ๋‚˜, temperature ๋ฏผ๊ฐ์„ฑ ๋“ฑ๋“ฑ์— ์˜ํ•œ ๊ฒƒ์œผ๋กœ ์ถ”์ •ํ•จ

      โ‡’ ์ถ”๋ก  ์‹œ์ ์— self-critique loop๋ฅผ ๋Œ๋ฆฌ๋Š” ๊ฑด ์˜คํžˆ๋ ค ๋ณต์žก์„ฑ๋งŒ ๋Š˜๋ฆฌ๊ณ  ์†ํ•ด์ž„. ์ฆ‰, ํ›ˆ๋ จ์€ critique๋กœ ํ•™์Šตํ•˜๊ณ  ์ถ”๋ก  ์‹œ์—๋Š” ๊ทธ๋ƒฅ direct inference ๋ฅผ ํ•˜๋Š”๊ฒƒ์ด ๊ฐ€์žฅ ํšจ๊ณผ์ ์ž„

Conclusion

  • ๋ชจ๋ธ์˜ reasoning ๋Šฅ๋ ฅ์„ ํ‚ค์šธ ๋•Œ ๋ฐ˜๋“œ์‹œ ์ •๋‹ต imitation ๋ฐฉ์‹(SFT)์ด ์ตœ์„ ์€ ์•„๋‹˜
    • ์˜คํžˆ๋ ค ํ‹€๋ฆฌ๊ฑฐ๋‚˜ ๋ถˆ์™„์ „ํ•œ ๋‹ต์•ˆ์„ ๋ณด๊ณ , ์–ด๋””๊ฐ€ ์™œ ๋ฌธ์ œ์ธ์ง€ ๋ถ„์„ํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ํ•™์Šต์ด ๋” ๊ฐ•ํ•œ ์‹ ํ˜ธ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Œ
  • CFT๋Š” ๊ธฐ์กด SFT์™€ ๋น„๊ตํ•ด์„œ ํ–ฅ์ƒ๋œ accuracy๋ฅผ ๋‹ฌ์„ฑํ–ˆ์ง€๋งŒ, ์ถ”๊ฐ€์ ์œผ๋กœ data efficiency, compute efficiency ์ธก๋ฉด์—์„œ๋„ ์ด์ ์„ ๋‹ฌ์„ฑํ•จ
  • Critique์„ ์ƒ์„ฑํ•˜๋Š” teacher ๋ชจ๋ธ์˜ ํ’ˆ์งˆ์— ์„ฑ๋Šฅ์ด ์ขŒ์šฐ๋˜๋Š”๊ฒƒ์€ ๊ฐœ์„ ๋  ์—ฌ์ง€๊ฐ€ ์žˆ์Œ

Categories

SFT research