17 December 2025

Chain-of-Model Learning for Language Model

๐Ÿ’กRepresentation์„ sequancialํ•œ sub-representation์œผ๋กœ ๋‚˜๋ˆ„๋ฉด ๊ธฐ์กด ๋ชจ๋ธ์„ ์œ ์ง€ํ•œ ์ฑ„ ์ถ”๊ฐ€ ํ•™์Šต๋„ ๊ฐ€๋Šฅํ•˜๊ณ , ํ™•์žฅ๋„ ๊ฐ€๋Šฅํ•˜๊ณ  ์œ ์—ฐํ•จ!

๐Ÿฅ‡

Chain-of-Model Learning for Language Model

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
์›”๋“œ์ฝ˜Motivation๊ณผ ๋ฐฉ๋ฒ•๋ก ์ด ์ข‹์€ ์ธ์‚ฌ์ดํŠธ๋ฅผ ์ค€ ๊ฒƒ ๊ฐ™๋‹ค. ํšจ์œจ์„ฑ ์ธก๋ฉด์€ ์ถ”๋ก  ์‹œ๊ฐ„ / ํ˜ธ์ถœ ์ˆซ์ž ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ํŒŒ๋ผ๋ฏธํ„ฐ์—๋„ ์ ์šฉ์ด ๋  ๊ฒƒ ๊ฐ™์€๋ฐ, ํฐ ๋ชจ๋ธ์ด ํŒŒ๋ผ๋ฏธํ„ฐ ์ผ๋ถ€๋งŒ์„ ์“ธ ์ˆ˜ ์žˆ๋‹ค๋ฉด ํšจ์œจ์ ์ž„. 4
ํŒŒ๋น„์•„๋…ธ์นด๋ฃจ์•„๋‚˜์‹คํ—˜๊ฒฐ๊ณผ๋Š” ๋ณ„๋ก ๋ฐ, ์•„์ด๋””์–ด๋Š” ์ •๋ง ๋›ฐ์–ด๋‚˜๋‹ค. ๊ธฐ์กด ๋ชจ๋ธ์„ ์ž˜ ์žฌํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ชจ๋ธ๋งํ•˜๊ณ , ์‹ค์ œ๋กœ ์ ์šฉํ•˜๋Š” ๊ณผ์ •์€ ์ •๋ง ๋…ผ๋ฆฌ์ ์ด๋‹ค. 5
ํ‚ค๋ณด๋“œ๋„ˆ๋ฌด ์‹ ๊ธฐํ•˜๋‹ค ์ด๊ฑธ ์–ด๋–ป๊ฒŒ ์ƒ๊ฐํ•˜์ง€? ๋ชจ๋ธ ๊ตฌ์กฐ ๋ฐ”๊ฟ”๋ฒ„๋ฆฌ๋Š” ์—ฐ๊ตฌ๋Š” ์ •๋ง ์‹ ๊ธฐํ•˜๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์ƒ๊ฐ๋ณด๋‹ค ์‹คํ—˜ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜์ง„ ์•Š์•„์„œ ์•„์‰ฝ์ง€๋งŒ, ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ฐ‘๋ฐ”๋‹ฅ๋ถ€ํ„ฐ ์žฌํ•™์Šต ์—†์ด ํ™•์žฅํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฑด ํ™•์‹คํ•œ ์ด์ ์ธ ๋“ฏํ•จ5
์šฐ์‚ฐ์•ˆ๊ฐ€์ ธ์˜ดํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ณ ์ •๋œ ๊ตฌ์กฐ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜๋Š” ํ•œ๊ณ„๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค๋Š” ์‹œ๋„๋“ค์˜ ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์ด ๋‚˜์˜ค๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ๋ชจ๋ธ ์•ˆ์— ์ค‘์ฒฉ๋œ ์„œ๋ธŒ๋ชจ๋ธ ์ฒด์ธ์„ ๊ตฌ์„ฑํ•œ๋‹ค๋Š” ์•„์ด๋””์–ด๋ฅผ ์ƒ๊ฐํ•ด๋ƒˆ๋‹ค๋Š” ์ ์ด ๋Œ€๋‹จํ•˜๋‹ค.4.5
๊ผฌ๋“ค๋ชฉโ€œ 8b์งœ๋ฆฌ ํ•™์Šตํ•  ๋•Œ 3b์งœ๋ฆฌ๋ฅผ ์žฌํ™œ์šฉํ•˜์ง€ ๋ชปํ•˜๊ณ  ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์‹œ ํ•™์Šตํ•จโ€ ์ด๊ฑฐ ๊ฐœ์ธ์ ์œผ๋กœ ์•„์‰ฌ์› ๋˜ ๋ถ€๋ถ„์ธ๋ฐ motivation์— ์žˆ์–ด์„œ ๊ฐ๊ฒฉ์Šค๋Ÿฌ์› ๋‹ค. ์ƒ๊ฐ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์•„์‰ฝ๊ธด ํ•˜์ง€๋งŒ, ๊ณ ๋ฌด์ ์ธ ์—ฐ๊ตฌ๋‹ค. ๋ถ€๋Ÿฝ๋‹ค ๋˜‘๋˜‘ํ•ด์„œ !!@@@ 4.5
์œก์‚ฌ์‹œ๋ฏธChain-of-XXX ๊ฐœ๋…์„ ์–ด๋””์—๋‚˜ ์ ์šฉํ•  ์ˆ˜ ์žˆ๊ตฌ๋‚˜.. ํŠนํžˆ โ€˜์–ผ๋งˆ๋‚˜ ์ƒ๊ฐํ• ์ง€โ€™๋ฅผ ๋ชจ๋ธ ์‚ฌ์ด์ฆˆ๋‚˜ chain ์ฐจ์›์—์„œ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์œผ๋กœ ์ƒ๊ฐ๋จ. ์ด ์•„์ด๋””์–ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ํ›„์† ์—ฐ๊ตฌ๊ฐ€ ๋งŽ์ด ๋‚˜์˜ฌ ๊ฒƒ ๊ฐ™์Œ4.5
๋‚ ์”จ:ํ๋ฆผํ•œ ๋ฒกํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ sub-representation์œผ๋กœ ๋ถ„ํ•ดํ•ด์„œ layer ๋‹จ์œ„๋กœ ํ™œ์„ฑํ™”ํ•˜๋Š” ๋ฐœ์ƒ์ด ์‹ ์„ ํ•˜๋‹ค.. ๋ญ”๊ฐ€ layer ๋‹จ์œ„์˜ ๋ถ„ํ•ดํ•™์Šต์ธ๋ฐ ์˜์กด์„ฑ์ด ๊ฐ•์กฐ๋œ ๋А๋‚Œ..?4.8
๋งˆ์šฐ์Šค์ƒˆ๋กœ์šด ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์ˆ˜ํ•™์ ์ธ ๊ด€์ ์—์„œ ์ œ์‹œํ•˜๊ณ , ๊ธฐ์กด ๋ชจ๋ธ์„ ์ œ์–ดํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•˜์˜€๋‹ค๋Š” ์ ์ด novelty๊ฐ€ ์•„์ฃผ ํฐ ๊ฒƒ ๊ฐ™๋‹ค.5

TL; DR

๐Ÿ’ก

Representation์„ sequancialํ•œ sub-representation์œผ๋กœ ๋‚˜๋ˆ„๋ฉด ๊ธฐ์กด ๋ชจ๋ธ์„ ์œ ์ง€ํ•œ ์ฑ„ ์ถ”๊ฐ€ ํ•™์Šต๋„ ๊ฐ€๋Šฅํ•˜๊ณ , ํ™•์žฅ๋„ ๊ฐ€๋Šฅํ•˜๊ณ  ์œ ์—ฐํ•จ!

Summary

Motivation

  • Transformer์˜ scaling laws ๋•๋ถ„์— ๋งŽ์€ ๊ธฐ์—…๋“ค์ด ํฐ ๋ชจ๋ธ์„ ๋งŒ๋“œ๋Š”๋ฐ์— ์ „๋…ํ•˜๊ณ  ์žˆ์œผ๋‚˜, ์•„ํ‚คํ…์ณ
    ํ™•์žฅ์—๋Š” ๋‹ค์Œ์˜ ๋ฌธ์ œ๋“ค์ด ์žˆ์Œ
    • scale upํ•  ๋•Œ, ๊ธฐ์กด scale์„ ์œ ์ง€ํ•˜์ง€ ๋ชปํ•˜๊ณ  ํ•ญ์ƒ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•ด์•ผ ํ•จ. ์‚ฌ๋žŒ์€ ๋ฐฐ์šธ๋•Œ ์ ์ง„์ ์œผ๋กœ ํ•™์Šตํ•˜๋Š”๋ฐ ๋ชจ๋ธ์€ ๊ทธ๋ ‡์ง€ ๋ชปํ•จ.
      • e.g. LLaMA-3-3b ํ•™์Šตํ•˜๊ณ , 8b์งœ๋ฆฌ ํ•™์Šตํ•  ๋•Œ 3b์งœ๋ฆฌ๋ฅผ ์žฌํ™œ์šฉํ•˜์ง€ ๋ชปํ•˜๊ณ  ์ฒ˜์Œ๋ถ€ํ„ฐ ๋‹ค์‹œ ํ•™์Šตํ•จ
    • ๊ธฐ์กด LLM ์•„ํ‚คํ…์ณ๋Š” ํ•ญ์ƒ ๊ณ ์ •๋œ ๊ทœ๋ชจ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด์„œ, ๋ฌธ์ œ ํ•ด๊ฒฐ ๋Šฅ๋ ฅ์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๋งค์ปค๋‹ˆ์ฆ˜์ด ๋ถ€์กฑํ•จ
      • e.g. 3b๋ชจ๋ธ๋„ ์ž˜ํ•  ๊ฐ„๋‹จํ•œ instruction๋„ 3000b ๋ชจ๋ธ์— ๋งก๊ธฐ๋Š” ๊ฑด ๋น„ํšจ์œจ์ ์ž„
      • ์ž‘์„ฑ์ž ์ฝ”๋ฉ˜ํŠธ) ์ด๊ฑด Speculative decoding์—์„œ ํ•ด๊ฒฐํ•จ.. ๊ทธ๋ž˜๋„ fundamentalํ•œ motivation์€ ๋งž๋Š”๋“ฏ?
        • Speculative decoding
          • ๊ฐ„๋‹จํ•˜๊ฒŒ ๋งํ•ด์„œ! ์ž‘์€ ๋ชจ๋ธ๋กœ ๋Œ๋ ค๋ณธ ๋‹ค์Œ์— ํฐ ๋ชจ๋ธ์—์„œ ๊ฒ€์ฆํ•˜์ž!
            • e.g. 175b ๋ชจ๋ธ ๋Œ๋ฆฌ๊ธฐ ์ „์—, 3b์งœ๋ฆฌ๋กœ ๋ช‡ ํ† ํฐ inferenceํ•ด๋ณด๊ณ  175b llm ๋Œ๋ ค์„œ ํฐ ๋ชจ๋ธ์—์„œ๋„ ๊ฐ™์€ ์ถœ๋ ฅ ๋‚ผ ๊ฒƒ์ด์—ˆ๋Š”์ง€ ๊ฒ€์ฆํ•˜๊ธฐ

Contribution

  • Representation(hidden state)์„ ๋” ์ผ๋ฐ˜ํ™”ํ•˜๋Š” Chain-of-Representation(CoR) ์ œ์•ˆ
    • representation์„ ํ•˜์œ„ ์ฐจ์›์˜ sub-representations์˜ ์กฐํ•ฉ์œผ๋กœ ๋ณด์ž
    • ์—ฌ๋Ÿฌ ํŠน์ง•(chain)์œผ๋กœ ์ง€์‹(scale)์„ ํ‘œํ˜„ํ•˜์ž
  • CoR๋ฅผ ์ž˜ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์œ„ํ•ด Chain-of-Model ์ œ์•ˆ
    • ์„œ๋กœ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์— ๊ฑธ์ณ ์ธ๊ณผ์  ์˜์กด์„ฑ์„ ํ†ตํ•ฉํ•˜์ž
    • ๊ฐ ๋ ˆ์ด์–ด๋งˆ๋‹ค Chain of Layer๋กœ ๊ตฌ์„ฑ๋จ
      • CoL์€ ๋‹ค์Œ์˜ ํŠน์ง•์ด ์žˆ์Œ
        • ์ผ๋ฐ˜์„ฑ(Generality): ๊ธฐ์กด์˜ ํŠธ๋žœ์Šคํฌ๋จธ ๋ ˆ์ด์–ด๋Š” Chain์ด 1์ธ CoL์ž„!
        • ์ธ๊ณผ์„ฑ(Causality): Scale ii๏ปฟ์˜ ํŠน์ง•์„ ์–ป๊ธฐ ์œ„ํ•ด 1~ii๏ปฟ๊นŒ์ง€์˜ chain ํŒŒ๋ผ๋ฏธํ„ฐ๋งŒ ํ™œ์„ฑํ™” ํ•˜๋ฉด ๋จ
        • ๊ตฌ์„ฑ์„ฑ(Compositionality): ๋‘ ๋ ˆ์ด์–ด๊ฐ€ CoL์ด๋ผ๋ฉด ๋ ˆ์ด์–ด ๋ผ๋ฆฌ๋„ CoL์˜ ํŠน์ง•์„ ๊ฐ–๊ฒŒ ๋จ
  • ๊ธฐ์กด LLM ํ”„๋ ˆ์ž„์›Œํฌ์— ๋น„ํ•ด ์„ฑ๋Šฅ์‘ ๋น„์Šทํ•œ๋ฐ ํ™•์žฅ์„ฑ๊ณผ ์œ ์—ฐ์„ฑ์—์„œ ๋›ฐ์–ด๋‚จ

Chain-of-Model Learning

  • Chain-of-Representation

    ์–ด๋–ค ํ‘œํ˜„ xโˆˆRDx \in \mathbb{R}^D๏ปฟ์— ๋Œ€ํ•ด, ์ด๋Š” ํ•ญ์ƒ n๊ฐœ์˜ ํ•˜์œ„ ํ‘œํ˜„๋“ค์˜ concatenation์œผ๋กœ ๋™๋“ฑํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ฮพ(x,n)=x1,...,xn\xi(x,n) = {x_1, ..., x_n}๏ปฟ๋กœ ํ‘œ๊ธฐํ•จ. ์—ฌ๊ธฐ์„œ xiโˆˆRdix_i \in \mathbb{R}^{d_i}๏ปฟ์ด๊ณ , โˆ‘i=1ndi=D\sum_{i=1}^{n} d_i = D๏ปฟ์ž„.

    ์ด๊ฑธ Chain-of-Representation, CoR์ด๋ผ ์ •์˜ํ•จ

    • ๊ฐ chain์€ CoR๋‚ด์˜ ํ•˜์œ„ํ‘œํ˜„ xix_i๏ปฟ์— ํ•ด๋‹น๋จ.
    • ์ฒซ ii๏ปฟ๊ฐœ์˜ ์ฒด์ธ์„ ํ™œ์„ฑํ™”ํ•ด, ์Šค์ผ€์ผ i i๏ปฟ์— ํ•ด๋‹นํ•˜๋Š” ์ •๋ณด๋ฅผ ์ธ์ฝ”๋”ฉํ•  ์ˆ˜ ์žˆ์Œ
      • ์ฆ‰, CoR์€ ํ•œ ํ‘œํ˜„ ๋‚ด์—์„œ n๊ฐœ์˜ ์ •๋ณด๋ฅผ ์ธ์ฝ”๋”ฉํ•  ์ˆ˜ ์žˆ์Œ
      • n=1์ด๋ฉด CoR์€ ์›๋ž˜ ํ‘œํ˜„๊ณผ ๋™์ผ
  • Chain-of-Layer
    • ii๏ปฟ๋ฒˆ์งธ scale์€ 1~ii๏ปฟ-1๊นŒ์ง€์˜ ์ •๋ณด๋งŒ ํ™œ์šฉํ•ด์•ผ ํ•จ
    • CoR์˜ ์ธ๊ณผ ๊ด€๊ณ„๋ฅผ ํ†ตํ•ฉํ•˜๋Š” Chain-of-Layer ์ œ์•ˆ

    ๋ ˆ์ด์–ด y=fฮธ(x)y = f_\theta(x)๏ปฟ์— ๋Œ€ํ•ด, ์ž…๋ ฅ x์™€ ์ถœ๋ ฅ y๊ฐ€ ๋ชจ๋‘ CoR ฮพ(x,n)\xi(x,n)๏ปฟ์™€ ฮพ(y,n)\xi(y,n)๏ปฟ๋กœ ํ‘œํ˜„๋  ์ˆ˜ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •. ๊ฐ yiy_i๏ปฟ๊ฐ€ ์˜ค์ง xโ‰คix_{\le i}๏ปฟ์—๋งŒ ์˜์กดํ•˜์—ฌ ๋ฐœํ˜„๋˜๋Š” fฮธ(โ‹…)f_\theta(\cdot)๏ปฟ๋ฅผ Chain-of-Layer, CoL์ด๋ผ๊ณ  ์ •์˜ํ•จ

    • ์ž‘์„ฑ์ž ์ฝ”๋ฉ˜ํŠธ) RNN๊ณผ ์œ ์‚ฌํ•œ๊ฒƒ ๊ฐ™์Œ
    • Corollary(๋”ฐ๋ฆ„ ์ •๋ฆฌ)
      • Generality
        • ์ผ๋ฐ˜์ ์ธ ํŠธ๋žœ์Šคํฌ๋จธ ๋ ˆ์ด์–ด๋Š” chain์ด 1์ธ ๊ฒฝ์šฐ์ž„. โ†’ ๊ธฐ์กด ๋ชจ๋“  ๋ ˆ์ด์–ด๋Š” CoL ํ˜•ํƒœ๋ฅผ ๋งŒ์กฑํ•จ!
        • ๊ธฐ์กด chain ์œ„์— ์ถ”๊ฐ€ chain์„ ๋„ฃ์–ด์„œ ์ด๋ฏธ ์žˆ๋Š” ๋ชจ๋ธ์—์„œ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Œ
      • Causality
        • ๋ ˆ์ด์–ด y=f(x)y=f(x)๏ปฟ๊ฐ€ CoL์„ ๋งŒ์กฑํ•œ๋‹ค๋ฉด, ๊ฐ€์ค‘์น˜ ฮธ\theta๏ปฟ๋Š” ๋…๋ฆฝ์ ์ธ ๊ฐ€์ค‘์น˜ ฮธ1,...,ฮธn{\theta_1, ..., \theta_n}๏ปฟ๋กœ ๋ถ„ํ• ํ•  ์ˆ˜ ์žˆ๊ณ , ๊ฐ ฮธi\theta_i๏ปฟ๋Š” xโ‰คix_{\le i}๏ปฟ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ yiy_i๏ปฟ๋ฅผ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ์‚ฌ์šฉ๋จ. ์ฆ‰, ์ถœ๋ ฅ yiy_i๏ปฟ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด xโ‰คix_{\le i}๏ปฟ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ฮธโ‰คi\theta_{\le i}๏ปฟ๋ฅผ ๊ณ„์‚ฐํ•ด์•ผ ํ•จ.
        • ์ด CoL ์„ค๊ณ„์—์„œ ii๏ปฟ๋ฒˆ์งธ scale์„ ๊ณ„์‚ฐํ•  ๋•Œ, ์ด์ „ scale์˜ ์ •๋ณด๋ฅผ ํ†ตํ•ฉํ•˜๋ฏ€๋กœ catastrophic foggeting์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ์Œ. yiy_i๏ปฟ๋ฅผ ์–ป๊ธฐ ์œ„ํ•ด ฮธโ‰คi\theta_{\le i}๏ปฟ๋งŒ ๊ณ„์‚ฐํ•˜๋ฉด ๋˜๋ฏ€๋กœ, ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋™์ ์œผ๋กœ ์‚ฌ์šฉํ•จ.
      • Compositionality
        • ๋‘ ๋ ˆ์ด์–ด y=f1(x)y=f_1(x)๏ปฟ, z=f2(y)z=f_2(y)๏ปฟ ๊ฐ€ ์žˆ๊ณ  x, y, z ๋ชจ๋‘ CoR๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •.
          f1f_1๏ปฟ, f2f_2๏ปฟ๊ฐ€ CoL์ด๋ผ๋ฉด, ํ•ฉ์„ฑํ•จ์ˆ˜์ธ z=f2(f1(x))z=f_2(f_1(x))๏ปฟ๋„ CoL์„ ๋งŒ์กฑํ•จ. ์ฆ‰ ziz_i๏ปฟ๋Š” xโ‰คix_{\le i}๏ปฟ์—์„œ๋งŒ
          ์˜์กดํ•จ
        • ์—ฌ๋Ÿฌ CoL์„ ์Œ“์•„๋„ ์ „์ฒด๋กœ ๋ณด๋ฉด CoL์ด ์œ ์ง€๋จ โ†’ ๋ชจ๋ธ๋กœ ํ™•์žฅ ๊ฐ€๋Šฅ
  • Chain-of-Model

    L๊ฐœ์˜ ๋ ˆ์ด์–ด๋ฅผ ๊ฐ€์ง„ ๋ชจ๋ธ ฮฆ\Phi๏ปฟ์— ๋Œ€ํ•ด ๋ชจ๋“  ๋ ˆ์ด์–ด๊ฐ€ CoL์ด๋ผ๋ฉด, ์ด๋ฅผ Chain-of-Model, CoM์ด๋ผ ์ •์˜ํ•จ

    • CoM์ด๋ฉด CoL๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ generality์™€ causality๋ฅผ ๊ฐ€์ง.
    • ๋ชจ๋“  ๋ชจ๋ธ์€ CoM(n=1)์ด๊ณ , ํ•˜๋‚˜์˜ ๋ชจ๋ธ ๋‚ด์—์„œ ๋‹ค๋ฅธ scale์˜ ์—ฌ๋Ÿฌ ํ•˜์œ„ ๋ชจ๋ธ์„ ํ†ตํ•ฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, base model์„ ํ™œ์šฉํ•˜์—ฌ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์Œ. โ†’ ํ™•์žฅ์„ฑ, ์œ ์—ฐ์„ฑ ํ™•๋ณด

Architecture

  • ์ด์ œ ๊ฐœ๋…์ •๋ฆฌ ํ–ˆ์œผ๋‹ˆ ์‹ค์ œ๋กœ ๋ชจ๋ธ์— ์ ์šฉํ•ด ๊ตฌํ˜„ํ•ด๋ณด์ž!
  • Linear Layer
    • ์•„๋ž˜ ๊ทธ๋ฆผ์—์„œ์ฒ˜๋Ÿผ, linear layer์—์„œ CoL๋ฅผ ๋งŒ์กฑํ•˜๊ธฐ ์œ„ํ•ด condition์„ ๊ฑธ์–ด์„œ ์ถœ๋ ฅ์„ ๊ณ„์‚ฐํ•จ
      • y=โˆฅi=1n(yi)=โˆฅi=1n(Wixโ‰คi+bi)y = \parallel_{i=1}^n (y_i) = \parallel_{i=1}^n (W_i x_{\le i} + b_i) ๏ปฟ
      • Chain์ด๋ผ๋Š” ํ•˜์ดํผ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ฑธ์–ด์„œ, ๊ฐ chain(C=c1,...,cn\mathcal{C}={c_1, ..., c_n}๏ปฟ)์ด ์ด์ „ chain์„ ํฌํ•จํ•ด ๊ณ„์‚ฐํ•˜๋„๋ก ์„ค์ •
      • ์ผ๋ฐ˜ linear layer๋Š” n=1์ธ ๊ฒฝ์šฐ์ž„
      • Chain-of-Linear๋ผ๊ณ  ๋ถ€๋ฆ„!
  • Transformer
    • Multi Head Attention
      • ๊ฐ ์ž„๋ฒ ๋”ฉ์—์„œ CoR์„ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•ด, Key, Query, Value, Output ๋ณ€ํ™˜ ํ–‰๋ ฌ์„ ๋ชจ๋‘ Chain-of-Linear ๋ ˆ์ด์–ด๋กœ ๋ฐ”๊ฟˆ
      • MHA(x)=O(โˆฅi=1hAttention(qi,ki,vi))=O(โˆฅi=1hsoftmax(qikiTdk)vi)\text{MHA}(x) = O(\parallel_{i=1}^h\text{Attention}(q_i, k_i, v_i)) = O(\parallel_{i=1}^h \text{softmax}(\frac{q_i k_i^T}{\sqrt{d_k}})v_i)๏ปฟ์—์„œ
        ๋‹จ์ผ ํ—ค๋“œ๋‚ด์—์„œ chain์ด 2๊ฐœ ์ด์ƒ์ผ ๊ฒฝ์šฐ chain๊ฐ„์˜ ์ •๋ณด๊ฐ€ ํ˜ผํ•ฉ๋˜์–ด CoL์ด ์•„๋‹ˆ๊ฒŒ ๋จ. ๊ทธ๋ž˜์„œ ๊ฐ ํ—ค๋“œ๊ฐ€ ํŠน์ • chain๋งŒ ๊ณ„์‚ฐํ•˜๋„๋ก ํ•จ!
        • e.g. ์–ด๋–ค ํ—ค๋“œ๋Š” 1~2๊นŒ์ง€๋งŒ ๊ณ„์‚ฐํ•˜๊ณ , ์–ด๋–ค ํ—ค๋“œ๋Š” 3~4๊นŒ์ง€, ๋‹ค๋ฅธ๊ฑด 5~8๊นŒ์ง€ ์ด๋Ÿฐ์‹์œผ๋กœ
      • Chain-of-Attention
    • Feed-Forward Network
      • ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ฐ linear๋ฅผ CoL๋กœ ๋Œ€์ฒดํ•˜๊ณ , ์œ„์˜ Chain-of-Attention๊ณผ ๊ฐ™์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ Chain(representation์— ๋Œ€ํ•ด Chain ์–ด๋–ป๊ฒŒ ์ชผ๊ฐค์ง€, e.g. 2, 2, 4) ์‚ฌ์šฉ!
    • Normalization
      • ๊ฐ ์ฒด์ธ๋ณ„๋กœ ์ •๊ทœํ™”ํ•จ
    • Embedding
      • Scale ii๏ปฟ์—์„œ ์ธ์ฝ”๋”ฉํ•  ๋•Œ๋Š” 1~ii๏ปฟ์˜ chain์— ํ•ด๋‹นํ•˜๋Š” ์ž„๋ฒ ๋”ฉ๋งŒ ์‚ฌ์šฉํ•จ
  • KV sharing
    • ์–ดํ…์…˜์—์„œ ๊ฐ chain๋งˆ๋‹ค key, value๋ฅผ ๊ฐ€์ ธ์„œ ์„œ๋กœ ๋‹ค๋ฅธ scale์„ ์—ฐ๊ฒฐํ•  ๋•Œ align์ด ์ž˜ ์•ˆ๋จ
      • e.g. ์ž‘์€ ๋ชจ๋ธ๋กœ ์ถ”๋ก ํ•˜๋‹ค๊ฐ€ ํ™•์žฅ๋œ ๋ชจ๋ธ๋กœ ์ถ”๋ก ํ•  ๋•Œ, context์— ๋Œ€ํ•œ key, value๋ฅผ ๋‹ค ์ƒˆ๋กœ ๊ณ„์‚ฐํ•ด์•ผ ํ•จ
    • KV sharing์œผ๋กœ ํ•ด๊ฒฐํ•จ!
      • ๋ชจ๋“  key, value๊ฐ’์ด ์ฒซ๋ฒˆ์งธ chain์—์„œ ๊ณ„์‚ฐ ํ•œ ํ›„, ๋ชจ๋“  chain์—์„œ ๊ณต์œ ๋จ
      • key, value์˜ ์ˆ˜๊ฐ€ head๋ณด๋‹ค ์ ์œผ๋ฉด, ๊ฐ’์„ ๋ฐ˜๋ณต์‹œ์ผœ์„œ ๋•Œ์›€
    • ์ด๋ ‡๊ฒŒํ•˜๋ฉด ์„ฑ๋Šฅ์ด ์‚ด์ง ๋‚ฎ์•„์ง€๊ธฐ๋Š” ํ•˜๋Š”๋ฐ, prefilling์ด ๋นจ๋ผ์ง€๊ณ , ์„œ๋กœ ๋‹ค๋ฅธ ์Šค์ผ€์ผ์˜ LM์œผ๋กœ ๋Š๊น€์—†์ด ์ „ํ™˜ํ•˜๋ฉด์„œ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Œ!
    • ์ด๋ฅผ Chain-of-Language-Model Air, CoLM-Air๋ผ๊ณ  ๋ถ€๋ฆ„(shairing ์•ˆํ•˜๋Š”๊ฒŒ CoLM)
  • Objective Function
    • ์ผ๋ฐ˜์ ์ธ cross-entropy์†์‹ค์„ objective๋กœ ์“ธ ์ˆ˜ ์žˆ์ง€๋งŒ, multi scale prediction์„ ํ•˜๋ ค๋ฉด ๊ฐ scale๋งˆ๋‹ค classification head(representationโ†’vocab ํ–‰๋ ฌ)์„ ์จ์•ผํ•จ
    • ๊ทธ๋ž˜์„œ ๊ฐ scale์„ ๊ณ„์‚ฐํ•˜๋Š” multi-chain cross-entropy loss๋ฅผ ์ œ์•ˆํ•จ
      • Lossi=L(Wixโ‰คi)Loss_i = \mathcal{L}(W^i x_{\le i})๏ปฟ
    • ๊ทผ๋ฐ loss ๊ณ„์‚ฐํ•˜๋Š”๊ฑด ๊ณ„์‚ฐ๋Ÿ‰์ด ์ปค์„œ, fine-tuningํ• ๋•Œ๋งŒ ์‚ฌ์šฉํ•จ

Experiments

  • Setup
    • 0.2T์˜ corpus๋กœ pre-training
    • 32๊ฐœ์˜ Nvidia A100 40GB GPU ์‚ฌ์šฉ(๋ถ€๋Ÿฝ๋‹ค)
    • baseline ๋ชจ๋ธ์€ C\mathcal{C}๏ปฟ=32 ์„ธํŒ…, ๋‚˜๋จธ์ง€๋Š” Llama-3.2-1B๋ž‘ ๋™์ผํ•œ ๊ตฌ์„ฑ
    • CoLM ์‹œ๋ฆฌ์ฆˆ๋Š” C=16,16\mathcal{C}={16,16}๏ปฟ , C=8,8,8,8\mathcal{C}={8,8,8,8}๏ปฟ ์‚ฌ์šฉ
    • Chain-of-Linear๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์ ๊ฒŒ ์ฐจ์ง€ํ•ด์„œ, dimension์„ ๋Š˜๋ฆผ
    • ์ƒ์‹ task์—์„œ zero-shot setting์œผ๋กœ ์‹คํ—˜
  • Results
    • KV sharing์„ ํ•˜๋ฉด ์„ฑ๋Šฅ ์‚ด์ง ๋–จ์–ด์ง
    • ์˜์™ธ๋กœ 16, 16 ์„ธํŒ…์ด 8, 8, 8, 8๋ณด๋‹ค ๋‚˜์Œ
      • ๋‹ค๋งŒ ๋” ๋งŽ์€ chain์„ ์‚ฌ์šฉํ•˜๋ฉด ํ•˜์œ„๋ชจ๋ธ์„ ๋” ๋งŽ์ด ์ œ๊ณตํ•  ์ˆ˜ ์žˆ์Œ (8, 16, 24)
    • ๋ชจ๋“  ๋ชจ๋ธ์€ chain์ด 1์ธ CoLM์ด๋ฏ€๋กœ chain์„ ๋” ๋ถ™์—ฌ์„œ(dimension์„ ๋Š˜๋ ค์„œ) ํ™•์žฅํ•  ์ˆ˜ ์žˆ๋‹ค!
      • {32, 8} ์„ธํŒ…์œผ๋กœ 0.8B ํŒŒ๋ผ๋ฏธํ„ฐ ์ถ”๊ฐ€ํ•จ
      • ๊ธฐ์กด ์ง€์‹ ๋ณด์กดํ•˜๋ฉด์„œ ํ•™์Šต๋„ ๋น ๋ฅด๊ฒŒ!
    • ๋™์ ์œผ๋กœ ์ถ”๋ก ํ•˜๊ธฐ
      • ์ž‘์€ ๋ชจ๋ธ๋กœ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ์Œ!
    • CoLM-Air๋ฅผ ์‚ฌ์šฉํ–ˆ์„ ๋•Œ, ์ฒซ๋ฒˆ์งธ chain์—์„œ ๋ชจ๋“  key์™€ value๋ฅผ ๊ณ„์‚ฐํ•ด์„œ prefilling์„ ์•„์ฃผ ๋น ๋ฅด๊ฒŒ ํ•จ
    • MInference๋Š” ์ถ”๋ก  ๊ธฐ์ˆ ์ธ๋ฐ ์ถ”๊ฐ€๋กœ ์ ์šฉํ•ด๋„ ๊ฐ™์€ ์–‘์ƒ
    • fine-tuningํ• ๋•Œ, ๋งˆ์น˜ base-model์„ ํ™•์žฅํ•œ ๊ฒƒ์ฒ˜๋Ÿผ ํ›„์† chain๋งŒ fine-tuning ๊ฐ€๋Šฅํ•จ
      • ์ด๋Š” ์น˜๋ช…์ ์ธ ๋ง๊ฐ์„ ๋ง‰์Œ!
      • ์ผ๋ถ€๋งŒ fine-tuningํ•ด๋„ ์„ฑ๋Šฅ์ด ๊ฝค ์˜ฌ๋ผ๊ฐ
      • ์‹ฌ์ง€์–ด ์ด๊ฒƒ๋„ LoRA๋ž‘ ํ˜ธํ™˜๋จ

Categories

research