07 January 2026

What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers

๐Ÿ’กTransformer ๋ชจ๋ธ ํ›ˆ๋ จ ์‹œ ์†์‹คํ•˜๋ฝ์ด ์ดˆ๊ธฐ๋‹จ๊ณ„์—์„œ ์ •์ฒด๋˜๋‹ค๊ฐ€ ๊ฐ‘์ž๊ธฐ ํฌ๊ฒŒ ์ผ์–ด๋‚˜๋Š” abrupt learning ํ˜„์ƒ ํƒ๊ตฌ

What Happens During the Loss Plateau? Understanding Abrupt Learning in Transformers

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
๋งˆ์Šคํ‚นํ…Œ์ดํ”„abrupt learning ์ด๋ผ๋Š” ๋‹จ์–ด๋ฅผ ์ฒ˜์Œ ๋ด„. ์†์‹ค ํ•˜๋ฝ์ด ์ •์ฒด๋˜๋‹ค๊ฐ€ ๊ฐ‘์ž๊ธฐ ํฌ๊ฒŒ ์ผ์–ด๋‚˜๋Š” ๊ฒƒ์€ ๋‹ค๋ฅธ ํƒœ์Šคํฌ ํ•™์Šต ๋•Œ๋„ ๋ช‡๋ฒˆ ๊ฒฝํ—˜ํ•  ์ˆ˜ ์žˆ์—ˆ์ง€๋งŒ, ๊ทธ ์ด์œ ๋ฅผ ํŒŒ์•…ํ•ด๋ณด๊ณ  ํŒŒํ—ค์ณ๋ณด๋Š” ์—ฐ๊ตฌ๋Š” ์ฒ˜์Œ ๋ณด๋Š” ๊ฒƒ ๊ฐ™์Œ. ๋ฐ˜๋ณต์  ํ† ํฐ์ด ์™œ ๋ฐœ์ƒํ•˜๋Š”์ง€, attention map์„ ์–ด๋–ป๊ฒŒ ํƒ์ƒ‰ํ•˜๊ณ  ๊ฐœ์„ ํ•˜๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  ๊ทธ๊ฒƒ์„ ํ™•์ธํ•œ ๋ฐฉ๋ฒ•์„ ์ฐธ๊ณ ํ• ๋งŒํ•ด๋ณด์ž„.4.0
๊ทคMamba ๊ฐ™์€ ๋‹ค๋ฅธ ์•„ํ‚คํ…์ฒ˜์—์„œ๋„ loss plateau์™€ ๊ฐ™์€ ํ˜„์ƒ์ด ๋‚˜ํƒ€๋‚˜๋Š”์ง€ ๊ถ๊ธˆํ•จ. Transformer์˜ plateau๊ฐ€ attention์ด ์˜ฌ๋ฐ”๋ฅธ ํ† ํฐ ์ •๋ ฌ์„ ์ฐพ์•„๊ฐ€๋Š” ๊ณผ์ •์—์„œ ์ƒ๊ธฐ๋Š” ๋ณ‘๋ชฉ์ด๋ผ๋ฉด, attention์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๋ชจ๋ธ์—์„œ๋Š” plateau๊ฐ€ ๋” ์•ฝํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚˜๊ฑฐ๋‚˜ ์•„์˜ˆ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์„ ๊ฐ€๋Šฅ์„ฑ๋„ ์žˆ์„๋“ฏ3.7
๋™๊นŒ์Šคloss plateau๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค๋Š” ์‚ฌ์‹ค์„ ๋ชฐ๋ž๋Š”๋ฐ plateau์—์„œ ํ•™์Šต ์ข…๋ฃŒ๋ฅผ ๋„ˆ๋ฌด ์ผ์ฐ ํ•ด๋ฒ„๋ฆฌ๋ฉด plateau๋ฅผ ๋šซ๊ธฐ ์ง์ „์— ํ•™์Šต์„ ๋ฉˆ์ถœ ์ˆ˜ ์žˆ๊ฒ ๋‹ค(๋‹ค์ด์•„๋ฅผ ์•ž์— ๋‘๊ณ  ๊ด‘๋ฌผ์บ๊ธฐ ๋ฉˆ์ถ”๋Š” ๊ฒƒ์ฒ˜๋Ÿผ)๋ผ๋Š” ์ƒ๊ฐ๊ณผ ๊ทธ๋Ÿผ ํ•™์Šต ์ข…๋ฃŒ ํƒ€์ด๋ฐ์„ ์–ธ์ œ๋กœ ์žก์•„์•ผ๋˜์ง€ ๋ผ๋Š” ์ƒ๊ฐ์ด ๋“ฆ 3.7
์ˆ˜๋ฉด์žฅ์• ์„์‚ฌ 1ํ•™๊ธฐ๋•Œ ์ฝ”๋“œ ์—ฐ์Šตํ•˜๋‹ค๊ฐ€ ์‹ค์ œ๋กœ repetition bias, loss plateau๋ฅผ ๊ฒฝํ—˜ํ•œ ์ ์ด ์žˆ๋Š”๋ฐ, ๋…ผ๋ฌธ์—์„œ ์˜คํ”ผ์…œํ•˜๊ฒŒ ์–ธ๊ธ‰ํ•˜๋‹ˆ ์‹ ๊ธฐํ•˜๊ณ  ์žฌ๋ฐŒ๊ฒŒ ์ฝ์—ˆ์Œ
loss plateau๋Š” Transformer ๊ตฌ์กฐ์˜ ๋ฌธ์ œ์ผ ๊ฐ€๋Šฅ์„ฑ์ด ํฌ๋‹ค!! ๊ทผ 2๋…„ ์ด๋‚ด๋กœ ์ถ”์„ธ๊ฐ€ mamba๋กœ ์˜ฎ๊ฒจ๊ฐ€๊ฒ ๊ตฌ๋‚˜ !
4
์ด์–ดํฐํ›ˆ๋ จ ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์˜ ์˜ํ–ฅ์ด ํฌ๋‹ค๋Š” ์ด๋ฒˆ์ฃผ ๋‹ค๋ฅธ ์Šคํ„ฐ๋”” ๋…ผ๋ฌธ๊ณผ ์—ฐ๊ฒฐ์ง€์–ด ์ƒ๊ฐํ•˜๊ฒŒ ๋œ๋‹ค (training data temporal dependence ๋…ผ๋ฌธ). ํ˜„์ƒ์„ ๋ฐํ˜€๋‚ด๋Š” ๋ฐ ์ง‘์ค‘ํ•˜๋Š”๋ฐ ํ˜„์ƒ๋“ค์˜ ์ด์œ ๊ฐ€ ๋” ๊ถ๊ธˆํ•ด์ง„๋‹ค3.7
์‚ฌ๊ณผTransformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค์„ ์‹คํ—˜ํ•˜๋ฉด์„œ ๊ฐ‘์ž๊ธฐ loss๊ฐ€ ์ฆ๊ฐ€ํ•˜์—ฌ ์ด์ƒํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•œ ์ ์ด ๋งŽ์•˜๋Š”๋ฐ, loss plateau์ž„์„ ์•Œ ์ˆ˜ ์žˆ์—ˆ๋˜ ๋…ผ๋ฌธ. ์•ž์œผ๋กœ์˜ ์‹คํ—˜์—์„œ loss ํƒ์ƒ‰ ์‹œ์ ์„ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์„๋“ฏ.4.7
7์ผPlateau ๊ตฌ๊ฐ„์—์„œ ๊ฒ‰์œผ๋กœ ๋ณด์ด์ง€ ์•Š๋Š” representation ๋ณ€ํ™”๋ฅผ ์ง๊ด€์ ์œผ๋กœ ์‹คํ—˜ํ•œ ๋ถ€๋ถ„์ด ์ธ์ƒ์ ์ž„. MWS ํƒœ์Šคํฌ์˜ ์กด์žฌ๋ฅผ ์•Œ๊ฒŒ๋๋Š”๋ฐ, ํŠน์ • task ์„ฑ๋Šฅ์ด ๊ฐ‘์ž๊ธฐ ๋ฌด๋„ˆ์ง€๊ธฐ ์ง์ „์˜ signal์„ ํƒ์ง€ํ•˜๋„๋ก ํ™œ์šฉ๊ฐ€๋Šฅํ•ด๋ณด์ž„. Catastrophic forgetting๋„ ์ค„์ผ ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ?4.4

TL; DR

๐Ÿ’ก

Transformer ๋ชจ๋ธ ํ›ˆ๋ จ ์‹œ ์†์‹คํ•˜๋ฝ์ด ์ดˆ๊ธฐ๋‹จ๊ณ„์—์„œ ์ •์ฒด๋˜๋‹ค๊ฐ€ ๊ฐ‘์ž๊ธฐ ํฌ๊ฒŒ ์ผ์–ด๋‚˜๋Š” abrupt learning ํ˜„์ƒ ํƒ๊ตฌ

Summary

Motivation

  • Transformers๋ฅผ ์ˆ˜ํ•™ ํ˜น์€ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํƒœ์Šคํฌ์— ํ›ˆ๋ จํ•  ๋•Œ ๋ณด์ด๋Š” abrupt learning (๊ฐ‘์ž‘์Šค๋Ÿฌ์šด ํ•™์Šต) ํ˜„์ƒ
    • : ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์ด ์˜ค๋žซ๋™์•ˆ ์ •์ฒด๋˜์—ˆ๋‹ค๊ฐ€ ๊ฐ‘์ž๊ธฐ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ํ–ฅ์ƒ๋˜๋Š” ํ˜„์ƒ
  • ๋ณธ ๋…ผ๋ฌธ์€ ํ›ˆ๋ จ ์‹œ ์ด๋Ÿฌํ•œ ํ˜„์ƒ์˜ ๋ณดํŽธ์ ์ธ ํŠน์„ฑ๊ณผ ๊ธฐ๋ณธ์  ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ๋ฐํžˆ๊ณ ์ž ํ•จ

Contribution

  • ์†Œํ˜• Transformers๋ฅผ ๊ฐ„๋‹จํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ํƒœ์Šคํฌ๋กœ ํ›ˆ๋ จํ•˜์—ฌ abrupt learning๊ณผ ๊ด€๋ จ๋œ ์—ฌ๋Ÿฌ ํ˜„์ƒ ํƒ๊ตฌํ•˜๊ณ ์ž ํ•จ

    • ์‚ฌ์šฉ ํƒœ์Šคํฌ: moving-window-sum (MWS)
      window size: 2
      • x1,...,xnx_1, ..., x_n๏ปฟ ์‹œํ€€์Šค๊ฐ€ ์ฃผ์–ด์ง€๋ฉด, SEPSEP๏ปฟ ์ดํ›„ y1,...,yny_1, ..., y_n๏ปฟ ์„ ์ถœ๋ ฅํ•ด์•ผ ํ•จ
        • xix_i๏ปฟ๋Š” 0, 1, 2, โ€ฆ, 17 ์ค‘ ํ•˜๋‚˜์˜ ์ˆซ์ž
      • y1y_1๏ปฟ์€ x1x_1๏ปฟ ๊ทธ๋Œ€๋กœ, y2y_2๏ปฟ๋Š” (x1+x2)(x_1 + x_2)๏ปฟ๋ฅผ pp๏ปฟ(=17)๋กœ ๋‚˜๋ˆˆ ๋‚˜๋จธ์ง€, y3y_3๏ปฟ์€ (x2+x3)modโ€‰โ€‰p(x_2+x_3)\mod p๏ปฟ, โ€ฆ
      • โ‡’ ground truth๊ฐ€ ์ž˜ ์•Œ๋ ค์ ธ ์žˆ๋Š” ํƒœ์Šคํฌ๋กœ์„œ ๋ชจ๋ธ์˜ ํ›ˆ๋ จ ์ง„ํ–‰ ์ •๋„๋ฅผ ์ •ํ™•ํžˆ ์ธก์ •ํ•  ์ˆ˜ ์žˆ์Œ
    • ๋ชจ๋ธ ์•„ํ‚คํ…์ณ: 1-layer, 1-head Transformer
      • ์ด ๊ตฌ์กฐ๋กœ๋„ ์ฃผ์–ด์ง„ ํƒœ์Šคํฌ ์™„๋ฒฝํžˆ ์ˆ˜ํ–‰ ๊ฐ€๋Šฅ
        • (s1,...,sL)(s_1, ..., s_L)๏ปฟ : ์ž…๋ ฅ ํ† ํฐ ์‹œํ€€์Šค
        • MLPMLP๏ปฟ : 2-layer NN
        • IdId๏ปฟ : residual connection
        • LMLM๏ปฟ : linear layer, mapping hidden state to logits
      • greedy decoding ์‚ฌ์šฉ
      • โ‡’ ์ž‘์€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ ๋‚ด๋ถ€ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์‰ฝ๊ฒŒ ๋ถ„์„ํ•˜๊ณ  ํ•ด์„ํ•  ์ˆ˜ ์žˆ์Œ
    • ํ›ˆ๋ จ
      • (x1,...,xn,SEP,y1,...,yn)(x_1, ..., x_n, SEP, y_1, ..., y_n)๏ปฟ์˜ ์ „์ฒด ์‹œํ€€์Šค์— ๋Œ€ํ•ด next-token-prediction cross-entropy loss ์†์‹ค ์ตœ์†Œํ™”ํ•˜๋„๋ก ํ›ˆ๋ จ
        • 256๊ฐœ ํ›ˆ๋ จ ์ƒ˜ํ”Œ๋กœ 1 ์—ํญ ํ›ˆ๋ จ
      • ์ •ํ™•๋„ ์ธก์ •: output ๋ถ€๋ถ„์ธ y1,...,yny_1, ..., y_n๏ปฟ์˜ nn๏ปฟ๊ฐœ ํ† ํฐ์— ๋Œ€ํ•œ ์˜ˆ์ธก ์ •ํ™•๋„ ํ‰๊ท 

    • abrupt learning
      • ํ›ˆ๋ จ ์ค‘ training loss๊ฐ€ sub-optimal ๊ฐ’์—์„œ ์ƒ๋‹นํžˆ ๋งŽ์€ step๋™์•ˆ ์œ ์ง€๋˜๋‹ค๊ฐ€ ๊ธ‰๊ฒฉํ•œ ์ •ํ™•๋„ ์ฆ๊ฐ€์™€ ์†์‹ค ํ•˜๋ฝ
      • โ‡’ optimal solution์„ ๊ฐ‘์ž๊ธฐ ํ•™์Šตํ•˜๋Š” abrupt learning ํ˜„์ƒ ํ™•์ธ
      loss, accuracy

    • attention map
      • ์‹คํ—˜ ํƒœ์Šคํฌ์— ๋Œ€ํ•ด์„œ ์ตœ์ ์˜ attention pattern์€ ๊ฐ output token yiy_i๏ปฟ๊ฐ€ ์ž์‹ ์„ ๊ณ„์‚ฐํ•˜๋Š”๋ฐ ๊ด€๋ จ์žˆ๋Š” input token๋งŒ์„ attendํ•˜๋Š” ๊ฒƒ
        • y1y_1๏ปฟ์€ x1x_1๏ปฟ์— attend, y2y_2๏ปฟ๋Š” x1,x2x_1, x_2๏ปฟ์— attend, โ€ฆ
      • attention progress measure (APM) ์‚ฌ์šฉํ•ด attention pattern ๋ณ€ํ™” ๊ด€์ฐฐ
        • AijA_{ij}๏ปฟ : ii๏ปฟ-th output token ๊ณ„์‚ฐํ•  ๋•Œ jj๏ปฟ-th token์— ํ• ๋‹น๋˜๋Š” attention score
        • ฮฉ\Omega๏ปฟ : ์ตœ์  attention map์˜ position pair set
      • โ†’ ํ›ˆ๋ จ ์ค‘ APM์ด 0๋ถ€ํ„ฐ 0.8 ์ •๋„๊นŒ์ง€ ๋‹จ์กฐ ์ฆ๊ฐ€ํ•˜๋ฉฐ, ์†์‹ค/์ •ํ™•๋„๋ณด๋‹ค ์™„๋งŒํ•œ ๋ณ€ํ™” ๊ณก์„  ๋ณด์ž„
        • step 150 ์ฏค์—์„œ ๊ธ‰๊ฒฉํ•œ ์†์‹ค ํ•˜๋ฝ ์žˆ๋Š”๋ฐ, ๊ทธ ์ด์ „๋ถ€ํ„ฐ APM์€ ์ด๋ฏธ ์ƒ๋‹นํžˆ ์ฆ๊ฐ€ํ•จ
        • โ‡’ ์†์‹ค ํ•˜๋ฝ์€ ๊ธ‰๊ฒฉํ•˜์ง€๋งŒ ๊ทธ ์ด์ „๋ถ€ํ„ฐ attention pattern learning์€ ์ ์ง„์ ์œผ๋กœ ์ง„ํ–‰๋จ
        attention progress measure

  • Transformers ํ›ˆ๋ จ์˜ ์ดˆ๊ธฐ ์ •์ฒด๊ธฐ (early loss plateau period) ๋™์•ˆ ๋ชจ๋ธ์€ ์ข…์ข… partial solution์„ ํ•™์Šตํ•จ

    • e.g., moving-window-sum ํƒœ์Šคํฌ์—์„œ ์ฒซ๋ฒˆ์งธ ์ž…๋ ฅ ํ† ํฐ x1x_1๏ปฟ๋ฅผ ๊ทธ๋Œ€๋กœ ์ถœ๋ ฅํ•˜๋ฉด ๋˜๋Š” y1y_1๏ปฟ ์˜ˆ์ธก์€ ๋น ๋ฅด๊ฒŒ ํ•™์Šตํ•˜์ง€๋งŒ, ์ „์ฒด ์†์‹ค์€ ์—ฌ์ „ํžˆ ๋†’์œผ๋ฉฐ ์ดํ›„ ํ† ํฐ์— ๋Œ€ํ•œ ์ •ํ™•๋„ ๋–จ์–ด์ง
      partial solution accuracy
      • ์ฒซ๋ฒˆ์งธ ์ถœ๋ ฅ ํ† ํฐ ์˜ˆ์ธก ์ •ํ™•๋„์ธ partial solution accuracy๊ฐ€ ๋น ๋ฅด๊ฒŒ ์ฆ๊ฐ€
      • ๋ฐ˜๋ฉด ์ „์ฒด loss๋Š” ๋งŽ์€ ํ›ˆ๋ จ ์Šคํ… ์ดํ›„ ํ•˜๋ฝ
      • โ‡’ ์ดˆ๊ธฐ ์ •์ฒด๊ธฐ ๋™์•ˆ ์ „์ฒด loss๋Š” ํฌ๊ฒŒ ์ค„์ง€ ์•Š์ง€๋งŒ ๋ชจ๋ธ์˜ partial solution ํ•™์Šต ์ง„ํ–‰๋จ

  • ์ •์ฒด๊ธฐ๋™์•ˆ ๋ชจ๋ธ์ด ๋ฐ˜๋ณต์  ํ† ํฐ์„ ์ถœ๋ ฅํ•˜๋Š” ๊ฒฝํ–ฅ์ธ repetition bias๊ฐ€ ๊ฐ•ํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚จ

    • repetition frequency: repetition bias ์ •๋Ÿ‰ํ™” ์ง€ํ‘œ
      ๋‹ค์Œ ํ† ํฐ๊ณผ ๋™์ผํ•œ ํ† ํฐ ์ถœ๋ ฅํ•œ ๋นˆ๋„์ˆ˜
    • ๊ฒฐ๊ณผ
      repetition frequency
      • repetition frequency๊ฐ€ ํ›ˆ๋ จ ์‹œ์ž‘ ์‹œ ์ž‘์•˜๋‹ค๊ฐ€ ์ฒ˜์Œ 50 ์Šคํ…๋™์•ˆ 0.8๊นŒ์ง€ ์ƒ์Šน
      • โ‡’ ์ดˆ๊ธฐ ์ •์ฒด๊ธฐ์˜ ๊ฐ•ํ•œ repetition bias ํ™•์ธ

  • output repetition bias๋Š” ๋‹ค๋ฅธ ํ† ํฐ์— ๋Œ€ํ•œ hidden representation์ด ๊ฑฐ์˜ ๋™๋“ฑํ•˜๊ฒŒ ๋˜๋Š” representation collapse๋ฅผ ๋™๋ฐ˜ํ•จ

    • ์ถœ๋ ฅ ์œ„์น˜ i,ji, j๏ปฟ์—์„œ hidden representation ๊ฐ„ pairwise cosine similarity:
      • hih_i๏ปฟ : ii๏ปฟth ํ† ํฐ์˜ hidden representation (logit ๋ณ€ํ™˜ ์ง์ „)
    • cosine similarity๊ฐ€ ํ›ˆ๋ จ ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€
      cosine similarity
      • โ‡’ partial solution์—์„œ ์ •ํ™•ํžˆ ์˜ˆ์ธก๋˜๋Š” ์ฒซ๋ฒˆ์งธ ์ถœ๋ ฅ ์œ„์น˜๋ฅผ ์ œ์™ธํ•˜๊ณ ๋Š”, ํ›ˆ๋ จ ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ ์—ฌ๋Ÿฌ ์ถœ๋ ฅ ์œ„์น˜์˜ hidden representation์ด ๊ฑฐ์˜ ๋™๋“ฑํ•ด์ง

  • attention map learning์ด repetition, representation collapse๊ณผ loss plateau ํ˜•์„ฑ์—๋„ ์ค‘์š”ํ•œ ์—ญํ•  ํ•จ์„ ๋ณด์ž„

    • ์ตœ์ ์˜ ์–ดํ…์…˜ ๋งต์„ ํ–ฅํ•ด(๋˜๋Š” ๋ฐ˜๋Œ€๋กœ) ํŽธํ–ฅ์‹œ์ผœ ๋ฐ˜๋ณต, ํ‘œํ˜„ ๋ถ•๊ดด, ์†์‹ค ์ •์ฒด๊ฐ€ ๊ฐ์†Œ(๋˜๋Š” ์ฆํญ)๋˜๋Š”์ง€ ํ™•์ธ
    • (i,j)โˆˆฮฉ(i,j) \in \Omega๏ปฟ ์— ๋Œ€ํ•ด attention mask Mi,jM_{i,j}๏ปฟ๋ฅผ cc๏ปฟ๋กœ ์„ค์ •: ๋‚˜๋จธ์ง€ ๊ฒฝ์šฐ Mi,j=1M_{i,j}=1๏ปฟ
      • ๊ธฐ์กด attention์„ ์ด attention mask์™€ hadamard ๊ณฑ ํ•จ์œผ๋กœ์จ ๋ณ€ํ˜•ํ•˜์—ฌ ํ›ˆ๋ จ๊ณผ ์ถ”๋ก ์— ์‚ฌ์šฉ
      • c>1c > 1๏ปฟ ์ด๋ฉด optimal attention map ๋ฐฉํ–ฅ์œผ๋กœ ํŽธํ–ฅ์‹œํ‚ค๊ณ , 0<c<10<c<1๏ปฟ ์ด๋ฉด ๋ฐ˜๋Œ€ ๋ฐฉํ–ฅ์œผ๋กœ ํŽธํ–ฅ์‹œํ‚ค๋Š” ๊ฒƒ

    • c>1c > 1๏ปฟ : hidden state ๊ฐ„ ํ‰๊ท  ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„์™€ repetition์˜ ๊ฐ์†Œ, ๋ณด๋‹ค ๋นจ๋ฆฌ ์ˆ˜๋ ดํ•จ
    • 0<c<10 < c < 1๏ปฟ : representation collapse ์ƒํƒœ์— ๋ชจ๋ธ์ด ๋ณด๋‹ค ์˜ค๋ž˜ ๋จธ๋ฌผ๊ณ  ๋ณด๋‹ค ์ดํ›„์— ์ˆ˜๋ ด, repetition frequency ๋˜ํ•œ ์ •์ฒด๊ธฐ๋™์•ˆ ํฌ๊ฒŒ ์œ ์ง€๋จ

    • โ‡’ attention map learning์ด repetition, representation collapse, loss plateau ํ˜•์„ฑ์— ์ค‘์š”ํ•œ ์—ญํ•  ํ•จ

  • ์†Œํ˜• Transformer๋งŒ์ด ์•„๋‹ˆ๋ผ ์‹ค์ œ LLM์˜ ์‚ฌ์ „ํ›ˆ๋ จ ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ๋„ ์ด๋Ÿฌํ•œ repetition bias, representation collapse๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š”์ง€ ํ™•์ธ

    • LLM: Pythia, OLMo-2 (open-source)
    • task: ARC-Easy ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ์—์„œ ๋žœ๋ค ์„ ์ •ํ•œ 100๊ฐœ ์งˆ๋ฌธ (์ดˆ๋“ฑํ•™๊ต ์ˆ˜์ค€ ๊ณผํ•™ ๊ฐ๊ด€์‹ ์งˆ๋ฌธ)
      • ๊ฐ ์งˆ๋ฌธ์— ๋Œ€ํ•ด 8๊ฐœ ํ† ํฐ์„ ์ƒ์„ฑํ•˜๊ฒŒ ํ•˜๊ณ  hidden representation์˜ pair-wise cosine similarity ๊ณ„์‚ฐ

    • 14M, 1B, 1.4B, 2.8B Pythia ๋ชจ๋ธ์˜ ์ดˆ๊ธฐ ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ ์ถœ๋ ฅ ์‹œํ€€์Šค์˜ repetition bias ๋ฐœ๊ฒฌ
      • ์ดˆ๊ธฐํ™” ์‹œ ํ‰๊ท  ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๊ฐ€ ๋น„๊ต์  ๋‚ฎ์œผ๋‚˜ (0.4~0.65), ๋ชจ๋“  ์‚ฌ์ด์ฆˆ์˜ ๋ชจ๋ธ์— ๋Œ€ํ•ด ๋ช‡ ์Šคํ…๋งŒ ํ•™์Šตํ•ด๋„ 0.9 ์ด์ƒ์œผ๋กœ ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€
    • OLMo-2 7B ๋ชจ๋ธ์—์„œ๋„ Pythia์™€ ์œ ์‚ฌํ•œ representation collapse ํ˜„์ƒ ๊ด€์ฐฐ
      • 150์Šคํ…์˜ ์ดˆ๊ธฐ ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ representation ํ‰๊ท  ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋Š” 0.93
      • 600์Šคํ…์—์„œ๋Š” 0.43์œผ๋กœ ๊ฐ์†Œ

    • โ‡’ repetition bias, representation collapse๊ฐ€ LLM ์ดˆ๊ธฐ ์‚ฌ์ „ํ›ˆ๋ จ ๋‹จ๊ณ„์—์„œ ์‹ค์ œ๋กœ ๋ฐœ์ƒํ•˜๋Š” ํ˜„์ƒ์ž„

Conclusion

  • Transformer ํ›ˆ๋ จ์˜ ์ดˆ๊ธฐ๋‹จ๊ณ„์—์„œ repetition bias์™€ representation collapse๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉฐ, ์ด๋Š” loss plateau์™€ ๋ฐ€์ ‘ํ•œ ๊ด€๋ จ์ด ์žˆ์Œ
    • loss plateau๊ฐ€ ์ตœ์ ์˜ attention map ํƒ์ƒ‰ํ•˜๋Š” ๊ณผ์ •์ผ ๊ฐ€๋Šฅ์„ฑ ์žˆ์Œ
  • ๋ณธ ๋…ผ๋ฌธ์˜ ๋ฐœ๊ฒฌ์— ์ด์–ด representation collapse์™€ ๊ฐ™์€ ํ˜„์ƒ์ด๋‚˜, attention map์˜ ๋А๋ฆฐ learning์˜ ์›์ธ์— ๋Œ€ํ•œ ํ–ฅํ›„ ์—ฐ๊ตฌ๊ฐ€ ์œ ๋ง

Categories

research