27 March 2026

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

๐Ÿ’กํŠธ๋žœ์Šคํฌ๋จธ๋Š” ํ•™์Šต ์ดˆ๊ธฐ์— 3๊ฐ€์ง€ ๋ฐฉ์‹์˜ ํ†ต๊ณ„ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ค‘์น˜์— ์ง์ ‘ ๋ฐ˜์˜ํ•˜๋ฉฐ, ์ด๋“ค์˜ ์กฐํ•ฉ๋งŒ์œผ๋กœ ์˜๋ฏธ์  ๊ด€๊ณ„์™€ ์–ดํ…์…˜์ด ํ˜•์„ฑ๋จ

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
๋ˆˆ๋ฌผ โ€ข ๊ฐ•์  : transformer์˜ ๋ธ”๋ž™๋ฐ•์Šค ๋™์ž‘ ๊ตฌ์กฐ๋ฅผ ํ•ด์„๊ฐ€๋Šฅํ•œ ํ†ต๊ณ„ ๊ด€๊ณ„ ๊ตฌ์กฐ๋กœ ํ‘œํ˜„ํ•ด ์ด๋ก ์ ์ธ ๊ฐ€์„ค์ด ์‹ค์ œ ์„ฑ๋Šฅ์— ๊ทผ์‚ฌํ•œ์ง€ ์‹คํ—˜์œผ๋กœ ํ™•์ธํ•จ.
โ€ข ์•ฝ์  : ์ดˆ๊ธฐ gradient ๊ตฌ์กฐ์— ํ•œํ•ด์„œ๋งŒ ๋ถ„์„ํ–ˆ๋‹ค๋Š” ์ . ์ „์ฒด ๊ตฌ์กฐ๋ฅผ ํŒŒ์•…ํ•˜๋ ค๋ฉด leading term์œผ๋กœ ์ œํ•œ๋จ. ํ•˜์ง€๋งŒ ์‹ค์ œ transformer๋Š” ๋งค์šฐ ๋ณต์žกํ•œ ๊ตฌ์กฐ๊ธฐ ๋•Œ๋ฌธ์— ์ด๋ ‡๊ฒŒ ์‹คํ—˜ํ•  ์ˆ˜ ๋ฐ–์— ์—†์—ˆ๋˜ ๊ฒƒ ๊ฐ™์Œ. (๊ทธ๋Ÿผ์—๋„ ์ดˆ๊ธฐ ๊ตฌ์กฐ๋ฅผ ์ž˜ ํ’€์–ด๋‚ด์–ด ํ•ด์„ํ•ด๋‚ธ ๋“ฏ)
โ€ข ๋ณด์™„์  : ์ดˆ๊ธฐ gradient์˜ ํ•ด์„๋งŒ์œผ๋กœ๋„ ์‹ค์ œ ๊ตฌ์กฐ์— ๊ทผ์‚ฌํ•จ์„ ํ™•์ธํ–ˆ์œผ๋ฏ€๋กœ, full gradient์˜ ํ•ด์„๊นŒ์ง€ ์ด๋Œ์–ด๋‚ธ๋‹ค๋ฉด, transformer ์ „์ฒด๋ฅผ ํ•ด์„ํ•  ์ˆ˜ ์žˆ์„์ง€๋„..
3.3
ํ”ผ๋•€ โ€ข ๊ฐ•์ : ํŠธ๋žœ์Šคํฌ๋จธ์˜ self attention์—์„œ ๋‚˜ํƒ€๋‚˜๋Š” ํŒจํ„ด์— ๋Œ€ํ•ด ๋ถ„์„ํ•˜๊ณ  ์ด๊ฑธ ์–ด๋–ป๊ฒŒ ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ์—ฐ๊ตฌ๋™๊ธฐ๊ฐ€ ํƒ„ํƒ„ํ•จ. ๊ธฐ์กด ์—ฐ๊ตฌ์— ๋น„ํ•ด ํ›จ์”ฌ ์‹ค์ œ์™€ ๋น„์Šทํ•œ ํ™˜๊ฒฝ์—์„œ ์‹คํ—˜์„ ์ง„ํ–‰ํ•จ
โ€ข ๋‹จ์  & ๋ณด์™„์ : ํ•™์Šต ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ๋งŒ ๊ฐ€๋Šฅํ•˜๋‹ค๊ณ  ํ–ˆ๋Š”๋ฐ ์ข€ ๋” ํ•™์Šต์ด ์ง„ํ–‰๋์„ ๋•Œ์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๊ฐ€ ์žˆ์œผ๋ฉด ์ข‹์„๋“ฏ
4.1
์›ƒ์œผ๋ฉด์„œ ๋ณด์ž์žฅ์ : ํŠธ๋žœ์Šคํฌ๋จธ ํ•™์Šต ๊ณผ์ •์„ ์ˆ˜ํ•™ ๋ฐ ์‹คํ—˜์ ์œผ๋กœ ์ž˜ ํ•ด์„ํ•ด๋‚ธ๋“ฏ. ํ† ํฐ ๋‹จ์œ„์—์„œ ํ’€์–ด๋‚ด๋Š” ์—ฐ๊ตฌ๊ฐ€ ๊ฝค ๋งŽ์ด ๋‚˜์˜ค๋Š” ๊ฒƒ ๊ฐ™์€๋ฐ, ๊ทธ ๊ณ„์—ด์— ์ž˜ ์–ด์šธ๋ฆฌ๋Š” ๋…ผ๋ฌธ์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ. ํƒ€์ด๋ฐ์ด ์ข‹์€ ์—ฐ๊ตฌ
๋‹จ์ : ํ† ํฐ์˜ ๋ณต์žก์„ค, ์˜๋ฏธ์  ๋ณต์žก์„ฑ, ์ง€์‹ ํŠน์ด์„ฑ ๋“ฑ์— ๋Œ€ํ•œ ๊ณ ๋ ค ์—†์ด ๋ฌถ์–ด์„œ ์‹คํ—˜ํ•œ์ .
๋ณด์™„์ : ํ† ํฐ์„ ๋” ์„ธ๋ถ„ํ™”ํ•˜๊ณ  ๊ฒฝํ–ฅ์„ฑ ๋ถ„์„ํ•˜๋Š” ๊ฒƒ์ด ์˜๋ฏธ์žˆ์–ด๋ณด์ž„
3.7
thumbs-up
โ€ข ์žฅ: transformer(์‚ฌ์‹ค์ƒ ๋ชจ๋“  LLM)์˜ gradient ๊ฒฝํ–ฅ์„ ์ฒด๊ณ„์ ์œผ๋กœ ๋ถ„์„ํ•จ. ํŠนํžˆ basis function ์˜ ์กฐํ•ฉ์„ ํ†ตํ•ด ๋ชจ๋ธ์˜ ์˜๋ฏธ ์Šต๋“์„ ํŒŒํ—ค์นจ.
โ€ข ๋‹จ&๋ณด์™„: ์™œ ํ•™์Šต ์ดˆ๊ธฐ์—๋งŒ ํ–ˆ์„๊นŒ? ์–ด์ฐจํ”ผ ๋ช‡๋ฒˆ์˜ gradient descent๋ฅผ ์ง„ํ–‰ํ•˜๋ฉด saturate๋˜๋Š” ๊ฑฐ ๊ฐ™๊ธด ํ•œ๋ฐ, ์‹คํ—˜์œผ๋กœ ํ•œ๋ฒˆ ๋” ๋ณด์—ฌ์คฌ์œผ๋ฉด ์ข‹์•˜์„๋“ฏ!
4.2
๋…์ˆ˜๋ฆฌ์˜คํ˜•์ œ โ€ข ๊ฐ•์ : ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์ด ์ฃผ๋กœ ํ•™์Šต์ด ๋๋‚œ ๋’ค ๋‚˜ํƒ€๋‚œ ๊ฒฐ๊ณผ๋ฅผ ๋ถ„์„ํ•˜๋Š” ๋ฐ ์ง‘์ค‘ํ–ˆ๋‹ค๋ฉด, ์ด ๋…ผ๋ฌธ์€ 'ํ•™์Šต ์ค‘์— ์ผ์–ด๋‚˜๋Š” ์ƒ์„ฑ ์ค‘ ์›๋ฆฌ'๋ฅผ ๊ฑด๋“œ๋ฆฌ๊ณ ์ž ํ•จ
โ€ข ์•ฝ์ : Closed-form ์„ค๋ช…์€ early-stage training์—์„œ association์ด โ€œ์ฒ˜์Œ ์–ด๋–ป๊ฒŒ ํ˜•์„ฑ๋˜๋Š”์ง€โ€๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๋ฐ ์ดˆ์ ์ด ์žˆ์Œ. ๊ทธ๋ž˜์„œ ํ•™์Šต์ด ๊นŠ์–ด์กŒ์„ ๋•Œ ๊ฐ™์€ ์„ค๋ช…๋ ฅ์ด ์œ ์ง€๋˜๋Š”์ง€๋Š” ๋ณ„๊ฐœ์˜ ๋ฌธ์ œ์ธ๊ฑฐ๊ฐ™๋‹ค
โ€ข ๋ณด์™„์ : ์ด ๋…ผ๋ฌธ์€ token๊ฐ„์˜ association์„ ๋‹ค๋ฃจ๋Š”๋ฐ, ์ƒ์œ„ ์ˆ˜์ค€ reasoning ๋งค์ปค๋‹ˆ์ฆ˜์œผ๋กœ๋„ ํ™•์žฅ ๊ฐ€๋Šฅํ•ด ๋ณด์ž„
3.7
์‚์งˆ โ€ข ๊ฐ•์ : ๊ธฐ์กด interpretability ์—ฐ๊ตฌ๋“ค์ด โ€œ๊ฒฐ๊ณผ์ ์œผ๋กœ ๋ฌด์—‡์ด ํ•™์Šต๋๋Š”๊ฐ€โ€๋ฅผ ๋ถ„์„ํ–ˆ๋‹ค๋ฉด, ์ด ๋…ผ๋ฌธ์€ โ€œ์–ด๋–ป๊ฒŒ ํ•™์Šต์ด ์‹œ์ž‘๋˜๋Š”๊ฐ€โ€๋ฅผ ์ˆ˜ํ•™์ ์œผ๋กœ ์ด๋Œ์–ด๋ƒ„.
โ€ข ์•ฝ์ : Semantic association์„ ์ฃผ๋กœ co-occurrence๊ณผ ๊ฐ™์€ ๋‹จ์ˆœ ํ†ต๊ณ„๋กœ ํ•ด์„ํ•˜๋Š”๋ฐ, ๋ณต์žกํ•œ reasoning์—๋„ ์ด ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ ์šฉํ•  ์ˆ˜ ์žˆ์„๊นŒ ์ƒ๊ฐ์ด ๋“ฆ.
โ€ข ๋ณด์™„์ : ํ•™์Šต์ด ์ง„ํ–‰๋ ์ˆ˜๋ก high-order term์ด ๋ฐ”๋€Œ๋Š” ์ž„๊ณ„์ ์„ ํŒŒ์•…ํ•ด์„œ ๋ณต์žกํ•œ reasoning์˜ ๋ฐœ์ƒ์‹œ์ ์„ ํ•ด์„
3.9
ํŒ์ฝ˜โ€ข ์žฅ์ : ์ตœ๋Œ€ํ•œ ์‹ค์ œ์™€ ์œ ์‚ฌํ•œ ์„ค์ •์—์„œ LLM ๋ถ„์„. 3๊ฐœ์˜ ์ฃผ์š” basis function์œผ๋กœ ํ•™์Šต ๊ฐ€์ค‘์น˜ ์„ค๋ช…
โ€ข ๋‹จ์  & ๋ณด์™„์ : Figure6 ํฌํ•จํ•ด์„œ, ๋ณด๋‹ค ํฐ ๋ชจ๋ธ ๋ถ„์„ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ์žˆ์—ˆ์œผ๋ฉด ๋” ํ˜„์‹ค์ ์ด์—ˆ์„๋“ฏ
3.7
ํŒŒ์ด์–ด โ€ข ์žฅ์ : Transformer์˜ ๊ฐ€์ค‘์น˜ ๊ตฌ์กฐ๋ฅผ ํ•ด์„ ๊ฐ€๋Šฅํ•˜๋„๋ก ํ•˜์—ฌ ํ•™์Šต ๊ณผ์ • ์ค‘์˜ ์›๋ฆฌ๋ฅผ ๋ฐํ˜€๋‚ธ ๊ฒƒ.
โ€ข ๋‹จ์ : ํ•™์Šต ํ›„๋ฐ˜๋ถ€์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ์—†์œผ๋ฉฐ, Cosine Simliarity๊ฐ€ ๋‚ฎ์•„์ง€๋Š” ์ด์œ ๊ฐ€ ๋ฌด์—‡์ธ์ง€?
โ€ข ๋ณด์™„: ๋ณด๋‹ค ๋ณต์žกํ•œ ๋ชจ๋ธ์— ๋Œ€ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋„ ์žˆ์–ด์•ผ ํ•จ.
3.4
๋ฉ์ฟ ๋ฆผ๋ณด๊ต‰์žฅํžˆ novelํ•˜๊ณ  soundํ•œ๋ฐ ์ดํ•ด๋ฅผ ์ž˜ ๋ชปํ•˜๊ฒ ์Œ ใ…œใ…œ ์–ดํ…์…˜์˜ ํ•™์Šต์„ ํ•ด์„ํ•œ ๊ฒƒ์€ ์ข‹์œผ๋‚˜, MLP ๋ ˆ์ด์–ด์™€์˜ ๊ด€๊ณ„๊นŒ์ง€ ๋ดค์œผ๋ฉด LLM์˜ ์ดํ•ด์— ๋„์›€๋˜์ง€ ์•Š์•˜์„๊นŒ ์‹ถ์Œ. ์ธ์‚ฌ์ดํŠธ ๊ธฐ๋ฐ˜์œผ๋กœ ํ™œ์šฉ๊ฐ€๋Šฅ์„ฑ์€ ์ปค๋ณด์ž„!4
์ดˆ์ฝœ๋ฆฟ โ€ข ์žฅ์ : ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ํ•™์Šต ์ดˆ๊ธฐ์— 3๊ฐ€์ง€ basis function์˜ ์กฐํ•ฉ์œผ๋กœ ์˜๋ฏธ ๊ด€๊ณ„๋ฅผ ํ˜•์„ฑํ•œ๋‹ค๋Š” ๊ฑธ ์ˆ˜ํ•™์ ์œผ๋กœ ์œ ๋„ํ•˜๊ณ  ์‹คํ—˜์œผ๋กœ ํ™•์ธํ•จ
โ€ข ์•ฝ์ : ๋ถ„์„์ด ํ•™์Šต ์ดˆ๊ธฐ ๋‹จ๊ณ„์—๋งŒ ํ•œ์ •๋˜์–ด ์žˆ์–ด์„œ, ๋ชจ๋ธ์ด ํ•™์Šต๋œ ์ดํ›„์—๋„ ๊ฐ™์€ ๊ตฌ์กฐ๊ฐ€ ์œ ์ง€๋˜๋Š”์ง€ ์•Œ ์ˆ˜ ์—†์„๋“ฏ
โ€ข ๋ณด์™„์ : ํ† ํฐ์„ ์„ธ๋ถ„ํ™”ํ•ด์„œ ๊ฐ basis function์ด ์–ด๋–ค ์œ ํ˜•์˜ ํ† ํฐ์—์„œ ๋” ๊ฐ•ํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚˜๋Š”์ง€ ๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉด ์–ด๋–จ๊นŒ
3.8

TL; DR

๐Ÿ’ก

ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ํ•™์Šต ์ดˆ๊ธฐ์— 3๊ฐ€์ง€ ๋ฐฉ์‹์˜ ํ†ต๊ณ„ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ค‘์น˜์— ์ง์ ‘ ๋ฐ˜์˜ํ•˜๋ฉฐ, ์ด๋“ค์˜ ์กฐํ•ฉ๋งŒ์œผ๋กœ ์˜๋ฏธ์  ๊ด€๊ณ„์™€ ์–ดํ…์…˜์ด ํ˜•์„ฑ๋จ

Summary

  • ์—ฐ๊ตฌ์ง„: ์œ„์Šค์ฝ˜์‹  ๋Œ€ํ•™, ์‹œ๋“œ๋‹ˆ ๊ณต๊ณผ๋Œ€ํ•™
  • ์ธ์šฉ์ˆ˜ : 2

์—ฐ๊ตฌ ๋™๊ธฐ

  • Self-attention ๊ธฐ๋ฐ˜ LLM์€ factual ์ง€์‹๊ณผ word knowledge๋ฅผ ๋ชจ๋‘ ํ•™์Šต

    โ‡’ ๋ชจ๋ธ ๋‚ด๋ถ€์—์„œ ์–ด๋–ค ๊ตฌ์กฐ๊ฐ€ ๋งŒ๋“ค์–ด์ง€๊ณ , ์–ด๋–ป๊ฒŒ ํ•™์Šต๋˜๋Š”๊ฑธ๊นŒ?
    โ‡’ ์•„๋ž˜์™€ ๊ฐ™์€ ํŒจํ„ด ๋ฐœ๊ฒฌ

    • induction heads (ํŒจํ„ด ๋ณต์‚ฌ)
    • linear semantic relations (์„ ํ˜• ์˜๋ฏธ ๊ด€๊ณ„)
    • topic clustering (์ฃผ์ œ๋ณ„ ๋ฌถ์ž„)
  • ๊ฒฐ๋ก : Semantic association (๋‹จ์–ด๋“ค ๊ฐ„ ์˜๋ฏธ์  ์—ฐ๊ฒฐ)์ด LLM์˜ ํ•ต์‹ฌ ๋Šฅ๋ ฅ์ด๋‹ค!
    • ์ •์˜: ํ† ํฐ๋“ค ์‚ฌ์ด์˜ ํ†ต๊ณ„์ (์–ผ๋งˆ๋‚˜ ์ž์ฃผ?) + ๊ธฐ๋Šฅ์ (๋ฌธ์žฅ ์•ˆ์—์„œ ๋น„์Šทํ•œ ์—ญํ• ์„ ํ•˜๋Š”์ง€?) ๊ด€๊ณ„
      • bird โ†” flew โ‡’ ๊ฐ™์ด ์ž์ฃผ ๋“ฑ์žฅ
      • car โ†” truck โ‡’ ์„œ๋กœ ๋Œ€์ฒดํ•˜๊ธฐ ์‰ฌ์›€
      • country โ†”capital โ‡’ ๋น„์Šทํ•œ ์˜๋ฏธ๋กœ ๋ฌถ์ž„

      โ†’ ์ด๋Ÿฐ ์˜๋ฏธ์  ์—ฐ๊ด€์„ฑ์ด ์žˆ์–ด์•ผ ๋ฌธ์žฅ ์ƒ์„ฑ๊ณผ ์ผ๋ฐ˜ํ™”๊ฐ€ ๊ฐ€๋Šฅ

โ€œ๊ทธ๋ ‡๋‹ค๋ฉด ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์–ด๋–ป๊ฒŒ ๋‹จ์–ด ๊ฐ„ ์˜๋ฏธ์  ์—ฐ๊ด€(semantic association)์„ ํ•™์Šตํ•˜๋Š”๊ฐ€?โ€

โ‡’ gradient descent ์ตœ์ ํ™”๋กœ๋ถ€ํ„ฐ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋“œ๋Ÿฌ๋‚จ
โ‡’ ์ด๋Ÿฌํ•œ ๊ตฌ์กฐ๊ฐ€ ํ•™์Šต ์‹œ ์–ด๋–ป๊ฒŒ ๋งŒ๋“ค์–ด์ง€๋Š”์ง€ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ๋“ค์ด ์ง„ํ–‰๋จ

๊ธฐ์กด ์—ฐ๊ตฌ์˜ ํ•œ๊ณ„

ํŠธ๋žœ์Šคํฌ๋จธ์˜ ํ•™์Šต ๋ฐฉ์‹์ด ๋งค์šฐ ๋ณต์žกํ•ด์„œ, ๊ธฐ์กด ์—ฐ๊ตฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋น„ํ˜„์‹ค์ ์ธ ๊ฐ€์ •์„ ์ฑ„ํƒ
  • Synthetic structured language
    • ๋ฐ˜๋ณต์ ์ธ ํŒจํ„ด or ๋ฌธ๋ฒ•์ด ์™„์ „ํžˆ ๊ทœ์น™์ ์ธ toy language
    • ์‹ค์ œ ์ž์—ฐ์–ด์™€ ๊ดด๋ฆฌ ํผ
  • Positional encoding์ด๋‚˜ residual connection์ด ์ œ๊ฑฐ๋œ ๋‹จ์ˆœํ™”๋œ ๋ชจ๋ธ ๊ตฌ์กฐ
    • ์ž์—ฐ์–ด๋Š” ๋‹จ์–ด์˜ ์ˆœ์„œ๊ฐ€ ์ค‘์š”ํ•œ๋ฐ, ์ˆœ์„œ ์ •๋ณด ์ „๋‹ฌ ๊ธฐ๋Šฅ์„ ์ƒ์‹คํ•จ โ†’ bag-of-words์ฒ˜๋Ÿผ ํ–‰๋™
    • Residual connection ์—†์• ๋ฉด ์ดˆ๊ธฐ feature ์œ ์ง€๊ฐ€ ์•ˆ๋˜๊ณ  gradient ํ๋ฆ„์ด ๊นจ์ง
  • ๋น„ํ˜„์‹ค์ ์ธ ํ•™์Šต ๋ฐฉ์‹
    • end-to-end๊ฐ€ ์•„๋‹Œ component๋ณ„ ํ•™์Šต / ์ผ๋ถ€ ๊ฐ€์ค‘์น˜๋ฅผ freeze
    • ์‹ค์ œ LLM ํ•™์Šต ๋ฐฉ์‹๊ณผ ๋‹ค๋ฆ„ (gradient ์ž์ฒด๊ฐ€ ์ „์ฒด network๋ฅผ ํ†ตํ•ด ์ „๋‹ฌ๋˜๊ธฐ ๋•Œ๋ฌธ)

์ œ์•ˆ ์•„์ด๋””์–ด

  • ํ•™์Šต ์ดˆ๊ธฐ gradient leading term (์ฃผ์š” ํ•ญ) ๋ถ„์„์„ ํ†ตํ•ด ํŠธ๋žœ์Šคํฌ๋จธ ๊ฐ€์ค‘์น˜๋ฅผ ์ˆ˜ํ•™์ ์œผ๋กœ ํ•ด์„ํ•˜์ž!
    • ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ์˜๋ฏธ ๊ด€๊ณ„ ํ•™์Šต์„ ์ดˆ๋ฐ˜๋ถ€์— ๋Œ€๋ถ€๋ถ„ ์ง„ํ–‰๋˜๊ณ  ์ดํ›„์—๋„ ์œ ์ง€๋˜๊ธฐ ๋•Œ๋ฌธ
    • ์ดˆ๊ธฐ์—๋Š” gradient๊ฐ€ ๋‹จ์ˆœํ•ด์„œ ๊ฐ€์ค‘์น˜๋ฅผ closed-form์œผ๋กœ ๊ทผ์‚ฌ ๊ฐ€๋Šฅ

โ‡’ ๋ฐœ๊ฒฌ: ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜๋Š” ๋‹จ์ˆœํžˆ co-occurence๊ฐ€ ์•„๋‹Œ, 3๊ฐ€์ง€ basis function์˜ ์กฐํ•ฉ์œผ๋กœ ํ‘œํ˜„๋œ๋‹ค!

  • basis function์ด๋ž€?
    • ์˜๋ฏธ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๊ฐ€์žฅ ๊ธฐ๋ณธ์ ์ธ ๋ณ€ํ™˜ ๋‹จ์œ„

      โ‡’ ํ† ํฐ โ†’ ๋‹ค๋ฅธ ํ† ํฐ๋“ค๊ณผ์˜ ๊ด€๊ณ„๋กœ ๋ฐ”๊ฟ”์ฃผ๋Š” ๊ทœ์น™ / ํ•จ์ˆ˜

      • e.g., "fish" โ†’ [pond, water, lake, swim, ...]
  • Bigram mapping : ์ธ์ ‘ ํ† ํฐ ๊ฐ„ ์˜์กด๊ด€๊ณ„
  • Interchangeability mapping: ํ† ํฐ ๊ฐ„ ์œ ์‚ฌ๋„ ๋ฐ˜์˜ (e.g., ์œ ์˜์–ด, ๋ฌธ๋ฒ•์  ์—ญํ• )
  • Context mapping: ๊ณ ์ฐจ ์˜๋ฏธ ๊ด€๊ณ„ (long-range)

Theoretical Analysis

  • ์–ดํ…์…˜ ๊ธฐ๋ฐ˜ ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ฐ€์ค‘์น˜๊ฐ€ gradient์˜ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๋ถ€๋ถ„(leading term)์œผ๋กœ ๋Œ€๋ถ€๋ถ„ ์„ค๋ช…๋จ
    • ์ž‘์€ ์ดˆ๊ธฐ๊ฐ’(Gaussian initialization)์—์„œ ์‹œ์ž‘ํ•ด์„œ, learning rate์™€ step ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ํฌ์ง€ ์•Š์€ ์ดˆ๊ธฐ ํ•™์Šต ๊ตฌ๊ฐ„ ๊ฐ€์ •
    • gradient descent๋ฅผ ๋ช‡ ๋ฒˆ ์ง„ํ–‰ํ•˜๋ฉด, ํŠธ๋žœ์Šคํฌ๋จธ์˜ ๊ฐ ๊ฐ€์ค‘์น˜๋“ค์ด ํŠน์ •ํ•œ ํ˜•ํƒœ์— ๊ฐ€๊น๊ฒŒ ๋จ

    โ‡’ ๋ชจ๋“  layer๊ฐ€ ๋™์ผํ•œ ๊ตฌ์กฐ๋ฅผ ๋ฐฐ์šด๋‹ค!

    • 3๊ฐ€์ง€ basis function
      • Bigram mapping Bห‰\bar{B}๏ปฟ
        • Bห‰ij=Pt(ei)Pt(ejโˆฃei)โˆ’Pt(ei)/โˆฃVโˆฃ\bar{B}_{ij}=\mathcal{P}_t(e_i)\mathcal{P}_t(e_j|e_i) - \mathcal{P}_t(e_i)/|\mathcal{V}|๏ปฟ
          • Pt(ei)\mathcal{P}_t(e_i)๏ปฟ: ํ† ํฐ eie_i๏ปฟ๊ฐ€ ์ „์ฒด ๋ฐ์ดํ„ฐ์—์„œ ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ๋‚˜์˜ค๋Š”์ง€
          • ์ฒซ๋ฒˆ์งธ ํ•ญ : iโ†’j๋กœ์˜ ๋“ฑ์žฅ ํ™•๋ฅ 
          • ๋‘๋ฒˆ์งธ ํ•ญ: randomํ•˜๊ฒŒ ๋‹ค์Œ ํ† ํฐ์ด ๋“ฑ์žฅํ•  ํ™•๋ฅ 

        โ‡’ ํ† ํฐ eie_i๏ปฟ ๋’ค์— ํ† ํฐ eje_j๏ปฟ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ์˜ค๋Š”์ง€๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๋‘ ํ† ํฐ ์‚ฌ์ด์˜ โ€œ์—ฐ๊ด€์„ฑโ€์„ ๋‚˜ํƒ€๋ƒ„

      • Interchangeability mapping โˆ‘Bห‰\sum_{\bar{B}}๏ปฟ
        • โˆ‘Bห‰=Bห‰TBห‰\sum_{\bar{B}}=\bar{B}^T\bar{B}๏ปฟ (ํ† ํฐ ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„ ํ–‰๋ ฌ)
          • ๋‘ ํ† ํฐ์ด ๋น„์Šทํ•œ ๋ฐฉ์‹์œผ๋กœ ๋“ฑ์žฅํ•˜๋Š”์ง€ ํŒŒ์•…
        • ์ด๋ฅผ ํ’€์–ด์“ฐ๋ฉด ์•„๋ž˜ ์ˆ˜์‹๊ณผ ๊ฐ™์Œ
        • ์ž์ฃผ ๋‚˜์˜ค๋Š” ํ† ํฐ์ผ์ˆ˜๋ก ์ค‘์š”ํ•˜๋ฉฐ, ๋‘ ํ† ํฐ์ด ๋™์ผํ•œ ์ด์ „ ํ† ํฐ์„ ๊ณต์œ ํ•˜๋Š” ์ง€ ์—ฌ๋ถ€๋ฅผ ๊ณ ๋ ค
          ํ† ํฐ์•ž์— ์˜ค๋Š” ํ† ํฐ
          dogthe, big
          catthe, small

        โ‡’ ์ด์ „์— ๋“ฑ์žฅํ•˜๋Š” ํ† ํฐ ๋ถ„ํฌ๊ฐ€ ๋น„์Šทํ•˜๋ฉด ๊ทธ ํ† ํฐ๋“ค์€ ๋น„์Šทํ•œ ๊ธฐ๋Šฅ/์—ญํ• ์„ ํ•จ

      • Context mapping ฯ•ห‰\bar{\phi}๏ปฟ
        • ํ† ํฐ eie_i๏ปฟ ์ฃผ๋ณ€ context์— ํ† ํฐ eje_j๏ปฟ๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ์ง€?
          • ํ† ํฐ eie_i๏ปฟ์˜ ๋“ฑ์žฅ ์œ„์น˜๋ฅผ ๊ธฐ์ค€์œผ๋กœ, ์ด์ „ prefix๋ฅผ ๋ชจ๋‘ ์‚ดํŽด๋ด์„œ ๊ทธ ์ค‘ eje_j๏ปฟ๊ฐ€ ๋“ฑ์žฅํ•˜๋Š” ์ง€ ํ™•์ธ

        โ‡’ ์ด context ์ •๋ณด๊ฐ€ Attention & Value ํ–‰๋ ฌ์„ ์ƒ์„ฑํ•˜๋Š” ํ•ต์‹ฌ ์š”์†Œ๊ฐ€ ๋จ

    • ํŠธ๋žœ์Šคํฌ๋จธ ๊ฐ ๊ฐ€์ค‘์น˜ ํ–‰๋ ฌ์ด ๋ฌด์Šจ ์˜๋ฏธ๋ฅผ ๊ฐ€์ง€๋‚˜? (๋ฐ์ดํ„ฐ ํ†ต๊ณ„ ๊ด€์ )
      • Output matrix WOW_O๏ปฟ

        โ‡’ Bigram mapping์ด leading term! โ†’ output layer๋Š” bigram ๋ชจ๋ธ์ด๋‹ค

        โ‡’ ํŠธ๋žœ์Šคํฌ๋จธ๋Š” next-token ๋ถ„ํฌ๋ถ€ํ„ฐ ํ•™์Šต

      • Value matrix V(l)V^{(l)}๏ปฟ

        โ‡’ ํ† ํฐ์ด ์ฃผ์–ด์ง€๋ฉด context๋ฅผ ์š”์•ฝํ•œ ํ›„, next-token์— ๋ฐ˜์˜

        โ‡’ ๋‹จ์ˆœํžˆ ๊ฐ’์„ ์ „๋‹ฌํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹Œ, ์˜๋ฏธ ํ‘œํ˜„์„ ์ƒ์„ฑ

      • Attention matrix W(l)W^{(l)}๏ปฟ
        • ํ† ํฐ ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋‹ด๊ณ ์žˆ์Œ
        • query-key๊ฐ€ ํฌ๊ฒŒ ๋‚˜์˜จ๋‹ค๋Š” ๊ฑด ๋‹จ์ˆœํžˆ โ€œ๋น„์Šทํ•˜๋‹คโ€๊ฐ€ ์•„๋‹ˆ๋ผ ์ € ํ† ํฐ์„ ๋ณด๋ฉด ์ง€๊ธˆ ํ•„์š”ํ•œ ๋‹ค์Œ ๋‹จ์–ด ์˜ˆ์ธก์ด ๋” ์ž˜ ๋œ๋‹ค๋Š” ๋œป
        • ๊ทธ๋Ÿฌ๋ฉด Q๊ฐ€ ์–ด๋–ป๊ฒŒ ๋งŒ๋“ค์–ด์ง€๋‚˜?
          • ๋จผ์ € ๊ฐ ์ด์ „ ํ† ํฐ์ด ํ˜„์žฌ ๋ชฉํ‘œ ์ถœ๋ ฅ(next token)๊ณผ ์–ผ๋งˆ๋‚˜ ์˜๋ฏธ์ ์œผ๋กœ ์—ฐ๊ฒฐ๋˜๋Š”์ง€ score๋ฅผ ๋งค๊น€
          • Score์—์„œ ๋ฏธ๋ž˜ ํ† ํฐ์€ ์ œ๊ฑฐํ•˜๊ณ , ์œ„์น˜์— ๋”ฐ๋ผ ์ •๋ฆฌ
          • Score๋ฅผ โ€œ์ถœ๋ ฅ ํ† ํฐ๊ณผ์˜ ๊ด€๊ณ„โ€๊ฐ€ ์•„๋‹ˆ๋ผ ์–ดํ…์…˜์ด ์‹ค์ œ๋กœ ํ™œ์šฉํ•˜๋Š” โ€œ์ฟผ๋ฆฌ ํ† ํฐ๊ณผ์˜ ๊ด€๊ณ„โ€๋กœ ๋ณ€ํ™˜

  • ๊ฐ€์ค‘์น˜๋“ค์ด ์‹ค์ œ๋กœ ๊ฐ™์ด ์–ด๋–ป๊ฒŒ ์ž‘์šฉํ•˜๋Š”๊ฐ€?
    • ๊ธฐ์กด ํŠธ๋žœ์Šคํฌ๋จธ single-layer ์ˆ˜์‹
    • ์ด ๋…ผ๋ฌธ์˜ ๊ทผ์‚ฌ์น˜ ๋Œ€์ž…

    โ‡’ ํŠธ๋žœ์Šคํฌ๋จธ๋Š” ํ†ต๊ณ„์  ๊ด€๊ณ„(B, ฮฆ, ฮฃB)๋ฅผ ์กฐํ•ฉํ•˜๋Š” ๊ตฌ์กฐ๋‹ค

Experiments

  • ์‹คํ—˜ ๊ฐœ์š”
    • ์‹คํ—˜ ๋ชฉํ‘œ: ์ด๋ก ์ด ์‹ค์ œ๋กœ ๋งž๋Š”์ง€ ๊ฒ€์ฆํ•˜๊ณ , ๊ฐ€์ค‘์น˜ ์•ˆ์— ์˜๋ฏธ์  ๊ตฌ์กฐ๊ฐ€ ์‹ค์ œ๋กœ ์กด์žฌํ•˜๋Š”์ง€ ํ™•์ธ
      • ๊ฐ€์ค‘์น˜๊ฐ€ leading term์œผ๋กœ ๊ทผ์‚ฌ๊ฐ€ ๋˜๋Š”์ง€?
      • 3๊ฐ€์ง€ basis function ๊ตฌ์กฐ๊ฐ€ ์‹ค์ œ๋กœ ๋‚˜ํƒ€๋‚˜๋Š”์ง€?
    • ๋ฐ์ดํ„ฐ์…‹: TinyStories
      • ํ–‰๋ ฌ์ด ํฌ๋ฉด ๋ถ„์„์ด ์–ด๋ ต๊ธฐ์— 3000 ๋‹จ์–ด๋กœ ์ œํ•œ
  • ์‹คํ—˜ ๊ฒฐ๊ณผ 1: 3-Layer Transformer ์ด๋ก  ๊ฒ€์ฆ
    โ€œSGD๋กœ ํ•™์Šต๋œ ๊ฐ€์ค‘์น˜์™€ ์ด๋ก ์œผ๋กœ๋ถ€ํ„ฐ ์œ ๋„๋œ leading term ๊ฐ€์ค‘์น˜ ๊ฐ„ ์œ ์‚ฌ๋„ ๋น„๊ตโ€
    • ๋ชจ๋“  ํ–‰๋ ฌ์— ๋Œ€ํ•ด ๊ฑฐ์˜ ์™„๋ฒฝํ•˜๊ฒŒ ์ผ์น˜ํ•œ๋‹ค! โ†’ neural net์ด ์•„๋‹ˆ๋ผ ๋‹จ์ˆœํ•œ ํ†ต๊ณ„๋กœ๋„ ์„ค๋ช…์ด ๊ฐ€๋Šฅํ•จ
    โ€œํ•™์Šต epoch ๋ณ€ํ™”์— ๋”ฐ๋ผ์„œ๋Š” ์œ ์‚ฌ๋„๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ• ๊นŒ?โ€
    • ์ดˆ๊ธฐ ํ•™์Šต์€ ์ด๋ก  ๊ทธ๋Œ€๋กœ ์ง„ํ–‰๋˜๋ฉฐ, ๋ชจ๋ธ์ด ๋งŽ์ด ํ•™์Šต๋œ ์ดํ›„์—๋„ ์–ด๋А์ •๋„ ์œ ์ง€
  • ์‹คํ—˜ ๊ฒฐ๊ณผ 2: 3-Layer Transformer ์˜๋ฏธ ํ•ด์„ ๊ฒ€์ฆ
    โ€œ๊ฐ ํ† ํฐ๋งˆ๋‹ค ๊ฐ€์žฅ ๊ด€๋ จ๋œ ํ† ํฐ top 30์„ ๋ฝ‘์•„์„œ ์‹ค์ œ๋กœ ์˜๋ฏธ๊ฐ€ ๋งž๋Š”์ง€ ํ™•์ธโ€
    • Bห‰\bar{B}๏ปฟ โ†’ ๋‹ค์Œ์— ๋‚˜์˜ฌ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์€ ๋‹จ์–ด๋“ค
    • โˆ‘Bห‰\sum_{\bar{B}}๏ปฟ โ†’ ๋น„์Šทํ•œ ์—ญํ• ์„ ํ•˜๋Š” ๋‹จ์–ด๋“ค
    • ฯ•ห‰\bar{\phi}๏ปฟ โ†’ ๊ฐ™์€ ๋ฌธ๋งฅ์—์„œ ๋“ฑ์žฅํ•˜๋Š” ๋‹จ์–ด๋“ค

  • ์‹คํ—˜ ๊ฒฐ๊ณผ 3: ์‹ค์ œ LLM์—์„œ๋„ ์ ์šฉ ๊ฐ€๋Šฅํ•œ ์ง€ ๊ฒ€์ฆ
    • Pythia-1.4B ํ™œ์šฉ (ํ•™์Šต ์ค‘๊ฐ„๋งˆ๋‹ค checkpoint ์ œ๊ณตํ•ด์„œ layer ๋ณ„ ๋ถ„์„ ๊ฐ€๋Šฅ)
    • ํ•˜์ง€๋งŒ ์‹ค์ œ LLM์€ MLP, multi-head attention ๋“ฑ ์ถ”๊ฐ€์ ์ธ component๋ฅผ ํฌํ•จํ•˜๊ธฐ์— ๊ฐ€์ค‘์น˜ ํ•ด์„์ด ๋ถˆ๊ฐ€

    โ‡’ ๊ฐ€์ค‘์น˜๋ฅผ ์ง์ ‘ ๋ณด์ง€ ๋ง๊ณ  ์ž„๋ฒ ๋”ฉ ๊ธฐ๋ฐ˜์œผ๋กœ ํ† ํฐ ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ„์ ‘์ ์œผ๋กœ ์ถ”์ถœํ•˜์ž!

    1. ํŠธ๋žœ์Šคํฌ๋จธ์— ๊ฐ ํ† ํฐ์„ input์œผ๋กœ ๋ถ€์—ฌ
    1. Layer ํ†ต๊ณผ ์ „ ์ž„๋ฒ ๋”ฉ, Layer ํ†ต๊ณผ ํ›„ ์ž„๋ฒ ๋”ฉ, ์–ดํ…์…˜ ํ†ต๊ณผ ํ›„ ์ž„๋ฒ ๋”ฉ ๊ฐ๊ฐ ๊ณ„์‚ฐ
    1. ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž„๋ฒ ๋”ฉ ํ–‰๋ ฌ ๊ตฌ์„ฑ

    โ‡’ ๊ทผ๋ฐ ์ด๋ก ์—์„œ๋Š” ํ† ํฐ ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„์ธ๋ฐ, ์‹ค์ œ๋กœ๋Š” ํ† ํฐ โ†” ์ž„๋ฒ ๋”ฉ ๊ฐ„ ์ƒ๊ด€๊ด€๊ณ„๋ผ์„œ ์ง์ ‘ ๋น„๊ต๊ฐ€ ์–ด๋ ค์›€
    1. Leading term ๊ณ„์‚ฐ (OpenWebTest ๋ฐ์ดํ„ฐ์—์„œ ์‹ค์ œ ํ†ต๊ณ„ ๊ณ„์‚ฐ)

    1. ์ •๊ทœํ™” ํ›„ covariance ํ–‰๋ ฌ ๊ณ„์‚ฐ (์ž„๋ฒ ๋”ฉ์„ ๊ณต๋ถ„์‚ฐ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋ฉด ํ† ํฐ ํ–‰๋ ฌ๋กœ ๋ณ€ํ™˜๋จ)

    ์‹ค์ œ LLM๋„ ์ดˆ๊ธฐ์—๋Š” ์ด๋ก ๊ณผ ๊ฑฐ์˜ ๋™์ผํ•˜๊ฒŒ ํ•™์Šต๋˜๋ฉฐ,
    ์ด ๊ตฌ์กฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ ์  ๋” ๋ณต์žกํ•œ ํ‘œํ˜„์„ ์Œ“์•„๊ฐ

    • ์ฒซ๋ฒˆ์งธ ๋ ˆ์ด์–ด๋Š” ์•„์ง ์ถฉ๋ถ„ํ•œ context๊ฐ€ ์—†๊ธฐ์— ์–ดํ…์…˜์— ๋Œ€ํ•œ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋Š” ๋‚ฎ์Œ
      • MLP ๋ ˆ์ด์–ด๋Š” ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์„ ํ˜•์„ฑํ•˜๋Š” ์—ญํ•  ์ˆ˜ํ–‰
      • ์ดํ›„์—๋Š” ์–ดํ…์…˜ ์˜ํ–ฅ์ด ์ปค์ง
    • ์ดํ›„ ์ ์  ๋ณต์žกํ•œ context๋ฅผ ๋ฐฐ์šฐ๋ฉด์„œ ์œ ์‚ฌ๋„ ์ƒ์Šน

    โ€œ๊ฐ ์–ดํ…์…˜ head๊ฐ€ ์ด๋ก ์ ์œผ๋กœ ๊ณ„์‚ฐ๋œ ๊ตฌ์กฐ์™€ ์–ผ๋งˆ๋‚˜ ๋‹ฎ์•˜์„๊นŒ?โ€
    • ์ดˆ๊ธฐ ๋ ˆ์ด์–ด๋Š” semantic association ๋Šฆ๊ฒŒ ํ•™์Šต โ†’ ์•„์ง ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋Š” ๋‚ฎ์Œ
    • ์ค‘๊ฐ„ ๋ ˆ์ด์–ด์—์„œ head๋“ค์ด ์„œ๋กœ ๋‹ค๋ฅธ ์—ญํ• ๋กœ ๋‚˜๋‰˜๊ธฐ ์‹œ์ž‘
    • ๋ง๋‹จ ๋ ˆ์ด์–ด๋Š” variance ๊ฐ์†Œ โ†’ ๊ฐ head์˜ ์—ญํ• ์ด ๋ช…ํ™•ํžˆ ๊ตฌ๋ถ„๋จ

Categories

Interpretabilityresearch