21 January 2026

An Analysis for Reasoning Bias of Language Models with Small Initialization

๐Ÿฅˆ

An Analysis for Reasoning Bias of Language Models with Small Initialization

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
๋งน๊ตฌmemory์™€ reasoning์€ ๋‹ค๋ฅธ ๊ฒƒ์ด๋ผ๊ณ  ์‹œ์ž‘ํ•˜๋Š” ๊ฒƒ ๊ฐ™์Œ. ์ด๋ฒˆ์ฃผ ๋‹ค๋ฅธ ๋…ผ๋ฌธ์—์„œ ๊ฒฐ๊ตญ LLM์€ memory ๊ธฐ๋ฐ˜ ์ถ”๋ก ์„ ํ•˜๋Š” ๊ฒƒ์ด๋‹ค ๋А๋‚Œ์˜ ๊ฒฐ๋ก ์ด ๋‚˜์™”๋Š”๋ฐ, ์ด ๋…ผ๋ฌธ์€ ๋” ๋””ํ…Œ์ผํ•˜๊ฒŒ ํ™•์ธํ•ด๋ณด๋ ค๊ณ  ํ•œ ๊ฒƒ ๊ฐ™๋‹ค. LLM๋„ ๊ฒฐ๊ตญ Transformer๋‹ˆ๊นŒ, ์ด๋Ÿฐ ์‹คํ—˜๋„ ๊ฐ€๋Šฅํ•˜๊ตฌ๋‚˜ ๋ผ๋Š” ์ƒ๊ฐ์„ ํ•˜๊ฒŒ ๋˜์—ˆ์Œ. ๊ผญ ํฐ ๋ชจ๋ธ์˜ ๊ฒฐ๊ณผ์™€ ๊ณผ์ •์— ์ง‘์ฐฉํ•  ํ•„์š”๋Š” ์—†๊ตฌ๋‚˜ ๋ผ๋Š” ์ƒ๊ฐ์ด ๋“ค๊ธฐ๋„ ํ•จ!4.2
๊ณ„๋ž€์ดˆ๋ฐฅreasoning bias๋ฅผ ๊ณต๋ก ํ™”(?)ํ•˜๊ณ , ์ด๋ฅผ ์œ„ํ•ด reasoning๊ณผ memory๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๊ตฌ๋ถ„ ๋ฐ ๋ถ„์„ํ•œ ๋…ผ๋ฌธ! ๊ตฌ๋ถ„ ๊ธฐ์ค€์ด ๋ช…ํ™•ํ•˜๊ณ , ์‹คํ—˜ ์„ค๊ณ„๋„ ๊ฐœ์—ฐ์ ์ด๊ณ  ๋…ผ๋ฆฌ์ ์ด๋ฉฐ, ์ž„๋ฒ ๋”ฉ๋‹จ์—์„œ ๋ณด์ด๋Š” ๊ฒฝํ–ฅ๋„ ์ž˜ ํ‘œํ˜„ํ–ˆ๋‹ค. 4.3
๊ตญ๋ฐฅLLM์ด ์ถ”๋ก ์„ ํ•˜๋Š”์ง€ ์•”๊ธฐ๋ฅผ ํ•˜๋Š”์ง€๋Š” ๋ฐ์ดํ„ฐ๋‚˜ ์•„ํ‚คํ…์ณ๋งŒ์˜ ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋ผ๋Š” ๊ฒƒ์„ ์‹คํ—˜์ ์œผ๋กœ ์ž˜ ํ’€์–ด๋‚ธ๊ฑฐ๊ฐ™๋‹ค. ์ž„๋ฒ ๋”ฉ ๋ถ„๋ฆฌ ์‹คํ—˜๋„ ์ง๊ด€์ ์ด๋ผ ์ข‹๋‹ค. Query๋งˆ๋‹ค ๋‹ค๋ฅธ KG๊ฐ€ ์ฃผ์–ด์ง€๋ฉด..? ์ด๊ฑด ์•”๊ธฐ๋ณด๋‹ค๋Š” ๊ตฌ์กฐ์ ์ธ ์ผ๋ฐ˜ํ™”๊ฐ€ ์œ ๋ฆฌํ•ด๋ณด์ด๋Š”๋ฐ small init์ด ๋งž์„ ๋“ฏ4.4
ํ–„๋ฒ„๊ฑฐ๋ชจ๋ธ์˜ reasoning/ memorization ํŽธํ–ฅ์ด ์ดˆ๊ธฐํ™” ์Šค์ผ€์ผ์— ์˜ํ•ด ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ๋ช…ํ™•ํ•˜๊ฒŒ ๋ณด์—ฌ์ค€๋“ฏ. ๊ฒฐ๊ตญ ํ•™์Šต ์ดˆ๊ธฐ์˜ ์„ธํŒ…์ด ์ดํ›„ ํ•™์Šต ๊ณผ์ •์—์„œ ๋ชจ๋ธ์˜ โ€œ์„ฑํ–ฅโ€์ด๋‚˜ ์ˆ˜๋ ด ๋ฐฉํ–ฅ์„ ํฌ๊ฒŒ ์ขŒ์šฐํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋ชจ๋ธ์˜ ๋ชฉํ‘œ ํŠน์„ฑ์— ๋งž์ถฐ ์ดˆ๊ธฐํ™” ์„ค๊ณ„๋ฅผ ์ ์ ˆํžˆ ์ ์šฉํ•  ์ˆ˜ ์žˆ์„๋“ฏ4.4
ํ”ผ์ž๋ชจ๋ธ์˜ ์ดˆ๊ธฐ ํ•™์Šต ์„ธํŒ… ๋ฐ Scale, ๋ฐ์ดํ„ฐ์…‹์ด ํ•™์Šต์˜ ์ง„ํ–‰ ๋ฐฉํ–ฅ๋งˆ์ €๋„ ํฌ๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค๋Š” ๋‚ด์šฉ์„ Embedding Space๋กœ ์ฆ๋ช…ํ•จ์œผ๋กœ์จ ํฌ๊ฒŒ ์˜๋ฏธ๊ฐ€ ์žˆ๋Š” ์—ฐ๊ตฌ๋ผ๊ณ  ๋ณด์—ฌ์ง. 4.1
์น˜ํ‚จ์š”์ฆ˜ ํ•™์Šต ๋‹จ๊ณ„๋ณ„ ์Šค์ผ€์ผ๋ง ๊ด€๋ จ ๋…ผ๋ฌธ๋“ค์ด ๋งŽ์ด ๋ณด์ด๋Š”๋ฐ ๊ฒฐ๊ตญ ์ •ํ•ด์ง„ ํ•™์Šต ๋น„์šฉ ๋‚ด์—์„œ ์ž์› ๋Œ€๋น„ ํšจ์œจ์„ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•ด์„œ๊ฒ ์ง€? reasoning bias๋ฅผ ์ž…์ฆํ•ด๋‚ธ ์‹คํ—˜์ด ๋ช…ํ™•ํ•˜๊ฒŒ ์ดํ•ด๊ฐ€ ๋˜์„œ ์ข‹์•˜๋˜๊ฑฐ ๊ฐ™๋‹ค4.6
ํŽ˜๋ธŒ๋ฆฌ์ฆˆmemory์™€ reasoning์ด ๋‹ค๋ฅธ ๊ฒƒ์ธ์ง€, ๋‹ค๋ฅด๋‹ค๋ฉด ๋ญ๊ฐ€ ์šฐ์„ ์ธ ๊ฒƒ์ธ์ง€ ๋…ผ์˜ํ•˜๋Š” ๋…ผ๋ฌธ ์ค‘์— ํ•˜๋‚˜. ํ•œํŽธ์œผ๋กœ ์ด ์ฃผ์ œ๋กœ๋Š” ์ตœ๋Œ€ํ•œ ํฐ ๋ชจ๋ธ์„ ๋งŽ์ด ํ•™์Šต์‹œํ‚ค๊ณ ์„œ ์‹คํ—˜ํ•˜๊ณ  ๋…ผ์˜ํ•˜๋Š” ๊ฒŒ ๋งž์ง€ ์•Š๋‚˜ ์‹ถ๊ธฐ๋„ ํ•˜๊ณ ..4.2

TL; DR

๐Ÿ’ก

Transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์—์„œ ์ดˆ๊ธฐํ™” Scale์— ๋”ฐ๋ผ์„œ ์ถ”๋ก ์„ ๋จผ์ € ๋ฐฐ์šฐ๋Š”๊ฐ€, ์•”๊ธฐ๋ฅผ ๋จผ์ € ๋ฐฐ์šฐ๋Š”๊ฐ€์˜ ํŽธํ–ฅ์ด ์กด์žฌํ•œ๋‹ค!

Summary

์—ฐ๊ตฌ์ง„: ๋ฏธ๊ตญ Duke University, ์ค‘๊ตญ ์ƒํ•˜์ด๊ตํ†ต๋Œ€

Cite: 4


  1. ํŠธ๋žœ์Šคํฌ๋จธ ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ์—์„œ Parameter Initialization Scale์— ๋”ฐ๋ผ ํ•™์Šต์ด๋‚˜ LLM์˜ Task Preference์˜ ์˜ํ–ฅ์— ๋Œ€ํ•ด ์กฐ์‚ฌํ•จ
    1. Small Initialization Scale์—์„œ๋Š” ๋ชจ๋ธ์ด Reasoning Task๋ฅผ ์ž˜ ์ˆ˜ํ–‰ํ•˜๋„๋ก Encourage๋˜์—ˆ์Œ
    1. Large Initialization Scale์—์„œ๋Š” ๋ชจ๋ธ์ด Memorization Task๋ฅผ ์ž˜ ์ˆ˜ํ–‰ํ•˜๋„๋ก Preference๊ฐ€ ์œ ๋„๋จ

    โ‡’ ์ดˆ๊ธฐํ™” ์Šค์ผ€์ผ์— ๋”ฐ๋ผ ๋ชจ๋ธ์˜ โ€˜ํ•™์Šต ์„ฑํ–ฅ(bias)โ€™๊ฐ€ ๋ณ€ํ•จ

  1. ์‹ค์ œ ๋ฐ์ดํ„ฐ์…‹๊ณผ Anchor Function์œผ๋กœ ์ด ์„ฑํ–ฅ์„ ๊ฒ€์ฆ
  1. ์ดˆ๊ธฐํ™” ์Šค์ผ€์ผ์— ๋”ฐ๋ฅธ ํ˜„์ƒ์˜ ์›์ธ์„ Embedding Space์™€ Self-Attention Mechanism์„ ํ†ตํ•ด ๋ถ„์„
    1. Model Training์˜ ๋™์—ญํ•™(dynamics) ๊ด€์ ์—์„œ ์ด ํ˜„์ƒ์ด ์ƒ๊ธฐ๋Š” ์›์ธ์„ ์„ค๋ช…ํ•˜๋Š” ์ด๋ก ์  ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ
    1. LLM์˜ ์ดˆ๊ธฐํ™”๊ฐ€ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์— ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์ดํ•ด๋ฅผ ๋†’์ด๋Š” ์—ฐ๊ตฌ

Introduction

Motivation

  • LLM์˜ Reasoning Task์— ๋Œ€ํ•ด์„œ๋Š” RHO-1๊ณผ ๊ฐ™์€ Data-driven Approach๊ฐ€ ๋งŽ์ด ์ œ์‹œ๋˜์–ด ์žˆ์œผ๋‚˜, LLM์ด ์ง„์งœ logical rule์„ ์ดํ•ดํ•˜๊ณ , reasoning์„ ์ˆ˜ํ–‰ํ•˜๋Š”์ง€ ์•„๋‹ˆ๋ฉด ์ฃผ์–ด์ง„ ๊ทœ์น™์„ ๋‹จ์ˆœํžˆ ๋”ฐ๋ผ๋งŒ ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ์˜๋ฌธ
    • ์ž‘์€ initialization scale์—์„œ๋Š” ๋ชจ๋ธ์ด ์ž‘์€ ๋‹จ์œ„ ๋ ˆ๋ฒจ์˜ ๊ธฐ๋Šฅ๊ณผ ๋ณต์žกํ•œ ๊ทœ์น™์„ ํ•™์Šตํ•จ์œผ๋กœ์จ data์— fit๋˜๋„๋ก ์œ ๋„ํ•จ
      • Neuron condensation effect๊ฐ€ ํ•™์Šต ๊ณผ์ •์—์„œ ์ƒ๊ฒจ๋‚จ
        • Neuron Condenstation Effect: ๋™์ผํ•œ ๊ณ„์ธต์˜ ๋‰ด๋Ÿฐ๋“ค์ด ์œ ์‚ฌํ•œ ์ถœ๋ ฅ์„ ๋‚ด๋„๋ก ๋ญ‰์ณ์ง€๋Š” ํ˜„์ƒ
      • ๊ฐ™์€ ๋ ˆ์ด์–ด์˜ Neuron์ด ์œ ์‚ฌํ•œ ํŒจํ„ด์œผ๋กœ ๋งž์ถฐ์ง€๋Š” ํ˜„์ƒ์œผ๋กœ ์ธํ•ด ๋ฐœ์ƒํ•˜์—ฌ ์ตœ์†Œ ๋ณต์žก๋„๋กœ data fitting์ด ๋˜๋„๋ก ํ•จ
      • ํ‘œํ˜„๋ ฅ์€ ์ถฉ๋ถ„ํ•˜์ง€๋งŒ ์‹ค์งˆ์ ์œผ๋กœ ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ž์œ ๋„๊ฐ€ ์ ์–ด์ง
        • ๊ฐœ๋ณ„ ์ƒ˜ํ”Œ์„ ๋”ฐ๋กœ ์™ธ์šธ ์ˆ˜ ์žˆ๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ๋ถ„๋ฆฌ๊ฐ€ ์–ด๋ ค์›Œ ์•”๊ธฐ ์„ฑ๋Šฅ์€ ๋–จ์–ด์ง
      • ๊ณตํ†ต์œผ๋กœ ์ ์šฉ๋˜๋Š” ๊ฐ„๋‹จํ•œ ๊ทœ์น™์„ ์ฐพ๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต์ด ์ง„ํ–‰
    • ํฐ initialization scale์—์„œ๋Š” ๋ชจ๋ธ์ด input-output ๋งคํ•‘์— ๋Œ€ํ•œ ๊ธฐ์–ต์„ ํ•˜๋„๋ก ์œ ๋„ํ•˜์—ฌ ์•”๊ธฐ ์„ฑ๋Šฅ์ด ์˜ฌ๋ผ๊ฐ

Contribution

  • Reasoning bias๋ฅผ ์‹ค์ œ ์ž์—ฐ์–ด ํ•™์Šต ์„ค์ •์—์„œ ๋ณด์—ฌ์ฃผ๋Š” ์‹คํ—˜
  • ๋ชจ๋ธ์˜ Initialization Scale์ด Reasoning Behavior (bias)์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์ด ์ƒ๋‹น์„ ์„ค๋ช…ํ•˜๋Š” ์—ฐ๊ตฌ

Experiments

  • Reasoning Bias๋ฅผ ์‹๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด neural network๋ฅผ Small Parameter Scale๋กœ, ์„œ๋กœ ๋‹ค๋ฅธ reasoning ๋ณต์žก์„ฑ์„ ๊ฐ€์ง„ ๋‘ ๊ฐ€์ง€ ๋ฐ์ดํ„ฐ์…‹์œผ๋กœ ํ•™์Šตํ•˜๊ณ  ๊ฒฐ๊ณผ๋ฅผ ๋น„๊ตํ•จ
  1. ๋‘ ๋ฐ์ดํ„ฐ์…‹์„ GPT-2 ์ƒ์—์„œ ์„ž์–ด์„œ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ ๊ฒฐ๊ณผ
    • PrOntoQA ๋ฐ์ดํ„ฐ์…‹
      • CoT๋ฅผ ํฌํ•จํ•˜๋Š” QA ๋ฐ์ดํ„ฐ์…‹
      • ์งˆ๋ฌธ์„ ์ •ํ™•ํžˆ ๋งž์ถ”๊ธฐ ์œ„ํ•œ Reasoning์„ ๋ช…์‹œ์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๋Š” ๋‚ด์šฉ
    • TinyStories ๋ฐ์ดํ„ฐ์…‹
      • 3-4์„ธ์˜ ์•„์ด๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ๋‹จ์–ด๋กœ ์ด๋ฃจ์–ด์ง„ ์งง์€ ํ•ฉ์„ฑ๋œ ์Šคํ† ๋ฆฌ(์•”๊ธฐ(Memory) ์œ„์ฃผ)
    • PrOntoQA ๋ฐ์ดํ„ฐ์…‹์—์„œ Loss๊ฐ€ ๋น ๋ฅด๊ฒŒ ์ค„์–ด๋“œ๋Š” ๊ฒƒ์œผ๋กœ ๋ณผ ๋•Œ, ๋ชจ๋ธ์ด Reasoning pattern์„ ๋” ์ž˜ ํŒŒ์•…ํ•จ์„ ์•Œ ์ˆ˜ ์žˆ์Œ

  1. ์ถ”๋ก (Reasoning) ๊ณผ์ œ๊ฐ€ ํ•™์Šต ์ดˆ๊ธฐ์— ๋นจ๋ฆฌ ์Šต๋“๋˜๋Š” ์ด์œ 
    • ํ•™์Šต ๊ณผ์ •์—์„œ ์ดˆ๊ธฐ์— Embedding space๊ฐ€ ๋” ๋ถ„ํ™”๋˜๋Š” ํŠน์„ฑ์ด ์กด์žฌํ•จ
      • ์ž„๋ฒ ๋”ฉ์ด ๋ถ„ํ™”๋œ๋‹ค: ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์—์„œ Vector๋“ค์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐฉํ–ฅ, ์œ„์น˜๋กœ ์ด๋™
    • ํ† ํฐ t๋Š” one-hot์ด๊ธฐ ๋•Œ๋ฌธ์— ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ณ€ํ™˜ ํ›„ ๊ทธ ํ† ํฐ์ด ๋“ฑ์žฅํ•œ ๋ชจ๋“  ์ƒ˜ํ”Œ์˜ loss gradient๋ฅผ ๋ˆ„์ ํ•ด์„œ ์—…๋ฐ์ดํŠธํ•˜๋Š”๋ฐ ํ† ํฐ๋งˆ๋‹ค ์–ด๋–ค ๋ผ๋ฒจ๊ณผ ํ•จ๊ป˜ ๋“ฑ์žฅํ•˜๋Š”๊ฐ€๊ฐ€ ์ž„๋ฒ ๋”ฉ ๋ฐฉํ–ฅ์„ ๊ฒฐ์ •
      • Reasoning Task์—์„œ ์ž„๋ฒ ๋”ฉ ๋ถ„ํ™”๊ฐ€ ๋น ๋ฅธ ์ด์œ 
        • ํŠน์ • ํ† ํฐ์€ ํŠน์ • ์œ ํ˜•์˜ ๋ผ๋ฒจ๊ณผ ๊ฐ•ํ•˜๊ฒŒ ์—ฐ๊ด€๋˜์–ด ํ† ํฐ๋ณ„ ๋ผ๋ฒจ ๋ถ„ํฌ๊ฐ€ ์„œ๋กœ ๋‹ค๋ฆ„
        • ์ž„๋ฒ ๋”ฉ์ด ํ•™์Šต ์ดˆ๋ฐ˜๋ถ€ํ„ฐ ๋‹ค๋ฅธ ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™
      • Memory Task์—์„œ ์ž„๋ฒ ๋”ฉ ๋ถ„ํ™”๊ฐ€ ๋А๋ฆฐ ์ด์œ 
        • ์„œ๋กœ ๋‹ค๋ฅธ Memory (์•”๊ธฐ) ํ† ํฐ๋“ค์ด ๋น„์Šทํ•œ ๋ผ๋ฒจ ๋ถ„ํฌ๋ฅผ ๊ฐ€์ง
        • Gradient ๋ฐฉํ–ฅ์ด ์„œ๋กœ ์œ ์‚ฌ, ์ดˆ๊ธฐ์— ์ž„๋ฒ ๋”ฉ์ด ์„œ๋กœ ๊ตฌ๋ถ„๋˜์ง€ ์•Š์Œ

Result

  • Transformer์—์„œ โ€˜์ž‘์€ ์ดˆ๊ธฐํ™”โ€™๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ—˜์„ ์ง„ํ–‰
  • Biased๋œ ํ˜„์ƒ์„ ์ž์„ธํžˆ ๊ด€์ฐฐํ•˜๊ธฐ ์œ„ํ•˜์—ฌ ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด์™€ Multi-layer Perceptron์œผ๋กœ ๊ตฌ์„ฑ๋œ ๊ฐ„๋žตํ™”๋œ ๋ชจ๋ธ์„ ์ œ์•ˆ
  • ํ† ํฐ ์ž„๋ฒ ๋”ฉ์€ ํ•ด๋‹น ํ† ํฐ์ด ๋“ฑ์žฅํ•œ ์ƒ˜ํ”Œ๋“ค์˜ ๋ผ๋ฒจ ๋ถ„ํฌ์— ์˜ํ•ด ํ•™์Šต๋˜๋Š” ๊ฒƒ์„ ์ด์šฉํ•˜์—ฌ ์‹คํ—˜ ์„ค๊ณ„
    • Reasoning Anchor
      • ํ† ํฐ ์ž์ฒด๋งŒ์œผ๋กœ ์ •๋‹ต์ด ๊ฒฐ์ •๋˜์ง€ ์•Š๊ณ  ๋‹ค๋ฅธ ํ† ํฐ๋“ค๊ณผ์˜ Composition์— ๋”ฐ๋ผ ๋ผ๋ฒจ์ด ๋‹ฌ๋ผ์ง
      • Gradient์˜ ๋ถ„์‚ฐ์ด ํผ
      • ๋ผ๋ฒจ ๋ถ„ํฌ๊ฐ€ ๋‹ค์–‘ํ•˜๋ฏ€๋กœ ์ดˆ๊ธฐ ํ•™์Šต ๋‹จ๊ณ„์—์„œ ์ž„๋ฒ ๋”ฉ ๋ถ„ํ™”๊ฐ€ ๋น ๋ฆ„
    • Memory Anchor
      • ํŠน์ • ํ† ํฐ์ด ๊ฑฐ์˜ ๊ฐ™์€ ์ •๋‹ต(label)๊ณผ ์—ฐ๊ฒฐ๋จ
      • ๋ผ๋ฒจ ๋ถ„ํฌ์˜ ๋ถ„์‚ฐ์ด ์ž‘๊ณ  ์ž„๋ฒ ๋”ฉ ์—…๋ฐ์ดํŠธ๋‚˜ ๋ถ„ํ™”๊ฐ€ ์ž‘์Œ
Reasoning Bias in Transformer with Composite Anchor Functions
  • 0.3, 0.5, 0.8๋กœ Gamma ํฌ๊ธฐ๋ฅผ ๋ณ€ํ™”ํ•˜์˜€์„ ๋•Œ, ์œ„๋Š” ํ•™์Šต ์ค‘ loss์˜ ๋ณ€ํ™”, ์•„๋ž˜ ํ–‰์€ Prediction Accuracy์˜ ๋ณ€ํ™”
  • ์ฃผ์˜: Gammaํฌ๊ธฐ๊ฐ€ ํด์ˆ˜๋ก ์ดˆ๊ธฐํ™” ์Šค์ผ€์ผ์€ ์ž‘์€ ๊ฒƒ์ž„

  • Amem (์•”๊ธฐ๋กœ ํ’€ ์ˆ˜ ์žˆ๋Š” ํ† ํฐ), Arsn (๊ทœ์น™์„ ์•Œ์•„์•ผ ํ’€ ์ˆ˜ ์žˆ๋Š” ํ† ํฐ), Z (์ผ๋ฐ˜ ํ† ํฐ), M (ํŠน์ • Reasoning Anchor ์Œ์—๋งŒ ์˜๋ฏธ ์žˆ๋Š” ๊ทœ์น™)์œผ๋กœ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ
  • ๋ฐ์ดํ„ฐ์…‹
    • ์ด 200,000๊ฐœ ์ƒ˜ํ”Œ
    • ์•”๊ธฐ๋งŒ ํ•˜๋ฉด ๋˜๋Š” ๋ฐ์ดํ„ฐ, Reasoning์ด ํ•„์š”ํ•œ train ๋ฐ์ดํ„ฐ, Reasoning์ด ํ•„์š”ํ•œ test๋ฐ์ดํ„ฐ๋กœ ๋ถ„๋ฆฌ
  • ๋ชจ๋ธ
    • Decoder-only Transformer (2 layers, 1 attention head)
    • Loss: Cross Entropy + AdamW
    • ์ดˆ๊ธฐํ™” ์Šค์ผ€์ผ: 0.3, 0.5, 0.8
  • 0.3์˜ ํฐ ์ดˆ๊ธฐํ™”
    • ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์—์„œ๋Š” ์•”๊ธฐ ๋ฐ์ดํ„ฐ, Reasoning์ด ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ๋ชจ๋‘ ์ž˜ ๋งž์ถค
    • ํ…Œ์ŠคํŠธ ์ถ”๋ก  ๋ฐ์ดํ„ฐ์—์„œ๋Š” loss๊ฐ€ ๊ฑฐ์˜ ์ค„์ง€ ์•Š์Œ
    • ํ›ˆ๋ จ ์ƒ˜ํ”Œ ์ž์ฒด๋ฅผ ์•”๊ธฐํ•˜๊ณ  ์žˆ์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
    • Memory ๋ฐ์ดํ„ฐ์˜ loss๊ฐ€ ๋‹ค์†Œ ๋น ๋ฅด๊ฒŒ ํ•˜๊ฐ•ํ•˜๊ณ  ์žˆ์Œ
  • 0.8์˜ ์ž‘์€ ์ดˆ๊ธฐํ™”
    • ์ถ”๋ก  ๋ฐ์ดํ„ฐ๋Š” ํ›ˆ๋ จ ๋ฐ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ ๋ชจ๋‘ loss๊ฐ€ ์ž˜ ํ•˜๊ฐ•
    • Memory ๋ฐ์ดํ„ฐ์˜ loss ํ•˜๊ฐ•์ด ๋‹ค์†Œ ๋А๋ฆผ
    • ๋‹จ์ˆœ ์•”๊ธฐ๋ณด๋‹ค ๊ทœ์น™์„ ๋จผ์ € ํ•™์Šต
    • Reasoning Bias๊ฐ€ ๋ฐœ์ƒํ•จ

โ‡’ ๋ชจ๋ธ์˜ Learning Bias๊ฐ€ Initialization scale์— ์˜ํ–ฅ์„ ๋ฐ›์Œ

Simplified Model

Bias๋ฅผ ๋” ์ž˜ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•ด 2 layer์˜ ์ž‘์€ Fully Connected Network์—์„œ ์‹คํ—˜

  • ๋ชจ๋ธ์˜ ์ •์˜

2 layer์˜ ๋ชจ๋ธ๋กœ ๊ตฌ์„ฑํ•˜๊ณ , W(1)์„ ์ž…๋ ฅ ํ† ํฐ์—์„œ hidden state๋ฅผ ์ถ”์ถœํ•˜๋Š” Weight๋กœ, W(2)๋ฅผ hidden state์—์„œ ์ถœ๋ ฅ ํ† ํฐ์„ ์ถ”์ถœํ•˜๋Š” Weight๋กœ ๊ตฌ์„ฑํ•˜๊ณ  ํ™œ์„ฑํ™” ํ•จ์ˆ˜(sigmoid)๋ฅผ ์‚ฌ์šฉ

  • Embedding Space๊ฐ€ ๋ณด์ด๋Š” ํŒจํ„ด ๋น„๊ต
  • Memory Anchor๊ฐ€ ํ•™์Šต Epoch์ด ์ฆ๊ฐ€ํ•˜์˜€์Œ์—๋„ ์ผ๊ด€์ ์œผ๋กœ ๊ฑฐ์˜ ๊ฐ™์€ ๋ฐฉํ–ฅ์„ ๊ฐ€๋ฆฌํ‚ด๊ณผ Reasoning Anchor๊ฐ€ ๊ฐ€๊นŒ์šด ๊ฑฐ๋ฆฌ์ผ์ˆ˜๋ก ๊ฐ™์€ ๋ฐฉํ–ฅ(Cosine ์œ ์‚ฌ๋„๊ฐ€ ๋†’์Œ)์„ ๊ฐ€๋ฆฌํ‚ค๊ณ , ๊ฑฐ๋ฆฌ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ์œ ์‚ฌ๋„๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ์—ฐ์†์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๊ฐ€์ง์„ ๋ณด์—ฌ์คŒ

  1. Memory Anchor์™€ Reasoning Anchor์˜ ์ž„๋ฒ ๋”ฉ์—์„œ Cosine ์œ ์‚ฌ๋„๋ฅผ ๋ณด์•˜์„ ๋•Œ,
  • Reasoning Anchor์—์„œ๋Š” Anchor๊ฐ„ ๊ฑฐ๋ฆฌ๊ฐ€ ์ปค์งˆ์ˆ˜๋ก Cosine ์œ ์‚ฌ๋„๊ฐ€ ๊ฐ์†Œํ•˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ Reasoning Anchor๊ฐ€ ๋น ๋ฅด๊ฒŒ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์—์„œ ์—ฐ์†์ , ๊ณ„์ธต์  ๊ตฌ์กฐ๊ฐ€ ๋งŒ๋“ค์–ด์ง
  • Memory Anchor์—์„œ๋Š” ๋ชจ๋“  Memory Anchor๊ฐ€ ๊ฑฐ์˜ ๊ฐ™์€ ๋ฐฉํ–ฅ์„ ๊ฐ€๋ฆฌํ‚ด
    • ๋ชจ๋ธ์˜ Primitive-level ๋งคํ•‘, ์ฆ‰, ์ž„๋ฒ ๋”ฉ์ด ๋” ๋‹ค์–‘ํ•ด์ ธ์•ผ ํ•จ
    • ๋ณต์žก์„ฑ๊ณผ ๋‹ค์–‘์„ฑ์ด ๋”์šฑ ์ฆ๋Œ€๋˜์–ด์•ผ ํ•จ
    • โ‡’ ๊ทธ๋Ÿฌ๋‚˜ ๊ฒฐ๊ณผ์ ์œผ๋กœ Reasoning Anchor์— ๋น„ํ•ด Memory Anchor๊ฐ€ ๋‹ค์–‘ํ•˜์ง€ ๋ชปํ•˜๊ณ , ๋ถ„ํ™”๊ฐ€ ์ž˜ ์•ˆ๋˜๊ณ  ์žˆ์Œ

  • Target ๋ถ„ํฌ๊ฐ€ Embedding์„ ๊ฒฐ์ •ํ•˜๋Š” ์ด์œ 
    1. Assumption
    • ์ž‘์€ ์ž…๋ ฅ์—์„œ๋Š” ํ™œ์„ฑํ™”๊ฐ€ ๊ฑฐ์˜ ์„ ํ˜•์ด๊ณ , Gradient๊ฐ€ ํญ์ฃผํ•˜์ง€ ์•Š์Œ์„ ๊ฐ€์ •ํ•จ

      โ‡’ Small initialization์—์„œ๋Š” Emb-MLP (ํ•ฉ์„ฑ๋œ ๋ชจ๋ธ)๊ฐ€ ๊ฑฐ์˜ ์„ ํ˜• ๋ชจ๋ธ์ฒ˜๋Ÿผ ์ž‘๋™ํ•œ๋‹ค

    • Hidden Layer์˜ ๋น„์„ ํ˜•์„ฑ์ด ์‚ฌ๋ผ์ง€๋ฏ€๋กœ, Target Distribution๋งŒ ๋ณด๊ณ  Embedding์ด ์›€์ง์ž„

    • ํ† ํฐ s์— ๋Œ€ํ•œ Embedding์˜ Gradient๋Š” s๊ฐ€ ๋“ฑ์žฅํ•œ ๋ชจ๋“  ์ƒ˜ํ”Œ์˜ ์ •๋‹ต ๋ ˆ์ด๋ธ”๊ณผ uniform ๋ถ„ํฌ์˜ ์ฐจ์ด์— ์˜ํ•ด ๋ˆ„์ ๋จ

    1. Proposition
    • ๋žœ๋ค ๋ณ€์ˆ˜: ํ† ํฐ s๋ฅผ ํฌํ•จํ•œ ์ƒ˜ํ”Œ์„ ํ•˜๋‚˜ ๋ฌด์ž‘์œ„๋กœ ๋ฝ‘์•˜์„ ๋•Œ์˜ ์ •๋‹ต ๋ ˆ์ด๋ธ”

      ๋ถ„ํฌ: ํ† ํฐ s๊ฐ€ ์–ด๋–ค ๋ ˆ์ด๋ธ”๊ณผ ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ํ•จ๊ป˜ ๋“ฑ์žฅํ•˜๋Š”๊ฐ€

    • Embedding์˜ ์ด๋™ ๋ฐฉํ–ฅ์€ ์ •๋‹ต ๋ถ„ํฌ P์™€ ์™„์ „ ๊ท ๋“ฑ ๋ถ„ํฌ์˜ ์ฐจ์ด์— ์˜ํ•ด ๊ฒฐ์ •๋จ

      ๋ชจ๋ธ ๊ตฌ์กฐ, ๋‹ค๋ฅธ ํ† ํฐ์€ ๊ฑฐ์˜ ๊ด€์—ฌํ•˜์ง€ ์•Š์Œ

    1. Results
    • Memory Anchor๊ฐ€ ๋ชจ๋‘๋‹ค ๊ฑฐ์˜ ๊ฐ™์€ ๋ฐฉํ–ฅ์œผ๋กœ Align ๋˜๋Š” ์ด์œ 
      • ์–ด๋–ค Memory Anchor๊ฐ€ ๋“ฑ์žฅํ•ด๋„ ์ •๋‹ต ๋ ˆ์ด๋ธ” ๋ถ„ํฌ๊ฐ€ ๋™์ผ(Uniform ๋ถ„ํฌ์™€ ์ฐจ์ด๊ฐ€ ๊ฑฐ์˜ ๋ฐœ์ƒํ•˜์ง€ ์•Š์Œ)
      • ๋ชจ๋“  Memory Anchor์˜ Gradient ๋ฐฉํ–ฅ์ด ๊ฐ™์Œ
      • โ‡’ ๋”ฐ๋ผ์„œ Embedding์ด ๊ฐ™์€ ๋ฐฉํ–ฅ์œผ๋กœ๋งŒ ๊ณ„์† ์›€์ง์ž„
    • Reasoning Anchor๊ฐ€ ๋ถ„ํ™”๋˜๋Š” ์ด์œ 
      • Reasoning Anchor s์— ๋Œ€ํ•˜์—ฌ ์ •๋‹ต ๋ ˆ์ด๋ธ”์ด ๋ชจ๋‘ ๋™์ผํ•˜์ง€ ์•Š์Œ
      • ํ‰๊ท (๊ธฐ๋Œ€๊ฐ’)์ด s๋งˆ๋‹ค ๋‹ค๋ฆ„
      • Embedding Gradient ๋ฐฉํ–ฅ์ด ๋‹ฌ๋ผ์ง
      • โ‡’ ์ดˆ๊ธฐ ๋‹จ๊ณ„์—์„œ๋ถ€ํ„ฐ ๋น ๋ฅด๊ฒŒ ๋ถ„ํ™”

์ผ๋ฐ˜์ ์ธ Task์—์„œ์˜ Transformer

Transformer์—์„œ์˜ Bias ์‹คํ—˜ ๊ฐœ์š”

  • MLP (Multi-layer Perceptron) ๋ชจ๋ธ์ด Noise Sequence์—์„œ ์‹คํŒจํ•œ๋‹ค๋Š” ์ ์—์„œ ์ด๋Ÿฌํ•œ ์‹คํŒจ๊ฐ€ ์ ์€ Transformer ๋ชจ๋ธ์ด ๋” ๋‚˜์€ ์ ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ๊ทธ๋Ÿผ์—๋„ Reasoning bias๊ฐ€ ์œ ์ง€๋˜๋Š”์ง€ ๋ณด์—ฌ์ฃผ๋Š” ๋ถ€๋ถ„
  • Transformer์˜ ์ž„๋ฒ ๋”ฉ space๊ฐ€ Emb-MLP์—์„œ ๋ณด์ธ ๊ฒƒ๊ณผ ์œ ์‚ฌํ•œ ํ˜„์ƒ์„ ๋ณด์ด๋Š”์ง€ ์‚ดํŽด๋ณด๊ณ , ๋ชจ๋ธ์ด ์ž…๋ ฅ์œผ๋กœ๋ถ€ํ„ฐ ์ •๋ณด๋ฅผ ์–ด๋–ป๊ฒŒ ํฌ์ฐฉํ•˜๋Š”์ง€ ํ™•์ธํ•˜๋Š” ์‹คํ—˜

์‹คํ—˜ ๊ฒฐ๊ณผ

  • ์•ž์˜ ํ˜„์ƒ์ด ํ•ฉ์„ฑ ๋ชจ๋ธ์ด ์•„๋‹Œ ์‹ค์ œ Transformer ๋ชจ๋ธ์—์„œ๋„ ๊ทธ๋Œ€๋กœ ๋‚˜ํƒ€๋‚œ๋‹ค๋Š” ์ ์„ ์„ค๋ช…ํ•œ Figure
  • B: Memory Anchor์™€ Reasoning Anchor์˜ PCA ๋ถ„ํฌ์˜ ๋น„๊ต
  • C: ์‹ค์ œ Transformer Embedding๊ณผ ํ•ฉ์„ฑ ๋ชจ๋ธ์˜ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๋น„๊ต ๊ทธ๋ž˜ํ”„
  • D: ์ด๋ก ์ ์œผ๋กœ ๊ตฌ์„ฑํ•œ ์ž„๋ฒ ๋”ฉ์˜ PCA

  1. Embedding Space
    • Transformer์˜ Embedding Space๋Š” Emb-MLP์™€ ๊ฑฐ์˜ ๋™์ผ
    • Reasoning Anchor๋Š” ๊ณ„์ธต์ , ์—ฐ์†์  ๊ตฌ์กฐ๋ฅผ ๋ณด์˜€์œผ๋ฉฐ, ๊ฑฐ๋ฆฌ๊ฐ€ ๋ฉ€์ˆ˜๋ก Cosine ์œ ์‚ฌ๋„๊ฐ€ ๊ฐ์†Œํ•จ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
    • Memory Anchor๋Š” ๋Œ€์กฐ์ ์œผ๋กœ Alignment์— ๋ชจ๋‘ ์œ ์‚ฌ์„ฑ์„ ๋ณด์ž„
    • PCA (์ฃผ์„ฑ๋ถ„๋ถ„์„)์—์„œ ์ „์ฒด ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์˜ ๊ตฌ์กฐ์  ํŠน์„ฑ์„ ๋ถ„์„ํ•จ

      โ‡’Attention์€ ์ž„๋ฒ ๋”ฉ์„ ์ƒˆ๋กœ ๋งŒ๋“ค์ง€ ์•Š๊ณ , ๊ธฐ์กด ์ž„๋ฒ ๋”ฉ์˜ bias๋ฅผ ํ‚ค์šฐ๋Š” ์—ญํ• ๋งŒ ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์คŒ

      โ‡’ Transformer๋„ Emb-MLP์™€ ๊ฐ™์€ ๊ฒฝํ–ฅ์„ฑ์„ ๋ณด์ธ๋‹ค!

  1. First Attention Module
    • i๋ฒˆ์งธ token์˜ ์ถœ๋ ฅ์ด ๊ทธ ์ด์ „ ๋ชจ๋“  token์˜ ํ‰๊ท ์ด ๋˜๋Š” ๊ฒƒ
    • Query-Key์˜ Dot product๊ฐ€ ์ ์ฐจ ๋™์ผํ•ด์ง
    • ๊ฒฐ๊ณผ์ ์œผ๋กœ mask๋กœ ์ธํ•ด Prefix Average ์—ฐ์‚ฐ์œผ๋กœ ๋จ
    • ์ฒซ Attention Layer๋Š” ์–ด๋–ค ํ† ํฐ์ด ์ค‘์š”ํ•œ์ง€ ๊ณ ๋ฅด๋Š” ์—ญํ• ์ด ์•„๋‹Œ ์ง€๊ธˆ๊นŒ์ง€ ๋“ฑ์žฅํ•œ Token์„ ๋ˆ„์ ํ•˜๋Š” ์žฅ์น˜
    • ์˜ ๊ฒฝ์šฐ ๊ฐ€์žฅ ํฐ Singular Value๊ฐ€ ๋‚˜๋จธ์ง€๋ณด๋‹ค ํ›จ์”ฌ ํผ
    • Corresponding Singular Vector๊ฐ€ Reasoning Anchor์™€๋Š” ๊ฐ€๊น๊ฒŒ align๋˜์ง€๋งŒ, Memory Anchor๋Š” ๊ฑฐ์˜ ์ˆ˜์ง์ž„
      • Reasoning Vector๋Š” ๊ฑฐ์˜ ๋Œ€๋ถ€๋ถ„ W์—์„œ ํฌ์ฐฉ๋˜๋ฉฐ, ๋ชจ๋“  subsequent ํ† ํฐ๋“ค๋กœ ์ „ํŒŒ๋จ
  1. Second Attention Module
    • ์ค‘์š”ํ•œ ์ •๋ณด๊ฐ€ ์–ด๋””์žˆ๋Š”์ง€ ์ฐพ๊ณ , ๋งˆ์ง€๋ง‰ ์ •๋ณด๋ฅผ ๋ชจ์œผ๋Š” ์—ญํ• 
  1. [Definition 2] One-layer Transformer
    • Layer Normalization์™€ final projection layer๋Š” ์ œ์™ธํ•จ

      (๊ฒฐ๊ณผ์— ์˜ํ–ฅ์„ ๋ฏธ์น˜์ง€ ์•Š๋Š” ์š”์†Œ์ž„)

    • ์•ž์„  ๊ด€์ฐฐ ๊ฒฐ๊ณผ์™€ ๊ฐ™์ด Small initialization scale์—์„œ๋Š” Attention A๊ฐ€ average๋กœ ํ•ด์„๋  ์ˆ˜ ์žˆ์Œ
    • ์ž‘์€ ์ดˆ๊ธฐํ™”์ผ ๋•Œโ‡’ Self-Attention์ด Prefix Average์™€ ๊ฑฐ์˜ ์œ ์‚ฌํ•จ
    • Q, K์˜ scale์ด ์ž‘๊ณ  softmax ์ž…๋ ฅ์ด ๊ฑฐ์˜0์ด๋ฉด softmax ๊ฒฐ๊ณผ๊ฐ€ ๊ฑฐ์˜ uniform
    • ์ˆœ์ฐจ์  ์ •๋ณด ๋ˆ„์ ์œผ๋กœ Reasoning์— ์œ ๋ฆฌ
    1. Proposition 2
    • Memory Anchor๋“ค์€ Embedding ๋ฐฉํ–ฅ์ด ๋น„์Šทํ•ด์ง€๊ณ  ํ•œ ๊ณณ์œผ๋กœ ๋ญ‰์นจ

    1. Proposition 3
    • Reasoning Anchor์˜ ์ž„๋ฒ ๋”ฉ ์—…๋ฐ์ดํŠธ
    • Reasoning Token์€ ์ด์ „ ํ† ํฐ๋“ค๊ณผ ์กฐํ•ฉ๋˜์–ด Label์ด ๊ฒฐ์ •๋จ
    • Label ๋ถ„ํฌ P๊ฐ€ s๋ฅผ ์ค‘์‹ฌ์œผ๋กœ ํผ์ง„ ๋ถ„ํฌ
    • ์ž„๋ฒ ๋”ฉ์ด ์ ์ง„์ ์œผ๋กœ ๋ถ„ํ™”
    1. Theorem1
    • Reasoning Embedding์˜ ๊ทผ์‚ฌ ํ˜•ํƒœ
    • ๊ฒฐ๊ณผ์ ์œผ๋กœ Transformer์—์„œ๋„ Reasoning bias๊ฐ€ ์ผ์–ด๋‚จ์„ ์ˆ˜ํ•™์ ์œผ๋กœ ์ฆ๋ช…ํ•จ

Real Language Tasks

  • โˆ†L: Reasoning Bias๋ฅผ ์ •๋Ÿ‰ํ™”ํ•œ ์ง€ํ‘œ

๋ถ„์ž: L(TinyStories)-L(PrOntoQA): ๋‘ task ๊ฐ„์˜ loss์ฐจ์ด

๋ถ„๋ชจ(L(PrOntoQA)): ์ถ”๋ก  ๊ณผ์ œ์˜ loss๋กœ ์ •๊ทœํ™”

Reasoning ๊ณผ์ œ ๋Œ€๋น„, Memory ๊ณผ์ œ๊ฐ€ ์–ผ๋งˆ๋‚˜ ๋” ์–ด๋ ต๊ฒŒ ํ•™์Šต๋˜๊ณ  ์žˆ๋Š”๊ฐ€์— ๋Œ€ํ•œ ์ง€ํ‘œ

  • โˆ†L์ด ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์˜ ์˜๋ฏธ
    • L(PrOntoQA)๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ๋” ๋น ๋ฅด๊ฒŒ ๊ฐ์†Œ
    • ๋ชจ๋ธ์ด ์ถ”๋ก  ๊ณผ์ œ๋ฅผ ๋” ์ž˜ ํ•™์Šตํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ biased
  • ์ด ํ˜„์ƒ์˜ ์›์ธ
    • GPT-2๋Š” ์ž‘์€ ์ดˆ๊ธฐํ™” scale๋กœ ํ•™์Šต๋จ
    • ์ดˆ๊ธฐ ํ•™์Šต ๋‹จ๊ณ„์—์„œ์˜ Representation์ด ๋ถ„ํ™”ํ•˜์˜€์Œ์„ ์•Œ ์ˆ˜ ์žˆ์Œ

Conclusion

  • ์ž‘์€ ์ดˆ๊ธฐํ™”๊ฐ€ ์ถ”๋ก  ์„ ํ˜ธ bias๋ฅผ ๋งŒ๋“ฆ
  • Label distribution์ด ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์„ ๋งŒ๋“œ๋Š”๋ฐ ํ•ต์‹ฌ ์—ญํ• ์„ ํ•˜๊ณ  ํ•™์Šต์— ์žˆ์–ด ๋™์—ญํ•™์  ์˜ํ–ฅ์„ ๋ฏธ์นจ
  • Next-token prediction training๊ณผ ๊ฐ™์€ ์œ ์‚ฌํ•œ task์— ํ™œ์šฉ ๊ฐ€๋Šฅ
  • ์‹คํ—˜์  ๊ด€์ฐฐ๊ณผ ์ด๋ก ์ ์ธ ์ˆ˜์‹์œผ๋กœ ์ฆ๋ช…ํ•จ

Categories

research