10 December 2025

Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

๐Ÿ’กGeneralization์ด๋“  Hallucination์ด๋“  ๋ชจ๋‘ ๋‹ค Out-of-Context Reasoning์˜ ํ˜„์ƒ์ด๊ณ , ์ด๋Š” Output ํ–‰๋ ฌ๊ณผ Value ํ–‰๋ ฌ์ด ๋ถ„๋ฆฌ๋˜์–ด์žˆ์–ด ํ•™์Šต๊ฐ€๋Šฅํ•˜๋‹ค!

๐Ÿฅˆ

Generalization or Hallucination? Understanding Out-of-Context Reasoning in Transformers

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
๋ธ”๋ž™ํ”„๋ผ์ด๋ฐ์ด์ง€๋‚œ์ฃผ์— ๊ต์ˆ˜๋‹˜์ด ํ•ด์ฃผ์‹  ๋ง์”€(์ž„์˜๋กœ ์ƒ์„ฑํ•œ CounterFactual ๊ธฐ๋ฐ˜์˜ ์ง€์‹์— ๋Œ€ํ•ด์„œ๋„ LLM์ด ๊ธˆ๋ฐฉ ํ•™์Šตํ•˜๋Š” ๊ฒƒ)์ด๋ž‘ ๊ฒฐ์ด ๋น„์Šทํ•˜๋‹ค! ๊ทธ ๋…ผ๋ฌธ์€ ์˜จํ†จ๋กœ์ง€ ๊ธฐ๋ฐ˜์ด์—ˆ๋˜ ๊ฒƒ์œผ๋กœ ๊ธฐ์–ตํ•จ
๋‹ค๋งŒ ๊ทธ๋Ÿฐ ๋‚ด์šฉ์„ matrix-level์˜ decomposition์„ ํ†ตํ•ด ์ˆ˜ํ•™์ ์œผ๋กœ ํ’€์–ด๋‚ด๊ณ , ๊ฒฐ๊ตญ ํ•™์Šต์„ ํ†ตํ•ด hallucination issue๋ฅผ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒฐ๋ก ์œผ๋กœ ๋„์ถœํ•œ ์ ์ด ๋…ผ๋ฆฌ์ ์ด๊ณ  ๋˜‘๋˜‘ํ•˜๋‹ค.
3.8
3์‹œGradient๋ฅผ ํ™œ์šฉํ•œ LLM ํ•ด์„์˜ ํ•œ๊ณ„๋Š” ์–ด๋””๊นŒ์ง€์ผ๊นŒ? ์ด ๋…ผ๋ฌธ์—์„œ๋„ ๋‹จ์ˆœํ•œ symbolic ์ˆ˜์ค€์˜ ์ง€์‹์„ ํ‰๊ฐ€ํ•˜๋Š”๋ฐ, ์ข€ ๋” ๋„“์€ ๋ฒ”์œ„์˜ ์ง€์‹์ด ์ฃผ์–ด์งˆ ๋•Œ ์ธ๊ณผ์ถ”๋ก ์˜ ํšจ๊ณผ๋ฅผ gradient๋ฅผ ํ†ตํ•ด ์ž…์ฆํ•ด๋ณผ ์ˆ˜ ์žˆ์„๊นŒ?4.2
์‚ฌ์ด์‹œ์˜ท์ผ๋ฐ˜ํ™”๋Šฅ๋ ฅ๊ณผ hallucination์˜ ๊ทผ๋ณธ์ด ๊ฐ™๋‹ค๋Š”๊ฑด ๊ต‰์žฅํžˆ ๋‚ฉ๋“๊ฐ€๋ฉด์„œ๋„ ์ƒˆ๋กœ์šด Aha moment์ธ๋“ฏ! ๊ทธ๊ฑธ ์ฆ๋ช…ํ•˜๋ ค๊ณ  ์•„์ฃผ ๊ฐ„๋‹จํ•œ ํ•ฉ์„ฑ ์‹คํ—˜๋ถ€ํ„ฐ ์ˆ˜ํ•™์  ์ฆ๋ช…๊นŒ์ง€ ์—ฐ๊ตฌ์ง„๋“ค์˜ ๋Šฅ๋ ฅ์ด ์ƒ๋‹นํ•˜๋‹ค 4.5
๋ฐฅLLM์ด ํ‘œ๋ฉด์ ์ธ ๊ฒƒ์— ์ง‘์ค‘ํ•จ์„ ์ž˜ ๋ณด์—ฌ์ฃผ๋Š” ๋“ฏํ•จ. ์‚ฌ์‹ค ๊ฐ„ ์—ฐ๊ฒฐ์„ ์˜๋ฏธ ๊ณ ๋ คํ•ด์„œ ๋…ผ๋ฆฌ์ ์ธ์ง€ ๊ฒ€์ฆํ•˜๊ณ  ํ•˜๊ธฐ๋ณด๋‹ค, ์ผ๋‹จ ์—ฐ๊ฒฐํ•˜๊ณ  ๋ณด๋Š” ๊ฒฝํ–ฅ์„ฑ. ๊ทธ๋ž˜์„œ ๊ทธ๊ฒŒ ์‹ค์ œ์™€ ์ผ์น˜ํ•˜๋ฉด ์ผ๋ฐ˜ํ™”๊ฐ€ ๋˜๋Š” ๊ฑฐ๊ณ , ๋ถˆ์ผ์น˜ํ•˜๋ฉด hallucination์ด ๋˜๋Š”.. ์ƒˆ๋กœ์šด ๊ด€์ ์„ ์•Œ๊ฒŒ ๋๋‹ค4
6์‹œ๊ฐ™์€ hallucination๊ณผ generalization์ด ๊ฐ™์€ ์›์ธ์ด๋ผ๋‹ˆ.. ์ ์€ ์˜ˆ์‹œ๋งŒ์œผ๋กœ๋„ ์–ด๋–ป๊ฒŒ๋“  ํŒจํ„ด์„ ์ฐพ์•„ ์ตํžˆ๋Š” llm์˜ ํŠน์„ฑ ์ƒ ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ๊ตฌ์„ฑํ•˜๋Š” ๊ฒƒ๊ณผ pre-training์ด ์ค‘์š”ํ•จ์„ ๋‹ค์‹œ ํ•œ๋ฒˆ ์•Œ๋ ค์ค€ ๋“ฏํ•˜๋‹ค!4.3
ํ”„๋ฆฌ๋ฐ”์ด์˜คํ‹ฑ์Šค๋Š” ์œ ์‚ฐ๊ท ๋จน์ด์ผ๋‹จ ์–ด๋ ต๋‹ค. LLM์ด ๋‚ด๋ถ€์ ์œผ๋กœ๋Š” ์ •๋ง ๋งŽ์€ ์—ฐ๊ฒฐ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์„ํ…๋ฐ, ์–ด๋–ค ์ง€์‹์ด ์ถ”๊ฐ€๋  ๋•Œ ์–ด๋–ป๊ฒŒ ๊ธฐ์กด ์ง€์‹์— ์—ฐ๊ฒฐํ• ์ง€ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์ด ์ผ๋ฐ˜ํ™”์™€ hallucination ๊ด€์ ์—์„œ ๋™์ผํ•œ ๊ฒƒ์ผ๊นŒ? ๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋‹จ์ˆœํ•˜๊ฒŒ ๋А๊ปด์ง€๋Š”๋ฐ, ๋ง‰์ƒ ๊ฒ€์ฆํ•˜๋ ค๊ณ  ๋ณด๋ฉด ์–ด๋ ต๋‹ค๊ณ  ์ƒ๊ฐํ•จ. Attention ๊ด€์ ์—์„œ ์ด๋ฅผ ํ’€์–ด๋ณธ ๊ฒƒ์€ ์ข‹์€ ๊ฒƒ ๊ฐ™์Œ. Transformer๊ฐ€ ์–ธ์ œ๊นŒ์ง€ ๊ฐˆ์ง„ ๋ชจ๋ฅด๊ฒ ์ง€๋งŒ, ํ•ญ์ƒ ๊ทธ ์ž์ฒด์˜ ํŠน์„ฑ์„ ์ดํ•ดํ•˜๊ณ  ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ์Œ.4.5
๊ณ ๋ถ•์ผ๋ฐ˜ํ™”์™€ hallucination์„ ๊ฐ™์€ ๋ฉ”์ปค๋‹ˆ์ฆ˜์œผ๋กœ ๋ณธ๊ฒŒ ์ƒˆ๋กœ์šด ๊ด€์ ์ธ๊ฒƒ ๊ฐ™์Œ. ๋ชจ๋ธ์€ out of context์—์„œ๋Š” ๊ทธ๋Ÿด๋“ฏํ•œ ์ผ๋ฐ˜ ํŒจํ„ด์„ ํƒํ• ํ…๋ฐ, ์ด๊ฒƒ์ด ์ผ๋ฐ˜ํ™”๋ฅผ ์˜ฌ๋ฆฌ๋ฉด์„œ ํ™˜๊ฐ ๊ฐ€๋Šฅ์„ฑ๋„ ์˜ฌ๋ฆฐ๋‹ค. ํ™˜๊ฐ์„ ๋ฌด์ž‘์ • ์–ต๋ˆ„๋ฅด๊ธฐ๋ณด๋‹ค ์ตœ์†Œํ•œ์œผ๋กœ ํ•˜๋˜ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ํ•ด์น˜์ง€ ์•Š๊ฒŒ๋” ํ•ด์•ผํ• ๋“ฏ ์‹ถ๋‹ค4.2
์š˜์„ธ์ด์ผ๋ฐ˜ํ™”์™€ Hallucination์ด ๊ฐ™์€ ์›์ธ์ด๋ผ๋Š”๊ฒŒ ์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ์ ์ธ ๋ฐœ๊ฒฌ์ธ ๋“ฏํ•˜๋‹ค. ๊ทธ๋Ÿผ Hallucinatio์„ ํ•ด๊ฒฐํ•˜๋Š” ๊ณผ์ •์—์„œ ์ผ๋ฐ˜ํ™”๋ฅผ ํ•ด์น˜๊ฑฐ๋‚˜, ๋ฐ˜๋Œ€๋กœ ์ผ๋ฐ˜ํ™” ๊ณผ์ • ์ค‘ ํ™˜๊ฐ์ด ์ผ์–ด๋‚˜๋Š” ๋ถ€๋ถ„์— ๋Œ€ํ•œ ๋Œ€์ฑ…์ด ํ•œํŽธ์œผ๋กœ๋Š” ํ•„์š”ํ•  ๋“ฏํ•˜๋‹ค.4.7

TL; DR

๐Ÿ’ก

Generalization์ด๋“  Hallucination์ด๋“  ๋ชจ๋‘ ๋‹ค Out-of-Context Reasoning์˜ ํ˜„์ƒ์ด๊ณ , ์ด๋Š” Output ํ–‰๋ ฌ๊ณผ Value ํ–‰๋ ฌ์ด ๋ถ„๋ฆฌ๋˜์–ด์žˆ์–ด ํ•™์Šต๊ฐ€๋Šฅํ•˜๋‹ค!

  • Output ํ–‰๋ ฌ: Attention(K, Q, V) ์ดํ›„ FFN์— ๋“ค์–ด๊ฐ€๊ธฐ ์ „ ๊ณฑํ•ด์ฃผ๋Š” ํ–‰๋ ฌ(์ฐจ์›์„ ๋งž์ถ”๊ฑฐ๋‚˜, multi-head attention์—์„œ head๊ฐ„ ์ •๋ณด ์ถ”ํ•ฉ)

Summary

Motivation

  • Example
    • Generalizablity
      • Training: ์•จ๋ฆฌ์Šค๋Š” ํ”„๋ž‘์Šค์— ์‚ฐ๋‹ค, ์•จ๋ฆฌ์Šค๋Š” ํ”„๋ž‘์Šค์–ด๋ฅผ ํ•œ๋‹ค. ๋ผ์šธ์€ ํ”„๋ž‘์Šค์— ํ•œ๋‹ค.
      • Test: ๋ผ์šธ์ด ์“ฐ๋Š” ์–ธ์–ด๋Š”? โ†’ ํ”„๋ž‘์Šค์–ด โœ…
    • Hallucination
      • Training: ์•จ๋ฆฌ์Šค๋Š” ํ”„๋ž‘์Šค์— ์‚ฐ๋‹ค, ์•จ๋ฆฌ์Šค๋Š” ์ž๋ฐ”๋ฅผ ์ฝ”๋”ฉํ•œ๋‹ค. ๋ผ์šธ์€ ํ”„๋ž‘์Šค์— ํ•œ๋‹ค.
      • Test: ๋ผ์šธ์ด ์“ฐ๋Š” ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด๋Š”? โ†’ ์ž๋ฐ”โŒ
    • Training์—์„œ ๋ผ์šธ์€ ํ”„๋ž‘์Šค์—์„œ ์‚ฌ๋Š”๊ฒƒ๋งŒ ์•„๋Š”๋ฐ, ์–ด๋–ค ์–ธ์–ด๋ฅผ ์“ฐ๊ณ  ์–ด๋–ค ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์–ธ์–ด๋ฅผ ์“ฐ๋Š”์ง€ ์•Œ ์ˆ˜ ์žˆ๋Š”๊ฐ€?
  • Research Question: Does generalization and hallucination on newly-injected factual knowledge arise from the same underlying mechanism?
    • LLM์ด ์ง€์‹์„ ๋ฐฐ์šฐ๋ฉด ์ผ๋ฐ˜ํ™”๋„ ์ž˜ ํ•˜๋Š”๋ฐ, ํ™˜๊ฐ ํ˜„์ƒ๋„ ๋ถ„๋ช…ํžˆ ์กด์žฌํ•จ
    • ์ด ๋‘˜์ด ๋‹ค๋ฅธ ์›์ธ์ผ๊นŒ? ์•„๋‹ˆ๋ฉด ๊ฐ™์€ ์›์ธ์ธ๊ฐ€?
๐Ÿ’ก

Generalizablity์™€ Hallucination์„ ๊ฐ™์€ textual implication(entailment)๋กœ ๋ด„!!

Contribution

  • LLM(์ •ํ™•ํžˆ๋Š” attention mechanism)์˜ ์ผ๋ฐ˜ํ™”(์ž์—ฐ์–ด ์ถ”๋ก ) ๋Šฅ๋ ฅ๊ณผ, hallucination์˜ ๊ทผ๋ณธ์ ์ธ ์›์ธ์€ ๊ฐ™๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์ด๊ณ , ์ˆ˜ํ•™์ ์œผ๋กœ ์œ ๋„ํ•จ
    • Single layer, Single head attention transformer๋„ ์ด๋Ÿฌํ•œ Out of context reasoning์„ ์ˆ˜ํ–‰ํ•จ
      ๋‹จ, Outputํ–‰๋ ฌ(K dot Q)์™€ Valueํ–‰๋ ฌ์ด ๋ถ„๋ฆฌ๋˜์–ด์žˆ์–ด์•ผ ํ•จ

Out of Context Reasoning (OCR) in LLM

  • Implication

    An underlying rule (s,r1,bi)โ†’implies(s,r2,ci),โˆ€sโˆˆS(s, r_1, b_i) \xrightarrow{\text{implies}} (s, r_2, c_i), \quad \forall s \in \mathcal{S}๏ปฟ means that any
    subject ss๏ปฟ having relation r1r_1๏ปฟ with bib_i๏ปฟ also has relation r2r_2๏ปฟ with cic_i๏ปฟ. For example, (s,livesin,Paris)โ†’implies(s,speaks,French),โˆ€sโˆˆS(s, lives in, Paris) \xrightarrow{\text{implies}} (s, speaks, French), \quad \forall s \in \mathcal{S}๏ปฟ means โ€œpeople live in Paris speak Frenchโ€.

    • (r1,bi)(r_1, b_i)๏ปฟ: Fact, (r2,ci)(r_2, c_i)๏ปฟ: Implication
  • ํ•ฉ์„ฑ๋ฐ์ดํ„ฐ ๊ตฌ์„ฑ
    • ๊ฐ€์ƒ์˜ ์ด๋ฆ„ ๋ชฉ๋ก์œผ๋กœ ์ง‘ํ•ฉ ๊ตฌ์„ฑ SS๏ปฟ
    • ๊ฐ€์ƒ์˜ ์ด๋ฆ„์— ๋Œ€ํ•œ 5๊ฐ€์ง€ fact A1A_1๏ปฟ๊ณผ, 5๊ฐ€์ง€ implication A2A_2๏ปฟ ๋ฅผ ์ง์ง€์Œ
      • ๋„์‹œ-์–ธ์–ด, ๋„์‹œ-์–ธ์–ด(CounterFactual), ๊ตญ๊ฐ€-์ฝ”๋“œ, ์ง์—…-์ƒ‰๊น”, ์Šคํฌ์ธ -์Œ์•…
        • ๋„์‹œ-์–ธ์–ด๋Š” pre-training์„ ํ†ตํ•ด ํ•™์Šต๋˜์—ˆ์„ ๊ฐ€๋Šฅ์„ฑ์ด ๋†’์Œ
        • CounterFactual ์Œ์€ ๋„์‹œ์— ์˜ฌ๋ฐ”๋ฅด์ง€ ์•Š์€ ์–ธ์–ด ๋งคํ•‘
          • e.g. ํŒŒ๋ฆฌ-์ผ๋ณธ์–ด
    • SS๏ปฟ๋ฅผ 5๊ฐ€์ง€ ํ•˜์œ„์ง‘ํ•ฉ์œผ๋กœ ๋‚˜๋ˆ„์–ด ๊ฐ fact-implication์— ๋ถ„ํ• ํ•จ
      • e.g. S1S_1๏ปฟ : ๊ตญ๊ฐ€-์ฝ”๋“œ

  • ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ์˜ 20%๋งŒ ํ•™์Šต์‹œํ‚ค๊ณ , 80%๋กœ ํ…Œ์ŠคํŠธ (SS๏ปฟ๊ธฐ์ค€ 20%)
  • ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ LLM์—๊ฒŒ ์ง€์‹ ์ฃผ์ž… ํ›„ ์ผ๋ฐ˜ํ™”์™€ ํ™˜๊ฐ ์ธก์ •
    • LLMs: Gemma-2-9B, OLMo-7B, Qwen-2-7B, Mistral-7B-v0.3, Llama-3-8B
    • Metric: mean rank(์ •๋‹ต implication์˜ ํ‰๊ท  ์ˆœ์œ„, ๋‚ฎ์„์ˆ˜๋ก ์ข‹์Œ)
  • ์‹คํ—˜ ๊ฒฐ๊ณผ
    • ์ธ๊ณผ์ ์œผ๋กœ ๋งž๋Š” ํ•จ์˜์— ๋Œ€ํ•ด์„œ ์ผ๋ฐ˜ํ™”๋ฅผ ์ž˜ ํ•˜์ง€๋งŒ, ์ธ๊ณผ๊ฐœ๋…์ด ์—†๋Š” ๊ฒƒ๋“ค๋„ ์—ฐ๊ฒฐํ•˜๋„๋ก ํ•™์Šต๋จ
    • ๋งค์šฐ ์ ์€ ๋ฐ์ดํ„ฐ๋กœ๋„ ํ•™์Šต๋จ(ํ•œ fact-implication์— ๋Œ€ํ•ด 4๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋กœ๋„ ํ•™์Šต๋จ)
    • ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ๋” ๊ฐ•๋ ฅํ•œ ๊ฑด pre-training data์™€ ๋น„์Šทํ•˜๊ธฐ ๋•Œ๋ฌธ

One-Layer Attention-Only Transformers can Do Symbolic OCR

  • ์œ„์—์„œ ์ƒ์„ฑํ•œ ํ•ฉ์„ฑ ๋ฐ์ดํ„ฐ๋กœ ์•„์ฃผ ๊ฐ„๋‹จํ•œ ํ˜•ํƒœ์˜ transformer์—์„œ ์‹คํ—˜
  • ํ•ฉ์„ฑ๋ฐ์ดํ„ฐ notation
  • ๋ชจ๋ธ์€ ์‰ฝ๊ฒŒ ๋งํ•ด z1:(T+1)=[s,r,<EOS>,aโˆ—(s,r)]z_{1:(T +1)}= [s, r, <EOS>, a^โˆ—(s, r)]๏ปฟ๋กœ ํ•™์Šตํ•ด test ss๏ปฟ์— ๋Œ€ํ•ด z1:Tz_{1:T}๏ปฟ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ zT+1 z_{T+1}๏ปฟ์˜ˆ์ธก
  • ์—ฌ๊ธฐ์„œ ๊ฐ ํ† ํฐ๋“ค(zz๏ปฟ)์€ one-hot vector๋กœ ์ž„๋ฒ ๋”ฉ ๋จ
  • ์ž„๋ฒ ๋”ฉ๋œ input์€X=[ez1,โ€ฆ,ezT]โŠคโˆˆRTร—M,whereย xi=eziย forย iโˆˆ[T] X = [e_{z_1}, \dots, e_{z_T}]^\top \in \mathbb{R}^{T \times M}, \quad \text{where } x_i = e_{z_i} \text{ for } i \in [T]๏ปฟ ๋กœ ๋‚˜ํƒ€๋ƒ„
  • Output, Value๊ฐ€ ๋ถ„๋ฆฌ๋œ ๊ฐ„๋‹จํ•œ ํ˜•ํƒœ์˜ transformer์—์„œ ์ถœ๋ ฅ๋ฒกํ„ฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Œ
    • ๋ถ„๋ฆฌ ๋ชจ๋ธ: fฮธ(X)=WOWVโŠคXโŠคXWKQxTโˆˆRdf_{\theta}(X) = W_O W_V^\top X^\top X W_{KQ} x_T \in \mathbb{R}^d๏ปฟ
  • Output, Value๊ฐ€ ํ•ฉ์ณ์ง„ ๋ชจ๋ธ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ถœ๋ ฅํ•จ
    • ๋น„๋ถ„๋ฆฌ ๋ชจ๋ธ: fฮธ(X)=WOVXโŠคXWKQxTโˆˆRdf_{\theta}(X) = W_{OV} X^\top X W_{KQ} x_T \in \mathbb{R}^d๏ปฟ
  • next token prediction ํ™•๋ฅ ์€ ์šฐ๋ฆฌ๊ฐ€ ์ตํžˆ ์•„๋Š” ์‹์œผ๋กœ ๋‚˜ํƒ€๋‚ด๊ณ 
    • pฮธ(zโˆฃz1:T):=expโก(ezโŠคfฮธ(X))โˆ‘zโ€ฒโˆˆAexpโก(ezโ€ฒโŠคfฮธ(X))=expโก(fฮธ(z1:T,z))โˆ‘zโ€ฒโˆˆAexpโก(fฮธ(z1:T,zโ€ฒ))p_{\theta}(z|z_{1:T}) := \frac{\exp(e_z^\top f_{\theta}(X))}{\sum_{z' \in \mathcal{A}} \exp(e_{z'}^\top f_{\theta}(X))} = \frac{\exp(f_{\theta}(z_{1:T}, z))}{\sum_{z' \in \mathcal{A}} \exp(f_{\theta}(z_{1:T}, z'))}๏ปฟ
  • ํ•™์Šต ์†์‹ค๊ณผ ํ›ˆ๋ จ ์†์‹ค์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Œ
    • Ltrain(ฮธ)=Ez1:T+1โˆผDtrain[โˆ’logโกpฮธ(zT+1โˆฃz1:T)]\mathcal{L}_{train}(\theta) = \mathbb{E}_{z_{1:T+1} \sim \mathcal{D}_{train}} [-\log p_{\theta}(z_{T+1}|z_{1:T})]๏ปฟ
    • Ltest(ฮธ)=Ez1:T+1โˆผDtest[โˆ’logโกpฮธ(zT+1โˆฃz1:T)]\mathcal{L}_{test}(\theta) = \mathbb{E}_{z_{1:T+1} \sim \mathcal{D}_{test}} [-\log p_{\theta}(z_{T+1}|z_{1:T})]๏ปฟ
  • ํ•™์Šต ์ค‘ ๋ถ„ํ•ด ๋ชจ๋ธ๊ณผ ๋น„๋ถ„ํ•ด ๋ชจ๋ธ ๋ชจ๋‘ training loss 0์„ ๋‹ฌ์„ฑํ–ˆ๊ณ ,
    ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ถ„ํ•ด ๋ชจ๋ธ๋งŒ test loss 0์„ ๋‹ฌ์„ฑํ•จ
  • Figure 2 ์™ผ์ชฝ์—์„œ ์ฒ˜๋Ÿผ test-implication์—์„œ ๋ถ„ํ•ด ๋ชจ๋ธ์€ ์œ ์‚ฌํ•œ ๊ฐ€์ค‘์น˜ ํŒจํ„ด์„ ๋ณด์ด์ง€๋งŒ, ๋น„๋ถ„ํ•ด ๋ชจ๋ธ์€ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ์•”๊ธฐ๋งŒ ํ•  ์ˆ˜ ์žˆ์Œ
    • ๊ทธ๋ž˜์„œ ์˜ค๋ฅธ์ชฝ ์ฒ˜๋Ÿผ ๋ถ„ํ•ด ๋ชจ๋ธ์€ s2s_2๏ปฟ์— ๋Œ€ํ•ด ์ด๋ฏธ b2b_2๏ปฟ์™€ c2c_2๏ปฟ(์ด๊ฑด ํ•™์Šตํ•œ ์  ์—†์ง€๋งŒ ๋‹ค๋ฅธ s1s_1๏ปฟ์ด ๊ทธ๋žฌ์œผ๋‹ˆ๊นŒ~)์— ๋Œ€ํ•ด ๊ฐ€์ค‘์น˜๋ฅผ ๋‘๊ณ  ์žˆ์Œ

Theoretical Results

  • ~~ ์ˆ˜ํ•™์  ์ฆ๋ช…~~
  • ๋ถ„ํ•ด ๋ชจ๋ธ์€ Nuclear Norm์„ ์ตœ์†Œํ™”ํ•˜๋Š” ํ•ด๋ฅผ ์ฐพ๊ณ , ๊ทธ๊ฑธ ํ•˜๋ ค๋ฉด test data์˜ ๊ฐ€์ค‘์น˜๋ฅผ 0์œผ๋กœ ์ฑ„์šฐ๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์™€์˜ ์—ฐ๊ด€์„ฑ์„ ํ†ตํ•ด ๊ฐ’์„ ์ฑ„์›Œ๋„ฃ๋Š” low-rank ๊ตฌ์กฐ๋ฅผ ๊ฐ–๊ฒŒ ๋จ
  • ๋น„๋ถ„ํ•ด ๋ชจ๋ธ์€ ํ•™์Šต ๊ณผ์ •์—์„œ Frobenius norm์„ ์ตœ์†Œํ™”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋ณด์ง€ ๋ชปํ•œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๊ฐ€์ค‘์น˜๋ฅผ 0์œผ๋กœ ๋„ฃ์Œ
  • ๋˜, SVM ๊ด€์ ์—์„œ ๋ดค์„ ๋–„ ๋น„๋ถ„ํ•ด ๋ชจ๋ธ์€ ์ƒˆ๋กœ์šด ์ง€์‹์— ๋Œ€ํ•œ ๋งˆ์ง„์ด 0์ด ๋˜๋Š”๋ฐ, ๋ถ„ํ•ด ๋ชจ๋ธ์€ ์–‘์ˆ˜์˜ ๋งˆ์ง„์„ ๊ฐ€์ง์„ ์ฆ๋ช…ํ•จ

Categories

research