14 January 2026

Interpreting the Repeated Token Phenomenon in Large Language Models

๐Ÿ’กLLM์— ๊ฐ™์€ ๋‹จ์–ด๋ฅผ ๊ณ„์† ๋ฐ˜๋ณต์‹œํ‚ค๋ฉด ๋ชจ๋ธ์ด ์–ด๋А ์ˆœ๊ฐ„๋ถ€ํ„ฐ ๊ทธ ๋‹จ์–ด๋ฅผ ์ œ๋Œ€๋กœ ๋ฐ˜๋ณตํ•˜์ง€ ๋ชปํ•˜๊ณ  ๋ถ•๊ดด๋˜๋Š”๋ฐ, ์ด๋Š” attention sink๋ฅผ ๋งŒ๋“œ๋Š” neuron์ด ๋ฐ˜๋ณต๋˜๋Š” ํ† ํฐ์„ โ€˜๋ฌธ์žฅ์˜ ์ฒซ ํ† ํฐ(BoS)โ€™์œผ๋กœ ์˜ค์ธํ•˜์—ฌ attention์ด ๋ชฐ๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์ž„

์ตœ๋ฏผ์˜
์ตœ๋ฏผ์˜
๐Ÿฅˆ

Interpreting the Repeated Token Phenomenon in Large Language Models

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
์ฐฐ๋‚˜์†”์งํžˆ ๋‚ด๊ฐ€ ์ด motivation์ด๋ž‘ idea ๋“ค๊ณ  ๊ฐ€๋ฉด ๋“ค์—ˆ์„ ๋ฒ•ํ•œ ๋ง์€ ์ด๊ฒŒ ๋„๋Œ€์ฒด ์™œ ํ•„์š”ํ•œ๊ฑด๋ฐ? ์ผ ๊ฒƒ ๊ฐ™์Œ. ์ •๋ง ์„œ์ˆ ํ•˜๋Š” ํฌ์žฅ ๋ฐฉ์‹๋„ ์ค‘์š”ํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์Œ. ์ด ํ˜„์ƒ์ด, LLM์ด ๊ฐ€์ง„ ๊ตฌ์กฐ์  ๋ฌธ์ œ๋ฅผ ๊ทœ๋ช…ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ด์–ด์งˆ ๊ฒƒ์ด๋ผ๊ณค ์ƒ๊ฐํ•˜๊ธฐ ์–ด๋ ค์šด๋ฐ, ๋Œ€๋‹จํ•œ ๊ฒƒ ๊ฐ™์Œ. 4.1
์™€์‚ฌ๋น„๊ฝƒ๊ฒŒ๋ž‘Attention sink ์— ๋Œ€ํ•œ ๊ฐœ๋…์„ ์ด ๊ธฐํšŒ์— ์ œ๋Œ€๋กœ ์•ˆ๋“ฏ. ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” attention sink์„ ๋ถ€์ •์ ์ธ ํ˜„์ƒ๊ณผ ์—ฐ๊ด€์ง€์–ด ์„ค๋ช…ํ•˜๊ธดํ–ˆ์ง€๋งŒ ์‚ฌ์‹ค ๋ชจ๋ธ ์•ˆ์ •์„ฑ์„ ์œ„ํ•ด ๋งŒ๋“ค์–ด์กŒ๋‹ค๋Š” ๊ฒƒ์ด๊ธด ํ•จ. ์ด๋ ‡๊ฒŒ ์ „ํ˜€ ๋‹ค๋ฅธ๊ฐœ๋…์„ ์ด์–ด์„œ ์ƒ๊ฐํ•  ์ƒ๊ฐ์„ ํ–ˆ๋Š”์ง€ ๊ทธ ์ ‘๊ทผ ์ž์ฒด๊ฐ€ ์ƒˆ๋กœ์šด ๋ฐœ์ƒ์ด๋‹ค4.2
๋ฉ”๊ฐ€์ปคํ”ผpreliminary๋ž‘ repeated token divergence์— ๋Œ€ํ•ด ์ฒ˜์Œ ์•Œ์•˜๋Š”๋ฐ ๋…ผ๋ฌธ์ด ์ฐธ ์žฌ๋ฐŒ๋‹ค. BOS ํ† ํฐ๊ณผ Repeat ํ† ํฐ์˜ ๋ถ„ํฌ์˜ ์œ ์‚ฌํ•จ์„ ํ†ตํ•ด์„œ reapeat ํ† ํฐ์ด attention sink์ž„์„ ์ฆ๋ช…ํ•˜๋Š” ๊ฒƒ์„ ๋ณด๊ณ  motivation์— ๋Œ€ํ•œ ์ฆ๋ช… ์‹คํ—˜ ์„ค๊ณ„๋ฅผ ์ฐธ ์ž˜ํ–ˆ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ์Œ4.4
์š”๋ฆฌ๊ดด๋ฌผAttention score ์ˆ˜์‹์„ ์ด๋ ‡๊ฒŒ๋„ ํ•ด์„ํ•  ์ˆ˜ ์žˆ๊ตฌ๋‚˜..! ๋‰ด๋Ÿฐ ํ•˜๋‚˜๋งŒ ์กฐ์ •ํ•ด๋„ sink๊ฐ€ ์™„ํ™”๋œ๋‹ค๋Š” ๊ฒƒ์ด ๋†€๋ž๋‹ค. ๊ทผ๋ฐ ์‹ค์ œ downstream task ๊ด€์ ๋ณด๋‹ค๋Š” ๋ชจ๋ธ ๋ณด์•ˆ์„ ๊ฐ•ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๊ณ ๋ คํ•ด๋ด„์ง ํ•œ๋“ฏ4.3
์ƒˆ์šฐ๊นกsink layer/neuron์ด ์‹ ๊ธฐํ•˜๊ธด ํ•œ๋ฐ ์–ด๋”” ์“ธ ์ˆ˜ ์žˆ์œผ๋ ค๋‚˜?! ๋งˆ์ง€๋ง‰์— ์ œ์•ˆํ•œ ๊ฒƒ์ฒ˜๋Ÿผ attack ๋ง‰๋Š” ์šฉ๋„๋กœ, ๋ชจ๋ธ ๋ฐฐํฌํ•˜๋Š” ๊ธฐ์—… ์ž…์žฅ์—์„œ๋Š” ์ค‘์š”ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค. ์•ž์œผ๋กœ LLM์ด ๋ฐ˜๋ณตํ•ด์„œ ๋งํ•˜๋Š” ํ˜„์ƒ ๋ณด์ด๋ฉด ๋ฌด์–ธ๊ฐ€ ํ˜ผ๋ž€์Šค๋Ÿฌ์›Œํ•˜๊ณ  ์žˆ๊ตฌ๋‚˜ ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์„๋“ฏ4.1
์•ˆ์„ฑ์žฌ์ผ์ข…์˜ red teamming์— ๋Œ€ํ•œ white box ํ•ด์„์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒƒ ๊ฐ™์Šต๋‹ˆ๋‹ค. Soundness ํ’๋ถ€ํ•˜๊ณ , ๋Œ€์•ˆ๊นŒ์ง€ ์ œ์‹œํ•˜๋Š” ๊ฒƒ์ด ์—ฐ๊ตฌ์˜ ์™„์„ฑ๋„๋ฅผ ํ•œ ์ธต ๋” ์˜ฌ๋ ค์ฃผ๋„ค์š”. ์ƒ์กด์ž…๋‹ˆ๋‹ค.4.5
์Šคํƒ€๋ฒ…์ŠคAttention sinnking์ด ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์œ ์ถœ ์ทจ์•ฝ์ ์œผ๋กœ ์•…์šฉ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์— ๋”ฐ๋ผ ๋ฐ˜๋ณตํ•˜์ง€ ๋ชปํ•˜๊ฒŒ ํ•˜๋Š” ๋ฐฉํ–ฅ์„ฑ์ด ์ƒˆ๋กœ์› ๋˜ ๊ฒƒ ๊ฐ™์Œ. ์–ธ์–ด ๋ชจ๋ธ์ด ๋‹จ์ˆœํ•˜๊ฒŒ ๋ฐ˜๋ณตํ•ด ๋‹ฌ๋ผ๋Š” ์ง€์‹œ๋ฅผ ์ž˜ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์™œ ๊ทธ๋Ÿฐ์ง€ ๊ถ๊ธˆํ–ˆ๋Š”๋ฐ ์ด๊ฒŒ ๋‹ต์ด ๋  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™์Œ.4.6
๊ณ ๊ตฌ๋งˆ๋ง›๋„๋ฆฌ ์ƒˆ๋กœ์šด ๊ฐœ๋… (attention sink, BoS token)์„ ๋งŽ์ด ์•Œ๊ฒŒ ๋˜์—ˆ๋‹ค! ๋ชจ๋ธ ๋ณ„ sink neuron ID๊ฐ€ ๋‹ค๋ฅธ ๋ถ€๋ถ„์ด ์‹ ๊ธฐํ•˜๋‹ค. ๊ณง์žˆ์œผ๋ฉด ๊ฐ layer, ๊ฐ neuron ๋ณ„ map์ด ๋งŒ๋“ค์–ด์ง€๊ฒ ๊ตฐ !4.8

TL; DR

๐Ÿ’ก

LLM์— ๊ฐ™์€ ๋‹จ์–ด๋ฅผ ๊ณ„์† ๋ฐ˜๋ณต์‹œํ‚ค๋ฉด ๋ชจ๋ธ์ด ์–ด๋А ์ˆœ๊ฐ„๋ถ€ํ„ฐ ๊ทธ ๋‹จ์–ด๋ฅผ ์ œ๋Œ€๋กœ ๋ฐ˜๋ณตํ•˜์ง€ ๋ชปํ•˜๊ณ  ๋ถ•๊ดด๋˜๋Š”๋ฐ, ์ด๋Š” attention sink๋ฅผ ๋งŒ๋“œ๋Š” neuron์ด ๋ฐ˜๋ณต๋˜๋Š” ํ† ํฐ์„ โ€˜๋ฌธ์žฅ์˜ ์ฒซ ํ† ํฐ(BoS)โ€™์œผ๋กœ ์˜ค์ธํ•˜์—ฌ attention์ด ๋ชฐ๋ฆฌ๊ธฐ ๋•Œ๋ฌธ์ž„

Summary

  • Interpreting the Repeated Token Phenomenon in Large Language Models, ICLMโ€™25 | Link
  • Author
  • Citation: 3

Introduction

Preliminaries

  • Beginning of Sequence (BoS) ํ† ํฐ
    • ์ •์˜: ๋ฌธ์žฅ์—์„œ ํ•ญ์ƒ ์ฒซ ๋ฒˆ์งธ ์œ„์น˜์— ์žˆ๋Š” ํ† ํฐ
      • e.g., "Once upon a timeโ€ โ†’ BoS: Once
  • Attention Sinks
    • ์ •์˜: ๋ชจ๋ธ์ด ํ•œ ํ† ํฐ์— ๋งŽ์€ attention์„ ์ฃผ๋Š” ํ˜„์ƒ
      • attention ๊ฐ’๋“ค์€ ํ•ญ์ƒ ํ•ฉ์ด 1์ด ๋˜์–ด์•ผ ํ•˜๋Š”๋ฐ, ๋งˆ๋•…ํžˆ ์ค„ ๊ณณ์ด ์—†์œผ๋ฉด ํŠน์ • ํ† ํฐ์— ๋ชฐ์•„์คŒ(๋‚จ๋Š” attention์„ ๋ฒ„๋ฆฌ๋Š” ํ†ต โ€˜sinkโ€™)

        โ†’ ๊ทธ ๋Œ€์ƒ์ด ํ•ญ์ƒ ์ ‘๊ทผ ๊ฐ€๋Šฅํ•œ ์ดˆ๊ธฐ ํ† ํฐ์ธ๊ฒƒ

      • semantic ๋•Œ๋ฌธ์ด ์•„๋‹Œ structural ์—ญํ•  ๋•Œ๋ฌธ
        • "\n"์œผ๋กœ ๋ฐ”๊ฟ”๋„ ๋™์ผํ•œ ํŒจํ„ด ๋ฐœ์ƒ
    • Fig.
      • Llama-2-7B์—์„œ 256๊ฐœ ๋ฌธ์žฅ(๊ฐ 16ํ† ํฐ)์— ๋Œ€ํ•ด, ํ‰๊ท  attention logit์„ ์‹œ๊ฐํ™”ํ•œ heatmap
        • x์ถ•: key token ์œ„์น˜
        • y์ถ•: query token ์œ„์น˜
          • y = 10, x = 0 ์ด ๋ถ‰๋‹ค๋ฉด โ†’ 10๋ฒˆ์งธ ํ† ํฐ์ด 0๋ฒˆ์งธ ํ† ํฐ์„ ๋งค์šฐ ๊ฐ•ํ•˜๊ฒŒ ์ฐธ๊ณ ํ•œ๋‹ค๋Š” ๋œป
      • Layer2๋ฅผ ๋„˜์–ด๊ฐˆ์ˆ˜๋ก ๋ชจ๋“  query ํ† ํฐ์ด ์ดˆ๊ธฐ ํ† ํฐ(token 0) ์— ๋งค์šฐ ํฐ attention์„ ์คŒ
    • BoS sink:
      • ์ •์˜: BoS (Beginning of Sequence) token์ด attention sink ์—ญํ• ์„ ํ•˜๋Š” ํ˜„์ƒ(attention์„ ๋ฒ„๋ฆฌ๋Š” ํ†ต(sink) ์œผ๋กœ BoS ํ† ํฐ์„ ์‚ฌ์šฉํ•œ๋‹ค~)
      • ํšจ๊ณผ
        • ๋ฌธ์žฅ ๊ตฌ์กฐ์˜ ๊ธฐ์ค€์  ์ œ๊ณต ๋ฐ ๋ฌธ๋งฅ ํ•ด์„์˜ anchor ์—ญํ•  ์ˆ˜ํ–‰
        • BoS ํ† ํฐ์€ ๋Œ€ํ‘œ์ ์ธ ์ •์ƒ attention sink์ž„
        • attention์ด ์™„์ „ํžˆ ํฉ์–ด์ง€์ง€ ์•Š๋„๋ก ์ค‘์‹ฌ์„ ์žก์•„์คŒ(๋ฌธ์žฅ์˜ ๊ธฐ์ค€์ ์„ ๋งŒ๋“ค์–ด ์คŒ)
        • ๋ชจ๋ธ์˜ fluency๋ฅผ ๋†’์—ฌ์คŒ
  • Attention Sinks & BoS ๊ฐ„์˜ ๊ด€๊ณ„
    • ๋ชจ๋ธ์€ ํ•™์Šต ์ค‘์— โ€˜์ฒซ ํ† ํฐ์€ ๋ฌธ๋งฅ์˜ ๊ธฐ์ค€์ โ€™์œผ๋กœ ๋ฐฐ์šฐ๊ธฐ ๋•Œ๋ฌธ์—, BoS ํ† ํฐ์€ ํ•ญ์ƒ attention sink ์—ญํ• ์„ ํ•˜๋„๋ก ํ•™์Šต๋จ
    • ํ•˜์ง€๋งŒ, ๋ฐ˜๋ณต ํ† ํฐ์ด๋‚˜ ํŠน์ • ํŒจํ„ด ํ† ํฐ์ด BoS์ฒ˜๋Ÿผ ์ทจ๊ธ‰๋  ๋•Œ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•จ. (attnetion sink ๊ฐ€ โ€˜์—ฌ๋Ÿฌ๊ฐœโ€™์ธ ๊ฒฝ์šฐ ๋ฌธ์ œ ๋ฐœ์ƒ)
      • BoS sink๊ฐ€ ์—ฌ๋Ÿฌ ๊ฐœ ์ƒ๊ธฐ๊ณ 
      • ๋ฌธ์žฅ์ด ์—ฌ๋Ÿฌ ๋ฒˆ ์ƒˆ๋กœ ์‹œ์ž‘๋˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์ธ์‹๋˜์–ด ๋ชจ๋ธ์ด ๋ฌธ๋งฅ์„ ์žƒ๊ณ  ๋ฐœ์‚ฐํ•จ

Background

Repeated Token Divergence Phenomenon

  • LLM์€ ๋‹ค์–‘ํ•œ ์ž์—ฐ์–ด ํƒœ์Šคํฌ์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ์Œ
  • ํ•˜์ง€๋งŒ, โ€˜ํ•˜๋‚˜์˜ ๋‹จ์–ด๋ฅผ ๋ฐ˜๋ณตํ•˜๋ผโ€™๋ผ๋Š” ๋‹จ์ˆœํ•œ ์ง€์‹œ๋ฅผ ์ œ๋Œ€๋กœ ์ˆ˜ํ–‰ํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Œ

    โ‡’ ์ด๋ฅผ โ€œrepeated token divergenceโ€ phenomenon ๋ผ๊ณ  ํ•จ

    • e.g.,
      • GPT-3.5-turbo ๋กœ ์‹คํ—˜ ํ–ˆ์„ ๋•Œ, ๋‹จ์ผ token์„ ๋ฐ˜๋ณตํ•˜๋‹ค๊ฐ€ ๋‹ค๋ฅธ ๋ฌด๊ด€ํ•œ text๋ฅผ ์ถœ๋ ฅํ•˜๊ฒŒ ๋จ.
      • ์ด๋ ‡๊ฒŒ ์ถœ๋ ฅ๋˜๋Š” text๋Š” ๋ชจ๋ธ์ด ํ•™์Šต ๊ณผ์ •์—์„œ ์–ธ์  ๊ฐ„ ํ•œ๋ฒˆ ๋ณด์•˜๋˜ ๋ฐ์ดํ„ฐ๋ผ๊ณ  ํ•จ
  • โ€˜๋ฐ˜๋ณต ํ† ํฐ ํ˜„์ƒโ€™์„ ๋ถ„์„ํ•˜๋Š” ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค:
    • ๋ฐ˜๋ณต ํ† ํฐ์ด ์ฒซ ํ† ํฐ(BoS)ํ‘œํ˜„์œผ๋กœ ์ˆ˜๋ ดํ•จ์„ ์ง๊ด€์ ์œผ๋กœ ๋ถ„์„ ๋ฐ ๊ด€์ฐฐ
    • ๋ฐ˜๋ณต ํ† ํฐ์ด ์ด๋ก ์ ์œผ๋กœ ๋ฌธ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ถ„์„

    โ‡’ ์™œ ์ด๋Ÿฐ ํ˜„์ƒ์ด ์ผ์–ด๋‚˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ตฌ์ฒด์ ์ธ ๋ถ„์„์€ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์•˜์Œ

In this Paperโ€ฆ

  • โ€˜Repeated token divergenceโ€™ ํ˜„์ƒ์„ โ€˜attention sinkโ€™ ํ˜„์ƒ๊ณผ ์—ฐ๊ด€์ด ์žˆ๋‹ค๊ณ  ๋ณด๊ณ , ํ•ด๋‹น ๋งค์ปค๋‹ˆ์ฆ˜๊ณผ ์—ฐ๊ฒฐํ•ด์„œ ์„ค๋ช…ํ•˜๊ณ ์ž ํ•จ
    • attention sink: ๋ฌธ์žฅ์˜ ์ฒซ ํ† ํฐ(BoS)์ด ๋น„์ •์ƒ์ ์œผ๋กœ ๋†’์€ attention์„ ๋ฐ›๋Š” ํ˜„์ƒ

    โ†’ โ€œ๋ฐ˜๋ณต ํ† ํฐ ํ˜„์ƒ์€ attention sink๋ฅผ ๋งŒ๋“œ๋Š” ๋™์ผํ•œ ์‹ ๊ฒฝ ํšŒ๋กœ๊ฐ€ ์˜ค์ž‘๋™ํ•˜๋ฉด์„œ ๋ฐœ์ƒํ•œ๋‹ค.โ€

  • attention sink๋ฅผ ๋งŒ๋“œ๋Š” ์‹ ๊ฒฝ ํšŒ๋กœ๋ฅผ ๋ถ„์„ํ•˜๊ณ , ๊ทธ ํšŒ๋กœ๊ฐ€ ๋ฐ˜๋ณต ํ† ํฐ์—์„œ ์–ด๋–ป๊ฒŒ ์˜ค์ž‘๋™ํ•˜๋Š”์ง€ ๋ถ„์„ํ•˜๋ฉฐ ๊ทธ ๊ฒฐ๊ณผ ๋ชจ๋ธ์ด ์™œ ๋ถ•๊ดด๋˜๋Š”์ง€๋ฅผ ์„ค๋ช…ํ•˜๊ณ ์ž ํ•จ

    โ‡’ LLM์˜ ์œ ์ฐฝ์„ฑ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” attention sink ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด, ๋™์‹œ์— repeat token ์ทจ์•ฝ์ ์„ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ตฌ์กฐ์  ์›์ธ์ด๋‹ค!

Contribution

  • Repeated Token Divergence Phenomenon ์„ ๋งค์ปค๋‹ˆ์ฆ˜ ์ˆ˜์ค€์—์„œ ์ฒ˜์Œ์œผ๋กœ ์„ค๋ช…
    • ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์ด โ€˜ํ˜„์ƒโ€™ ๋งŒ ๊ด€์ฐฐํ–ˆ๋‹ค๋ฉด, ํ•ด๋‹น ๋…ผ๋ฌธ์€ white-box ์„ธํŒ…์—์„œ ๋ถ„์„
  • Attention sink์˜ ์›์ธ์ด ๋˜๋Š” sink neuron์„ ์‹๋ณ„ํ•˜๊ณ  ์ธ๊ณผ์ ์œผ๋กœ ๊ฒ€์ฆ
    • sink layer์—์„œ norm์„ ๋งŒ๋“œ๋Š” ์†Œ์ˆ˜ ๋‰ด๋Ÿฐ์„ ์„ ๋ณ„ํ•˜์—ฌ ํ•ด๋‹น ๋‰ด๋Ÿฐ์„ ablation ํ–ˆ์„ ๋•Œ ํ˜„์ƒ์ด ์‚ฌ๋ผ์ง์„ ๋ณด์ž„
  • ๋ชจ๋ธ์ด ์ฒซ ํ† ํฐ์„ ์ธ์‹ํ•˜๋Š” first-token detector neuron ๊ตฌ์กฐ๋ฅผ ๊ทœ๋ช…
    • ์ฒซ attention layer ์ดํ›„, first token /non-first token์ด ์„ ํ˜• ๋ถ„๋ฆฌ๋จ์„ ์ฆ๋ช…ํ•จ

      โ†’ LLM์ด ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ช…์‹œ์  ๋‰ด๋Ÿฐ ํ•˜๋‚˜์— ์ธ์ฝ”๋”ฉ์„ ํ•œ๋‹ค!

  • Repeated token Attack์— ๋Œ€ํ•œ ๋ฐฉ๋ฒ• ๋ฐ mitigation ๋ฐฉ์•ˆ ์ œ์•ˆ

Mechanistic Analysis of Repeated Token Divergence

  • Repeated token divergenceํ˜„์ƒ์„ ์‹ ๊ฒฝ ํšŒ๋กœ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๊ด€์ ์œผ๋กœ ์„ค๋ช…ํ•˜๊ณ ์ž ํ•จ
  • Flow
    1. ๋ฐ˜๋ณต token์˜ attention์ด BoS์˜ attention (attention sink ํ˜„์ƒ)๊ณผ ์œ ์‚ฌํ•จ์„ ๊ด€์ฐฐ
    1. Attention sink๋ฅผ ๋งŒ๋“œ๋Š” mechanism ์„ ํŒŒ์•…
    1. ์ด mechanism์ด repeated token์—์„œ ์–ด๋–ป๊ฒŒ ์žฌํ˜„๋˜์–ด divergence์„ ์œ ๋ฐœํ•˜๋Š”์ง€ ํŒŒ์•…
  • LLaMa-2 ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์‹คํ—˜ ์ง„ํ–‰

Large Attention Scores of Repeated Tokens

Q: repeated token์ด ๋‚˜ํƒ€๋‚ด๋Š” attention score๊ณผ attention sink attention score ์–‘์ƒ์ด ์œ ์‚ฌํ•œ๊ฐ€?
  • Setting
    • Top panel: ์ผ๋ฐ˜๋ฌธ์žฅ
    • Bottom panel: โ€˜the โ€ฆโ€™๋ฅผ ๋ฐ˜๋ณตํ–ˆ์„ ๋•Œ
    • x์ถ•: key token ์œ„์น˜
    • y์ถ•: query token ์œ„์น˜
  • Bottom panel (์ผ๋ฐ˜ ๋ฌธ์žฅ): ์ฒซ ํ† ํฐ์ด ๊ฐ€์žฅ ๋†’์€ attention์„ ๋ฐ›์Œ โ†’ attention sink
  • Top panel (the .. repeat): ์—ฌ๋Ÿฌ ์œ„์น˜์˜ "the" ํ† ํฐ๋“ค์ด ์ฒซ ํ† ํฐ๊ณผ ๊ฑฐ์˜ ๋™์ผํ•œ attention score๋ฅผ ๋ฐ›์Œ

    โ†’ ๋ฐ˜๋ณต๋˜๋Š” ํ† ํฐ๋“ค์ด ๋ฌธ์žฅ์˜ ์‹œ์ž‘ ํ† ํฐ(BoS) ์ฒ˜๋Ÿผ ์ทจ๊ธ‰๋˜๊ณ  ์žˆ๋‹ค๋Š” ์˜๋ฏธ

โ‡’ ๋ฐ˜๋ณต๋œ "the" ํ† ํฐ๋“ค์˜ attention ์ ์ˆ˜ ๋ถ„ํฌ๋Š” ์ผ๋ฐ˜ ๋ฌธ์žฅ์—์„œ ์ฒซ ํ† ํฐ์ด ๋ฐ›๋Š” attention๋ถ„ํฌ๊ณผ ์œ ์‚ฌํ•จ!

  • ์ด ํ˜„์ƒ์€ repeated token๊ณผ attention sink ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์—ฐ๊ด€๋˜์–ด ์žˆ์Œ์„ ์‹œ์‚ฌํ•จ

The Attention-Sink Mechanism

Q: ์™œ repeated token์ด ๋†’์€ attention score์„ ๋ฐ›๊ฒŒ ๋˜๋Š”๊ฐ€?

A: Token์ด repeat ๋  ์ˆ˜๋ก hidden state norm์ด ์ปค์ง€๋Š”๋ฐ, norm์ด ํด์ˆ˜๋ก ์ž๋™์œผ๋กœ ๋†’์€ attention์„ ๋ฐ›๊ฒŒ ๋˜๊ธฐ ๋•Œ๋ฌธ์ž„.

  • Transformer๋Š” ํ•™์Šต ๊ณผ์ •์—์„œ ๋ฌธ์žฅ ์‹œ์ž‘ ์œ„์น˜(BoS)์— ๊ฐ•ํ•œ ๊ตฌ์กฐ์  ์˜๋ฏธ๋ฅผ ๋ถ€์—ฌํ•˜๋„๋ก ํ•™์Šต๋˜๋Š”๋ฐ, ์ด ๊ณผ์ •์—์„œ hidden norm์ด ํ•ญ์ƒ ํฌ๊ฒŒ ํ˜•์„ฑ๋จ
    • Hidden state norm์ด ํฌ๋‹ค โ†’ ํ•ด๋‹น ํ† ํฐ ํ‘œํ˜„์ด layer์—์„œ ๊ฐ•ํ•˜๊ฒŒ ํ™œ์„ฑํ™” ๋˜์–ด์žˆ๋‹ค
  • ์ด Hidden state norm์€ attention score์—๋„ ์˜ํ–ฅ์„ ์คŒ

    โ†’ hidden state norm์ด ํฌ๋ฉด, ๊ทธ ํ† ํฐ์€ ์ž๋™์œผ๋กœ ๋†’์€ attention์„ ๋ฐ›๊ฒŒ ๋˜๋Š” ๊ตฌ์กฐ์ž„

  • ๋”ฐ๋ผ์„œ, ํŠน์ • token์„ ๋ฐ˜๋ณตํ–ˆ์„ ๋•Œ ํ•ด๋‹น token์ด BoS token๊ณผ ๊ฐ™์ด hidden state norm์–‘์ƒ์ด ๋น„์Šทํ•œ์ง€ ์•Œ์•„๋ณด๊ณ ์ž ํ•จ
    • ๋…ผ๋ฌธ์—์„œ๋Š” attention sink ํ˜„์ƒ์ด ์ฒ˜์Œ ๊ฐ•ํ•˜๊ฒŒ ํ˜•์„ฑ๋˜๊ธฐ ์‹œ์ž‘ํ•˜๋Š” ๋ ˆ์ด์–ด๋ฅผ sink layer์œผ๋กœ ์ •์˜ํ•˜๊ณ , ํ•ด๋‹น layer(=1)์—์„œ ์‹คํ—˜์„ ์ง„ํ–‰

  • Setting
    • Sink Layer: 1
      • sink layer: attention sink ํ˜„์ƒ์ด ์ฒ˜์Œ ๊ฐ•ํ•˜๊ฒŒ ํ˜•์„ฑ๋˜๊ธฐ ์‹œ์ž‘ํ•˜๋Š” ๋ ˆ์ด์–ด
  • ๋ชจ๋“  ํ† ํฐ์—์„œ ๊ณตํ†ต์ ์œผ๋กœ ๋ฐ˜๋ณต ํšŸ์ˆ˜๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก hidden state norm์ด ์ง€์†์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜๋ฉฐ ๊ฒฐ๊ตญ BoS token norm ์ˆ˜์ค€์— ์ˆ˜๋ ด
    • ์„ธ ํ† ํฐ(the, one, es) ๋ชจ๋‘ ์†๋„ ์ฐจ์ด๋Š” ์žˆ์ง€๋งŒ, ๋ฐฉํ–ฅ์€ ๋™์ผํ•จ

โ‡’ Repeated token์ด ๋ชจ๋ธ ๋‚ด๋ถ€ ํ‘œํ˜„ ๊ณต๊ฐ„์—์„œ BoS token๊ณผ ๊ฑฐ์˜ ๋™์ผํ•œ ์ƒํƒœ๋กœ ๋ณ€ํ•ด๊ฐ

Q: ์–ด๋–ค neuron๋“ค์ด attention sink ํ˜„์ƒ์„ ์œ ๋ฐœํ•˜๋Š”๊ฐ€?
  • ์ด๋Ÿฌํ•œ ํ˜„์ƒ์„ ์•ผ๊ธฐํ•˜๋Š” ํŠน์ • neuron์ด ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ , ์ด๋ฅผ โ€˜sink neuronsโ€™์œผ๋กœ ์ •์˜
  • sink neuron ์ •์˜ ๊ณผ์ •
    1. BoS ํ† ํฐ์„ ๋„ฃ์—ˆ์„ ๋•Œ, MLP ์ถœ๋ ฅ์ด residual stream norm์— ๊ฐ€์žฅ ํฌ๊ฒŒ ๊ธฐ์—ฌํ•˜๋Š” ๋‰ด๋Ÿฐ๋“ค์„ top-K๋กœ ์šฐ์„  ์„ ๋ณ„
    1. ํ›„๋ณด ๋‰ด๋Ÿฐ๋“ค์„ ํ•˜๋‚˜์”ฉ ablation ์ง„ํ–‰
      • ์ด ๋‰ด๋Ÿฐ์„ ์ œ๊ฑฐํ–ˆ์„ ๋•Œ ๋ฐ˜๋ณต ํ† ํฐ norm์ด ์˜๋ฏธ ์žˆ๊ฒŒ ์ค„์–ด๋“œ๋Š”๊ฐ€? ๋ฅผ ํ™•์ธ

    โ†’ 1, 2๋ฅผ ๋ชจ๋‘ ๋งŒ์กฑ์‹œํ‚ค๋Š” neuron์„ sink neuron์œผ๋กœ ํ™•์ •

  • ์•ž์„œ ์‹๋ณ„ํ•œ sink neuron์„ zero-ablation ํ•œ ๋’ค norm ๋ณ€ํ™”๋ฅผ ํ™•์ธ
  • Setting
    • ์œ„์ชฝ ๊ทธ๋ž˜ํ”„: sink neuron ์ œ๊ฑฐ X
    • ์•„๋ž˜ ๊ทธ๋ž˜ํ”„: sink neuron ์ œ๊ฑฐ O
      (y์ถ• scale๊ฐ€ ์œ„์ชฝ ๊ทธ๋ž˜ํ”„์™€ ๋‹ค๋ฆ„!)
    • x์ถ•: ๋ฐ˜๋ณต ํšŸ์ˆ˜
    • y์ถ•: residual stream activation norm

  • ์œ„์ชฝ ๊ทธ๋ž˜ํ”„: ๋ฐ˜๋ณต ์œ„์น˜๊ฐ€ ์ฆ๊ฐ€ํ• ์ˆ˜๋ก norm์ด ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€
  • ์•„๋ž˜ ๊ทธ๋ž˜ํ”„: norm์ด ์ „์ฒด์ ์œผ๋กœ ๊ทน์ ์œผ๋กœ ๊ฐ์†Œ

    โ†’ repeat token์ด attention sink ์ฒ˜๋Ÿผ ์ž‘๋™ํ•˜์ง€ ์•Š์Œ

โ‡’ ๋ฐ˜๋ณต ํ† ํฐ์˜ norm ํญ์ฆ์€ sink neuron์ด ์—†์œผ๋ฉด ๋ฐœ์ƒํ•˜์ง€ ์•Š์Œ!

  • ๋ชจ๋ธ๋ณ„๋กœ sink neuron IDs

Q: ์•ž์„œ ๋ฐ˜๋ณต ํ† ํฐ์€ ๊ฒฐ๊ตญ BoS(first token)์ฒ˜๋Ÿผ ์ทจ๊ธ‰๋œ๋‹ค๊ณ  ํ•˜์˜€๋Š”๋ฐ, ๊ทธ๋ ‡๋‹ค๋ฉด ๋ชจ๋ธ์€ ์ฒซ ํ† ํฐ์„ ์–ธ์ œ ์ธ์‹ํ•˜๋Š”๊ฐ€?
A: ์ฒซ ๋ฒˆ์งธ Attention Layer์—์„œ ์‹œํ€€์Šค์—์„œ ์ฒซ ํ† ํฐ๊ณผ ๊ทธ ์ดํ›„ ํ† ํฐ์„ ๊ตฌ๋ถ„ํ•จ
  • ์ฒซ attention layer๋ฅผ ์ง€๋‚œ๋’ค์˜ representation์„ 1์ฐจ์› ์ถ•์œผ๋กœ ํ™•์ธํ•˜์—ฌ ๋ถ„๋ฆฌ ์ •๋„๋ฅผ ํ™•์ธํ•˜๊ณ ์ž ํ•จ

  • ์ฒซ ํ† ํฐ๊ณผ ์ดํ›„ ํ† ํฐ์˜ ๋ถ„ํฌ๊ฐ€ ๋šœ๋ ทํ•˜๊ฒŒ ๋ถ„๋ฆฌ ๋˜์–ด ์žˆ์Œ โ†’ ์ฒซ ๋ฒˆ์งธ ํ† ํฐ๊ณผ ๊ทธ ์™ธ์˜ ํ† ํฐ ๊ฐ„์˜ โ€˜์„ ํ˜• ๋ถ„๋ฆฌโ€™๊ฐ€ ์ผ์–ด๋‚จ

    โ†’ ์ฒซ attention layer๊ฐ€ ์ฒซ ํ† ํฐ์„ ์‹๋ณ„ํ•˜๊ณ  ๋‚ด๋ถ€์ ์œผ๋กœ โ€˜ํ‘œ์‹(marking)โ€™ ํ•˜๋Š” ๋ฐ ํ•ต์‹ฌ ์—ญํ• 

  • ์ถ”๊ฐ€์ ์œผ๋กœ ๋‹จ์ผ ๋‰ด๋Ÿฐ ํ•˜๋‚˜๋งŒ์œผ๋กœ๋„ ์ด ๋ถ„๋ฆฌ๊ฐ€ ์™„๋ฒฝํžˆ ๋œ๋‹ค๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌ
    • LLaMA2: MLP0 gate neuron 912 โ†’ Layer 0 MLP์˜ gate projection์—์„œ 912๋ฒˆ์งธ ์ฐจ์› ๋‰ด๋Ÿฐ ํ•˜๋‚˜

Attack and Mitigation

Attack

  • ๋ฐ˜๋ณต ํ† ํฐ์€ โ€˜์ทจ์•ฝ์ โ€™์œผ๋กœ ์•…์šฉ๋  ์ˆ˜ ์žˆ์Œ
    • ๊ธฐ์กด ์—ฐ๊ตฌ์—์„œ, ๋ฐ˜๋ณต ํ† ํฐ ์ž…๋ ฅ์ด ๋ชจ๋ธ์„ ํ˜ผ๋ž€์‹œํ‚ค๊ณ  ๊ทธ ๊ฒฐ๊ณผ ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ ์œ ์ถœ(training data leakage) ๊ณต๊ฒฉ์— ์•…์šฉ๋  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•จ
    • ๋˜ํ•œ ์ง€์‹œ ๋”ฐ๋ฅด๊ธฐ์—์„œ ์ดํƒˆํ•˜๊ฒŒ ๋งŒ๋“ค๊ณ , ์™ธ์›Œ๋‘”(memorized) ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ๋ฅผ ๋…ธ์ถœ์‹œํ‚ค๋Š” ์‚ฌ๋ก€๊ฐ€ ์žˆ์Œ
      • e.g., Pythia-12B์— โ€˜asโ€™๋ฅผ 50๋ฒˆ ๋ฐ˜๋ณตํ•˜๋ฉด ๋ชจ๋ธ์ด 3D ํ”„๋ฆฐํŒ… ์„ค๋ช… ๊ฐ™์€ ํ…์ŠคํŠธ๋ฅผ ์ถœ๋ ฅํ•˜๋Š”๋ฐ, ๊ทธ ์ถœ๋ ฅ์€ ์‹ค์ œ ์›น์‚ฌ์ดํŠธ์— ์žˆ๋Š” ๋ฌธ์žฅ(ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์— ํฌํ•จ๋œ ๊ฒƒ์œผ๋กœ ์ถ”์ •๋˜๋Š” ํ…์ŠคํŠธ)์„ ์žฌ์ง„์ˆ  ํ•œ ๊ฒƒ์œผ๋กœ ํ™•์ธ๋จ
    • Attack ์˜ˆ(Before Patching)
  • ๋ฐ˜๋ณต ํ† ํฐ์ด ๊ธธ๊ฒŒ ๋‚˜์˜ค๋ฉด ๊ฐ์ง€ํ•ด์„œ ์ฐจ๋‹จํ•˜๋Š” ๋ฐฉ์‹(Surface-level mitigation)์ด ์žˆ์ง€๋งŒ, ์ด๋Ÿฐ ๋ฐฉ์‹์€ ๊ทผ๋ณธ ์›์ธ์„ ํ•ด๊ฒฐํ•˜์ง€ ๋ชปํ•จ

    โ†’ repeted token์ด ์•„๋‹ˆ๋”๋ผ๋„ ๋ชจ๋ธ ๊ณต๊ฒฉ์ด ๊ฐ€๋Šฅํ•จ

    • Attack detail
      • attention head์˜ ํˆฌ์˜ ๊ณต๊ฐ„์„ 2๊ฐœ(์ฒซ ๋ฒˆ์งธ ํ† ํฐ/ ๋‚˜๋จธ์ง€)๊ฐ€ ์•„๋‹ˆ๋ผ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ž์—ฐ์Šค๋Ÿฌ์šด ๊ทธ๋ฃน์œผ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ clusterํ•จ
      • ๊ฐ™์€ cluster ํ† ํฐ๋“ค์„ ์„ž์–ด ๋„ฃ์œผ๋ฉด BoS์ฒ˜๋Ÿผ ์ทจ๊ธ‰๋˜๋Š” ํ‘œํ˜„์ด ์ƒ๊ธฐ๊ณ  ๋ชจ๋ธ์ด ๋ฐœ์‚ฐ(diverge) ๋จ

        โ†’ ๋™์ผํ•œ token ์„ repeatํ•˜์ง€ ์•Š์•„๋„ ์œ ์‚ฌํ•œ token๋“ค์„ ๋ชจ์•„์„œ ๋ชจ๋ธ์„ ์šฐํšŒํ•˜์—ฌ ๊ณต๊ฒฉ์ด ๊ฐ€๋Šฅํ•จ

Mitigation

  • <Fig3. ์ฐธ๊ณ > sink๋ฅผ ์œ ๋ฐœํ•˜๋Š” ๋‰ด๋Ÿฐ์˜ ์ถœ๋ ฅ(activation)์„ ๊ฐ•์ œ๋กœ โ€˜no-sinkโ€™ ์ƒํƒœ๋กœ ๊ณ ์ •ํ•˜๋ฉด repeat token ๊ณต๊ฒฉ์„ ๋ง‰์„ ์ˆ˜ ์žˆ์Œ
    • Fig 3
    • Mitigation (After Patching)
  • LLaMA2์—์„œ repeat prompt๋ฅผ ์คฌ์„ ๋•Œ ํ•ด๋‹น ๊ณต๊ฒฉ์ด ๋” ์ด์ƒ ๋ชจ๋ธ์„ ๋ฐœ์‚ฐ์‹œํ‚ค์ง€ ๋ชปํ•จ์„ ๋ณด์—ฌ์คŒ
  • <Table 2> ์ถ”๊ฐ€์ ์œผ๋กœ, ์ด ํŒจ์น˜๊ฐ€ ๋ชจ๋ธ์˜ ๊ธฐ๋ณธ ๋Šฅ๋ ฅ์„ ๋ง๊ฐ€๋œจ๋ฆฌ์ง€ ์•Š๋Š”์ง€ ํ™•์ธ โ†’ ํŒจ์น˜๋ฅผ ํ•ด๋„, ์„ฑ๋Šฅ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์€ ์—†์—ˆ์Œ

Categories

research