26 November 2025

On the Role of Attention Heads in Large Language Model Safety

๐Ÿ’กLLM ์•ˆ์ „์„ฑ์€ ์‚ฌ์‹ค ์†Œ์ˆ˜์˜ attention head ์— ์ง‘์ค‘๋˜์–ด ์žˆ์–ด์„œ, ๊ทธ head๋“ค๋งŒ ์‚ด์ง ๊บผ๋„ ๐Ÿšจ ์•ˆ์ •์„ฑ์ด ๋ฐ”๋กœ ๋ฌด๋„ˆ์ง„๋‹ค๋Š” ๊ฑธ ๋ฐํž˜ ๐Ÿ” ShipsยทSahara๋กœ ์–ด๋–ค head๊ฐ€ ์ง„์งœ safety ๋‹ด๋‹น์ธ์ง€ ์ฐพ์•„๋‚ด๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•จ โš™๏ธ๐Ÿ”ฅ

๐Ÿฅˆ

On the Role of Attention Heads in Large Language Model Safety

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
MNGattention head ์ฐจ์ด๋กœ ์•ˆ์ „์„ฑ์ด ๋ฌด๋„ˆ์ง€๋Š” ๊ฒƒ์€ LLM์ด ๊ฐ€์ง„ ํŠน์„ฑ์ผ ๊ฒƒ ๊ฐ™์Œ. head๋ฅผ ๋ถ„์„ํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค๋ฅธ ์—ฐ๊ตฌ์—์„œ๋„ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ!3
์˜ค์ฐจ์ฆˆ์ผ€๋‹ค๋ฅธ ์—ฐ๊ตฌ์—์„œ๋Š” ํŠน์ • head๊ฐ€ ์ถœ๋ ฅ์— ๋Œ€ํ•œ ํ˜ผ๋™์„ ์œ ๋ฐœํ•œ๋‹ค๊ณ  ํ•˜์—ฌ ๋น„ํ™œ์„ฑํ™”ํ•ด์•ผํ•œ๋‹ค๊ณ  ์ฃผ์žฅํ•˜๋Š”๋ฐ, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์•ˆ์ •์„ฑ์„ ๋‹ด๋‹นํ•˜๋Š” head๊ฐ€ ์กด์žฌํ•œ๋‹ค๊ณ  ๋งํ•˜๊ณ  ์žˆ์–ด ์ฃผ์žฅ์ด ์™„์ „ ์ƒ์ถฉ๋˜์–ด์„œ ํฅ๋ฏธ๋กœ์šด ๊ฒƒ ๊ฐ™์Œ. ์•ž์„œ ๋งํ•œ ์ƒํ™ฉ์ฒ˜๋Ÿผ ํŠน์ • head๋ฅผ ๊ป์„ ๋•Œ ์•ˆ์ „์„ฑ๊ณผ ๋ฌด๊ด€ํ•œ ๋‹ค๋ฅธ ๊ธฐ๋Šฅ์— ์–ด๋–ค ๋ถ€์ž‘์šฉ์ด ๋ฐœ์ƒํ•˜๋Š”์ง€ ๋” ๋ถ„์„์ด ์žˆ์œผ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ4
๋ฐฉ์–ด๋ƒ ๋ƒ attention ๋งŒ๋“  ์‚ฌ๋žŒ ์ง„์งœ ์ฒœ์žฌ ์•„๋…€? safety ๋ฟ ์•„๋‹ˆ๋ผ ์–ธ์–ด๋ชจ๋ธ๋ง์˜ ๋‹ค์–‘ํ•œ ์ด์Šˆ๋“ค์„ attention head๋กœ ํ™•์žฅํ•ด๋ณผ ์ˆ˜ ์žˆ๊ฒ ๋‹ค ์‹ถ์—ˆ์Œ 3.8
42RENAttention Head๋ฅผ ํ†ตํ•ด ์•ˆ์ „์„ฑ ๋ฐ ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ์„ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด ํฅ๋ฏธ๋กœ์› ์Œ. ํ•œํŽธ์œผ๋กœ๋Š” Head๋ฅผ ๋„๋ฉด Token Selection์ด ์ €ํ•˜๋œ๋‹ค๊ณ  ํ•˜๋Š”๋ฐ ์ค‘์š”ํ•œ ํ† ํฐ์„ ์„ ํƒํ•˜๋Š” ๋Šฅ๋ ฅ์ด ๋–จ์–ด์ง€๋Š” ์ธก๋ฉด์ด ์žˆ๋Š”๊ฑด ์•„๋‹์ง€?4.5
์•ผํ‚คํ† ๋ฆฌsafety๊ฐ€ ์†Œ์ˆ˜์˜ attention head์— ์ง‘์ค‘๋˜์–ด ์žˆ์—ˆ๋‹ค๋Š” ์ ์„ ์ƒ๊ฐํ•ด๋ณธ ์ ์ด ์—†์—ˆ๋Š”๋ฐ ์ง€๊ธˆ ์ƒ๊ฐํ•ด๋ณด๋‹ˆ ํŠธ๋žœ์Šคํฌ๋จธ ํŠน์„ฑ ์ƒ ๊ทธ๋Ÿด ๊ฒƒ ๊ฐ™๋‹ค! ์†Œ์ˆ˜์˜ head๋งŒ ์กฐ์ž‘ํ•ด๋„ ์•ˆ์ „์„ฑ์ด ์‰ฝ๊ฒŒ ๋ฌด๋„ˆ์ง€๋Š” ์ทจ์•ฝ์„ฑ์„ ์‹ค์ œ ์‹คํ—˜์œผ๋กœ ๋ณด์—ฌ์ค€ ์ ์ด ์ธ์ƒ ๊นŠ๋‹ค4.5
ํ…€๋ธ”๋ŸฌAttention head ์—ฐ๊ตฌ๋กœ์„œ๋Š” ์ข‹์€ ์ธ์‚ฌ์ดํŠธ๋ฅผ ์ฃผ์ง€๋งŒ, ์‹ค์ œ adversarial attack ์—ฐ๊ตฌ ๊ด€์ ์—์„œ๋Š”, ์˜คํ”ˆ ์†Œ์Šค ๋ชจ๋ธ์—์„œ๋งŒ ๊ณต๊ฒฉ ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํฐ ๋ฒ”์šฉ์„ฑ์„ ๊ฐ–๊ธฐ๋Š” ์–ด๋ ต๋‹ค๊ณ  ์ƒ๊ฐ. ๋ฌผ๋ก  llama๋ฅผ ๊ณต๊ฒฉํ•˜๋Š” ๊ฒƒ๋„ ์ข‹์ง€๋งŒ, ChatGPT๋‚˜ Claude๋„ ๊ณต๊ฒฉํ•  ์ˆ˜ ์žˆ๋Š” ํ”„๋กฌํ”„ํŒ… ๊ธฐ๋ฐ˜ red teaming ๋ฐฉ์‹์ด ์ฃผ๋ฅ˜์ธ ์ด์œ ๊ฐ€ ์žˆ๋Š” ๋“ฏ. Novelty๋Š” ์•„์ฃผ ์ข‹์Œ (๋ชจ๋‘๊ฐ€ ์ด๊ฑฐ ํ•œ ๋ฒˆ์ฏค ์ƒ๊ฐํ•ด๋ณด๊ณ  ์‹œ๋„๋Š” ์•ˆํ•ด๋ณธ ์—ฐ๊ตฌ ๋А๋‚Œ)4.3
๊ฐ์žattention head๋ฅผ ํƒ๊ตฌํ•˜๋Š”๋ฐ safety ๊ด€์ ์—์„œ ํ•˜๊ณ , ๊ด€๋ จ ์‹คํ—˜๋„ ๋‹ค์–‘ํ•˜๊ฒŒ ํ•œ ๋…ผ๋ฌธ. ๊ผญ safety์—๋งŒ ์“ธ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•์€ ์•„๋‹ˆ์–ด๋ณด์—ฌ์„œ ๋‹ค์–‘ํ•˜๊ฒŒ ์“ธ ์ˆ˜ ์žˆ์–ด ๋ณด์ธ๋‹ค4
์ƒˆ์šฐ๊ฐ head ์ œ๊ฑฐ๋Š” ๊ฒฝํ—˜์ ์ด์ง€๋งŒ motivation ์ž์ฒด๊ฐ€ ์ข‹์Œ. ์‹ค์ œ๋กœ ์•ˆ์ „์„ฑ์„ ์ง€ํ‚ค๋Š” ์ตœ์†Œ ๋‹จ์œ„์˜ head group์„ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ ์ ์ด ์ธ์ƒ์ ์ž„4

TL; DR

๐Ÿ’ก

LLM ์•ˆ์ „์„ฑ์€ ์‚ฌ์‹ค ์†Œ์ˆ˜์˜ attention head ์— ์ง‘์ค‘๋˜์–ด ์žˆ์–ด์„œ, ๊ทธ head๋“ค๋งŒ ์‚ด์ง ๊บผ๋„ ๐Ÿšจ ์•ˆ์ •์„ฑ์ด ๋ฐ”๋กœ ๋ฌด๋„ˆ์ง„๋‹ค๋Š” ๊ฑธ ๋ฐํž˜ ๐Ÿ” ShipsยทSahara๋กœ ์–ด๋–ค head๊ฐ€ ์ง„์งœ safety ๋‹ด๋‹น์ธ์ง€ ์ฐพ์•„๋‚ด๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•จ โš™๏ธ๐Ÿ”ฅ

Summary

  • ์—ฐ๊ตฌ์ง„: ์•Œ๋ฆฌ๋ฐ”๋ฐ”, ์ค‘๊ตญ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™, ์นญํ™”๋Œ€ํ•™๊ต, ๋‚œ์–‘์ด๊ณต๋Œ€ํ•™๊ต

Main Idea

standard attention mechanism ์™€ safety capability ๊ฐ„์˜ ๊ด€๋ จ์„ฑ์„ ์ฐพ์•„, safety์— ๊ด€ํ•œ interpretability๋ฅผ ํƒ๊ตฌํ•˜์ž !

Background & Motivation

  • LLM์˜ safety
    • harmful query์— ๋Œ€ํ•ด ๋‹ต๋ณ€์„ ๊ฑฐ์ ˆํ•˜๋„๋ก alignment ๋˜์–ด ์žˆ์Œ (๊ทธ๋ฆผ ์™ผ์ชฝ)
      • e.g. โ€˜I cannotโ€™ or โ€˜As a responsible AI assistantโ€™ ๋“ฑ์˜ rejection token ์‚ฌ์šฉ
    • but, ํŠน์ • token์˜ ํ™•๋ฅ ๋ถ„ํฌ๋ฅผ ์กฐ์ •ํ•˜๋ฉด (Jailbreak Attack) safety์— ์ทจ์•ฝํ•ด์ ธ์„œ harmful query์—๋„ ๋‹ต๋ณ€ํ•˜๊ฒŒ ๋จ (๊ทธ๋ฆผ ์˜ค๋ฅธ์ชฝ)
      • โ€˜I cannotโ€™, โ€˜As a responsible AI assistantโ€™ ๋“ฑ์˜ rejection token์„ ๋‚ฎ์ถ”๊ฑฐ๋‚˜
      • โ€œSureโ€, โ€œHere isโ€ฆโ€ ๋“ฑ์˜ affirmative tokens์„ ๋†’ํžˆ๋Š” ๊ฒƒ
  • ๊ธฐ์กด LLM safety ๊ด€๋ จ ๋…ผ๋ฌธ์€ ์ฃผ๋กœ features, neurons, layers, parameters ๊ด€์ ์—์„œ ์ˆ˜ํ–‰๋จ
    • e.g. ์–ด๋–ค neuron์ด safety๋ฅผ ๋‹ด๋‹นํ•˜๋Š”์ง€

โ‡’ multi-head attention (MHA) ๊ด€์ ์—์„œ safety๋ฅผ ๋ถ„์„ํ•ด๋ณด์ž!
์ฆ‰, safety์— ๊ฐ€์žฅ ์˜ํ–ฅ์ด ํฐ head(=safety parameters)๋ฅผ ์ •๋Ÿ‰์ ์œผ๋กœ ์ฐพ์•„๋ณด์ž!

  • Why MHA?
    : input sequence์—์„œ feature๋ฅผ ํฌ์ฐฉํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ

Contributions (What theyโ€™ve revealed)

  • โš™๏ธย Settings
    • head ablation์— ์šฉ์ดํ•˜๋„๋ก ์ˆ˜์ •๋œ modified multi-head attention ์‚ฌ์šฉ
      • annotations
        • Wq,Wk,WvW_q, W_k, W_v๏ปฟ: Query, Key, and Value matrices
        • hih_i๏ปฟ: i๋ฒˆ์งธ attention head
      • ์ง๊ด€์ ์œผ๋กœ head ablation == ํ•ด๋‹น head์˜ output์„ 0์œผ๋กœ ๋‘๋Š” ๊ฒƒ์ด์ง€๋งŒ, ๋…ผ๋ฌธ์—์„œ๋Š” ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•ด head ablation ์ˆ˜ํ–‰
        • Undifferentiated Attention: Q ํ˜น์€ K(๋˜๋Š” ๋‘˜ ๋‹ค)์— ์•„์ฃผ ์ž‘์€ ๊ณ„์ˆ˜ ฮต๋ฅผ ๊ณฑํ•ด, ํ•ด๋‹น head์˜ attention weight๊ฐ€ ๋ชจ๋“  ํ† ํฐ์— ๋Œ€ํ•ด ๊ฑฐ์˜ ๊ท ์ผ(ํ‰๊ท )ํ•˜๊ฒŒ ๋˜๋„๋ก ๋งŒ๋“ฆ

          โ‡’ ์ฆ‰, ํŠน์ • head์—์„œ ์–ด๋–ค token์ด ์ค‘์š”ํ•œ์ง€ ํŒ๋‹จํ•˜๋Š” token selection ๊ธฐ๋Šฅ์„ ์ €ํ•˜์‹œํ‚ด

          ์ด ๊ทธ๋ฆผ์—์„œ, ๋ชจ๋“  token์ด ๋‹ค ๋น„์Šทํ•œ ์ƒ‰์ด ๋จ !

        • Scaling Contribution: attention weights๋Š” ๊ทธ๋Œ€๋กœ ๋‘๊ณ , head output V๋ฅผ ฮต๋กœ ์ค„์—ฌ์„œ, ์ด head์˜ contribution๋งŒ ๋งค์šฐ ์•ฝํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹

          โ‡’ ์ฆ‰, ํŠน์ • head์˜ output์„ ์ค„์—ฌ๋ฒ„๋ฆผ

    • backbone LLM: Llama-2-7b-chat, Vicuna-7b-v1.5
    • dataset: Advbench, Jailbreakbench, Malicious Instruct
    • metric: attack success rate (ASR)

      : ๋†’์„์ˆ˜๋ก unsafety

    • LLM input setting
      • template: template์— ์งˆ๋ฌธ์„ ๋„ฃ์–ด ๋‹ต๋ณ€์„ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ์‹
      • direct: template ์—†์ด ์งˆ๋ฌธ๋งŒ ๋„ฃ๊ธฐ
    • decoding setting
      • vanilla: original decoding
      • greedy: greedy decoding
      • top5: top-5 sampling decoding
  • 1. Safety interpretability ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•ด, ์ฒ˜์Œ์œผ๋กœ Safety-specific attention head์˜ ์กด์žฌ๋ฅผ ๋ฐํž˜

    : ๋‘๊ฐ€์ง€ ๋ฐฉ๋ฒ• (undifferentiated attention, scaling contribution)์œผ๋กœ safety-specific attention head (์ฆ‰ highest ship score๋ฅผ ๊ฐ€์ง€๋Š” head)๋ฅผ ์ œ๊ฑฐํ•ด ๋ณธ ๊ฒฐ๊ณผ

    • Llama2-7b-chat: layer 2 head 26 ์ œ๊ฑฐ ํ–ˆ๋”๋‹ˆ ASR์ด ํ‰๊ท  0.04์—์„œ 0.64โ†‘ (16๋ฐฐ)
    • Vicuna-7b-1.5: ์ผ๋ถ€ head ์ œ๊ฑฐํ–ˆ๋”๋‹ˆ ASR์ด ํ‰๊ท  0.27์—์„œ 0.55โ†‘ (2๋ฐฐ)
    • Undifferentiated Attention ๋ฐฉ์‹์ด Scaling Contribution๋ณด๋‹ค safety๋ฅผ ํ›จ์”ฌ ํฌ๊ฒŒ ์•ฝํ™”์‹œํ‚ด.

  • 2. attention head์˜ safety impact๋ฅผ ๋ณด์ด๊ธฐ ์œ„ํ•ด SHIP (Safety Head ImPortant) Score & SAHARA (Safety Attention Head Attribution Algorithm)์ œ์•ˆ
    • SHIP score๋ž€?
      • ๊ฐ attention head๊ฐ€ safety์— ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ ์ˆ˜
      • annotations
        • qH: ๊ฑฐ์ ˆํ•ด์•ผ ํ•˜๋Š” harmful query
        • ฮธhilฮธ_{h_i^l}๏ปฟ : target attention head
        • ฮธOฮธ_O๏ปฟ: original model parameters
        • DKLD_{KL}๏ปฟ: Kullback-Leibler divergence
        • ฮธO\ฮธhilฮธ_O\backslash ฮธ_{h_i^l}๏ปฟ : original model์—์„œ target attention head๋ฅผ ์ œ๊ฑฐํ•œ ๋ชจ๋ธ

      โ‡’ safety-critical attention head๋ฅผ ์ฐพ๊ธฐ ์œ„ํ•ด generalized๋œ ๋ฒ„์ „์˜ SHIP score

      ์ฆ‰, dataset ๋ ˆ๋ฒจ๋กœ attention head ablation ์ˆ˜ํ–‰

      • how to make?
        1. harmful query dataset QHQ_H๏ปฟ์˜ top layer activation a๋ฅผ ์Œ“์•„ matrix M ๊ตฌ์„ฑ
        1. Singular Value Decomposition SVD(M)=UฮฃVTSVD(M) = UฮฃV^T๏ปฟ์„ ํ†ตํ•ด left singular matrix UฮธU_ฮธ๏ปฟ ์–ป๊ธฐ
        1. attention head hilh_i^l๏ปฟ์ด ablation๋œ model์—์„œ UAU_A๏ปฟ ์–ป๊ธฐ

        โ‡’ 2์™€ 3 ์‚ฌ์ด์˜ principal angle์ด ํด์ˆ˜๋ก safety์— ๊ด€์—ฌ๊ฐ€ ํผ์— ์ฐฉ์•ˆํ•จ

      • annotations
        • ฯƒrฯƒ_r๏ปฟ: r-th singular value
    • SAHARA๋ž€?
      • safety degration์„ ์œ ๋„ํ•˜๋Š” head group์„ ์ฐพ๊ธฐ ์œ„ํ•œ heuristic algorithm
        • ์‹คํ—˜์„ ํ†ตํ•ด head group size < 5 ๋กœ ์„ค์ •

          โ‡’ ์ด์ „ ์—ฐ๊ตฌ๋ณด๋‹ค ๋” fine-grained & efficient

        • ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์š”์•ฝ: ๊ฐ group์„ ์ œ๊ฑฐํ•ด๋ณด๊ณ  ASR ์„ฑ๋Šฅ ํ™•์ธํ•ด๋ด„
    • ๊ฒฐ๊ณผ
      • Undifferentiated Attention > Scaling Contribution
      • SAHARA๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ฐพ์€ attention head๋ฅผ ์ œ๊ฑฐํ–ˆ๋”๋‹ˆ ASR ํ–ฅ์ƒ๋จ
      • safety ๊ธฐ๋Šฅ์€ ๋ชจ๋ธ ์ „์ฒด์— ๊ณจ๊ณ ๋ฃจ ์žˆ๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ ํŠน์ • layerยทhead์— ์ง‘์ค‘๋จ
  • 3. LLM safety ๊ด€์ ์—์„œ standard multi-head attention mechanism์˜ ์ค‘์š”์„ฑ์„ ๋ถ„์„ํ•จ์œผ๋กœ์„œ, LLM risk์— ๋Œ€ํ•œ ์šฐ๋ ค๋ฅผ ์™„ํ™”ํ•˜๊ณ , transperency์— ๊ธฐ์—ฌ
    • vicuna์™€ llama์—์„œ์˜ safety head๋Š” ์–ผ๋งˆ๋‚˜ ๊ฒน์น ๊นŒ?

      โ‡’ Safety head๋“ค์€ pre-training ๋‹จ๊ณ„์—์„œ ์ด๋ฏธ ์ƒ๋‹น๋Ÿ‰ ํ˜•์„ฑ๋˜์–ด ์žˆ๊ณ , ์ดํ›„ chat-tuning(Vicuna-style instruction-tuning)์—์„œ๋„ ๊ทธ๋Œ€๋กœ ์œ ์ง€๋˜๋Š” ๊ฒƒ

    • helpful-harmness trade-off๋Š”?
      • safety head ablationํ•œ Llama2-7b-chat์—์„œ zero-shot ์‹คํ—˜ ๊ฒฐ๊ณผ,

        ASR์€ ํฌ๊ฒŒ ์ฆ๊ฐ€(์•ˆ์ „์„ฑ ๋ถ•๊ดด)ํ•˜์ง€๋งŒ, ์ผ๋ฐ˜์ ์ธ pruning(SparseGPT, Wanda)์™€ ๋น„๊ตํ–ˆ์„ ๋•Œ zero-shot ์„ฑ๋Šฅ ์ €ํ•˜๋Š” ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ์ ์Œ.

      โ‡’ ์ฆ‰, safety head๋Š” โ€œ์ฃผ๋กœ safety์šฉโ€์œผ๋กœ ๊ธฐ๋Šฅํ•˜๊ณ , ์ผ๋ฐ˜ ์–ธ์–ด๋Šฅ๋ ฅ๊ณผ์˜ ์ค‘์ฒฉ(superposition)์ด ์ƒ๋Œ€์ ์œผ๋กœ ์ ์€ ๊ฒƒ์œผ๋กœ ๋ณด์ž„.

Categories

research