21 January 2026

LLMs Encode Harmfulness and Refusal Separately

๐Ÿ’กLLM์€ instruction์˜ ์œ ํ•ด์„ฑ๊ณผ ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋ฅผ ๋‹ค๋ฅธ latent space์—์„œ ์ธ์ฝ”๋”ฉํ•˜๊ณ  ์žˆ๋‹ค!

๐Ÿฅ‡

LLMs Encode Harmfulness and Refusal Separately

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
๊ณ„๋ž€์ดˆ๋ฐฅ๋ฌธ์ œ์ œ๊ธฐ-๊ฐ€์„ค์„ค์ •-์‹คํ—˜๊นŒ์ง€ ๋…ผ๋ฆฌ์ •์—ฐํ•˜๊ณ  ๊ผผ๊ผผํ•˜๋‹ค. ์ฝ๋Š” ๋‚ด๋‚ด ๋„ˆ๋ฌด ์žฌ๋ฐŒ์—ˆ์Œ! ๋‘ latent space๊ฐ€ ๋ช…ํ™•ํ•˜๊ฒŒ ๋‹ค๋ฅด๋‹ค๋‹ˆ! ์ด๋Ÿฐ์‹์œผ๋กœ ์„œ๋กœ ๋‹ค๋ฅด์ง€๋งŒ ์—ฐ๊ด€๋˜์–ด ์žˆ๋Š” ๋‘๊ฐ€์ง€ ์—ญํ• ์„ ๋‹ค๋ฅธ space์—์„œ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๊ฒƒ๋“ค์ด ๋˜ ๋ญ๊ฐ€ ์žˆ์„๊นŒ? factuality์™€ ์—ฐ๊ด€๋œ space๋Š” ๋ญ˜๊นŒ? 4.5
๋งน๊ตฌLLM์„ ์„ค๊ณ„ํ•  ๋•Œ, ์ด๋Ÿฐ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ฌ์ค„ ์•Œ๊ณ  ์žˆ์—ˆ์„๊นŒ? ์š”์ฆ˜ ๋“œ๋Š” ์ƒ๊ฐ์€, ์ •๋ง ํ˜„์ƒ์„ ๋ณด๊ณ  ๊ทธ ์ด์œ ๋ฅผ ํ•ด์„ํ•˜๋Š” ๊ณผํ•™์ด ๋˜์–ด๊ฐ€๋Š” ๋А๋‚Œ์ด๋‹ค. LLM์„ ๋งŒ๋“ค์–ด ๋‚ธ ๊ฑด ๊ณตํ•™์ธ๋ฐ, ์ตœ๊ทผ ์›€์ง์ž„์€ why?๋กœ ์‹œ์ž‘ํ•˜๋Š” ๋А๋‚Œ์ธ๋“ฏ. ์•ž์œผ๋กœ ๊ทธ๋Ÿฐ ์ƒ๊ฐ์„ ๊ฐ€์ง€๊ณ  ์—ฐ๊ตฌํ•ด์•ผ๊ฒ ๋‹ค๋Š” ์ƒ๊ฐ์ด ๋“ค์—ˆ์Œ. ์ด ๋…ผ๋ฌธ์˜ ๊ฒฐ๊ณผ์ฒ˜๋Ÿผ, ์‚ฌ๋žŒ๋„ ๊ฒฐ๊ตญ ์œ ํ•ดํ•œ ๊ฒƒ๊ณผ ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋Š” ๋‹ค๋ฅด๊ฒŒ ํ•ด์„ํ•˜๋Š” ๊ฒƒ ๊ฐ™์Œ. ์‚ฌ๋žŒ์˜ ์ง๊ด€์ด๋‚˜ ๊ฐ€์น˜์„ฑ ํŒ๋‹จ์ด ์ƒ๊ฐ๋ณด๋‹ค ๊ณ ์ˆ˜์ค€์ด๋ผ๋Š” ์ƒ๊ฐ์ด ๋“ฆ.4.3
ํ–„๋ฒ„๊ฑฐJailbreak์ด๋‚˜ attack ๊ด€๋ จ ๋…ผ๋ฌธ์„ ๋ณผ ๋•Œ ์œ ํ•ด์„ฑ๊ณผ ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋Š” ๋‹น์—ฐํžˆ ๋ถ™์–ด์žˆ๋Š” ๊ฐœ๋…์œผ๋กœ ์ธ์ง€ํ•˜๊ณ  ์žˆ์—ˆ๋Š”๋ฐ ์ด ๊ฐœ๋…์„ ๋ถ„๋ฆฌํ–ˆ๋‹ค๋Š” ์ ์ด ์ƒˆ๋กญ๋‹ค. ๋‹ค๋ฅธ ๋…ผ๋ฌธ๋„ ๊ทธ๋ ‡๊ณ  steering์ด ์ค‘๊ฐ„ layer์—์„œ ๋” ํšจ๊ณผ์ ์œผ๋กœ ๋จนํžŒ๋‹ค๋Š” ๊ด€์ฐฐ์ด ์ด ๋…ผ๋ฌธ์—์„œ๋„ ๋‚˜์˜ค๋Š”๊ฑธ ๋ณด๋‹ˆ ์ •๋ง ์–ด๋–ค ๋ชฉ์ ์— ๋Œ€ํ•ด์„œ ์ตœ์ ์˜ layer์ด ์žˆ๋Š”๊ฒƒ ๊ฐ™๋‹ค.4.4
ํ”ผ์žLLM์˜ Jailbreak๋ฅผ ๋ณผ ๋•Œ, ์œ ํ•ด์„ฑ๊ณผ ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋ฅผ ์ฒด๊ณ„์ ์œผ๋กœ ์ˆ˜์น˜ํ™”ํ•ด์„œ ๋ถ„์„ํ•œ ์ ์ด novelty๊ฐ€ ํฐ ๋…ผ๋ฌธ์ธ ๊ฒƒ ๊ฐ™์Œ. jailbreak๊ฐ€ ๋œ๋‹ค, ์•ˆ๋œ๋‹ค ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ด๊ฑธ ์œ ํ•ด์„ฑ๊ณผ ๊ฑฐ๋ถ€๋กœ ๋‚˜๋ˆ„์–ด hidden state์™€ ๋ฒกํ„ฐ ๊ณต๊ฐ„์œผ๋กœ ๋ถ„์„ํ•œ ๊ฒƒ์ด ๋†€๋ผ์šด ์ ์ด๋ผ๊ณ  ํ•  ์ˆ˜ ์žˆ์„ ๋“ฏํ•จ.4.6
์น˜ํ‚จ์‹œ๊ฐ„์ด ์ง€๋‚จ์— ๋”ฐ๋ผ ์ ์  ๋” elicitํ•ด์ง€๋Š” ๊ฒƒ ๊ฐ™๋‹ค. ํ•œ 5๋…„ ๋’ค๋ฉด ๊ทธ ๋•Œ๋Š” ์™œ ๊ทธ๋ ‡๊ฒŒ ์ƒ๊ฐํ–ˆ์—ˆ์ง€? ์‹ถ์—ˆ๋˜ ๊ฐœ๋…๋“ค์ด ๋งŽ์•„์ง€๊ฒ ์ง€? + ๊ฐœ์ธ์ ์œผ๋กœ Contribution์˜ figure๊ฐ€ ์ฐธ ์ž˜๊ทธ๋ ธ๋‹ค๊ณ  ์ƒ๊ฐ์ด ๋“ ๋‹ค4.6
ํŽ˜๋ธŒ๋ฆฌ์ฆˆ์‘๋‹ต ๋ฐ˜์ „์‹œํ‚จ ์‹คํ—˜ ๊ฒฐ๊ณผ๊ฐ€ ์œ ํ•ด์„ฑ๊ณผ ๊ฑฐ๋ถ€์„ฑ ์ธ์‹์„ ๋‹ค๋ฅด๊ฒŒ ํ•œ๋‹ค๋Š” ๊ฑธ ๋‚ฉ๋“ํ•˜๊ฒŒ ํ•ด์คฌ๋‹ค. ์‹คํ—˜ ์„ค๊ณ„๊ฐ€ ํŠนํžˆ ๊น”๋”ํ•˜๋ฉด์„œ ๊ด€๋ จํ•ด์„œ ๊ถ๊ธˆํ•œ ๊ฑด ์›ฌ๋งŒํผ ํ•ด์†Œํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋“ฏ4.3
๊ตญ๋ฐฅ์ผ๋ถ€ jailbreak๋Š” โ€˜๋ชจ๋ธ์ด ์œ ํ•ดํ•˜์ง€ ์•Š๋‹ค๊ณ  ์ฐฉ๊ฐํ•˜๊ฒŒโ€™ ๋งŒ๋“œ๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ, โ€˜๊ฑฐ๋ถ€ ์‹ ํ˜ธ๋งŒ ๋‚ฎ์ถ”๋Š” ๋ฐฉ์‹โ€™์œผ๋กœ ์ž‘๋™ํ•œ๋‹ค๋Š” ํ•ด์„์ด ์‹ ์„ ํ•จ.
์ง€๊ธˆ๊นŒ์ง€๋Š” jailbreak ์„ฑ๊ณต ์ž์ฒด๊ฐ€ ๋ชจ๋ธ์ด ์œ ํ•ดํ•˜์ง€ ์•Š๋‹ค๊ณ  ์ฐฉ๊ฐํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ–ˆ๋Š”๋ฐ ๋‚ด๋ถ€์—์„œ๋Š” ์ด๋ฏธ ์œ„ํ—˜ํ•˜๋‹ค๋Š” ์‹ ํ˜ธ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๊ตฌ๋‚˜!
4.5

TL; DR

๐Ÿ’ก

LLM์€ instruction์˜ ์œ ํ•ด์„ฑ๊ณผ ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋ฅผ ๋‹ค๋ฅธ latent space์—์„œ ์ธ์ฝ”๋”ฉํ•˜๊ณ  ์žˆ๋‹ค!

์ €์ž: Northeastern University, Stanford University

Summary

Motivation

  • LLM Safety์—์„œ, ์œ ํ•ดํ•œ instruction์„ ๊ฑฐ๋ถ€ํ•˜๋„๋ก ํ•™์Šตํ•ด๋„ ๊ทธ๊ฒƒ์„ ๋šซ๊ณ  ํƒˆ์˜ฅํ•˜๊ฑฐ๋‚˜(Jailbreaking), ๊ณผํ•˜๊ฒŒ ๊ฑฐ๋ถ€ํ•˜๋Š” ํ˜„์ƒ(Over-refusal)์€ ๋ฐœ์ƒํ•จ.
    • ์™œ ์ด๋Ÿด๊นŒ? instruction์ด ์œ ํ•ดํ•œ ๊ฒƒ์„ LLM์ด ์•Œ๊ณ  ์žˆ์„๊นŒ?
  • ๊ณผ๊ฑฐ ์—ฐ๊ตฌ๋“ค์€ LLM์ด ํŠน์ • latent space์—์„œ refusalํ• ์ง€ ๋ง์ง€ ๊ฒฐ์ •ํ•œ๋‹ค๊ณ ๋Š” ๋ฐํ˜€๋ƒˆ๋Š”๋ฐ, ๊ทธ๊ฒŒ instruction์˜ ์œ ํ•ด์„ฑ์ด๋ž‘ ํ†ตํ•ฉ๋˜์–ด ์žˆ๋Š” ๊ฑด์ง€, ๋ถ„๋ฆฌ๋˜์–ด ์žˆ๋Š” ๊ฑด์ง€๋Š” ์—ฐ๊ตฌํ•˜์ง€ ์•Š์Œ
    • ์ผ๋ฐ˜์ ์œผ๋กœ ๊ฑฐ๋ถ€ํ•˜๋ฉด ๊ทธ๊ฒŒ ๋‚˜์œ๊ฑฐ๋‹ˆ๊นŒ ๊ฑฐ๋ถ€ํ–ˆ๊ฒ ์ง€~ ๋ผ๋Š” ์ธ์‹์ด์—ˆ์Œ

Contribution

  • Instruction์ด ๋“ค์–ด์™”์„ ๋•Œ, ์œ ํ•ด์„ฑ๊ณผ ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋ฅผ ๋ณ„๋„๋กœ ์ธ์ฝ”๋”ฉํ•จ์„ ์ž…์ฆํ•จ
    • ์œ ํ•ด์„ฑ์€ instruction์˜ ๋งˆ์ง€๋ง‰ ํ† ํฐ, ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋Š” ์ „์ฒด ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๋งˆ์ง€๋ง‰ ํ† ํฐ์—์„œ ๊ฒฐ์ •๋จ
  • ์œ ํ•ด์„ฑ ๋ฐฉํ–ฅ์„ steeringํ•ด์„œ jailbreak๋ฅผ ๋ง‰๋Š” latent guard ์ œ์•ˆ
    • Fine-tuning ์—†์ด๋„ fine-tuned llama guard๋ณด๋‹ค ์ž˜ํ•จ

Experimental Setup

  • ์œ ํ•ด์„ฑ๊ณผ ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋ฅผ ํƒ๊ตฌํ•˜๋Š” ์‹คํ—˜ ์ค€๋น„
  • Model: Instruct๋ชจ๋ธ์ธ Llama-2-chat-7B, Llama3-Instruct-8B, Qwen-2-Instruct-7B
  • Prompt: Instruct ๋ชจ๋ธ๋“ค์€ ํŠน๋ณ„ํ•œ instruction ํ…œํ”Œ๋ฆฟ์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ (e.g. [INST]{user instruction)[/INST])
    • [/INST]๋ฅผ post-inst ํ† ํฐ์ด๋ผ ๋ช…๋ช…ํ•จ
  • Hidden state: user instruction์˜ ๋งˆ์ง€๋ง‰ ์œ„์น˜์ธ tinstt_{inst}๏ปฟ์™€ ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๋งˆ์ง€๋ง‰ ์œ„์น˜์ธ tpostโˆ’instt_{post-inst}๏ปฟ์˜ hidden state ๋ถ„์„
    • ๋ณดํ†ต ๊ฑฐ๋ถ€๋Š” tpostโˆ’instt_{post-inst}๏ปฟ์—์„œ ๊ฒฐ์ •๋จ
  • Dataset: ์œ ํ•ดํ•œ ๊ฑฐ๋ถ€๋Š” Advbench ์‚ฌ์šฉ, ๋ฌดํ•ดํ•œ ๊ฑฐ๋ถ€๋Š” Alpaca ์‚ฌ์šฉ, ๋ฌดํ•ดํ•œ๋ฐ ์œ ํ•ดํ•˜๊ฒŒ ๋ฐ›์•„๋“ค์ด๋Š” over-refusal์€ Xstest ์‚ฌ์šฉ
  • Jailbreak method: Adversarial suffixes(์ ๋Œ€์ ์ธ ์ ‘๋ฏธ์‚ฌ), Persuasion(์„ค๋“), Adversarial prompting templates (์ ๋Œ€์  ํ”„๋กฌํ”„ํŒ… ํ…œํ”Œ๋ฆฟ) ์‚ฌ์šฉ
  • Refusal rate: ๋ชจ๋ธ์ด Sorry I cannot๊ฐ™์€ ํŠน์ • ๋ฌธ๊ตฌ๋ฅผ ์ƒ์„ฑํ•˜๋ฉด ๊ฑฐ๋ถ€๋กœ ๋ถ„๋ฅ˜ํ•จ

Decoupling Harmfulness from Refusal

Removing post-instruction tokens weakens refusal abilities

  • tinstโˆ’postt_{inst-post}๏ปฟ ์ง€์šฐ๋‹ˆ๊นŒ refusal rate๊ฐ€ ํฌ๊ฒŒ ๋‚ฎ์•„์ง
    • ์ด ํ† ํฐ ์ „๊นŒ์ง€๋Š” ๊ฑฐ๋ถ€ ์‹ ํ˜ธ๊ฐ€ ์•ฝํ•œ ๊ฒƒ์ผ ์ˆ˜ ์žˆ์Œ
    • tinstโˆ’postt_{inst-post}๏ปฟ์— ๊ฐ•ํ•˜๊ฒŒ ์˜์กดํ•˜๊ณ  ์žˆ๋Š” ๊ฒƒ!
  • ๊ทธ๋Ÿผ tinstt_{inst}๏ปฟ์—๋Š” ๋ญ๊ฐ€ ์ธ์ฝ”๋”ฉ๋˜์–ด ์žˆ์„๊นŒ? ๋ถ„์„ํ•˜์ž
    • ๊ฐ€์„ค) tinstt_{inst}๏ปฟ์—๋Š” ์œ ํ•ด์„ฑ์„ ์ธ์ฝ”๋”ฉํ•˜๊ณ , tinstโˆ’postt_{inst-post}๏ปฟ์—๋Š” ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋ฅผ ์ธ์ฝ”๋”ฉํ•œ๋‹ค!

Hidden states cluster by harmfulness at tinstt_{inst}๏ปฟ, and by refusal at tpostโˆ’instt_{post-inst}๏ปฟ

  • ์œ ํ•ด/๋ฌดํ•ดํ•œ instruction์— ๋Œ€ํ•ด tinstt_{inst}๏ปฟ์™€ tinstโˆ’postt_{inst-post}๏ปฟ์˜ hidden state๊ฐ€ ์–ด๋–ค ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ํ˜•์„ฑํ•˜๋Š”์ง€ ๋ณด์ž
    • ์œ ํ•ดํ•œ ์ง€์‹œ์— ๋Œ€ํ•ด ๊ฑฐ๋ถ€ํ•˜๋Š” ๊ฒฝ์šฐ ์ˆ˜์šฉํ•˜๋Š” ๊ฒฝ์šฐ, ๋ฌดํ•ดํ•œ ์ง€์‹œ์— ๋Œ€ํ•ด ๊ฑฐ๋ถ€ํ•˜๋Š” ๊ฒฝ์šฐ ์ˆ˜์šฉํ•˜๋Š” ๊ฒฝ์šฐ์— ๋Œ€ํ•ด ๋ถ„์„
  • ์œ ํ•ดํ•œ ์ง€์‹œ๋ฅผ ๊ฑฐ๋ถ€ํ•˜๋Š” ๊ฒฝ์šฐ์—์„œ hidden state๋ฅผ ํ‰๊ท  ๋‚ด์–ด Crefusedย harmfulC_{refused\ harmful}๏ปฟ,
    ๋ฌดํ•ดํ•œ ์ง€์‹œ๋ฅผ ๊ฑฐ๋ถ€ํ•˜๋Š” ๊ฒฝ์šฐ์—์„œ hidden state๋ฅผ ํ‰๊ท  ๋‚ด์–ด Cacceptedย harmlessC_{accepted\ harmless}๏ปฟ๋ฅผ ๊ตฌํ•จ
  • ๊ทธ๋ฆฌ๊ณ  ์œ ํ•ดํ•œ ์ง€์‹œ๋ฅผ ์ˆ˜์šฉํ•˜๋Š” ๊ฒฝ์šฐ์˜ hidden state, ๋ฌดํ•ดํ•œ ์ง€์‹œ๋ฅผ ๊ฑฐ๋ถ€ํ•˜๋Š” ๊ฒฝ์šฐ์˜ hidden state๊ฐ€ Crefusedย harmfulC_{refused\ harmful}๏ปฟ์— ๊ฐ€๊นŒ์šด์ง€, Cacceptedย harmlessC_{accepted\ harmless}๏ปฟ์— ๊ฐ€๊นŒ์šด์ง€ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„๋กœ ๊ฒฐ์ •
    • ์œ ํ•ด์„ฑ์ด ๊ฐ™๊ณ  ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋Š” ๋‹ค๋ฅธ๋ฐ ๋น„์Šทํ•œ ํด๋Ÿฌ์Šคํ„ฐ โ†’ ์œ ํ•ด์„ฑ์„ ์ธ์‹ํ•œ๋‹ค!
    • ์œ ํ•ด์„ฑ์ด ๋‹ค๋ฅธ๋ฐ ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋Š” ๋น„์Šทํ•œ ํด๋Ÿฌ์Šคํ„ฐ โ†’ ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋ฅผ ์ธ์‹ํ•œ๋‹ค!
  • ๋ชจ๋“  ๋ชจ๋ธ, ๋ชจ๋“  ๋ ˆ์ด์–ด์—์„œ, tinstt_{inst}๏ปฟ๋Š” ์œ ํ•ด์„ฑ์ด ํด๋Ÿฌ์Šคํ„ฐ๋ง์— ๋” ๊ฒฐ์ •์ ์ด๊ณ , tinstโˆ’postt_{inst-post}๏ปฟ๋Š” ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๊ฐ€ ํด๋Ÿฌ์Šคํ„ฐ๋ง์— ๋” ๊ฒฐ์ •์ ์ธ ๊ฒฝํ–ฅ์„ ๋ณด์ž„
  • tinstt_{inst}๏ปฟ์—๋Š” ์œ ํ•ด์„ฑ์„ ์ธ์ฝ”๋”ฉํ•˜๊ณ , tinstโˆ’postt_{inst-post}๏ปฟ์—๋Š” ๊ฑฐ๋ถ€ ์—ฌ๋ถ€๋ฅผ ์ธ์ฝ”๋”ฉํ•œ๋‹ค! (๊ฐ€์„ค ๋งž์Œ)

Correlation between beliefs of harmfulness and refusal

  • ์œ ํ•ดํ•œ instruction, ๋ฌดํ•ดํ•œ instruction์—์„œ tinstt_{inst}๏ปฟ์˜ hidden state๋ฅผ ํด๋Ÿฌ์Šคํ„ฐ๋ง ํ•ด ์ค‘์‹ฌ์„ ฮผharmfull,tinst\mu_{harmful}^{l, t_{inst}}๏ปฟ, ฮผharmlessl,tinst\mu_{harmless}^{l, t_{inst}}๏ปฟ๋กœ ์ •์˜ํ•˜๊ณ , ๋‘˜ ์ค‘ ๋ชจ๋“  ๋ ˆ์ด์–ด์— ๊ฑธ์ณ hidden state๊ฐ€ ์œ ํ•ดํ•œ instruction์— ๊ฐ€๊นŒ์šด ์ง€ ๋ฌดํ•ดํ•œ instruction์— ๊ฐ€๊นŒ์šด ์ง€์— ๋Œ€ํ•ด ฮ”harmful\Delta_{harmful}๏ปฟ ์ •์˜
  • ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฑฐ๋ถ€๋œ instruction, ์ˆ˜์šฉ๋œ instruction์—์„œ tpostโˆ’instt_{post-inst}๏ปฟ์˜ hidden state๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ฮ”refuse\Delta_{refuse}๏ปฟ ์ •์˜
  • ฮ”harmful\Delta_{harmful}๏ปฟ, ฮ”refuse\Delta_{refuse}๏ปฟ๋Š” ๋ชจ๋ธ์ด ๊ฐ€์ง€๋Š” ์œ ํ•ด์„ฑ๊ณผ ๊ฑฐ๋ถ€ ์—ฌ๋ถ€์— ๋Œ€ํ•œ ๋ฏฟ์Œ(์ƒ๊ฐ)์ž„!
  • ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ฐ ๋ฒ”์ฃผ์— ํ•ด๋‹นํ•˜๋Š” instruction์„ ๊ฐ€์ง€๊ณ  ํ…Œ์ŠคํŠธํ•ด๋ณด๋‹ˆ ์‹ค์ œ๋กœ ๊ทธ๊ฒŒ ์ž˜ ์ž‘๋™ํ•จ
    • ๊ฑฐ๋ถ€ํ•˜๋Š” ์• ๋“ค์€ ฮ”refuse\Delta_{refuse}๏ปฟ๊ฐ€ 0๋ณด๋‹ค ํฌ๊ณ , ์œ ํ•ด์„ฑ์ด ์—†๋Š” ์• ๋“ค์€ ฮ”harmful\Delta_{harmful}๏ปฟ๊ฐ€ 0๋ณด๋‹ค ์ž‘์Œ

Eliciting refusal with harmfulness directions

  • ๋ฒกํ„ฐ ๊ณต๊ฐ„์—์„œ ์œ ํ•ด์„ฑ์— ํ•ด๋‹นํ•˜๋Š” ๋ฒกํ„ฐ๋ฅผ ํด๋Ÿฌ์Šคํ„ฐ ์ค‘์‹ฌ์˜ ์ฐจ์ด๋กœ ๊ตฌํ•จ
    • vharmfull=ฮผharmfull,tinstโˆ’ฮผharmlessl,tinstv_{harmful}^l = \mu_{harmful}^{l,t_{inst}} - \mu_{harmless}^{l,t_{inst}}๏ปฟ
  • ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๊ฑฐ๋ถ€์„ฑ ๋ฐฉํ–ฅ์˜ ๋ฒกํ„ฐ๋„ ์ถ”์ถœํ•จ
    • vrefusel=ฮผrefusall,tpostโˆ’instโˆ’ฮผacceptl,tpostโˆ’instv_{refuse}^l = \mu_{refusal}^{l,t_{post-inst}} - \mu_{accept}^{l,t_{post-inst}}๏ปฟ
  • ๊ฐ ๋ ˆ์ด์–ด์—์„œ tinstt_{inst}๏ปฟ, tpostโˆ’instt_{post-inst}๏ปฟ์— ์œ ํ•ด์„ฑ, ๊ฑฐ๋ถ€์„ฑ ๋ฒกํ„ฐ๋ฅผ ๋”ํ•ด(Steering) ๋ชจ๋ธ์˜ ํ–‰๋™ ๋ณ€ํ™” ๊ด€์ฐฐ
  • ๋ฌดํ•ดํ•œ instruction์— ๋Œ€ํ•ด, ์œ ํ•ด์„ฑ์„ ์ถ”๊ฐ€ํ•˜๋“  ๊ฑฐ๋ถ€์„ฑ์„ ์ถ”๊ฐ€ํ•˜๋“  refusal rate์ด ์˜ค๋ฅด๊ณ , ํŠนํžˆ ์ค‘๊ฐ„ ๋ ˆ์ด์–ด์—์„œ ํšจ๊ณผ์ ์ž„

Causally separating the harmfulness direction and the refusal direction

  • ์œ ํ•ด์„ฑ, ๊ฑฐ๋ถ€์„ฑ์„ ์ถ”๊ฐ€ํ–ˆ์„ ๋•Œ, ๋ชจ๋ธ ๋‚ด๋ถ€์˜ ์ƒ๊ฐ์„ ์•Œ์•„๋‚ด๊ธฐ ์œ„ํ•ด, ๋ฌดํ•ดํ•œ instruction์— ๋Œ€ํ•ด No๋ฅผ ๋Œ€๋‹ตํ•ด์•ผ ํ•˜๋Š” task๋กœ ์‹คํ—˜
    • ๊ทธ๋ƒฅ No๋ฅผ ํ•˜๋Š”๊ฑด์ง€(๊ฑฐ๋ถ€์„ฑ), ์œ ํ•ด์„ฑ์„ ํŒ๋‹จํ•˜๊ณ  No, Yes๋ฅผ ํ•˜๋Š”๊ฑด์ง€ ์•Œ์•„๋ณด๊ธฐ ์œ„ํ•จ!
  • ์—ฌ๊ธฐ์„œ No๋ฅผ ๋Œ€๋‹ตํ•˜๋ฉด(๋ฌดํ•ดํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด) ๊ทธ๊ฑธ refusal rate์— ๋ฐ˜์˜
    • refusal rate๊ฐ€ ๋†’๋‹ค โ†’ ์ด๊ฑด ๋ฌดํ•ดํ•˜๋‹ค! (๋ชจ๋ธ ์ƒ๊ฐ)
  • (a)
    • ์œ ํ•ด์„ฑ ๋ฐฉํ–ฅ์œผ๋กœ steeringํ•˜๋ฉด, ๋ชจ๋ธ๋„ ์œ ํ•ดํ•˜๋‹ค๊ณ  ์ƒ๊ฐํ•˜๊ฒŒ ๋จ!(์ฃผํ™ฉ์ƒ‰)
    • ๊ฑฐ๋ถ€์„ฑ์„ ๋†’์ด๋ฉด No๋ฅผ ๋งŽ์ด ๋งํ•˜๊ณ , ๋‚ฎ์ถ”๋ฉด Certainly๋ฅผ ๋” ๋งํ•˜๊ฒŒ ๋จ
  • (b)
    • ์œ ํ•ด์„ฑ ๋ฐ˜๋Œ€๋ฐฉํ–ฅ์œผ๋กœ steeringํ•˜๋ฉด ๋ชจ๋ธ์ด No๋ผ๊ณ  ๋งํ•˜๋Š” ๋น„์œจ์ด ์ฆ๊ฐ€ํ•จ(์ฃผํ™ฉ์ƒ‰)
    • ๊ฑฐ๋ถ€์„ฑ์„ ๋‚ฎ์ถ”๋ฉด Certainly ๋งŒ ๋งํ•จ(ํŒŒ๋ž€์ƒ‰)
  • ์‘๋‹ต์„ ๋ฐ˜์ „์‹œ์ผฐ๋”๋‹ˆ(๋ฌดํ•ดํ•œ๊ฑฐ์— ๋Œ€ํ•ด NO๋ผ๊ณ  ๋งํ•˜๊ธฐ), ์œ ํ•ด์„ฑ๊ณผ ์ˆ˜์šฉ์„ฑ์ด ๋น„์Šทํ•œ ์˜ํ–ฅ์„ ๋ณด์ž„!
    • ๋ชจ๋ธ์€ ์œ ํ•ด์„ฑ, ๊ฑฐ๋ถ€์„ฑ์— ๋Œ€ํ•ด ๋”ฐ๋กœ ์ƒ๊ฐํ•˜๊ณ  ์žˆ๊ณ , ๊ฑฐ๋ถ€์„ฑ์€ ๊ทธ๋ƒฅ No, Yes ๋งŒ ํŒ๋‹จํ•˜๋Š” ์• ์ž„

Analyzing Jailbreak via Harmfulness

  • ๊ฐ jailbreak method์— ๋Œ€ํ•ด ฮ”harmful\Delta_{harmful}๏ปฟ, ฮ”refuse\Delta_{refuse}๏ปฟ ๋ถ„์„
  • ๊ณต๊ฒฉ๋“ค์€ refuse์— ๋Œ€ํ•ด ๋‚ฎ์ถ”์ง€๋งŒ, template์ด๋‚˜ ์ผ๋ถ€ persuasion์€ ์œ ํ•ด์„ฑ๊นŒ์ง€ ์†์ด์ง€๋Š” ๋ชปํ•จ
    • ์ž˜ ๋งŒ๋“  persuasion์ด ์ง„์งœ ์น˜๋ช…์ ์ธ๋“ฏ..?

Developing a Latent Guard Model with Harmfulness Representations

  • ฮ”harmful\Delta_{harmful}๏ปฟ์ด ์Œ์ˆ˜๋ฉด ์ˆ˜์šฉ, ์–‘์ˆ˜๋ฉด ๊ฑฐ๋ถ€ํ•˜๋Š” ๊ฐ„๋‹จํ•œ ๋ถ„๋ฅ˜๊ธฐ latent guard ์ œ์•ˆ
    • ์•„์ฃผ ๊ฐ„๋‹จํ•˜๊ณ , ์ƒ์„ฑ ์ „์— ์•Œ ์ˆ˜ ์žˆ์Œ(๋‚ด๋ถ€์˜ hidden state๋กœ ํŒ๋‹จํ•ด์„œ)
  • ๊ฒฐ๊ณผ๋Š” fine-tuned llama guard 3๋ณด๋‹ค ์ž˜ํ•จ
    • qwen 3๋Š” template ๊ณต๊ฒฉ์— ๋Œ€ํ•ด ์œ ํ•ด์„ฑ์„ ์ œ๋Œ€๋กœ ํ•™์Šตํ•˜์ง€ ๋ชปํ•œ๋“ฏ?
  • ๋ชจ๋ธ๋“ค์€ ์†Œ์ˆ˜์˜ ์ ๋Œ€์ ์ธ data๋กœ ํ•™์Šต์‹œํ‚ค๋ฉด ์ž˜ ๋ฌด๋„ˆ์ง€๋Š”๋ฐ, ์‹ค์ œ๋กœ ๋‚ด๋ถ€์—์„œ์˜ ์œ ํ•ด์„ฑ์— ๋Œ€ํ•ด์„œ๋Š” ์˜ํ–ฅ์„ ํฌ๊ฒŒ ์ฃผ์ง€ ์•Š์Œ
  • latent guard๋Š” ์œ ํ•ด์„ฑ์— ๋Œ€ํ•œ ๋ชจ๋ธ ๋‚ด๋ถ€ ์ƒ๊ฐ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ด๋Ÿฐ fine-tuning ๊ณต๊ฒฉ์—๋„ ๊ฒฌ๊ณ ํ•จ!

Categories

research