26 March 2026

Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models

๐Ÿ’กRefusal token์œผ๋กœ ๋ชจ๋ธ์˜ ์‘๋‹ต ๊ฑฐ์ ˆ์„ ๋” ์„ฌ์„ธํ•˜๊ณ (์„ฑ๋Šฅโ†‘), ์œ ์—ฐํ•˜๊ฒŒ(inference ๋‹จ์—์„œ ์กฐ์ ˆ ๊ฐ€๋Šฅ) ํ•œ๋‹ค!

๐Ÿฅˆ

Refusal Tokens: A Simple Way to Calibrate Refusals in Large Language Models

Review

๋‹‰๋„ค์ž„ ์ฝ”๋ฉ˜ํŠธ(Strength, Weakness, Suggestion)๋ณ„์  (0/5)
๋Œ“์ธ ๋…ธ๋…ธ โ€ข ์žฅ์ : efficiency/faithfulness/safety ๋“ฑ LM์˜ ์ฃผ์š” challenging point์™€ ์ง๊ฒฐ๋˜๋Š” refusal์„ ๊ฐ„๋‹จ๋ช…๋ฃŒํ•˜๊ฒŒ ์กฐ์ ˆํ•จ
โ€ข ๋‹จ์ :
โ€ข ๋ณด์™„์ : Llama3-8b๋ง๊ณ  ๋‹ค๋ฅธ ๋ชจ๋ธ๋กœ๋„ ์ง„ํ–‰
โ€ข ๊ธฐํƒ€ ๋А๋‚€์ : ๊ฑฐ์ ˆํ•ด์•ผํ•˜๋Š”/๊ฑฐ์ ˆํ•˜๋ฉด ์•ˆ๋˜๋Š” ์งˆ๋ฌธ์˜ ๊ธฐ์ค€(coconut dataset)์ด ๋ช…ํ™•ํ• ๊นŒ? ์ฃผ๊ด€์ ์ผ ๊ฒƒ ๊ฐ™์€๋ฐ?
4.5
์•„์ด๋ฆฌ์Šค์žฅ์ : autoregressive model์— ์ž˜ ๋งž๋Š” ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ. ์ •๋ง ์ง๊ด€์  ์•„์ด๋””์–ด์ธ๋ฐ, ์™œ ์ด์ œ ๋‚˜์™”์ง€ ์‹ถ๊ธฐ๋„ ํ•จ.
๋‹จ์ : ๊ฒฐ๊ตญ ๋ฐ์ดํ„ฐ๋Š” ํ•„์š”ํ•œ๋ฐ, ์ด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ๊ณ ๋ ค๋„ ์žˆ์—ˆ์œผ๋ฉด ํ•จ.
๋ณด์™„์ : ๋ฐ์ดํ„ฐ์™€ ์ƒ์„ฑ ํ™•๋ฅ ์„ ๋™์‹œ์— ๊ณ ๋ คํ•ด์„œ, ๋” ์ข‹๊ณ  ์ ์€ ๋ฐ์ด์ฒ˜๋กœ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉด?
4.3
ํ•ธ๋“œํฌ๋ฆผโ€ข ์žฅ์ : ํŠน์ˆ˜ํ† ํฐ์˜ ์ƒ์„ฑํ™•๋ฅ  ์ž„๊ณ„๊ฐ’ ์กฐ์ •ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ฐ„๋‹จํ•˜๊ฒŒ ๊ฑฐ์ ˆ ์ •๋„ ์กฐ์ ˆ
โ€ข ๋‹จ์ : ๊ฑฐ์ ˆ ์œ ํ˜•์ด ๋ถ„๋ฆฌ๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ ์กฐ์ ˆ์ด ๋…๋ฆฝ์ ์œผ๋กœ ์•ˆ๋˜๊ณ  ์žˆ์Œ
โ€ข ๋ณด์™„์ : ๊ฑฐ์ ˆ ์œ ํ˜• ์„ค์ • ๋“ฑ ํ•™์Šต ๋ฐ์ดํ„ฐ ๋ณด์™„
4.3
3์›”โ€ข ์žฅ์ : ๋‹จ์ˆœํžˆ ํ† ํฐ ์ถ”๊ฐ€๋งŒ์œผ๋กœ refusal ํŒ๋‹จ์„ ํšจ๊ณผ์ ์œผ๋กœ ์ œ์–ดํ•จ. ๋†’์€ reproducibility!
โ€ข ๋‹จ์ : Refusal์„ ํŒ๋‹จํ•˜๋Š” ๊ธฐ์ค€์ด ์‹ค์ œ๋กœ ๋ชจ๋ธ์ด ๋ชฐ๋ผ์„œ๊ฐ€ ์•„๋‹ˆ๋ผ ํŒจํ„ด์œผ๋กœ ๊ฑฐ์ ˆํ•˜๋Š” ๊ฒƒ์œผ๋กœ ๋ณด์ž„. Contrast ๋ฐ์ดํ„ฐ ๋„ฃ์œผ๋‹ˆ๊นŒ ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ๋ฐ”๋€Œ๋‹ˆ๊นŒ! outlier ์งˆ์˜๋“ค ์ฒ˜๋ฆฌ๊ฐ€ ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Œ
โ€ข ๋ณด์™„์ : Self-supervised / negative sampling ๊ฐœ์„ . ์งˆ์˜๋ฅผ ์Šค์Šค๋กœ ๊ตฌ๋ถ„ํ•˜๊ฒŒ ํ•˜๋ฉด ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์ด ๊ฐœ์„ ๋  ๋“ฏ
4.4
ํ™”์ดํŠธ๋…ธ์ด์ฆˆ โ€ข ์žฅ์ : ๊ฐ„๋‹จํ•œ ๋ฉ”์†Œ๋“œ+์ถ”๊ฐ€ํ•™์Šต์ด ์—†์–ด๋„๋จ
โ€ข ๋‹จ์ : ๋” ์ž‘์€ ๋ชจ๋ธ์—์„œ๋„ ์„ฑ๋Šฅ์ด ์ž˜ ๋‚˜์˜ค๋Š”์ง€ ๊ถ๊ธˆํ•œ๋ฐ 8b ๋ชจ๋ธ ํ•˜๋‚˜๋งŒ ์žˆ์Œ
โ€ข ๋ณด์™„์ : Threshold ๋™์ ์œผ๋กœ ์กฐ์ • ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ํ›„์†์—ฐ๊ตฌ
4.4
ํ”ผ์ฆˆ์น˜์ž โ€ข ์žฅ์ : Inference์‹œ threshold ์กฐ์ ˆ๋งŒ์œผ๋กœ ๊ฑฐ์ ˆ๋ฅ ์„ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ . ๊ต‰์žฅํžˆ ๊ฐ„๋‹จํ•˜๊ฒŒ controlํ•  ์ˆ˜ ์žˆ์–ด์„œ ์‹ค์šฉ์ ์ธ๋“ฏ
โ€ข ๋‹จ์ : ์ด ๋ฐฉ๋ฒ•์€ refusal behavior๋ฅผ ๋” ์กฐ์ ˆ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฑฐ์ง€, ์œ„ํ—˜ํ•œ ์ง€์‹์„ ์ œ๊ฑฐํ•˜๋Š”๊ฑด ์•„๋‹˜. ๊ทธ๋ž˜์„œ ์—ฌ๋Ÿฌ jailbreak๊ณต๊ฒฉ์— ๋Œ€ํ•œ ๋Œ€์‘์€ ๋ณ„๊ฐœ์˜ ๋ฌธ์ œ์ผ๋“ฏ
โ€ข ๋ณด์™„์ : ์ œ์•ˆ: training์— refusal token์„ ํฌํ•จํ•˜๋Š” ๊ฒƒ๋งŒ์œผ๋กœ ์„ฑ๋Š” ํ–ฅ์ƒ. ํ† ํฐ์ด ํ•™์Šต ์ž์ฒด๋ฅผ ๋” ๊ตฌ์กฐํ™”ํ•˜๋‚˜? ๊ทธ๋Ÿฌ๋ฉด response ๊ด€์  ๋“ฑ ๋‹ค๋ฅธ ๊ด€์ ์—์„œ๋„ ์ถ”๊ฐ€ ํ† ํฐ์œผ๋กœ ํ•™์Šต์„ ๊ตฌ์กฐํ™”ํ•  ์ˆ˜ ์žˆ์„๊นŒ?
4.5
์—๋„ˆ์ง€ โ€ข ์žฅ์  : ํ† ํฐ์„ ํ™œ์šฉํ•ด์„œ ๊ฐ„๋‹จํ•œ ๋ฐฉ์‹์œผ๋กœ refusal ์ „๋žต์„ ์ œ์‹œํ•จ. ๊ธฐ์กด alignment์˜ ๋ณต์žกํ–ˆ๋˜ ๋‹ค๋ฅธ ๋ฐฉ์‹์— ๋น„ํ•ด์„œ ๋” ์ง๊ด€์ ์œผ๋กœ ์™€๋‹ฟ์€ ๊ฒƒ ๊ฐ™์Œ. ๋‹จ์ˆœ ๋ถ„๋ฅ˜์ด์ง€ ์•Š์„๊นŒ? ์‹ถ์—ˆ์ง€๋งŒ ์ƒ๊ฐํ•ด๋ณด๋‹ˆ ์ƒ์„ฑ ๋ชจ๋ธ์˜ token ํ™•๋ฅ ์„ ํ‹€์–ด๋ฒ„๋ฆฌ๊ฒŒ ํ•˜๋Š” ๊ฒƒ์ด๋‹ˆ ์‹ค์šฉ์ /ํšจ๊ณผ์ ์ด๋ผ๊ณ  ์ƒ๊ฐ.
โ€ข ์•ฝ์  : refuse, respond ๋ผ๋ฒจ์— ์˜์กด์ ์œผ๋กœ ๋ณด์ž„(ํŠนํžˆ ์–ด๋ ค์šด ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด์„œ..?)
โ€ข ๋ณด์™„์  : ์™œ ํ•ด๋‹น decision(refuse, respond)์ด ๋‚˜์™”๋Š”์ง€ reasoning? ์„ ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜, decision์„ ๋” ์„ธ๋ถ„ํ™”(?)ํ•˜๊ฑฐ๋‚˜(์ด๋ฏธ ํ•œ ๊ฒƒ ๊ฐ™์ง€๋งŒ),, ํ•˜๋Š” ๋ฐฉ์‹์„ ์ถ”๊ฐ€ํ•˜๋ฉด ์–ด๋–จ๊นŒ?
4.3
์ œ๋กœ์ฝœ๋ผ โ€ข ์žฅ์ : ํ† ํฐ ํ•˜๋‚˜๋ฅผ ์•ž์— ๋ถ™์ด๋Š” ๊ฒƒ๋งŒ์œผ๋กœ ๊ฑฐ์ ˆ ์—ฌ๋ถ€๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์•„์ด๋””์–ด๊ฐ€ ๋‹จ์ˆœํ•œ๋ฐ, ์‹ค์ œ๋กœ ์„ฑ๋Šฅ๊นŒ์ง€ ์˜ฌ๋ผ๊ฐ. inference ๋‹จ๊ณ„์—์„œ threshold๋งŒ ์กฐ์ •ํ•˜๋ฉด ๊ฑฐ์ ˆ ๊ฐ•๋„๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๋ฐ”๊ฟ€ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ด ์‹ค์šฉ์ ์œผ๋กœ ๋А๊ปด์ง.
โ€ข ์•ฝ์ : Llama-3-8B ํ•˜๋‚˜๋กœ๋งŒ ์‹คํ—˜ํ•ด์„œ ์ด ๋ฐฉ๋ฒ•์ด ๋‹ค๋ฅธ ๋ชจ๋ธ์—์„œ๋„ ๋˜‘๊ฐ™์ด ์ž˜ ์ž‘๋™ํ• ์ง€ ํ™•์‹ ํ•˜๊ธฐ ์–ด๋ ค์›€.
โ€ข ๋ณด์™„์ : ๊ฑฐ์ ˆํ•ด์•ผ ํ• ์ง€ ๋ง์ง€ ์• ๋งคํ•œ์งˆ๋ฌธ๋“ค์„ ๋ชจ๋ธ์ด ์Šค์Šค๋กœ ํŒ๋‹จํ•˜๋Š” ๋Šฅ๋ ฅ์„ ํ‚ค์šธ ์ˆ˜ ์žˆ๋„๋ก, ๋‹ค์–‘ํ•œ ์‚ฌ๋ก€๋ฅผ ์ž๋™์œผ๋กœ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ ํ™•๋ณดํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ํ•จ๊ป˜ ์žˆ์œผ๋ฉด ์ข‹์„๊ฒƒ ๊ฐ™์Œ.
4.4
์ฐฝ๋ฐฑ์นด์ธ„์žฅ์ : simple and powerful! ์‹คํ—˜๋„ ์—ด์‹ฌํžˆ ํ•˜๊ณ  ๊ฒฐ๊ณผ๋„ ์ข‹์Œ. Latent safety๋ฅผ explicitํ•˜๊ฒŒ ๋‚˜ํƒ€๋‚ด๋ฉด ๋ชจ๋ธ๋„ ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์›Œํ•˜๋Š”๋“ฏ ํ•จ! ์ด๋Ÿฐ ๋ฐฉ๋ฒ•๋ก  ๋‹ค๋ฅธ ํƒœ์Šคํฌ์—์„œ๋„ ์ ์šฉํ•˜๋ฉด ์ข‹์„๋“ฏ
๋‹จ์ : ๊ฒฐ๊ตญ ์ด ๋ฐฉ๋ฒ•๋ก ๋งŒ ์ ์šฉํ•˜๋ฉด ์ง€๊ธˆ๊นŒ์ง€ ๋‚˜์˜จ adversarial attack method์— ๋Œ€ํ•ด์„œ๋Š” ์ทจ์•ฝ์ ์„ ๊ฐ€์งˆ ๋“ฏ ํ•จ. ๊ต‰์žฅํžˆ 1์ฐจ์ ์ธ ํ™œ์šฉ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์ด๋Š” ๊ฒƒ ๊ฐ™์Œ. ํฐ ์•ฝ์ ์€ ์•„๋‹˜.
์ œ์•ˆ์ : ์ฒ˜์Œ์— refusal respond๋ฅผ ๊ฒฐ์ •ํ•˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ, ์œ„ํ—˜ ํ…์ŠคํŠธ ์•ž์—์„œ ์ƒ์„ฑํ•˜๊ฒŒ ํ•˜๋Š” ๊ฑด ์–ด๋–จ์ง€? ๊ทธ๋Ÿฐ ์•„์ด๋””์–ด ์ด๋ฏธ ์žˆ์—ˆ๋˜๊ฒƒ ๊ฐ™๊ธฐ๋„ ํ•˜๊ณ 
3.5

TL; DR

๐Ÿ’ก

Refusal token์œผ๋กœ ๋ชจ๋ธ์˜ ์‘๋‹ต ๊ฑฐ์ ˆ์„ ๋” ์„ฌ์„ธํ•˜๊ณ (์„ฑ๋Šฅโ†‘), ์œ ์—ฐํ•˜๊ฒŒ(inference ๋‹จ์—์„œ ์กฐ์ ˆ ๊ฐ€๋Šฅ) ํ•œ๋‹ค!

Summary

Background

  • Refusal
    • LLM์ด ์‘๋‹ต์„ ๊ฑฐ์ ˆํ•˜๋Š” ๊ฒƒ(ํ˜น์€ ๋ถ€์ •์ ์ธ ์‘๋‹ต์„ ์ƒ์„ฑํ•˜๋Š” ๊ฒƒ)
    • ๋ชจ๋ธ์˜ ์•ˆ์ „์„ฑ, ์œ ์—ฐ์„ฑ, ์‹ ๋ขฐ์„ฑ ์ธก๋ฉด์—์„œ ์ค‘์š”
    • ์˜ˆ์‹œ
      • Harmful text์— ๋Œ€ํ•ด ๊ฑฐ๋ถ€
        • Q: ํญํƒ„ ๋งŒ๋“œ๋Š” ๋ฒ• ์•Œ๋ ค์ค˜ A: ๋‚œ ๋ชปํ•ด
      • LLM์ด ๋Œ€๋‹ตํ•  ์ˆ˜ ์—†๋Š” ๊ฒƒ์— ๋Œ€ํ•œ ๊ฑฐ๋ถ€
        • Q: ์ง€ํ”ผํ‹ฐ์•ผ ๋‚˜ ๋Œ€์‹  ์ˆ˜๊ฐ•์‹ ์ฒญ ํ•ด์ค˜ A: ๋‚œ ๋ชปํ•ด

Motivation

  • Refusal โ€˜์ž˜โ€™ ํ•˜๋Š” ๊ฒƒ์ด๋ž€?
    • ๋Œ€๋‹ตํ•ด์•ผ ํ•  ๋•Œ๋Š” ํ•˜๊ณ ! ์•ˆ ํ•ด์•ผ ํ•  ๋•Œ๋Š” ์•ˆ ํ•˜๊ณ !
  • ํŠน์ • ๋ชจ๋ธ (llama-2-chat) ์€ ๋„ˆ๋ฌด ๊ฑฐ๋ถ€ ๋งŽ์ดํ•ด์„œ ์‚ฌ์šฉ์„ฑ์ด ๋–จ์–ด์ง
  • ์ด๋Š” ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๊ฑฐ์ ˆํ•˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„์„œ ๊ทธ๋Ÿฐ ๊ฒƒ
    • ํ•™์Šต์„ ์—ฌ๋Ÿฌ๋ฒˆ ํ•˜๋ฉด์„œ ๊ฑฐ์ ˆ๋ฐ์ดํ„ฐ์˜ ๋น„์œจ์„ ์ž˜ ์กฐ์ •ํ•˜๋Š” ๊ฒƒ์€ ๋น„์šฉ์ด ๋„ˆ๋ฌด ํผ
  • ๊ธฐ์กด ๊ธฐ์ˆ ๋“ค์€ ๋ฒ”์ฃผ์— ๋”ฐ๋ฅธ ๊ฑฐ์ ˆ์˜ ๊ฐ•๋„๋ฅผ ์กฐ์ •ํ•˜๊ธฐ๋„ ์–ด๋ ต๊ณ , inference time์—์„œ ๊ฑฐ์ ˆ ๊ธฐ์ค€์„ ์œ ์—ฐํ•˜๊ฒŒ ๋ฐ”๊พธ์ง€๋„ ๋ชปํ•จ!
    • ํ™˜๊ฒฝ, ์‹œ๊ฐ„์— ๋”ฐ๋ผ ์œค๋ฆฌ์ ยท๋ฒ•์ ยท๊ธฐ์ˆ ์  ๊ธฐ์ค€์ด ๋‹ฌ๋ผ์ง€๊ธฐ ๋•Œ๋ฌธ

Contribution

  • Refusal ํ† ํฐ ์ „๋žต ์ œ์•ˆ
    • ์ด ํ† ํฐ์„ ์“ฐ๋ฉด threshold๋ฅผ ๊ฑธ์–ด์„œ ๊ฑฐ์ ˆ ๊ฐ•๋„๋ฅผ ์•„์ฃผ ์‰ฝ๊ฒŒ ์กฐ์ ˆ ๊ฐ€๋Šฅํ•จ!
    • ์ž˜ ์กฐ์ •ํ•˜๋ฉด refusal task์—์„œ ์„ฑ๋Šฅ๋„ ์˜ฌ๋ผ๊ฐ
      • ๊ผญ ์กฐ์ • ์•ˆํ•˜๋”๋ผ๋„ ๊ทธ๋ƒฅ ์„ฑ๋Šฅ ์˜ฌ๋ผ๊ฐ€๊ธฐ๋„ ํ•จ
    • ์—ฌ๋Ÿฌ ๋ฒ”์ฃผ์— ๋Œ€ํ•œ refusal ํ† ํฐ์„ ์“ฐ๋ฉด ๊ฐ ๋ฒ”์ฃผ๋ณ„ ์กฐ์ •๋„ ๊ฐ€๋Šฅ!
  • ์•”ํŠผ ๋ฐ‘ table์ฒ˜๋Ÿผ ํŽธ๋ฆฌํ•˜๊ณ  ์ข‹์Œ

Method (์„ธ์ƒ์—์„œ ์ œ์ผ ๊ฐ„๋‹จ)

  • Refusal์— ๋Œ€ํ•œ instruction data (x,y)(x, y)๏ปฟ๊ฐ€ ์žˆ์„ ๋•Œ, ๋ชจ๋ธ์€ input x์— ๋Œ€ํ•ด output y๋ฅผ ํ•™์Šตํ•˜๋Š”๋ฐ, ์ด ๋•Œ
    y ๋Œ€์‹ ์— yโ€™ = [refuse]/[respond] + y ๋ฅผ ๋Œ€์‹  ํ•™์Šตํ•˜์ž!
    • ๊ฑฐ์ ˆํ•ด์•ผ ํ•˜๋Š” instruction ์— ๋Œ€ํ•ด์„œ๋Š” [refusal] ํ† ํฐ ๋ถ™์ด๊ณ  ์ƒ์„ฑ
    • ์‘๋‹ตํ•ด์•ผ ํ•˜๋Š” instruction ์— ๋Œ€ํ•ด์„œ๋Š” [respond] ํ† ํฐ ๋ถ™์ด๊ณ  ์ƒ์„ฑ
  • Inference time์—์„œ refusal ๊ฐ•๋„๋ฅผ ์กฐ์ ˆํ•˜๊ธฐ ์œ„ํ•ด ์ฒซ ๋ฒˆ์งธ ํ† ํฐ ์ƒ์„ฑ ์‹œ [refusal] ํ† ํฐ ์ƒ์„ฑ ํ™•๋ฅ ์—
    threshold๋ฅผ ๊ฑธ์–ด์„œ [refusal] ํ† ํฐ ์ƒ์„ฑ ํ™•๋ฅ ์ด threshold ์ด์ƒ์ด๋ฉด [refuse] ํ† ํฐ ์ถœ๋ ฅ,
    ์•„๋‹ˆ๋ฉด ๊ทธ๋ƒฅ ์ถœ๋ ฅํ•˜๋„๋ก ํ•จ!
    • ์ด ์™ธ์—๋„ ๋ฐ์ดํ„ฐ์…‹์— ๊ฑฐ์ ˆ ์œ ํ˜•(๋ฒ”์ฃผ)์ด ์žˆ๋Š” ๊ฒฝ์šฐ ์œ ํ˜•์— ๋”ฐ๋ฅธ ๊ฑฐ์ ˆ ํ† ํฐ์„ ์‚ฌ์šฉํ•ด์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํ† ํฐ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ
  • Refusal token์€ ๊ฑฐ์ ˆํ•ด์•ผ ํ•œ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋Š” model์˜ confidence๋กœ ๊ฐ„์ฃผํ•  ์ˆ˜ ์žˆ์Œ
    • ๋ฐ˜๋Œ€๋กœ response token์€ ์‘๋‹ตํ•ด์•ผํ•˜๋Š” ํ™•๋ฅ ์„ ๋‚˜ํƒ€๋ƒ„

Experiments

Setup

  • Model: Llama-3-8B
  • Dataset: UltraChat, Alpaca ์‚ฌ์šฉ
    • Alpaca๋Š” refusal instruction ์ด ๊ฑฐ์˜ ์—†์–ด์„œ ablation์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ์— ์ข‹์Œ
  • CoCoNot setting (main setting)
    • ๊ทธ๋ƒฅ toxicty๋งŒ ๋ณด๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ๋‹ค์–‘ํ•œ refusal ์œ ํ˜•์„ ๋ณผ ์ˆ˜ ์žˆ๋„๋ก CoCoNot ๋ฐ์ดํ„ฐ์…‹ ์‚ฌ์šฉ
      • 5๊ฐ€์ง€ ์œ ํ˜•: Humanizing, Indeterminate, Incomplete, Safety, Unsupported
    • ๊ทธ๋ฆฌ๊ณ  ์ด ๋ฐ์ดํ„ฐ์…‹์—๋Š” ๊ฑฐ์ ˆํ•ด์•ผ ํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ๋ณด์ด์ง€๋งŒ ๋Œ€๋‹ตํ•ด์•ผ ํ•˜๋Š” contrast data๋„ ์žˆ์Œ!
      • Refusal boundary์— ๊ฐ€๊นŒ์šด ์งˆ์˜๋“ค
  • Temporal setting
    • CoCoNot setting์—์„œ contrast data๊ฐ€ ์ ์–ด์„œ (์ „์ฒด์˜ 1/10), ์ด๋ฅผ ๋ณด์™„ํ•˜๋Š” ์„ธํŒ…
    • LLM knowledge cutoff๋ฅผ ๊ณ ๋ คํ•ด์„œ, cutoff date ์ด์ „/์ดํ›„ instruction ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ๊ฐ
      contrast, refusal ๋ฐ์ดํ„ฐ๋กœ ํ•จ
      • ๊ฐ๊ฐ ์‹œ๊ฐ„ ์ •๋ณด๊ฐ€ ์žˆ๋Š” ๋‰ด์Šค ๊ธฐ์‚ฌ๋กœ๋ถ€ํ„ฐ LLM prompting์œผ๋กœ instruction data๋ฅผ ๋งŒ๋“ฌ
    • ์ด ๋•Œ๋Š” contrast data, refusal ๋ฐ์ดํ„ฐ ๋น„์œจ์ด ๋ฐ˜๋ฐ˜
  • Evaluation
    • CoCoNot๊ณผ Temporal setting์— ๋Œ€ํ•ด, Llama-3.1-70B๋ฅผ llm-as-a-judge๋กœ ํ™œ์šฉํ•ด์„œ ํ‰๊ฐ€
      • ์‚ฌ๋žŒ๊ณผ ์ผ์น˜์œจ ๋†’์•˜๋‹ค๊ณ  ํ•จ

Results & Analysis

  • Contrast data์จ์„œ ํ•™์Šตํ•˜๋ฉด ๋” ์ž˜ํ•จ
    • Sampling w/ No Token: [Refusal] ํ† ํฐ ๋บ€ ๊ฑฐ
    • Sampling w/ Token: [Refusal] ํ† ํฐ ๋„ฃ์€ ๊ฑฐ
    • Thresholding sweep : [Refusal] ํ† ํฐ ์ƒ์„ฑ์— ๋Œ€ํ•œ ํ™•๋ฅ  ์ž„๊ณ„๊ฐ’์„ ์กฐ์ •ํ•œ ๊ฒƒ (์—ฌ๋Ÿฌ ์ž„๊ณ„๊ฐ’์— ๋Œ€ํ•œ ๋ฐ์ดํ„ฐ)
  • ์œ ํ˜•๋ณ„ refusal ํ† ํฐ์˜ ์ œ์–ด
    • 5๊ฐ€์ง€ ์œ ํ˜•์— ๋Œ€ํ•œ ํ† ํฐ๋“ค์„ ์–ต์ œ
    • ๋Œ€๋ถ€๋ถ„์€ ๋…๋ฆฝ์ ์œผ๋กœ ์ž‘๋™
      • E.g. Incomplete ์œ ํ˜•์— ๋Œ€ํ•ด์„œ๋Š” Incomplete refusal token์„ ์–ต์ œํ–ˆ์„ ๋•Œ๋งŒ refusal rate๊ฐ€ ๋–จ์–ด์ง
    • ๊ทผ๋ฐ Humanizing token์€ ์–ต์ œํ•˜๋ฉด ๊ฑ ๋–จ์–ด์ง
      • ์ด๊ฒŒ ๋‹ค๋ฅธ ์œ ํ˜•๊ณผ ๊ฒน์ณ์„œ ๊ทธ๋Ÿผ
  • Refusal token์˜ threshold ์‹คํ—˜
    • Threshold๋ฅผ 0๋ถ€ํ„ฐ 1๊นŒ์ง€ 0.1์”ฉ ์˜ฌ๋ฆฌ๋ฉด์„œ ์‹คํ—˜
    • Threshold ์ ์ ˆํ•˜๊ฒŒ ์ž˜ ๊ฑธ๋ฉด F1 ์„ฑ๋Šฅ ์ข‹์•„์ง
      • 0.1์ผ๋•Œ ์ตœ๊ณ 
    • Figure 4์˜ ์˜ค๋ฅธ์ชฝ์€ ์ž„๊ณ„๊ฐ’์„ ์–ด๋–ป๊ฒŒ ์ค˜๋„ False Positive๊ฐ€ 0.35๋ถ€๊ทผ ์•„๋ž˜๋กœ ๋‚ด๋ ค๊ฐ€์ง€ ์•Š๋Š”๋ฐ, ์ด๋Š” ๋ฐ์ดํ„ฐ์…‹ ์ž์ฒด์˜ refusal data ๋น„์œจ์ž„
      • Refusal token์˜ threshold๋ฅผ ๋†’์—ฌ์„œ ์ƒ์„ฑ์„ ์–ต์ œํ•ด๋„, ํ•™์Šต ๋ฐ์ดํ„ฐ ๋น„์œจ์„ ๋”ฐ๋ผ์„œ refusalํ•จ
    • ์œ ํ˜•๋ณ„๋กœ threshold ์ง€์ •ํ•˜๋ฉด ๊ทธ ์œ ํ˜•์— ๋Œ€ํ•ด์„œ๋งŒ refusal ์„ฑ๋Šฅ ์˜ค๋ฆ„
      • ์ด๊ฑธ๋กœ ์œ ํ˜•๋ณ„๋กœ ๊ฑฐ์ ˆ ๋ฏผ๊ฐ๋„ ์ œ์–ด ๊ฐ€๋Šฅ
    • single ํ† ํฐ(์œ ํ˜• ๊ณ ๋ ค X) ํ•˜๋ฉด ๋ชจ๋“  ์œ ํ˜•์—์„œ ์„ฑ๋Šฅ ์˜ค๋ฆ„
  • Ablation study
    • ํ•™์Šตํ•  ๋•Œ ๊ฑฐ์ ˆ ํ† ํฐ ์ถ”๊ฐ€๋งŒ ํ•ด๋„ ์„ฑ๋Šฅ ์˜ฌ๋ผ๊ฐ
      • ๊ทธ๋ฆฌ๊ณ  LLM์˜ generalํ•œ ๋Šฅ๋ ฅ๋„ ์œ ์ง€ํ•˜๋Š” ํŽธ์ž„ (Tasks Avg)
    • ๊ทธ๋ฆฌ๊ณ  Contrast data์ถ”๊ฐ€ํ•˜๋ฉด ์„ฑ๋Šฅ ๋” ์˜ฌ๋ผ๊ฐ
    • Figure 5 ์™ผ์ชฝ: ํ•™์Šต ๋ฐ์ดํ„ฐ์— ์†Œ์ˆ˜์˜ refusal sample์ด ํฌํ•จ๋˜์–ด๋„ refusal rate ์—„์ฒญ ์˜ฌ๋ผ๊ฐ
      • temporal๋กœ ํ•™์Šต์‹œ์ผœ๋„, Coconot์ด๋ž‘ TriviaQA์—์„œ ๊ฑฐ์ ˆ๋น„์œจ ์˜ฌ๋ผ๊ฐ
      • refusal token์œผ๋กœ ์ œ์–ดํ•ด๋„ refusal sample ๋งŽ์•„์ง€๋ฉด refusal ์—†๋Š” ์„ธํŒ…๊ณผ ์ˆ˜๋ ดํ•˜๋Š” ๊ฒฝํ–ฅ
    • Figure 5 ์˜ค๋ฅธ์ชฝ: Contrast data ์ถ”๊ฐ€ํ•˜๋ฉด (refusal sample๊ณผ 1:1 ๊ฐœ์ˆ˜๋กœ ์ถ”๊ฐ€) refusal rate ์ข€ ์ œํ•œ๋จ
      • ๋ฐ”์šด๋”๋ฆฌ์— ๋Œ€ํ•œ ํ•™์Šต์€ refusal rate๋ฅผ ์ž˜ ์กฐ์ •ํ•˜๋Š”๋ฐ ํšจ๊ณผ๊ฐ€ ์žˆ๋‹ค

Categories

SAFETY research