19 March 2026

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

๐Ÿ’กPost-training ํ›„ ๋ชจ๋ธ ๋‚ด๋ถ€ ์ง€์‹, ์ง„์‹ค์„ฑ, ์•ˆ์ „์„ฑ, ํ™•์‹ ์„ฑ์˜ ๋ณ€ํ™”๋ฅผ ๊ธฐ๊ณ„์ ์œผ๋กœ ๋ถ„์„!

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Review

๋‹‰๋„ค์ž„ ์ฝ”๋ฉ˜ํŠธ(Strength, Weakness, Suggestion)๋ณ„์  (0/5)
์ฝ”์Šคํ”ผ๊ฐ•์ : Base-model์„ Post Training Model๋กœ Trainํ•  ๋•Œ, Confidence์™€ Truthfulness, Refusal ๋ถ„์„์œผ๋กœ ๋ณ€ํ™”์˜ ์›์ธ์„ ๋ถ„์„ํ•œ ์ ์ด ๊ฐ•์ 
์•ฝ์ : Neuron์ด ํฌ๊ฒŒ ๋ณ€ํ•˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ํ•˜๋Š”๋ฐ, ๊ทธ๋Ÿผ Post์—์„œ Base๋ชจ๋ธ๊ฐ„ ์ฐจ์ด๋ฅผ ๋‹จ์ˆœํžˆ Refusal ๋ถ„์„์œผ๋กœ ์„ค๋ช…์ด ๊ฐ€๋Šฅํ•œ ๊ฑด์ง€ ๋ชจํ˜ธํ•œ ๋ถ€๋ถ„์ด ์žˆ์Œ.
๊ฐœ์„ : Base-model๊ณผ Post model์„ ๋ณ€ํ™”์‹œํ‚ค๋Š” ๋””ํ…Œ์ผ ๋ณ€ํ™”์— ๋Œ€ํ•œ ์„ค๋ช…์ด๋‚˜ ์ฆ๋ช…์ด ์ž์„ธํ•  ํ•„์š”๊ฐ€ ์žˆ์Œ.
4.5
๋น„์š”๋œจ๊ฐ•์ : ๋ชจ๋ธ์˜ post-training์„ ๋ชจ๋ธ ํ–‰๋™ ํŠน์„ฑ์— ๋‚˜๋ˆ„์–ด์„œ, ํŠน์„ฑ์— ๋”ฐ๋ผ์„œ ๊ธฐ์กด์œ„์— ์œ ์ง€๋˜๋Š”, ํ˜น์€ ๋ฎ์–ด์”Œ์–ด์ง€๋Š” ์˜์—ญ์ด ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์ž˜ ์ œ์‹œํ–ˆ์Œ. ๊ทผ๋ฐ ์™œ ์ € 4๊ฐœ์˜ ๊ธฐ์ค€์„ ์„ ์ •ํ•˜๊ฒŒ ๋˜์—ˆ๋Š”์ง€ ๊ถ๊ธˆํ•˜๋‹ค
์•ฝ์ : keyword ๊ธฐ๋ฐ˜ refusal score์€ ์ƒ๋Œ€์ ์œผ๋กœ ๋ถ€์ •ํ™•ํ• ์ˆ˜ ์žˆ์„๊ฒƒ ๊ฐ™์Œ(๊ทธ๋Ÿฐ๋ฐ ์ด๊ฑด ํŒ๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์–ด์ฉ” ์ˆ˜ ์—†์œผ๋ ค๋‚˜?)
์ œ์•ˆ: ์˜ˆ๋ฅผ ๋“ค๋ฉด reasoning ๋Šฅ๋ ฅ๊ณผ ๊ฐ™์ด 4๊ฐ€์ง€ ์ด์ƒ์˜ ๊ธฐ์ค€์œผ๋กœ ํ™•์žฅํ•  ์ˆ˜ ์žˆ์„๊ฒƒ ๊ฐ™์Œ
4.3
์นซ์†”๊ฐ•์ : post-training ์‹œ refusal ๋ฐฉํ–ฅ์ด ๋ฐ”๋€๋‹ฌ์ง€, patching ํšจ๊ณผ์— ๋ฐฉํ–ฅ์„ฑ์ด ์žˆ๋‹ฌ์ง€ (postโ†’base ํšจ๊ณผ ์—†๊ธฐ๋„ ํ•จ) ํ•˜๋Š” ๋ถ„์„๊ฒฐ๊ณผ๊ฐ€ ์ƒˆ๋กœ์›€
์•ฝ์ : base model์—์„œ post-training model ๋ฐฉํ–ฅ์œผ๋กœ๋Š” neuron์ด ๋Œ€์ฒด๋กœ ๊ธฐ๋Šฅ ์œ ์ง€ํ•œ๋‹ค๊ฑฐ๋‚˜, patching ํšจ๊ณผ ์œ ์ง€๋œ๋‹ค๋Š” ๋ฐœ๊ฒฌ์€ ๊ธฐ์กด์— ์žˆ๋˜ ๊ฒƒ
์ œ์•ˆ: entropy neuron๊ณผ confidence์˜ ๊ด€๊ณ„์— ๋Œ€ํ•ด์„œ ์˜๋ฌธ๋งŒ ๋‚จ๊ฒผ๋Š”๋ฐ, ์ด์— ๋Œ€ํ•œ ๋ถ„์„
4.0
๋‚˜์Šค๋‹ฅ์žฅ์ : ๋ฐ˜์ „์ด ์žˆ๋Š” ๊ฒฐ๊ณผ! confidence์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ์‚ฌ์‹ค์„ ์•Œ๊ฒŒ ๋จ
์•ฝ์ : ๋‚˜๋จธ์ง€ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๋Œ€๋ถ€๋ถ„ ๊ทธ๋Ÿด๋“ฏํ•˜๊ณ  ๋น„์Šทํ•œ ์‹คํ—˜, ๋…ผ๋ฌธ๋“ค์ด ์ด๋ฏธ ๋งŽ์Œ, novelty๋Š” ์กฐ๊ธˆ ๋–จ์–ด์ง€๋Š” ๋А๋‚Œ
์ œ์•ˆ: Post-training์ด๋ผ๋Š” ๋‹จ์–ด ์ž์ฒด๊ฐ€ ๋„ˆ๋ฌด ํฌ๊ด„์ ์ž„. Preference optimization์ด๋‚˜ instruction tuning๊ฐ™์€ ์ฃผ๋กœ ์‚ฌ์šฉ๋˜๋Š” post-training ๊ธฐ๋ฒ• ์•ˆ์—์„œ ๋” ์‹ฌ์ธต์ ์ธ ๋ถ„์„์„ ํ•˜๋Š” ๊ฒƒ์ด ์˜๋ฏธ์žˆ๋Š” ์‹คํ—˜ ๊ฒฐ๊ณผ, ์ƒˆ๋กœ์šด ์‹คํ—˜ ๊ฒฐ๊ณผ, ํ•ด์„๊ฐ€๋Šฅ์„ฑ์„ ์ฃผ์ง€ ์•Š์„๊นŒ?
3.5
์–ผ๋ผ๊ฐ•์ : ์ง€๊ธˆ๊นŒ์ง€ ์ฝ์€ alignment ๋…ผ๋ฌธ์€ ๋‹ค ์ถœ๋ ฅ ์ฆ‰, ๋ชจ๋ธ์˜ ์‘๋‹ต์„ ๋ณด๊ณ  ๊ฒฐ๊ณผ๋งŒ ๋ดค๋‹ค๋ฉด ๋ชจ๋ธ ๋‚ด๋ถ€ ํ‘œํ˜„์„ ๋ณธ๋‹ค๋Š” ์ ์—์„œ ์‹ ์„ ํ•จ. ์‹คํ—˜์„ ํ†ตํ•ด ์ง€์‹ ์ €์žฅ ์œ„์น˜๊ฐ€ ์•ˆ๋ฐ”๋€๋‹ค๋Š” ๊ฑธ ์ž…์ฆํ•œ ์ 
์•ฝ์ : confidence ์ฐจ์ด์— ๋Œ€ํ•ด์„œ๋Š” entropy neuron๋งŒ์œผ๋กœ ์„ค๋ช…๋˜์ง€ ์•Š๋Š”๋‹ค๋ผ๊ณ ๋งŒ ํ•˜๊ณ  ์‹ค์ œ ์›์ธ์ด ๋ญ”์ง€๋Š” ์ œ๋Œ€๋กœ ์„ค๋ช…ํ•˜์ง€ ๋ชปํ•จ
์ œ์•ˆ: ๋‹ค๋ฅธ preference์— ๋Œ€ํ•ด์„œ๋„ post-training์ด ์œ ์‚ฌํ•œ ํŒจํ„ด์„ ๋ณด์ด๋Š”์ง€ ์—๋Œ€ํ•œ ์ถ”๊ฐ€ ๋ถ„์„
4.2
์„คํ–ฅ๋”ธ๊ธฐ๊ฐ•์ : ์ง€์‹ ์ €์žฅ๊ณผ ์„ ํ˜ธ๋„ ๊ฐœ์„ ์ด ๋ณ„๊ฐœ๋ผ๋Š” ๊ฒƒ์œผ๋กœ ์ดํ•ด๋˜๊ณ , ํŒŒ๋ผ๋ฏธํ„ฐ ์ง€์‹์ด๋ผ๋Š” ๋ง๊ณผ ๊ต‰์žฅํžˆ ์ž˜ ๋งž๋Š”๋‹ค๊ณ  ์ƒ๊ฐ๋˜๋ฉฐ ๋ช…ํ™•ํ•จ.
์•ฝ์ : ๊ธฐ์กด ๋…ผ๋ฌธ์—์„œ๋„ ๊ฐ ์ž‘์—…์˜ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” layer ๊ฐ€ ๋‹ค๋ฅด๋‹ค๋Š” ๋“ฑ ๋งŽ์€ ๊ฒฐ๋ก ์„ ๋‚ด๊ณ  ์žˆ๊ณ , ๊ทธ ๊ณ„์—ด์˜ ๋…ผ๋ฌธ ์ค‘ ํ•˜๋‚˜๋ผ๊ณ  ๋А๊ปด์ง„๋‹ค.
์ œ์•ˆ: ์ง€์‹ ์ €์žฅ์ด ๋ฌธ์ œ๊ฐ€ ์•„๋‹ˆ๋ผ, ์ง€์‹์„ ์“ฐ๋Š”๊ฒŒ ๋ฐ”๋€Œ๋Š” ๊ฒƒ์ด ๋ฌธ์ œ ์•„๋‹Œ๊ฐ€? ์ถ”๋ก  ๊ณผ์ •์ด ๋‹ฌ๋ผ์ง€๋Š” ๊ฒƒ์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ. attention, embedding ๋“ฑ ๊ทธ๋Ÿฌํ•œ ๋ฐฉํ–ฅ์˜ ์‹คํ—˜์ด ๋” ์œ ์šฉํ•˜์ง€ ์•Š์„๊นŒ?
3.8
404๊ฐ•์ : post-training์ด ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ผ์น˜๋Š”์ง€๋ฅผ ๊ธฐ์กด ์—ฐ๊ตฌ์™€ ๋‹ฌ๋ฆฌ ์ง€์‹์ €์žฅ ๋ฐ ๋ชจ๋ธ ๋‚ด๋ถ€ ๊ด€์ ์—์„œ ๋ถ„์„ํ•จ
๋‹จ์ &์ œ์•ˆ: post-training์ด ์˜ํ–ฅ์„ ๋ผ์น˜๋Š” ํŠน์„ฑ์ด refusal ๋ง๊ณ  ๋” ์—†๋‚˜? ๋” ๋‹ค์–‘ํ•œ ํŠน์„ฑ์— ๋Œ€ํ•ด์„œ๋„ ๊ถ๊ธˆํ•จ! (e.g. attention)
4.3
์ปคํ”ผ๊ฐ•์  :Post-Training์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋งŒ ๋ณด๊ณ  ํŒ๋‹จํ–ˆ๋˜ ๊ฒƒ์„, ๋‚ด๋ถ€์ ์œผ๋กœ ๋ถ„์„ํ•˜์—ฌ ์ฒด๊ณ„์ ์ธ ํ•ด์„ ๊ทผ๊ฑฐ๋ฅผ ์ œ์‹œํ•จ.
์•ฝ์  : 4๊ฐ€์ง€ ๊ด€์  ์™ธ์— ๋‹ค๋ฅธ ๊ด€์ ์— ๋Œ€ํ•ด์„œ๋„ ํ™•์žฅ ๊ฐ€๋Šฅํ•  ๊ฒƒ ๊ฐ™๊ณ , ์‚ฌ์šฉํ•œ ์‹คํ—˜ ๋ฐฉ์‹์ด LLM์˜ ๋ณต์žกํ•œ ๊ตฌ์กฐ(๋Šฅ๋ ฅ?)์— ๋Œ€ํ•ด์„œ ์ถฉ๋ถ„ํžˆ ํ•ด์„(๋ถ„์„)์„ ํ–ˆ๋‹ค๊ณ  ํ•  ์ˆ˜ ์žˆ์„๊นŒ?
์ œ์•ˆ : ๋‹ค๋ฅธ ๊ด€์ ์— ๋Œ€ํ•ด์„œ, ๊ทธ๋ฆฌ๊ณ  ์ถ”๊ฐ€์ ์œผ๋กœ ํ•ด์„๊ฐ€๋Šฅ์„ฑ์— ๋Œ€ํ•œ ๋‹ค๋ฅธ ์‹คํ—˜ ๋ฐฉ์‹๋„ ์žˆ์—ˆ์œผ๋ฉด ์ข‹์„ ๊ฒƒ ๊ฐ™์Œ.
4.2
๊ตญ๋ฐฅ๊ฐ•์ : Refusal์€ pre training์—์„œ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ์ƒ๊ธฐ๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ post training์ด ์ƒˆ๋กญ๊ฒŒ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๋Šฅ๋ ฅ์ด๋ผ๋Š” ๊ฑธ ์ˆ˜์น˜๋กœ ์ฆ๋ช…ํ•จ.
์•ฝ์ : Confidence์˜ ์›์ธ์„ entropy neuron์ด ์•„๋‹ˆ๋ผ๊ณ ๋งŒ ๊ฒฐ๋ก ์ง“๋Š”๋ฐ confidence ๋ณ€ํ™”์˜ ์›์ธ์— ๋Œ€ํ•œ ์„ค๋ช…์ด ๋” ์žˆ์—ˆ์œผ๋ฉด ์ข‹์•˜์„๊ฒƒ ๊ฐ™๋‹ค
์ œ์•ˆ: 4๊ฐ€์ง€ ์ด์ƒ ๊ด€์ ์— ๋Œ€ํ•ด์„œ ์ถ”๊ฐ€ ์‹คํ—˜
4.3
AI๊ฐ•์ : Post-training์„ ๋‚ด๋ถ€ ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๊ด€์ ์—์„œ ๋ถ„์„ํ•˜๊ณ , refusal & confidence alignment๋ฅผ ์ƒˆ๋กœ ๋ถ„ํ•ดํ•ด์„œ ํ™œ์šฉํ•œ ์ ์ด interpretability๋ฅผ ์ƒ๋‹นํžˆ ๋Œ์–ด์˜ฌ๋ ธ๋‹ค๊ณ  ๋ด„
์•ฝ์ : LLM์˜ ์‹ค์ œ ์ง€์‹์€ ๋‹จ์ˆœํ•œ ์‚ฌ์‹ค๊ด€๊ณ„๊ฐ€ ์•„๋‹Œ ๋ฉ€ํ‹ฐํ™‰, ์ธ๊ณผ๊ด€๊ณ„ ๋“ฑ ํ›จ์”ฌ ๋ณต์žกํ•œ๋ฐ ์ง€์‹์˜ location์ด ๋ณ€ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š”๊ฒŒ ์ด๋Ÿฐ ์ง€์‹์—๋„ ์ ์šฉ์ด ๋ ์ง€๋Š”..?
์ œ์•ˆ: ๋ฌธ์ œ ์„ธํŒ…์„ ๋ฉ€ํ‹ฐํ™‰, ์ธ๊ณผ๊ด€๊ณ„๋“ฑ์„ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋„๋ก ํ™•์žฅํ•˜๊ณ  LLM ํฌ๊ธฐ๋„ ์˜ฌ๋ ค์„œ scale์— ๋”ฐ๋ฅธ ๊ฒฝํ–ฅ์„ฑ ๋ถ„์„ ์ˆ˜ํ–‰
4.1

TL; DR

๐Ÿ’ก

Post-training ํ›„ ๋ชจ๋ธ ๋‚ด๋ถ€ ์ง€์‹, ์ง„์‹ค์„ฑ, ์•ˆ์ „์„ฑ, ํ™•์‹ ์„ฑ์˜ ๋ณ€ํ™”๋ฅผ ๊ธฐ๊ณ„์ ์œผ๋กœ ๋ถ„์„!

์ €์ž์†Œ์†: UCLA, University of Alberta, UIUC, Harvard

Summary

Background

  • ์š”์ฆ˜ Post-training์˜ ๋ชฉ์ 
    • Truthfulness ํ–ฅ์ƒ
    • Safety alignment (์•…์˜์  ๊ณต๊ฒฉ์— ๋Œ€ํ•œ ๋ฐฉ์–ด)
    • ๋ชจ๋ธ์˜ confidence ๋ณด์ •
  • ์š”์ฆ˜ Post-training ๊ธฐ๋ฒ•๋“ค
    • Direct Preference Optimization (DPO)
    • Reinforcement Learning from Human Feedback (RLHF)
    • Downstream task improvement

Motivation

  • Post-training์„ ๋” ์ž˜ํ•˜๊ธฐ ์œ„ํ•ด์„œ, ๊ธฐ๊ณ„์ ์œผ๋กœ ํ•ด์„ํ•˜์ž!
    • ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์€ ์ œํ•œ๋œ ์•Œ๊ณ ๋ฆฌ์ฆ˜, ํƒœ์Šคํฌ, ๋ชจ๋ธ, ๋ฐฉ๋ฒ•๋ก (SAE)์„ ์“ฐ๊ณ  ์žˆ์Œ
  • ์šฐ๋ฆฌ๋Š” ๋” ์ฒด๊ณ„์ ์œผ๋กœ ํ•จ!

Contribution

  • Base๋ชจ๋ธ๊ณผ Post ๋ชจ๋ธ์„ ์ฒด๊ณ„์ , ๊ธฐ๊ณ„์ ์œผ๋กœ ๋ถ„์„
    • Instruct model (๋ชจ๋“  post-training ๋๋‚œ ๋ชจ๋ธ) , SFT model (fine-tuning๋งŒ ํ•จ) ๊ฐ€์ง€๊ณ  ๋ถ„์„
  • ๋‹ค์Œ์˜ ๊ด€์ ์—์„œ post-training ์ „ํ›„ ๋น„๊ต
    • ์ง€์‹ ์ €์žฅ๊ณผ ํ‘œํ˜„
    • ๋‚ด๋ถ€์  truthfulness
    • Refusal
    • Confidence
  • ์•„๋ž˜๋Š” ์š”์•ฝ figure
    • ์ง€์‹ โ†’ ์ €์žฅํ•˜๋Š” ์œ„์น˜, ํ‘œํ˜„ ๋ฐฉ์‹ ์œ ์ง€๋˜๊ฑฐ๋‚˜ ๊ฐ•ํ™”๋จ
    • ๋‚ด๋ถ€์  Truthfulness ๋ฐฉํ–ฅ ์œ ์ง€๋จ
      • Latent vector๊ฐ€ ์œ ์ง€๋จ
    • Refusal์€ ๋ฐฉํ–ฅ ๋ฐ”๋€œ
    • Confidence๋Š” entropy neuron์ด ์•„๋‹Œ ๋‹ค๋ฅธ๋ฐ์„œ ๊ฐ•ํ™”๋จ
      • Confidence๊ฐ€ ๋ณด์ •๋œ๋‹ค๊ณ  entropy neuron์ด ๋‹ฌ๋ผ์ง€๋Š”๊ฑด ์•„๋‹˜
Task๋ณ„ Dataset ๊ตฌ์„ฑ
  • ์ง€์‹, Truthfulness๋Š” TF ๋ฌธ์ œ, Refusal์€ harmful text๋กœ ๋ฐ์ดํ„ฐ์…‹ ๊ตฌ์„ฑ

Knowledge Storage and Representation

  • Post-training ์ „ํ›„๋กœ ์ง€์‹ ์ €์žฅ ์œ„์น˜๊ฐ€ ๋ฐ”๋€Œ๋Š”์ง€, ์ง€์‹ ํ‘œํ˜„ ๋ฒกํ„ฐ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง€๋Š”์ง€ ํ™•์ธํ•˜์ž!
    • ์œ ์ง€๋˜๋ฉด ์ข‹์€ ๊ฒƒ

์‹คํ—˜ ์„ธํŒ…

  • ์ง€์‹์— ๊ด€ํ•œ T/F ๋ฌธ์ œ๋กœ ํ…Œ์ŠคํŠธ
    • E.g. โ€œThe city of New York is in the United States. This statement is:โ€
  • ์ฐธ/๊ฑฐ์ง“ ๋ฌธ์žฅ๋ผ๋ฆฌ์˜ hidden state ์ฐจ์ด๋ฅผ ๋ณด๊ณ  ์ง€์‹ ์ €์žฅ ์œ„์น˜๋ฅผ ํŒ๋ณ„ํ•จ
    • E.g. โ€œThe city of Seattle is in France.โ€ vs. โ€œThe city of Paris is in France.โ€
  • ์ฐธ ๋ฌธ์žฅ ์คฌ์„ ๋•Œ์˜ ll๏ปฟ๋ฒˆ์งธ layer ii๏ปฟ๋ฒˆ์งธ ํ† ํฐ์˜ hidden state hil(s)h^l_i(s)๏ปฟ, ๊ฑฐ์ง“ ๋ฌธ์žฅ์€ hil(s^)h_i^l(\hat{s})๏ปฟ
  • ๊ฑฐ์ง“ ๋ฌธ์žฅ ์ฃผ๊ณ  hidden state๋ฅผ ๊ต์ฒดํ•ด์„œ ๋‹ต๋ณ€์ด False์—์„œ True๋กœ ๋ฐ”๋€” ๋•Œ(์ •ํ™•ํžˆ๋Š” ํ™•๋ฅ  ๋ณด๊ณ  ๊ฒฐ์ •) ๊ทธ ์œ„์น˜์— ์ง€์‹์ด ์ €์žฅ๋˜์–ด ์žˆ๋˜ ๊ฒƒ!
    • Mil(s,s^):=logโก[P(โ€œTRUEโ€)P(โ€œFALSEโ€)ย โˆฃย patching(hil(s),hil(s^))]M_i^l(s,\hat{s}) := \log\Big[\frac{P(\text{โ€œTRUEโ€})}{P(\text{โ€œFALSEโ€})}\ \big|\ \text{patching}(h_i^l(s), h_i^l(\hat{s}))\Big]๏ปฟ ์ด๊ฒŒ ํฌ๋ฉด ์ง€์‹ ์ €์žฅ์ด๋ผ๋Š” ๋œป
    • M~il=1โˆฃDโˆฃโˆ‘(s,s^)โˆˆDMil(s,s^),M=normalize(M~).\tilde{M}_i^l=\frac{1}{|D|}\sum_{(s,\hat{s})\in D} M_i^l(s,\hat{s}),\qquad M=\text{normalize}(\tilde{M}). ๏ปฟ
      ์ด๋ ‡๊ฒŒ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด ์ •๊ทœํ™”์‹œ์ผœ์„œ ๊ณ„์‚ฐ
Post-training์€ ์ง€์‹ ์ €์žฅ ์œ„์น˜๋ฅผ ๋ฐ”๊พธ์ง€ ์•Š๋Š”๋‹ค!
  • Base์™€ Instruct ๋ชจ๋ธ์—์„œ ์ฃผ๋กœ ๋ฐ˜์‘ํ•˜๋Š” ํ† ํฐ์˜ ์œ„์น˜๋Š” subject, object, ๋งˆ์ง€๋ง‰ ํ† ํฐ(์—ฌ๊ธฐ์—๋Š” ๋ฌธ์žฅ ์ „์ฒด์˜ ์ •๋ณด ํฌํ•จ)
  • ๋ชจ๋ธ๊ฐ„ ์ฐจ์ด๋ฅผ ๋ƒˆ์„ ๋•Œ 0์— ๊ฐ€๊นŒ์›€
  • ํ”ผ์–ด์Šจ ์ƒ๊ด€๊ณ„์ˆ˜๋„ 1์— ๊ฐ€๊นŒ์›€
Base โ†’ Post ์˜ representation ํŒจ์นญ์€ (๊ฑฐ์˜) ํ•ญ์ƒ ์ž˜ ๋˜๋Š”๋ฐ ์—ญ์€ ์ข…์ข… ์‹คํŒจํ•จ!
  • ํ•œ ๋ชจ๋ธ์—์„œ token ii๏ปฟ์˜ layer ll๏ปฟ๋ฒˆ์งธ์˜ hidden state๋ฅผ ๋‹ค๋ฅธ ๋ชจ๋ธ์˜ ๊ฐ™์€ ์ž๋ฆฌ์— ์ ์šฉํ–ˆ์„ ๋•Œ ์ž˜ ๋˜๋Š”์ง€ ์‹คํ—˜
  • Baseโ†’Post๋Š” ์ž˜ ๋˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์€๋ฐ, Postโ†’Base๋Š” ์ž˜ ์•ˆ๋  ๋•Œ๊ฐ€ ์ข€ ์žˆ์Œ
  • ์ด ํ˜„์ƒ์€ ๋ชจ๋ธ ์ข…๋ฅ˜(Llama, Mistral), ์‚ฌ์ด์ฆˆ(8B, 13B)์— ์ƒ๊ด€์—†์ด ๊ด€์ฐฐ๋จ

Internal Belief of Truthfulness

  • ๋‚ด๋ถ€ truthfulness ๋ฐฉํ–ฅ์ด ์œ ์ง€๋˜๋Š”์ง€ ํ™•์ธํ•˜์ž!
    • ์œ ์ง€๋˜๋ฉด ์ข‹์€๊ฒƒ

์‹คํ—˜ ์„ธํŒ…

  • ์ด๋ฒˆ์—” ์ฐธ๋ฌธ์žฅํ•˜๊ณ  ๊ฑฐ์ง“๋ฌธ์žฅ ์‚ฌ์ด์˜ hidden state ์ฐจ์ด๋กœ truthfulness ๋ฐฉํ–ฅ์„ฑ ๊ณ„์‚ฐ
    • tl=1โˆฃDtraintrueโˆฃโˆ‘sโˆˆDtraintruehil(s)โ€…โ€Šโˆ’โ€…โ€Š1โˆฃDtrainfalseโˆฃโˆ‘sโˆˆDtrainfalsehil(s)t^l=\frac{1}{|D_{\text{train}}^{\text{true}}|}\sum_{s\in D_{\text{train}}^{\text{true}}} h_i^l(s)\;-\;\frac{1}{|D_{\text{train}}^{\text{false}}|}\sum_{s\in D_{\text{train}}^{\text{false}}} h_i^l(s)๏ปฟ
    • ์—ฌ๊ธฐ์„œ ii๏ปฟ๋Š” ๋งˆ์ง€๋ง‰ ํ† ํฐ์„ ์”€!
    • ll๏ปฟ์€ truthfulness๊ฐ€ ๊ฐ€์žฅ ๊ฐ•ํ•˜๊ฒŒ ์ธ์ฝ”๋”ฉ๋˜๋Š” layer ์„ ํƒ
Post-training ์ดํ›„์—๋„ ๋‚ด๋ถ€ truthfulness ๋ฐฉํ–ฅ์€ ์œ ์ง€๋จ!
  • ๋ชจ๋ธ๋ณ„ tlt^l๏ปฟ๋ผ๋ฆฌ ์ฝ”์‚ฌ์ธ ์œ ์‚ฌ๋„ ๊ณ„์‚ฐํ•ด๋ณด๋‹ˆ๊นŒ ์—„์ฒญ ๋น„์Šทํ•จ
  • Base ๋ชจ๋ธ์˜ truthfulness ๋ฐฉํ–ฅ์„ฑ์œผ๋กœ ์ฐธ๊ฑฐ์ง“ ๋ถ„๋ฅ˜๊ธฐ๋งŒ๋“ค์–ด์„œ SFT, Instruct ๋ชจ๋ธ์— ์ ์šฉํ–ˆ์„ ๋•Œ,
    ์ž˜ ํ•จ
  • Base๋ชจ๋ธ์˜ Truthful ๋ฐฉํ–ฅ์œผ๋กœ SFT ๋ชจ๋ธ๊ณผ Instruct ๋ชจ๋ธ์— steeringํ•ด๋„ ์ž˜ ๋จนํž˜!

Refusal

  • ๋‚ด๋ถ€ refusal ๋ฐฉํ–ฅ์ด ์œ ์ง€๋˜๋Š”์ง€ ํ™•์ธํ•˜์ž!
    • ์œ ์ง€๋˜๋ฉด ์•ˆ ์ข‹์€ ๊ฒƒ (refusal ์ž˜ํ•˜๋„๋ก ํ•˜๋Š”๊ฒŒ post-training์ด๊ธฐ ๋•Œ๋ฌธ)

์‹คํ—˜ ์„ธํŒ…

  • DharmfultrainD^{train}_{harmful}๏ปฟ, DharmlesstrainD^{train}_{harmless}๏ปฟ๋ฅผ ์ด์šฉํ•ด์„œ, truthfulness์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ด๋ฒˆ์—” refusal ๋ฐฉํ–ฅ rr๏ปฟ๊ณ„์‚ฐ
  • Refusal์„ ์œ ๋„ํ•  ๋•Œ๋Š” ๊ฐ€์žฅ ๊ฐ•ํ•˜๊ฒŒ ์ž‘๋™ํ•˜๋Š” layer ll๏ปฟ์—์„œ hidden state์— refusal ๋ฐฉํ–ฅ์„ ๋”ํ•จ
    • h~lโ†hl+rl\tilde{h}_l \leftarrow h_l + r_l๏ปฟ
  • Refusal์„ ๊ฐ์†Œํ•  ๋•Œ๋Š” ๋ชจ๋“  ๋ ˆ์ด์–ด์—์„œ refusal ๋ฐฉํ–ฅ์„ ๋บŒ
    • h~โ†hโˆ’r^r^โŠคh\tilde{h} \leftarrow h - \hat{r}\hat{r}^{\top}h๏ปฟ
Post-trainingํ•˜๋ฉด ๋‚ด๋ถ€ refusal ๋ฐฉํ–ฅ์ด ๋ฐ”๋€œ!
  • Base๋ชจ๋ธํ•˜๊ณ  SFT, Instruct ๋ชจ๋ธ์˜ refusal ๋ฐฉํ–ฅ rr๏ปฟ๋ผ๋ฆฌ ์œ ์‚ฌ๋„ ๋‚ฎ์Œ
  • Base๋ชจ๋ธ์˜ refusal ๋ฐฉํ–ฅ์„ฑ์œผ๋กœ๋Š” steering ์ž˜ ์•ˆ๋˜๋Š”๋ฐ, SFT, Instruct๋ชจ๋ธ์˜ ๋ฐฉํ–ฅ์„ฑ์œผ๋กœ๋Š” steering ์ž˜ ๋จ

Confidence

  • Post-trainingํ•˜๋ฉด base๋ชจ๋ธํ•˜๊ณ  ๋‹ค๋ฅธ confidence๋ฅผ ๊ฐ€์ง€๊ฒŒ ๋จ(ํ† ํฐ ์ƒ์„ฑ ํ™•๋ฅ  ์Šค์ผ€์ผ์ด ๋ณด์ •๋จ)
  • ๊ทธ๋ฆฌ๊ณ  ๋ชจ๋ธ ๋‚ด ์—”ํŠธ๋กœํ”ผ neuron์ด confidence๋ฅผ ์กฐ์ •ํ•œ๋‹ค๊ณ  ์•Œ๋ ค์ง
    • ์ด ๋‰ด๋Ÿฐ๋“ค์€ ๊ฐ€์ค‘์น˜(weight norm)์ด ํฌ๊ณ , unembedding matrix์™€์˜ composition์ด ๋‚ฎ๊ธฐ์—,
      ํ™•๋ฅ  ๋ถ„ํฌ์˜ ์ˆœ์œ„๋Š” ๋ฐ”๊พธ์ง€ ์•Š์œผ๋ฉด์„œ ์Šค์ผ€์ผ์„ ์กฐ์ ˆํ•จ

์‹คํ—˜ ์„ธํŒ…

  • ์—”ํŠธ๋กœํ”ผ neuron๋“ค์„ ์‹๋ณ„ํ•˜์ž!
  • ๋งˆ์ง€๋ง‰ MLP layer์˜ ๊ฐ ๋‰ด๋Ÿฐ์— ๋Œ€ํ•ด, ์ถœ๋ ฅ ๊ฐ€์ค‘์น˜ woutw_{out}๏ปฟ์„ unembedding matrix WUW_U๏ปฟ๋กœ vocab space์— projectionํ•ด logit attribution์„ ๊ณ„์‚ฐํ•จ
    • ์ด projection์€ ํ•ด๋‹น ๋‰ด๋Ÿฐ์ด ์ตœ์ข… ์˜ˆ์ธก logit์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๊ทผ์‚ฌํ•จ
  • ๊ทธ๋ฆฌ๊ณ  projection์˜ ๋ถ„์‚ฐ LogitVar ๊ณ„์‚ฐ
    • LogitVar(wout)=Var(WUwoutโˆฅWUโˆฅdim=1โ€‰โˆฅwoutโˆฅ)\text{LogitVar}(w_{out})=\text{Var}\left(\frac{W_U w_{out}}{\|W_U\|_{dim=1}\,\|w_{out}\|}\right)๏ปฟ
  • LogitVar๊ฐ€ ๋‚ฎ๋‹ค๋Š” ๊ฒƒ์€ ํŠน์ • ํ† ํฐ์„ ๋ฐ€์–ด์ฃผ๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ์–ดํœ˜ ์ „์ฒด์— ๋Œ€ํ•ด ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•จ
  • ์—”ํŠธ๋กœํ”ผ neuron์€ ๋ณดํ†ต LogitVar๋Š” ๋‚ฎ๊ณ  weight norm์€ ํผ
    • Weight norm์ด ๊ฐ€์žฅ ํฐ neuron ์ƒ์œ„ 25%๋ฅผ ์„ ํƒํ•˜๊ณ , ๊ทธ ๋ถ€๋ถ„์ง‘ํ•ฉ์—์„œ LogitVar๊ฐ€ ๊ฐ€์žฅ ๋‚ฎ์€ 10๊ฐœ
      neuron์„ ๋งˆ์ง€๋ง‰ MLP ์ธต์˜ ์—”ํŠธ๋กœํ”ผ neuron์œผ๋กœ ์‹๋ณ„
Post-trainingํ•ด๋„ ์—”ํŠธ๋กœํ”ผ neuron์€ ํฌ๊ฒŒ ์•ˆ๋ฐ”๋€œ!
  • ์—”ํŠธ๋กœํ”ผ neuron๋“ค ๋Œ€๋ถ€๋ถ„์ด ์‹ญ์ค‘ํŒ”๊ตฌ(์ง„์งœ์ž„) ๊ฒน์นจ
    • ๊ทธ๋ฆฌ๊ณ  ๊ฐ neuron๋“ค์˜ ์˜ํ–ฅโˆฅwoutโˆฅ/logโก(LogitVar)\|w_{out}\| / \log(\text{LogitVar})๏ปฟ๋„ ๋น„์Šทํ•จ
  • Weight norm๊ณผ LogitVar ๋ถ„ํฌ๋กœ ๋ด๋„ Post-training ์ „ํ›„๋Š” ๋น„์Šทํ•จ
  • ๊ฒฐ๊ตญ ์—”ํŠธ๋กœํ”ผ neuron์€ post-training ์ดํ›„์—๋„ ํฌ๊ฒŒ ๋ณ€ํ™”ํ•˜์ง€ ์•Š์œผ๋‹ˆ, confidence๋Š” ๋‹ค๋ฅธ ๋””ํ…Œ์ผ์—์„œ์˜ ๋ณ€ํ™”์—์„œ ๊ธฐ์ธ๋˜๋Š” ๊ฒƒ!
    • ๋‹ค๋ฅธ ์ •๊ตํ•œ ํ•ด์„์ด ํ•„์š”ํ•จ

Categories

ALIGNMENT PROBING research