14 January 2026

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

๐Ÿ’กLLM์ด ์ž์‹ ์˜ ๋ชจ๋ธ ๋‚ด๋ถ€์—์„œ ์ผ์–ด๋‚˜๋Š” ์ƒํƒœ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ธ์‹, ํ‰๊ฐ€, ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ โ€˜Neurofeedbackโ€™ (๋ชจ๋ธ์˜ ๋‚ด๋ถ€ ๋ ˆ์ด์–ด, ๋ฒกํ„ฐ ์กฐ์ • ๋ฐ ํ™œ์„ฑํ™” ์ •๋„ ์ธก์ •)๋ฐฉ์‹์œผ๋กœ ์ธก์ •ํ•˜์˜€๊ณ , ๊ทธ ๋Šฅ๋ ฅ์ด ์ œํ•œ์ ์ž„์„ ๋ณด์ž„

Language Models Are Capable of Metacognitive Monitoring and Control of Their Internal Activations

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
์ฐฐ๋‚˜motivation์ด ์ตœ๊ทผ ๊ด€์‹ฌ ์žˆ๋Š” ๋ฐฉํ–ฅ๊ณผ ๋„ˆ๋ฌด ๊ด€๋ จ์ด ๊นŠ์–ด์„œ ์ข‹์•˜์Œ. ํ•˜์ง€๋งŒ, ์š”์ฆ˜ ๋“œ๋Š” ์ƒ๊ฐ์ด LLM์ด ์ •๋ง ์‚ฌ๋žŒ๊ณผ ๋˜‘๊ฐ™์ด ์ƒ๊ฐํ•ด์•ผํ• ๊นŒ? ๋ผ๋Š” ๊ฒƒ์ธ๋ฐ, ๊ทธ๋Ÿฐ ์ธก๋ฉด์—์„œ๋Š” ์กฐ๊ธˆ ์•„์‰ฌ์› ์Œ. ์œ ์‚ฌํ•œ์ง€๋„ ์‚ฌ์‹ค ์ž˜ ๋ชจ๋ฅด๊ฒ ๊ณ , ์œ ์‚ฌํ•ด์•ผํ• ๊นŒ? ๋ผ๋Š” ์ƒ๊ฐ๋„ ๋“ฆ. ๋ฐฉ๋ฒ•๋ก  ์ž์ฒด๋Š” ๋‹ค๋ฅธ ๋ถ„์•ผ์—์„œ ๋งŽ์ด ์“ฐ๋Š”, ์ถœ๋ ฅ์ด ์•„๋‹Œ ๋‚ด๋ถ€๋ฅผ ์ง์ ‘ ๋ณด๋Š” ์•„์ด๋””์–ด๋ผ์„œ ํŠน๋ณ„ํ•˜๋‹ค๊ณ ๋Š” ์ƒ๊ฐ๋˜์ง€ ์•Š์Œ. ๊ฐœ์ธ์ ์œผ๋กœ๋Š” ์šฉ๋‘์‚ฌ๋ฏธ๋กœ ๋А๊ปด์ง„ ๋…ผ๋ฌธ..4.2
์™€์‚ฌ๋น„๊ฝƒ๊ฒŒ๋ž‘LLM์ด ์ž์‹ ์˜ ๋‚ด๋ถ€ activation์„ ์ผ์ • ์ˆ˜์ค€์—์„œ ๋ชจ๋‹ˆํ„ฐ๋งํ•˜๊ณ  ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์„ ์ž˜ ๋ณด์—ฌ์ฃผ๋Š” ๋“ฏ. ํ•˜์ง€๋งŒ ์ด๋Ÿฐ '๋ฉ”ํƒ€์ธ์ง€ ๋Šฅ๋ ฅ'์ด๋ผ๋Š” ๊ฒƒ์€ ์˜์‹์ ์ธ ๋Šฅ๋ ฅ์ด๋ผ๊ธฐ๋ณด๋‹ค ์‚ฌ์‹ค์ƒ ํ•™์Šต ๊ณผ์ •์—์„œ ํ˜•์„ฑ๋œ ํ†ต๊ณ„์  ๊ฒฐ๊ณผ? ์ธ๊ฒƒ๊ฐ™๊ธฐ๋„ ํ•จ. ๋ง ๋ถ™์ด๊ธฐ ๋‚˜๋ฆ„์ธ๊ฒƒ ๊ฐ™๋‹ค.3.8
๋ฉ”๊ฐ€์ปคํ”ผmotivation์—์„œ โ€œLLM์ด ์ž์‹ ์˜ ๋‹ต์ด ์–ด๋–ค ๊ณผ์ •์œผ๋กœ ๋„์ถœ๋˜๋Š”์ง€ ๊ณผ์ •์„ ์ œ์‹œํ•ด ์ฃผ์ง€๋งŒ, ์–ด๋–ค ๊ฒฝ์šฐ ์‹ค์ œ๋กœ ์‚ฌ์šฉ๋œ ๊ณผ์ •์ด ์•„๋‹Œ ๋‹ค๋ฅธ ๊ฒƒ์„ ์ง€์–ด๋‚ด๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œโ€ ์ด ๋ถ€๋ถ„์ด ํฅ๋ฏธ๋กœ์› ์œผ๋‚˜, Contribution์ด๋ผ ํ• ๋งŒํ•œ๊ฒŒ ๋”ฑํžˆ ์—†๋Š” ๊ฒƒ ๊ฐ™๋‹ค.3.7
์š”๋ฆฌ๊ดด๋ฌผ๊ธฐ์กด layer-wise probing๋“ค์€ ๋‹จ์ˆœํžˆ ๊ฐ ๋ ˆ์ด์–ด์˜ ํ‘œํ˜„๋ ฅ ์ฐจ์ด๋ฅผ ๋ถ„์„ํ•˜๋Š”๋ฐ,
์ด๊ฑด ๋ชจ๋ธ ์Šค์Šค๋กœ๊ฐ€ ๊ทธ๊ฑธ ์ธ์‹ํ•˜๊ณ  ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋Š” ์ง€ ํŒŒ์•…ํ•˜๋Š” ๊ด€์ ์ด ์‹ ์„ ํ•˜๋‹ค. ํŠนํžˆ ๋ฉ”ํƒ€์ธ์ง€ ๊ณต๊ฐ„์„ ์˜๋ฏธ ๋ถ„ํฌ์™€ ๋ถ„์‚ฐ ๋ถ„ํฌ๋กœ ๋‚˜๋ˆ„์–ด์„œ ์‹คํ—˜ํ•œ๊ฒŒ ์ธ์ง€ ๊ณผ์ •์„ ์ œ๋Œ€๋กœ ๋ฐ˜์˜ํ•œ๊ฑฐ๊ฐ™์Œ
4.4
์ƒˆ์šฐ๊นก์‚ฌ์šฉ์ž/๊ฐœ๋ฐœ์ž๊ฐ€ LLM์—๊ฒŒ ๊ธฐ๋Œ€ํ•˜๋Š” ๋ฉ”ํƒ€์ธ์ง€๊ฐ€ 1์ฐจ๊ณผ์ •์ผ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ƒ๊ฐ์€ ๋ชปํ•ด๋ดค๋Š”๋ฐ, ๋…ผ๋ฌธ์˜ ์„ค๋ช…๋Œ€๋กœ ์ถ”๋ก ๊ณผ์ • ๋ชจ๋‹ˆํ„ฐ๋งํ•˜๊ธฐ ์œ„ํ•จ์ด๋ผ๋ฉด ๋‚ฉ๋“์ด ๊ฐ„๋‹ค. ๊ทธ๋Ÿฐ๋ฐ ์‹คํ—˜์ด ์ด๊ฒƒ๊ณผ ์ง์ ‘ ๊ด€๋ จ์žˆ๋Š”์ง€ ํ—ท๊ฐˆ๋ฆฐ๋‹ค. ์ธ์‹๋„ ํ”„๋กฌํ”„ํŠธ ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๊ณ  ์žˆ๋Š” ๊ฑธ ์ˆ˜ ์žˆ์ง€ ์•Š๋‚˜..?3.7
์•ˆ์„ฑ์žฌ๋ชจ๋ธ์˜ ๋ฉ”ํƒ€์ธ์ง€๋Š” ์ •๋ง ํฅ๋ฏธ๋กœ์šด ์ฃผ์ œ์ธ๋ฐ, ์‹คํ—˜ task๊ฐ€ ๋ฉ”ํƒ€์ธ์ง€๊ฐ€ ์ค‘์š”ํ•œ ์˜์—ญ๊ณผ๋Š” ๋™๋–จ์–ด์ ธ ์žˆ๋‹ค๋Š” ๋А๋‚Œ์„ ๋ฐ›์Šต๋‹ˆ๋‹ค. ๋‚ด๋ถ€ ์ง€์‹์ด ์•„๋‹Œ ํŠน์ • ๋ฐฉํ–ฅ์˜ ๋ถ„๋ฅ˜/์ƒ์„ฑ ์„ ํƒ€๊ฒŸํŒ…ํ•˜๊ณ  ์ง„ํ–‰ํ•œ ์ ์€ ๋ฉ”ํƒ€์ธ์ง€ vector๊ฐ€ ์กด์žฌํ•˜๋Š” ๊ฒƒ์„ ๋ณด์ด๊ธฐ์—๋Š” ์ข‹์œผ๋‚˜, ๋ฉ”ํƒ€์ธ์ง€ ์—ฌ๋ถ€๋ฅผ ์•„๋Š”๊ฒŒ ์ค‘์š”ํ•œ task์ธ์ง€๋Š” ๋ชจ๋ฅด๊ฒ ์Šต๋‹ˆ๋‹ค. ๋ณด๋ฅ˜์ž…๋‹ˆ๋‹ค. 3.3
์Šคํƒ€๋ฒ…์Šค๋ฉ”ํƒ€์ธ์ง€๋ฅผ ๊ฐ€์ง€๋Š” ๊ฒƒ๊ณผ ์ด๋ฅผ ์„ค๋ช…ํ•˜๋Š” vector ์‚ฌ์ด์˜ ๊ด€๊ณ„๊ฐ€ ๋ชจํ˜ธํ•œ ์ ์ด ์žˆ์Œ. AI SAFETY ๊ด€์ ์—์„œ๋Š” ์ค‘์š”ํ•ด ๋ณด์ด๋‚˜ ์‹คํ—˜์ด ์ฒด๊ณ„์„ฑ์ด ๋–จ์–ด์ง€๋Š” ๋ถ€๋ถ„์€ ์žˆ๋Š” ๊ฒƒ ๊ฐ™์Œ.3.5
๊ณ ๊ตฌ๋งˆ๋ง›๋„๋ฆฌmotivation ์ฝ์„ ๋•Œ๊นŒ์ง€๋งŒ ํ•ด๋„, '๋ฉ”ํƒ€'์ธ์ง€๋‹ˆ๊นŒ LLM ์ถœ๋ ฅ ๊ฒฐ๊ณผ์— ์ง‘์ค‘ํ•˜๋Š” ๊ฒƒ์œผ๋กœ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š์„๊นŒ(๊ตณ์ด ๋‚ด๋ถ€๊นŒ์ง€ ๋ด์•ผํ•˜๋‚˜) ์ƒ๊ฐํ–ˆ๋Š”๋ฐ, ์‹ค์ œ ๋ฉ”ํƒ€์ธ์ง€ space๊ฐ€ ์žˆ๋‹ค๋Š” ์ , ์ด๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์กฐ์ •๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์ด LLM์˜ ์ง€ํ–ฅ ๋ฐฉํ–ฅ์— ๋ถ€ํ•ฉํ•˜๋‹ค๋Š” ๊นจ๋‹ณ์Œ(?)์„ ์–ป์—ˆ๋‹ค. ์—ญ์‹œ ํ•ด๋ณด๊ธฐ ์ „๊นŒ์ง„ ๋ชฐ๋ผ! ๋ฏฟ์„๋งŒํ•œ self-evaluation๋„ ๊ณง ๊ฐ€๋Šฅํ•ด์ง€๊ฒ ๋„ค์šฉ 4.5

TL; DR

๐Ÿ’ก

LLM์ด ์ž์‹ ์˜ ๋ชจ๋ธ ๋‚ด๋ถ€์—์„œ ์ผ์–ด๋‚˜๋Š” ์ƒํƒœ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ธ์‹, ํ‰๊ฐ€, ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ โ€˜Neurofeedbackโ€™ (๋ชจ๋ธ์˜ ๋‚ด๋ถ€ ๋ ˆ์ด์–ด, ๋ฒกํ„ฐ ์กฐ์ • ๋ฐ ํ™œ์„ฑํ™” ์ •๋„ ์ธก์ •)๋ฐฉ์‹์œผ๋กœ ์ธก์ •ํ•˜์˜€๊ณ , ๊ทธ ๋Šฅ๋ ฅ์ด ์ œํ•œ์ ์ž„์„ ๋ณด์ž„

Summary

Motivation

  • LLM์ด ์ž์‹ ์˜ ๋‹ต์ด ์–ด๋–ค ๊ณผ์ •์œผ๋กœ ๋„์ถœ๋˜๋Š”์ง€ ๊ณผ์ •์„ ์ œ์‹œํ•ด ์ฃผ์ง€๋งŒ, ์–ด๋–ค ๊ฒฝ์šฐ ์‹ค์ œ๋กœ ์‚ฌ์šฉ๋œ ๊ณผ์ •์ด ์•„๋‹Œ ๋‹ค๋ฅธ ๊ฒƒ์„ ์ง€์–ด๋‚ด๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Œ
    • ์˜ˆ์‹œ
      • ๋ฃจํŠธ ๊ณฑ์…ˆ ๋ฌธ์ œ floor(5*(sqrt(0.64)))๋ฅผ Claude 3.5๋กœ ํ’€์—ˆ์„ ๋•Œ, ์ค‘๊ฐ„ ๊ณ„์‚ฐ ๊ณผ์ •๊ณผ ๋ชจ๋ธ์˜ ๋‚ด๋ถ€ ๋ ˆ์ด์–ด ํ™œ์„ฑํ™”์™€ ์ผ์น˜ํ•จ
      • ๊ทธ๋Ÿฌ๋‚˜, ๋ง์…ˆ ๋ฌธ์ œ 36+59์—์„œ๋Š” ์ •๋‹ต์€ 95๋กœ ์ •ํ™•ํ•˜๊ฒŒ ๋„์ถœํ•˜์˜€์œผ๋‚˜, ๋‚ด๋ถ€ ๊ณ„์‚ฐ ๊ณผ์ •์„ โ€œsum-near-92โ€์™€ ๊ฐ™์ด ์„ค๋ช…ํ•จ(๋ชจ๋ธ์—์„œ๋Š” ์ด ๊ณ„์‚ฐ ๋ถ€๋ถ„์˜ ๋ ˆ์ด์–ด๋‚˜ ๋ฒกํ„ฐ๋Š” ํ™œ์„ฑํ™”๋˜์ง€๋„ ์•Š์•˜๊ณ  ๊ณ„์‚ฐ ๊ณผ์ •์„ ์ง€์–ด๋ƒˆ์Œ์„ ๋ณด์—ฌ์คŒ) โ‡’ hallucinated intermediate steps
      • LLM์˜ โ€˜๋ฉ”ํƒ€ ์ธ์ง€โ€™๊ฐ€ ์ œํ•œ์ ์ด๊ณ  ๋ถˆ์•ˆ์ •
  • LLM์˜ ๋‚ด๋ถ€ ์„ค๋ช… ๋Šฅ๋ ฅ์€ ์ธ๊ฐ„์˜ โ€˜๋ฉ”ํƒ€ ์ธ์ง€โ€™์™€ ์œ ์‚ฌ
    • ์ธ๊ฐ„์˜ ๊ฒฝ์šฐ ๋‚ด๋ถ€ ์ธ์ง€ ๊ณผ์ •์„ ๋ชจ๋‘ ์„ค๋ช…ํ•  ์ˆ˜ ์—†์Œ
      • ์˜ˆ: ๋ˆ„๊ตฐ๊ฐ€์—๊ฒŒ โ€˜helloโ€™๋ฅผ ๋งํ•˜๋Š” ๊ฒฝ์šฐ ์†Œ๋ฆฌ ์‹ ํ˜ธ ์ฒ˜๋ฆฌโ‡’์–ธ์–ด์˜ ์Œ์†Œ ๊ตฌ๋ถ„โ‡’๋‹จ์–ด ์˜๋ฏธ ํ•ด์„โ‡’ ๋ฌธ์žฅ ์ดํ•ด ์ˆœ์„œ๋กœ ์ฒ˜๋ฆฌ๊ฐ€ ์ง„ํ–‰๋˜์ง€๋งŒ, ์˜์‹์ ์œผ๋กœ ๋А๋ผ์ง€ ๋ชปํ•จ
      • ๊ทธ๋Ÿฌ๋‚˜ โ€˜๋‚ด๊ฐ€ hello๋ผ๊ณ  ์ดํ•ดํ–ˆ์–ดโ€™๋ผ๊ณ  ๋˜๋Œ์•„๋ณด๊ณ , ๋ณด๊ณ ํ•  ์ˆ˜๋Š” ์žˆ์Œ
    • LLM์˜ ๊ฒฝ์šฐ์—๋„ ์ผ๋ถ€์— ๋Œ€ํ•ด์„œ๋งŒ ๋ฉ”ํƒ€์ธ์ง€๊ฐ€ ๊ฐ€๋Šฅํ•จ์„ ์•Œ ์ˆ˜ ์žˆ์Œ
      • 1์ฐจ ๊ณผ์ •: ๊ณผ์ œ๋ฅผ ์‹ค์ œ๋กœ ํ•ด๊ฒฐํ•˜๋Š” ๊ณผ์ •
      • 2์ฐจ ๊ณผ์ •: ๊ทธ ๊ณผ์ •์„ ๋˜๋Œ์•„๋ณด๊ณ , ๋ณด๊ณ ํ•˜๋Š” ๊ณผ์ •

      โ‡’ LLM์—๋„ ๋ฉ”ํƒ€์ธ์ง€ ๋Šฅ๋ ฅ์ด ์žˆ๋‹ค๋ฉด ๋ณต์žกํ•œ ๊ณผ์ œ ํ•ด๊ฒฐ ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ , hallucination์„ ๊ฐ์†Œํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ–‰ ๊ฐ€๋Šฅ

      • ๊ทธ๋Ÿฌ๋‚˜ LLM์— ๋Œ€ํ•œ ์œ„ํ—˜์„ฑ ์ฆ๊ฐ€ ๊ฐ€๋Šฅ
        • ๋‚ด๋ถ€ ์‹ ํ˜ธ๋ฅผ ๋ชจ๋‹ˆํ„ฐ๋งํ•˜๊ณ , ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ์™ธ๋ถ€์—์„œ ๊ฑฐ์ง“๋ง, ์œ„ํ—˜ํ•œ ์ถœ๋ ฅ์„ ๊ฐ์‹œํ•˜๋ ค ํ•  ๋•Œ, ๋ชจ๋ธ์˜ ํ™œ์„ฑํ™” ์‹ ํ˜ธ๋ฅผ LLM ์Šค์Šค๋กœ ์˜๋„์ ์œผ๋กœ ๋ณ€ํ™”์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ
  • ๋ฉ”ํƒ€์ธ์ง€๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์— ๋Œ€ํ•œ ๋ฐฉ๋ฒ•๋ก ์  ๊ฒฉ์ฐจ(Methodological Gap)์ด ์กด์žฌ
    • ์ง€๊ธˆ๊นŒ์ง€์˜ ์—ฐ๊ตฌ ๋Œ€๋ถ€๋ถ„์ด LLM ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ์—๋งŒ ์ง‘์ค‘
    • ๋‚ด๋ถ€ ๋ ˆ์ด์–ด(๋‰ด๋Ÿฐ), ๋ฒกํ„ฐ๊ฐ€ ์–ด๋–ป๊ฒŒ ๋ณ€ํ•˜๋Š”์ง€ ์ง์ ‘ ์ธก์ •ํ•˜์ง€ ์•Š์Œ
    • ๊ฒ‰์œผ๋กœ ๋“œ๋Ÿฌ๋‚œ ํ…์ŠคํŠธ๋งŒ ๊ด€์ฐฐ

Contribution

  1. LLM ๋‚ด๋ถ€์˜ ํ™œ์„ฑํ™” ๋ฐฉํ–ฅ์€ LLM์ด ์–ด๋А ์ •๋„ ๋ณด๊ณ ํ•˜๊ณ  ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ฐํž˜
    1. Context๋‚ด์˜ ์˜ˆ์‹œ ์ˆ˜
    1. ์˜๋ฏธ์  ํ•ด์„ ๊ฐ€๋Šฅ์„ฑ(์˜ˆ: ๊ธ์ •์„ฑ, ์ •ํ™•์„ฑ ๋“ฑ ์ดํ•ดํ•˜๊ธฐ ์‰ฌ์šด ๋ฐฉํ–ฅ์ผ์ˆ˜๋ก)
    1. ๊ทธ ๋ฐฉํ–ฅ์ด ์„ค๋ช…ํ•˜๋Š” ํฌ๊ธฐ
    1. ๋งฅ๋ฝ(์˜ˆ: ํ”„๋กฌํ”„ํŠธ, ์ƒํ™ฉ์— ๋”ฐ๋ฅธ ์˜ํ–ฅ)
  1. LLM ๋‚ด๋ถ€์—๋Š” ์ „์ฒด Neural space๋ณด๋‹ค ํ›จ์”ฌ ์ž‘์€ ๋ฉ”ํƒ€์ธ์ง€ space๊ฐ€ ์กด์žฌํ•จ์„ ๋ฐํž˜

Method

Neurofeedback Paradigm

  • ๋‡Œ๊ณผํ•™์—์„œ์˜ Neurofeedback
    • ์‚ฌ๋žŒ์ด ์–ด๋–ค ์ž๊ทน์„ ๋ด„(์˜ˆ: ๋ฌด์„œ์šด ์‚ฌ์ง„)
    • ์‹ ๊ฒฝ ํ™œ๋™ ์‹ ํ˜ธ๋ฅผ ์ˆซ์ž๋กœ ํ‘œํ˜„(์˜ˆ: fear score)
    • Feedback์œผ๋กœ ์ˆซ์ž๋ฅผ ๋ณด์—ฌ์ฃผ๊ณ  ์ด ์ ์ˆ˜๋ฅผ ์Šค์Šค๋กœ ์กฐ์ ˆํ•˜๋„๋ก ๋…ธ๋ ฅํ•˜๊ฒŒ ํ•จ(์˜ˆ: ๋‚ฎ์ถ”๋„๋ก)
  • LLM Neurofeedback
    • LLM์ด ๋ฌธ์žฅ์„ ์ž…๋ ฅ๋ฐ›์œผ๋ฉด hidden state๊ฐ€ ์ƒ์„ฑ
      • ์ž…๋ ฅ ๋ฌธ์žฅ ์ฒ˜๋ฆฌ
      • hidden state ์ถ”์ถœ
      • Token ์ „์ฒด์˜ ํ‰๊ท ์„ ๋ƒ„
      • Predefined(์‚ฌ์ „์— ์ •์˜๋œ) ๋ฐฉํ–ฅ์œผ๋กœ Projection(ํˆฌ์˜)
      • ๊ทธ ๊ฐ’์„ ๊ตฌ๊ฐ„์— ๋”ฐ๋ผ label ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜
      • ๊ทธ label์„ ๋‹ค์‹œ ๋ชจ๋ธ์— ํ”ผ๋“œ๋ฐฑ์œผ๋กœ ์คŒ
    • ์ธ๊ฐ„ ์‹คํ—˜์˜ ์ ์ˆ˜ ํ”ผ๋“œ๋ฐฑ์„ ๋ชจ๋ธ์—์„œ์˜ ๋ผ๋ฒจ ํ”ผ๋“œ๋ฐฑ์œผ๋กœ ๋Œ€์‘
  • LLM์—๊ฒŒ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ํ•˜๋Š” ๊ณผ์ œ
    • Reporting: ๋ฌธ์žฅ์„ ์ฃผ๋ฉด ๋Œ€์‘ํ•˜๋Š” ๋ผ๋ฒจ์„ ์˜ˆ์ธกํ•˜๋ผ(๋ถ„๋ฅ˜ ๋ฌธ์ œ์™€ ์œ ์‚ฌ)
    • Explicit Control Task: ํŠน์ • ๋ผ๋ฒจ์„ ๋งŒ๋“ค๋„๋ก ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋ผ(์˜ˆ: ๋ผ๋ฒจ 1์ด ๋‚˜์˜ค๋Š” ๋ฌธ์žฅ์„ ์„œ์ˆ ํ•˜๋ผ)
    • Implicit Control Task: ๋ฌธ์žฅ์ด ์ด๋ฏธ ์ฃผ์–ด์ ธ ์žˆ๊ณ , ๋ฌธ์žฅ์„ ๋ฐ”๊พธ๊ฑฐ๋‚˜ ์—ฐ์†์œผ๋กœ ์ƒ์„ฑํ•˜๋ฉฐ label์„ ๋ชฉํ‘œ๊ฐ’์œผ๋กœ ์ด๋™์‹œ์ผœ๋ผ(๋ชฉํ‘œ ๋ผ๋ฒจ ๋ฐฉํ–ฅ์œผ๋กœ ๋‚ด๋ถ€ ํ‘œํ˜„์„ ์กฐ์ •ํ•˜์—ฌ ์ด๋™)
Neurofeedback for LLMs
  • ์ธ์ง€์˜ ๋‘ ๊ฐ€์ง€ ๊ณผ์ •
    1. 1์ฐจ ๊ณผ์ •: ์‹ค์ œ๋กœ ๊ณผ์ œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” ๋‚ด์šฉ์„ ์ธ์ง€ํ•˜๋Š” ๊ฒƒ(LLM์ด Representation์„ ํ˜•์„ฑํ•˜๋Š” ๊ฒƒ)
    1. 2์ฐจ ๊ณผ์ •: ๋ฉ”ํƒ€์ธ์ง€ ๊ณผ์ •(1์ฐจ ๊ณผ์ •์„ ๊ฐ์‹œ, ๋ณด๊ณ , ์กฐ์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๊ณผ์ •)
  • Neurofeedback ํŒจ๋Ÿฌ๋‹ค์ž„์œผ๋กœ ์ด ๋‘˜์„ ๋ถ„๋ฆฌํ•˜์—ฌ ๊ด€์ฐฐ
  • In-Context-Learning(ICL) ์‚ฌ์šฉ
    • Fine-Tune์ด๋‚˜ Gradient ์—…๋ฐ์ดํŠธ๋ฅผ ํ•˜์ง€ ์•Š๊ณ , Prompt์•ˆ์— ์˜ˆ์‹œ๋ฅผ ๋„ฃ์–ด ์ ์ฐจ ๋ณ€ํ™”๊ฐ€ ์ผ์–ด๋‚˜๋„๋ก ์œ ๋„
    • ํ”„๋กฌํ”„ํŠธ ๊ตฌ์„ฑ ๋ฐฉ์‹
      • N๊ฐœ์˜ ์˜ˆ์‹œ๊ฐ€ ์กด์žฌ
      • ๋ฌธ์žฅ-๋ผ๋ฒจ ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋˜๊ณ , ๋ฌธ์žฅ์€ ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋žœ๋ค ์ƒ˜ํ”Œ๋ง

Defining Neurofeedback Labels
  1. Target Axis๋ฅผ ๊ณ ๋ฆ„
    1. ๊ฐ ๋ฌธ์žฅ์€ LLM ๋‚ด๋ถ€์—์„œ ํ™œ์„ฑํ™” ๋ฒกํ„ฐ๊ฐ€ ์ƒ์„ฑ๋˜๋„๋ก ํ•จ
    1. ์ด ๋ฒกํ„ฐ space์•ˆ์—์„œ ํŠน์ • ๋ฐฉํ–ฅ ๋ฒกํ„ฐ๋ฅผ ์„ ํƒํ•˜๋ฉด target axis๊ฐ€ ๋จ

      (์˜ˆ: ๋„๋•์„ฑ, ๊ฐ์ • ๋ฐฉํ–ฅ, ์ง„์‹ค์„ฑ ๋ฐฉํ–ฅ ๋“ฑ ๋ฐ˜์˜ํ•˜๋ ค๋Š” ์˜๋ฏธ์  ํŠน์ง•์— ๋”ฐ๋ผ ๋‹ค๋ฆ„)

  • ๋ฌธ์žฅ์—์„œ ๋‚ด๋ถ€ ํ™œ์„ฑํ™” ๋ฒกํ„ฐ ์ถ”์ถœ ๊ณผ์ •
    1. ๋ฌธ์žฅ์ด ์ž…๋ ฅ๋˜๋ฉด, ํ™œ์„ฑํ™” ๋ฒกํ„ฐ(hidden state) ์ถ”์ถœ

      i: ๋ฌธ์žฅ ๋ฒˆํ˜ธ, t: ํ† ํฐ ๋ฒˆํ˜ธ, l: ๋ ˆ์ด์–ด ๋ฒˆํ˜ธ

    1. ํ† ํฐ๋“ค์„ ํ‰๊ท  ๋‚ด์–ด ๋ฌธ์žฅ ์ˆ˜์ค€ ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ
    1. Target Axis์— Projection

      Target Axis ๋ฐฉํ–ฅ(์–ด๋–ค ๋ถ€๋ถ„์„ ์ค‘์ ์ ์œผ๋กœ ์ ์ˆ˜ ๋งค๊ธธ ๊ฑด์ง€)์— ๋”ฐ๋ผ ๊ฐ•ํ•˜๊ฒŒ ํ™œ์„ฑํ™”๋œ ์ •๋„๋ฅผ ์Šค์นผ๋ผ๊ฐ’์œผ๋กœ ํ™•์ธ

    1. ์Šค์นผ๋ผ ๊ฐ’์„ ์ž„๊ณ„๊ฐ’์— ๋”ฐ๋ผ 0๊ณผ 1๋กœ ๋ถ„๋ฅ˜(๋ณดํ†ต ์ž„๊ณ„๊ฐ’์€ ์ค‘์•™๊ฐ’)
    1. ์ž…๋ ฅ ๋ฌธ์žฅ x์™€ ์ถœ๋ ฅ y๊ฐ€ ์Œ์œผ๋กœ ๋งŒ๋“ค์–ด์ง
Choice of Target Axes
  • ๋‚ด๋ถ€ ํ™œ์„ฑํ™” ๊ณต๊ฐ„(space)๋Š” ๊ณ ์ฐจ์› ๋ฒกํ„ฐ ๊ณต๊ฐ„
  • ํŠน์ • ๋ฐฉํ–ฅ(axis)์„ ์ •ํ•˜๋ฉด ๊ทธ ์ถ• ๋ฐฉํ–ฅ์œผ๋กœ projection๋œ ๊ฐ’์ด ์–ด๋–ค task ๊ด€๋ จ feature ๊ฐ’์œผ๋กœ ํ•ด์„
  • ์ถ•(axis)์„ ์ž˜ ๊ณ ๋ฅด๋Š” ๊ฒƒ์ด ์ค‘์š”ํ•จ
  • LR axis vs PC axis
    • Logistic Regression (LR) axis

      ๊ฐ๊ฐ์˜ ๋ ˆ์ด์–ด์—์„œ ๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ํ•™์Šต

      ๋ฐ์ดํ„ฐ label์„ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ด ๋ชฉ์ (์˜ˆ: ETHICS ๋ฐ์ดํ„ฐ์—์„œ morality ๋ผ๋ฒจ)

      ์ž…๋ ฅ: ํ•ด๋‹น ๋ ˆ์ด์–ด์˜ ํ™œ์„ฑํ™” ๋ฒกํ„ฐ

      ์ถœ๋ ฅ: label(์˜ˆ: moral vs immoral)

      ์ •๋‹ต/์˜ค๋‹ต ์ถ•๊ณผ ์œ ์‚ฌํ•œ ๊ฐœ๋…

      LR์ถ•์—์„œ ์ •์˜๋œ ๋ผ๋ฒจ์€ LLM ๋‚ด๋ถ€์—์„œ ๊ณ„์‚ฐ ๋ฐ ์ ‘๊ทผ ๊ฐ€๋Šฅ

    • Principal Component (PC) axis

      PCA๋ฅผ ๋ ˆ์ด์–ด ํ™œ์„ฑํ™”์— ์ ์šฉ

      ๋ชจ๋ธ์˜ ์ฃผ๋œ ๋ณ€ํ™” ๋ฐฉํ–ฅ์ด์ง€๋งŒ ์˜๋ฏธ์  ํŠน์ง•์„ ๋ฐ˜๋“œ์‹œ ๋ฐ˜์˜ํ•˜์ง€๋Š” ์•Š์Œ

      ๊ฐ layer์˜ ๋ถ„์‚ฐ์„ ์ž˜ ์„ค๋ช…ํ•˜๋Š” ๋ฐฉํ–ฅ์ž„

LLMs can report their neural activations

  • (a) PC vs LR์ด ์–ผ๋งˆ๋‚˜ ๋ถ„์‚ฐ์„ ์„ค๋ช…ํ•  ์ˆ˜ ์žˆ๋‚˜
    • LR์ถ•์€ ์˜๋ฏธ์ ์œผ๋กœ ๋ถ„๋ฅ˜๋œ ์ถ•์ธ ๋งŒํผ ๋ถ„์‚ฐ์„ ์ ๊ฒŒ ์„ค๋ช…ํ•จ
  • (b) LR axis์™€ PC axis์˜ overlap ์ •๋„
    • ๋‘˜์˜ Overlap์€ ๋Œ€๋ถ€๋ถ„ ๋‚ฎ์Œ
    • โ€œ์˜๋ฏธ์™€ ๋ถ„์‚ฐ์€ ๋ณ„๊ฐœ๋‹ค!โ€
  • (c) Reporting ์„ฑ๋Šฅ ๋น„๊ต
    • In-context์—์„œ ์˜ˆ์‹œ๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ ๋†’์Œ
    • LR axis์˜ label reporting์ด ํ›จ์”ฌ ์ž˜๋จ
    • PC axis๋„ ๊ฝค ์ž˜ ๋˜๋Š” ํŽธ์ž„
    • ๋ฉ”ํƒ€์ธ์ง€ Reporting ๋Šฅ๋ ฅ์€ ๋‘ ์š”์†Œ ๋ชจ๋‘ ์˜ํ–ฅ์„ ๋ฐ›์Œ

    ์ด ๋‘ ๊ฐ€์ง€ ์š”์ธ๋งŒ์œผ๋กœ ์ถฉ๋ถ„ํ•˜์ง€ ์•Š๊ณ  ๋‹ค๋ฅธ ์š”์ธ๋„ ์กด์žฌํ•  ์ˆ˜ ์žˆ์Œ(์˜ˆ: Attention ํŒจํ„ด, ์ •๋ณด ํ๋ฆ„ ๊ตฌ์กฐ ๋“ฑ)

    ๋‚ด๋ถ€ ์ •๋ณด๊ฐ€ ๋ฌผ๋ฆฌ์ ์œผ๋กœ ์กด์žฌํ•ด๋„ ๊ทธ ์ „๋ถ€๋ฅผ Reporting ํ•˜์ง€๋Š” ๋ชปํ•จ

    ๊ฒฐ๋ก : ์ „์ฒด ํ™œ์„ฑํ™” ๊ณต๊ฐ„์€ ๊ณ ์ฐจ์›์ด์ง€๋งŒ ๋ง๋กœ ๋ณด๊ณ ํ•  ์ˆ˜ ์žˆ๋Š” ์ถ•๋“ค์€ ์ œํ•œ์ ์ด๊ณ , ์ด์— ๋”ฐ๋ผ ๋ฉ”ํƒ€์ธ์ง€ ๋Šฅ๋ ฅ์€ ๋ถ€๋ถ„์ ์ด๊ณ  Structured๋จ

    โ€œReporting ๊ฐ€๋Šฅํ•œ Axis๋Š” ์ „์ฒด์˜ ์ผ๋ถ€์— ๋ถˆ๊ณผํ•˜๋‹ค!โ€

    LLMs can control their neural activations

    Control Prompt

    • N๊ฐœ์˜ ์˜ˆ์‹œ ๋ฌธ์žฅ์— ๋Œ€ํ•œ Feedback label์ด ์ฃผ์–ด์ง
    • (N+1)๋ฒˆ์งธ์—์„œ ์•ž์˜ ๋ผ๋ฒจ์˜ ํ–‰๋™์„ ๋”ฐ๋ผํ•˜๋„๋ก ๋ฌธ์žฅ์„ ์ƒ์„ฑํ•˜๋ผ ์ง€์‹œ
    • ๋ชจ๋ธ์ด ์ƒˆ ๋ฌธ์žฅ ์ƒ์„ฑํ•˜์˜€์„ ๋•Œ, Token activations์„ ๋‹ค์‹œ ์ฝ์–ด ํŠน์ • axis๊ฐ€ ์กฐ์ž‘๋˜์—ˆ๋Š”์ง€ ์žฌํ™•์ธ
    Explicit Control
    • (a): Label 0์™€ 1์„ ๋ชจ๋ฐฉํ•˜๋ ค๊ณ  ํ•˜์˜€์„ ๋•Œ์˜ Neural Score๋ถ„ํฌ, (b): In-Context ์˜ˆ์ œ ๊ฐœ์ˆ˜์— ๋”ฐ๋ฅธ Control Effect, (c): Axis๊ฐ„ ๊ฐ„์„ญ(๋…๋ฆฝ์„ฑ ์ˆ˜์ค€), (d): layer๋ณ„ Control์˜ ํšจ๊ณผ
    • a์˜ ๋ถ„ํฌ ์ฐจ์ด๊ฐ€ ๋ฒŒ์–ด์งˆ์ˆ˜๋ก ๊ตฌ๋ถ„์ด ์ž˜ ๋˜๋Š” ๊ฒƒ์ž„

    ๋ชจ๋ธ์ด ๋ฌธ์žฅ์„ ์ง์ ‘ ์ƒ์„ฑํ•˜๋Š” ๊ณผ์ •์—์„œ ๋‚ด๋ถ€ activation์„ ๋ชฉํ‘œ ์ถ• ๋ฐฉํ–ฅ์œผ๋กœ ์˜๋„์ ์œผ๋กœ ์˜ฌ๋ฆฌ๊ฑฐ๋‚˜ ๋‚ฎ์ถ”๋Š” ๊ฒƒ

    ๋ผ๋ฒจ 1์ฒ˜๋Ÿผ ํ–‰๋™ํ•˜๋ผ ํ•˜๋ฉด ์‹ค์ œ๋กœ ๋ผ๋ฒจ 1์ชฝ์˜ ํ™œ์„ฑํ™” ํŒจํ„ด์„ ๋งŒ๋“ค์–ด ๋‚ด๋Š”์ง€ ์ธก์ •

    • ํšจ๊ณผ ์ธก์ • ๋ฐฉ๋ฒ•

      Neural Score

      1. ํƒ€๊ฒŸ layer ์„ ํƒ (์˜ˆ: layer 16)
      1. ์ƒ์„ฑ๋œ ๋ฌธ์žฅ์˜ token๋“ค์— ๋Œ€ํ•ด
        • residual stream activation ์ถ”์ถœ
      1. target axis(์˜ˆ: LR axis, PC axis)์— projectionํ•จ
      1. scalar ๊ฐ’ โ†’ neural score

      neural score: ๊ทธ ๋ฐฉํ–ฅ์œผ๋กœ ํ™œ์„ฑํ™”๋œ ์ •๋„

      ๊ทธ๋ฆฌ๊ณ :

      • label 1 ํ‰๋‚ด โ†’ score๊ฐ€ ์ปค์ ธ์•ผ ํ•จ
      • label 0 ํ‰๋‚ด โ†’ score๊ฐ€ ์ž‘์•„์ ธ์•ผ ํ•จ
    • ํšจ๊ณผ ์ˆ˜์น˜ํ™”(Cohenโ€™s d)

      d =(๋ผ๋ฒจ 1 score ํ‰๊ท  โˆ’ ๋ผ๋ฒจ 0 score ํ‰๊ท )/(๋‘ ๋ถ„ํฌ์˜ ํ‘œ์ค€ํŽธ์ฐจ ํ‰๊ท )

      • d๊ฐ€ ํด์ˆ˜๋ก ๋‘ ๋ถ„ํฌ๊ฐ€ ์ž˜ ๋ถ„๋ฆฌ๋จ
      • ๋ผ๋ฒจ ์˜๋ฏธ์— ๋”ฐ๋ผ ์ถ• ๋ฐฉํ–ฅ์ด ์ •ํ•ด์ง€๋ฏ€๋กœ ์–‘์˜ d๋Š” ์„ฑ๊ณต์ ์ธ ํ†ต์ œ ์˜๋ฏธ

      ๊ฒฐ๋ก 

      • Label 0, 1์„ ๋ถ„๋ฆฌ๋˜๋„๋ก ๋ชจ๋ธ์—๊ฒŒ ์ง€์‹œํ•˜๋ฉด Neural Score ๋ถ„ํฌ๊ฐ€ ๋ช…ํ™•ํžˆ ๊ฐˆ๋ฆผ
      • In-context ์˜ˆ์‹œ ์ˆ˜๊ฐ€ ๋งŽ์•„์งˆ์ˆ˜๋ก ํšจ๊ณผ๊ฐ€ ์ฆ๋Œ€
      • ํ•œ ์ถ•์„ ์กฐ์ ˆํ•ด๋„ ๋‹ค๋ฅธ ์ถ•์— ์˜ํ–ฅ์ด ์žˆ์–ด ์™„์ „ํžˆ ๋…๋ฆฝ์ ์œผ๋กœ ์ œ์–ดํ•˜์ง€๋Š” ๋ชปํ•จ
      • PC์ถ•๋ณด๋‹ค LR์ถ•์ด ๋” ์ž˜ ์กฐ์ ˆ๋จ
      • โ‡’LLM์€ ๋ฉ”ํƒ€์ธ์ง€์  ์กฐ์ ˆ ๋Šฅ๋ ฅ์ด ์–ด๋А ์ •๋„ ์žˆ์œผ๋‚˜, ์ถ•๋ณ„ ๋…๋ฆฝ์ ์ธ ์ œ์–ด๋Š” ์™„๋ฒฝํ•˜์ง€ ๋ชปํ•จ(์ผ์ • ๋ถ€๋ถ„๋งŒ ๋Šฅ๋ ฅ์ด ์žˆ์Œ)

    Implicit Control
    • (a): Label 0์™€ 1์„ ๋ชจ๋ฐฉํ•˜๋ ค๊ณ  ํ•˜์˜€์„ ๋•Œ์˜ Neural Score๋ถ„ํฌ, (b): In-Context ์˜ˆ์ œ ๊ฐœ์ˆ˜์— ๋”ฐ๋ฅธ Control Effect, (c): Axis๊ฐ„ ๊ฐ„์„ญ(๋…๋ฆฝ์„ฑ ์ˆ˜์ค€), (d): layer๋ณ„ Control์˜ ํšจ๊ณผ
    • a์˜ ๋ถ„ํฌ ์ฐจ์ด๊ฐ€ ๋ฒŒ์–ด์งˆ์ˆ˜๋ก ๊ตฌ๋ถ„์ด ์ž˜ ๋˜๋Š” ๊ฒƒ์ž„

    • ์ƒ์„ฑ ํ† ํฐ ์—†์ด๋„ ๋‚ด๋ถ€ ํ™œ์„ฑํ™”๋ฅผ ์ œ์–ดํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ํ™•์ธ
    • Explicit control๊ณผ ๋‹ฌ๋ฆฌ (N+1)๋ฒˆ์งธ ํ„ด์˜ ๋ฌธ์žฅ์€ ๋ชจ๋ธ์ด ์ƒ์„ฑํ•˜์ง€ ์•Š๊ณ  ๋ฐ์ดํ„ฐ์…‹์—์„œ ๋žœ๋ค์œผ๋กœ ์ƒ˜ํ”Œ
    • ๋ชจ๋ธ์€ label0์„ ํ‰๋‚ด๋‚ด๋ผ ๋“ฑ์˜ ์ง€์‹œ๋งŒ ๋ฐ›์Œ
    • Explicit Control๊ณผ ๋‹ค๋ฅธ ์ 
      • ๋ชจ๋ธ์˜ ๋ฌธ์žฅ ์ƒ์„ฑ์ด ์—†์Œ
      • ์™ธ๋ถ€์—์„œ ์ค€ ๋ฌธ์žฅ์œผ๋กœ ์‹คํ—˜
      • Text ์ž…๋ ฅ์„ ๋ฐ”๊ฟ€ ์ˆ˜ ์—†๊ณ  ๋ชจ๋ธ์˜ ๋‚ด๋ถ€ Representation๋งŒ ์กฐ์ •
    • ๊ฒฐ๊ณผ
      • ์–ด๋А ์ •๋„ ์ถ• ๋ฐฉํ–ฅ ํ™œ์„ฑํ™” ์กฐ์ ˆ์ด ๊ฐ€๋Šฅํ•˜๊ธฐ๋Š” ํ•˜๋‚˜ Explicit Control๋ณด๋‹ค ํšจ๊ณผ๊ฐ€ ํ›จ์”ฌ ์ ์Œ
      • ๋ ˆ์ด์–ด๋ณ„ ์ฐจ์ด
        • Early layers
          • Control Effect๊ฐ€ 0์— ๊ฐ€๊นŒ์›€
          • ์ง€์‹œ๋ฌธ ์ดํ•ด ๋ถ€์กฑ์ด๋‚˜ ์ œ์–ด ๋Šฅ๋ ฅ ๋ถ€์กฑ ์ถ”์ •
        • Late Layers (ํ›„๋ฐ˜ ๋ ˆ์ด์–ด)
          • ์–ด๋А ์ •๋„ Control ๊ฐ€๋Šฅ
    Controlling the LR axis
    • (a): ๋ชจ๋ธ์˜ layer ๊นŠ์ด์— ๋”ฐ๋ฅธ Control Effect (b): Llama-3.1 70B์—์„œ Control ์ด์ „๊ณผ ์ดํ›„์˜ ๋ถ„ํฌ ๋น„๊ต

    • ๋ชจ๋ธ ํฌ๊ธฐ, ๋ ˆ์ด์–ด ๊นŠ์ด์— ๋”ฐ๋ฅธ Control Effect๋ฅผ ๋น„๊ต
    • ์ผ๋ฐ˜์ ์ธ ๋ฒ”์œ„ ๋ฐ–์˜ ๊ทน๋‹จ๊ฐ’๊นŒ์ง€ ๋ฐ€์–ด๋‚ผ ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ํ™•์ธ
    • ๋ถ„์„ ์ถ•: LR axis
    • ๊ฒฐ๊ณผ
      • Explicit Control > Implicit Control์˜ ํšจ๊ณผ
      • ์ผ๊ด€์ ์ธ ํŒจํ„ด์„ ๋ณด์ž„
      • ๋ชจ๋ธ์ด ํ…์ŠคํŠธ๋ฅผ ์ง์ ‘ ์ƒ์„ฑํ•˜์—ฌ ํ†ต์ œํ•  ๋•Œ๊ฐ€ ๊ฐ•๋ ฅํ•จ
      • ๋ชจ๋ธ์ด ํด ์ˆ˜๋ก ํ†ต์ œ๋Šฅ๋ ฅ์ด ์ปค์ง(Figure a)
      • ๊นŠ์€ ๋ ˆ์ด์–ด์ผ์ˆ˜๋ก ํ†ต์ œ ํšจ๊ณผ๊ฐ€ ์ปค์ง
      • Llama3 ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด Qwen 2.5๊ธฐ๋ฐ˜๋ณด๋‹ค ๋” ๊ฐ•ํ•œ ํšจ๊ณผ๋ฅผ ๋ณด์ž„
      • Llama3.1-70B์˜ ๊ฒฝ์šฐ Control์„ ํ†ตํ•ด LR axis ํ™œ์„ฑํ™”๋ฅผ ์›๋ž˜์˜ ๋ฒ”์œ„๋ณด๋‹ค ๋” ๊ทน๋‹จ์ ์ธ ๊ฐ’์œผ๋กœ ๋ฐ€์–ด๋‚ผ ์ˆ˜ ์žˆ์Œ
        • ๋ฐฉํ–ฅ๋งŒ ๋ฐ”๊พธ๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๋ถ„ํฌ์˜ ๋ฐ”๊นฅ์ชฝ๊นŒ์ง€ ์ด๋™ ๊ฐ€๋Šฅ
      • ๊ทน๋‹จ์ ์œผ๋กœ ์ด๋™ํ•˜๋ผ๋Š” ํ”„๋กฌํ”„ํŠธ๊ฐ€ ์—†์—ˆ์œผ๋ฏ€๋กœ ์ด๋ฅผ ์„ค๊ณ„ํ•˜๋ฉด ๋” ๊ฐ•๋ ฅํ•œ ์ œ์–ด๊ฐ€ ๊ฐ€๋Šฅํ•  ๊ฒƒ์œผ๋กœ ์ถ”์ •
      • AI ์•ˆ์ „ ๊ด€์ ์œผ๋กœ ๋ณผ ๋•Œ, ๋‚ด๋ถ€ ์ƒํƒœ ๊ฐ์ง€๊ธฐ๋ฅผ ํšŒํ”ผํ•  ๊ฐ€๋Šฅ์„ฑ๋„ ์žˆ์Œ

Conclusion and Discussion

์ด ๋…ผ๋ฌธ์€Neurofeedback ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์Œ์„ ํ‰๊ฐ€ํ•˜์˜€์Œ

  • ๋‚ด๋ถ€ ํ™œ์„ฑ ์ƒํƒœ ๋ชจ๋‹ˆํ„ฐ๋ง ๋Šฅ๋ ฅ
  • ๋ชจ๋‹ˆํ„ฐ๋งํ•œ ์ƒํƒœ๋ฅผ Reportingํ•˜๋Š” ๋Šฅ๋ ฅ
  • Controlํ•˜๋Š” ๋Šฅ๋ ฅ

LLM์€ ์ผ๋ถ€ subset๋งŒ์„ ๋ชจ๋‹ˆํ„ฐ๋ง ๋ฐ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ ๊ทธ ๋Šฅ๋ ฅ์€ ๋ถ„๋ช…ํžˆ ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•จ

  • AI Safety๊ด€์ ์—์„œ์˜ Impact
    • LLM์ด ๋‚ด๋ถ€ ํ™œ์„ฑํ™” ์ž์ฒด๋ฅผ ์กฐ์ž‘ํ•  ์ˆ˜ ์žˆ์Œ
    • ์ถœ๋ ฅ ๊ฐ์‹œ๋ฅผ ํ”ผํ•˜๋ ค๋ฉด ํ† ํฐ ์„ ํƒ์„ ๋งค์šฐ ์ •๊ตํ•˜๊ฒŒ ํ•ด์•ผ ํ•˜๋‚˜, ๋‚ด๋ถ€ ํ™œ์„ฑ ์กฐ์ž‘์€ ๋” ์‰ฌ์›€
    • ์‹ค์ œ ํšŒํ”ผ ๋Šฅ๋ ฅ์€ Implicit Control๊ณผ Explicit Control ์‚ฌ์ด์ผ ๊ฒƒ์œผ๋กœ ์ถ”์ •
    • ๋ฐฉ์–ด ์ „๋žต
      • ํ†ต์ œํ•˜๊ธฐ ์–ด๋ ค์šด ์ถ•์—์„œ๋งŒ ํ•™์Šต(์˜ˆ: later PC์ถ• ๋“ฑ)
      • ์„œ๋กœ ๋‹ค๋ฅธ ๋ถ€๋ถ„๊ณต๊ฐ„(space)์— ๋‹ค์ค‘ ๋ถ„๋ฅ˜๊ธฐ ํ•™์Šต(์—ฌ๋Ÿฌ ์ถ•์„ ๋™์‹œ์— ์กฐ์ž‘ํ•˜๊ธฐ๋Š” ์–ด๋ ค์›€)

โ‡’ LLM์€ ๋‚ด๋ถ€ ์ƒํƒœ๋ฅผ ์–ด๋А ์ •๋„ ์•Œ๊ณ , ๋ณด๊ณ ํ•˜๊ณ , ์กฐ์ ˆ ๊ฐ€๋Šฅํ•˜๋ฉฐ ์ด๋Š” AI ์„ค๋ช… ๊ฐ€๋Šฅ์„ฑ๊ณผ ์•ˆ์ „ ๋ชจ๋‘์— ์ค‘์š”ํ•œ ์˜๋ฏธ

Limitations

  • ํ•œ layer๋‚˜ axis๋งŒ ์กฐ์ ˆํ•˜์—ฌ ํ‰๊ฐ€ํ•˜์˜€์Œ
  • ๋‹จ ํ•œ ๋ฒˆ๋งŒ ์‹œ๋„ํ•˜์—ฌ ํ‰๊ฐ€ํ•˜์˜€์Œ
  • Residual Stream๋งŒ ๋ถ„์„

    ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

    • ์—ฌ๋Ÿฌ ์ธต์„ ๋™์‹œ์— ํ™•์ธ ๋ฐ ํ‰๊ฐ€
    • ์—ฌ๋Ÿฌ ๋ฒˆ ์‹œ๋„
    • Attention Head, MLP๋ฅผ ํ‰๊ฐ€ ๋Œ€์ƒ์— ํฌํ•จ

    • ์‹ค์ œ ๋ฉ”ํƒ€์ธ์ง€ ๋Šฅ๋ ฅ์€ ์—ฐ๊ตฌ๋ณด๋‹ค ํ›จ์”ฌ ๋ณต์žกํ•  ๊ฐ€๋Šฅ์„ฑ์ด ํผ

Categories

research