Blog

์ตœ๋ฏผ์˜
26 November 2025

Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes

NIPS'24

๐Ÿ’กJailbreak: ์‚ฌ์šฉ์ž๊ฐ€ ๋ชจ๋ธ์˜ ์•ˆ์ „์žฅ์น˜๋ฅผ ์šฐํšŒํ•˜์—ฌ, ์›๋ž˜ ๊ฑฐ๋ถ€ํ•ด์•ผ ํ•  ์œ„ํ—˜ํ•œ ๋‹ต๋ณ€์„ ๋Œ์–ด๋‚ด๋ ค๋Š” ๊ณต๊ฒฉ์  ํ”„๋กฌํ”„ํŠธ ์กฐ์ž‘ ๊ธฐ๋ฒ•LLM์ด jailbreak์„ ์‹œ๋„ํ•˜๋Š” prompt์— ๋…ธ์ถœ๋  ๋•Œ, ๋ชจ๋ธ์˜ loss function์„ ์‹œ๊ฐํ™”ํ•œ landscape์˜ gradient๊ฐ€ ํ”๋“ค๋ฆฐ๋‹ค๋Š” ํŠน์ง•์„ ์ด์šฉํ•˜์—ฌ jailbreak ๊ณต๊ฒฉ์„ ์ฐจ๋‹จํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆ

26 November 2025

Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

NIPS'25

๐Ÿ’กRLVRํ•˜๋ฉด sampling path์—์„œ ์ •๋‹ต path๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ž˜ ์ฐพ๊ธด ํ•˜๋Š”๋ฐ, ์›๋ž˜ ๋ชจ๋ธ์ด ๊ณ ๋ ค์•ˆํ•˜๋Š”๊ฑธ ๊ณ ๋ คํ•˜๋Š”๊ฑด ์•„๋‹˜! ๊ฒŒ๋‹ค๊ฐ€ ์ƒ˜ํ”Œ๋ง์„ ๋Š˜๋ฆฌ๋ฉด ์˜คํžˆ๋ ค reasoning scope๊ฐ€ base model๋ณด๋‹ค ์ข์Œ!my insight: ์ด๊ฒƒ๋„ ์ง€์‹์˜ ์ €์ฃผ?!

์ด์Šนํ™˜
26 November 2025

A Probabilistic Perspective on Unlearning and Alignment for Large Language Models

ICLR'25

๐Ÿ’กLLM์ด ์–ธ๋Ÿฌ๋‹, ์ •๋ ฌ์ด ์ง„์งœ ์ž˜ ๋๋Š”์ง€ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด์„  ๊ธฐ์กด์˜ ๊ฒฐ์ •๋ก ์  ์ถœ๋ ฅ ์ฆ‰, ํ•˜๋‚˜์˜ ๋‹ต๋งŒ ํ‰๊ฐ€ํ•ด์„  ์•ˆ๋˜๊ณ , ๋ชจ๋ธ์˜ ์ „์ฒด ์ถœ๋ ฅ ๋ถ„ํฌ๋ฅผ ํ™•๋ฅ ์ ์œผ๋กœ ๋ณด๊ณ  ํ‰๊ฐ€๋ฅผ ํ•ด์•ผ ํ•จ์ด๋ฅผ ์œ„ํ•ด ์ƒˆ๋กœ์šด ๊ธฐ์กด์˜ ๊ฒฐ์ •๋ก ์ ์ธ ํ‰๊ฐ€์ง€ํ‘œ๊ฐ€ ์•„๋‹Œ ์ƒˆ๋กœ์šด ํ™•๋ฅ ๋ก ์ ์ธ ํ‰๊ฐ€ ์ง€ํ‘œ๋“ค์„ ์ œ์•ˆ

Previous Next »