19 March 2026

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

๐Ÿ’กPreference Alignment์—์„œ ์•ˆ์ „(์œ„ํ—˜ํ•œ ๋‹ตX)์„ ๊ฐ•ํ•˜๊ฒŒ ๋ณด์žฅํ•˜๋ฉด์„œ๋„, ๊ธฐ์กด RLHF์ฒ˜๋Ÿผ ๋ณต์žกํ•œ ํŒŒ์ดํ”„๋ผ์ธ ์—†์ด DPO์ฒ˜๋Ÿผ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ชจ๋ธ์„ ์ •๋ ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ธ SafeDPO ๋ฅผ ์ œ์‹œ๊ธฐ์กด์˜ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์žฌ์ •์˜ํ•˜๊ณ , ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ์ •๋ ฌํ•ด ๋ชจ๋ธ์ด ์•ˆ์ „ํ•œ ๋‹ต์„ ์ผ๊ด€๋˜๊ฒŒ ๋” ์„ ํ˜ธํ•˜๋„๋ก ํ•จ

์ตœ๋ฏผ์˜
์ตœ๋ฏผ์˜

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

Review

๋‹‰๋„ค์ž„ Strength & Weakness & Sugguestions๋ณ„์  (0/5)
์ปคํ”ผ๊ฐ•์  : safe/unsafe preference dataset์„ ๊ธฐ๋ฐ˜์œผ๋กœ, dataset์„ ์žฌ๊ตฌ์ถ•ํ•˜์—ฌ ๋ณต์žกํ•˜๊ฒŒ ๋ชจ๋ธ์„ safe alignment๋ฅผ ํ–ˆ๋˜ ๊ธฐ์กด ๋ฐฉ์‹์„ ๋ณด์™„ํ•จ.
์•ฝ์  : response์— ๋Œ€ํ•œ binary indicator dataset์—๋งŒ ์‚ฌ์šฉ๊ฐ€๋Šฅ. (ํฐ ๋ฌธ์ œ๋Š” ์•„๋‹ ๊ฒƒ ๊ฐ™๊ธดํ•จ -> LLM judge labeling)
์ œ์•ˆ : ์–ด๋А ๋ฐฉ์‹์œผ๋กœ๋“  label indicator๊ฐ€ ์ž˜ ๋ผ์žˆ๋‹ค๋ฉด safe/unsafe ์™ธ์— ๋‹ค๋ฅธ ์ธก๋ฉด์—๋„ ์‘์šฉ๊ฐ€๋Šฅํ•  ๊ฒƒ ๊ฐ™์Œ.
4.0
์ฝ”์Šคํ”ผ๊ฐ•์ : Safe/Unsafe์—์„œ Unsafe์˜ ํ™•๋ฅ ์ด 0์ด ๋˜๋„๋ก Margin๋„ ์ฃผ๊ณ  ๋” ๊ฐ•ํ•˜๊ฒŒ ๋ฐ€์–ด์ฃผ์–ด ๊ธฐ์กด DPO ๋ฐฉ์‹์˜ ํ•œ๊ณ„์ ์„ ๋ณด์™„ํ•จ.
์•ฝ์ : ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ๊ฐ€ ์ปค์งˆ ๊ฒฝ์šฐ, binary๊ฐ€ ์•„๋‹ ๊ฒฝ์šฐ ๊ฒฐ๊ณผ๊ฐ€ ๋‹ฌ๋ผ์ง€์ง€ ์•Š์„๊นŒ?
์ œ์•ˆ: Safe/Unsafe ํ•™์Šต ์ด์™ธ์—๋„ DPO๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํŠน์ • ๋ฐฉํ–ฅ์œผ๋กœ ๊ฐ•ํ•˜๊ฒŒ ์‘๋‹ต์„ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ์—ฐ๊ตฌ์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Œ.
3.9
์–ผ๋ผ๊ฐ•์ : Reward model๊ณผ ๊ฐ™์€ ์ถ”๊ฐ€๋ชจ๋ธ์–ด์—†์ด DPO์˜ ์ •์‹ ์„ ์ด์–ด๋ฐ›๋˜, ๋‹จ์ˆœ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ•˜๋‚˜ ์ถ”๊ฐ€ํ•œ๊ฒƒ๋งŒ์œผ๋กœ safety๋ฅผ ์ถ”๊ฐ€ํ–ˆ๋‹ค๋Š” ์ ์ด ๊ฐ•์ , SafeRLHF๋„ Citation์ด ๋งŽ์ด๋˜์—ˆ๋˜๋ฐ ์ด ๋…ผ๋ฌธ๋„ ๊ทธ๋ ‡๊ฒŒ ๋˜์ง€ ์•Š์„๊นŒ ์‹ถ์Œ
์•ฝ์ : ํ™•์‹คํžˆ ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ์…‹ ๋Œ€ํ•ด์„œ๋งŒ ์‚ฌ์šฉํ•œ๊ฒƒ์€ ์•ฝ์ ์ธ ๊ฒƒ ๊ฐ™์Œ. ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์‹คํ—˜์ด ๊ถ๊ธˆ
์ œ์•ˆ: ํŠน์ • preference๋ฅผ ๊ฐ•์กฐํ•˜๊ณ  ์‹ถ์„ ๋•Œ ํ•ด๋‹น ๋…ผ๋ฌธ์˜ ๋ฐฉ๋ฒ•๋ก ์„ ํ†ตํ•ด preference๋ฅผ ๊ฐ•์กฐํ•  ์ˆ˜ ์žˆ๋Š” ์‹ค์šฉ์ ์ธ ๋ฐฉ๋ฒ•๋ก ์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ
4.2
๋น„์š”๋œจ๊ฐ•์ : ๊ธฐ์กด์— ๋‹น์—ฐํžˆ ๊ฐ„์ฃผํ•˜๊ณ  ๋„˜์–ด๊ฐ”๋˜ ๋ถ€๋ถ„(ํ™•๋ฅ ์„ 0์„ ์›ํ•˜์ง€๋งŒ ์‚ฌ์‹ค์ƒ ๊ณ„์‚ฐํ–ˆ์„๋•Œ๋Š” 0์„ ๋ณด์žฅํ•˜์ง€ ์•Š๋Š”๋‹ค)์„ ๊ต‰์žฅํžˆ ์ž˜ ์ง€์ ํ•˜๊ณ  ํ—ˆ์ ์„ ํŒŒ๊ณ ๋“  ๋А๋‚Œ์ž„. ์•„์ด๋””์–ด๋Š” ๊ด‘์žฅํžˆ ๊ฐ„๋‹จํ•œ๋ฐ, ์ด๋ ‡๊ฒŒ ํ•˜๋ ค๋ฉด ์ผ๋‹จ ๊ธฐ๋ณธ์ ์œผ๋กœ ์ˆ˜ํ•™์— ๋Œ€ํ•ด์„œ ์ž˜ ์•Œ๊ณ  ์žˆ์–ด์•ผ ๊ฐ€๋Šฅํ•œ ์ ‘๊ทผ ๋ฐฉ๋ฒ•์ธ๋“ฏ
์•ฝ์ : ๊ทผ๋ฐ ๋ผ๋ฒจ์„ ๋ฐ”๊ฟˆ์œผ๋กœ์จ ์›๋ž˜์˜ ๋ฐ์ดํ„ฐ์˜ ์˜๋„์™€ ์กฐ๊ธˆ ํ‹€์–ด์งˆ ์ˆ˜๋„ ์žˆ์„๊ฒƒ ๊ฐ™์Œ
์ œ์•ˆ: binary๊ฐ€ ์•„๋‹Œ ๋ฐ์ดํ„ฐ์…‹์—๋„ ์œ ์‚ฌํ•œ ๋ฐฉ์‹์œผ๋กœ ์ ์šฉ ๊ฐ€๋Šฅํ• ๋“ฏ
4.1
์นซ์†”๊ฐ•์ : ํฐ penalty๋ฅผ ๊ทน๋‹จ์ ์œผ๋กœ ํ‚ค์›Œ๋ฒ„๋ฆฌ๋Š” ๋ณ€๊ฒฝ์ด ์ž˜ ๋‚ฉ๋“๊ฐ€๊ณ , ์‹ค์ œ๋กœ ํšจ๊ณผ๋„ ์žˆ์Œ
์•ฝ์ : penalty ๊ทน๋Œ€ํ™”์˜ ๋ถ€์ž‘์šฉ์€ ์—†์„๊นŒ? ๋ชฉํ‘œ์ธ safety๋Š” ์ž˜ ๋‹ฌ์„ฑ๋˜๊ฒ ์ง€๋งŒ
์ œ์•ˆ: preference๋ฅผ ๊ทน๋‹จ์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๊ธฐ ๋ฌด๋ฆฌ์ธ ๋„๋ฉ”์ธ์€ ์—†์„๊นŒ? ์—ฌ๋Ÿฌ ๋„๋ฉ”์ธ์— ์ ์šฉํ•˜๊ณ  ์‹คํ—˜
3.8
์„คํ–ฅ๋”ธ๊ธฐ๊ฐ•์ : ์ œ์•ฝ๋ณด๋‹ค DPO์— ๋” ์ ํ•ฉํ•œ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•˜์—ฌ safety ๋ฅผ ๊ฐœ์„ ํ•˜๋Š” ๋ฐฉ๋ฒ• ์ œ์•ˆ. ๋ญ”๊ฐ€ โ€œ์ƒ๋Œ€๋ฅผ ๋ˆ์œผ๋กœ ์„ค๋“ํ•˜์ง€ ๋ชปํ–ˆ๋‹ค๋ฉด ๊ทธ๊ฑด ๋ˆ์ด ๋ถ€์กฑํ•ด์„œโ€ ์ƒ๊ฐ์ด ๋‚œ๋‹ค.
์•ฝ์ : ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ์ •๋ ฌํ•˜๋Š” ๊ฒƒ์ด ๊ผญ ํ•„์š”ํ•œ ๊ณผ์ •์ด๋ผ๋ฉด, ์กฐ๊ธˆ์€ ์œ„ํ—˜ํ•œ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ. ์–ด๋–ค ๊ธฐ์ค€์œผ๋กœ ์žฌ์ •๋ ฌํ•˜๊ณ , ๊ทธ๊ฒƒ์ด ๋ฐ์ดํ„ฐ์…‹ ๋ถ„ํฌ ๋“ฑ์— ์˜ํ–ฅ์„ ๋ฏธ์น  ์ˆ˜ ์žˆ๋Š”๋ฐ, ๋‹ค์–‘ํ•œ ๋ฐ์ดํ„ฐ์…‹์„ ๊ณ ๋ คํ•˜์ง€ ์•Š์€ ๊ฒƒ์€ ์•„์‰ฌ์›€. ์‚ฌ์‹ค safety ๊ฐ€ ์ด๋Ÿฐ ๋ฐฉํ–ฅ์œผ๋กœ ํ‰๊ฐ€ ๊ฐ€๋Šฅํ•œ ์ง€ํ‘œ์ธ์ง€๋„ ์ข€ ์• ๋งคํ•˜๋‹ค๊ณ  ์ƒ๊ธฑํ•จ.
์ œ์•ˆ: ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ, safety ๋„ ๋” ๊ตฌ์ฒดํ™”ํ•ด์„œ ํ‰๊ฐ€ํ–ˆ์œผ๋ฉด ํ•จ. (๋ณด์•ˆ์  safety, ์œค๋ฆฌ์  safety ๋“ฑ)
4.0
๋‚˜์Šค๋‹ฅ์žฅ์ : โ€œํ•™์Šต์—์„œโ€ unsafeํ•œ ์‘๋‹ต์˜ ๋ฐฐ์ œ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ํ•™์Šตํ•˜๊ฒŒ ํ•˜๋Š” ๊ฒƒ์€ ํ›Œ๋ฅญํ•จ. ๊ฐœ์ธ์ ์œผ๋กœ ์ด๋Ÿฐ guaranteeํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ๋” ๋‚˜์˜ค๊ธฐ๋ฅผ ๋ฐ”๋žŒ
์•ฝ์ : ์ด๊ฒŒ ํ•™์Šต์—์„œ๋Š” ๊ทธ๋ ‡๊ฒŒ ํ•˜๋Š”๋ฐ ์‹ค์ œ๋กœ ์–ด๋–ป๊ฒŒ ์ž‘๋™ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ฒ€์ฆ์ด ๋„ˆ๋ฌด ๋นˆ์•ฝํ•จ. ๋ฐ์ดํ„ฐ์…‹๋„ ํ•œ๊ฐœ๋งŒ ์“ฐ๊ณ  adversarial attack์— ๋Œ€ํ•œ ๋ฐฉ์–ด ๋“ฑ safety์—์„œ ๋‹ค๋ค„์•ผ ํ•˜๋Š” ์‹คํ—˜๋“ค์ด ๋„ˆ๋ฌด ๋งŽ์ด ๋น ์ ธ์žˆ์Œ.
์ œ์•ˆ: ์‹คํ—˜์„ ๋Š˜๋ ค์ค˜!
2.6
AI๊ฐ•์ : LLM์„ ์•ˆ์ „ํ•˜๊ฒŒ ๋งŒ๋“ค ๋•Œ ๊ธฐ์กด ์—ฐ๊ตฌ์™€ ๋‹ฌ๋ฆฌ reward model์ด๋‚˜ cost model์ด ์—†์–ด ๋ฒ”์šฉ์„ฑ์ด ์•„์ฃผ ๋†’์Œ + ์‹คํ—˜ ๊ฒฐ๊ณผ๋„ ์ข‹์€ํŽธ
์•ฝ์ : DPO -> SimPO๋กœ ๊ฐ€๋Š” ๋А๋‚Œ...? ๊ธฐ์กด DPO paradigm๊ณผ ๋น„๊ตํ•ด์„œ ์ƒˆ๋กœ์šด contribution์ด ์—†๋Š”๊ฑฐ๊ฐ™์Œ
์ œ์•ˆ: ์•ˆ์ „ํ•˜์ง€ ์•Š์œผ๋ฉด ๋ฌด์ž‘์ • hard constraint๋ฅผ ์ฃผ๊ธฐ๋ณด๋‹ค ์•ˆ์ •์„ฑ์— ๋Œ€ํ•œ ๊ธฐ์ค€์„ ๋„“ํžˆ๋Š” ๋ฐฉ๋ฒ• ์ œ์•ˆ
3.6
404๊ฐ•์ : safety๋ฅผ ๋‹ค๋ฃจ๋Š” ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค ์ค‘ ๊ฐ€์žฅ ๋ช…ํ™•ํ•˜๊ณ  ์ง๊ด€์ ์ž„. ์ด์ƒ์ ์ธ ๊ฐ’์„ ์ˆ˜์‹์œผ๋กœ ์ฐพ๊ณ , ํ˜„์‹ค์ ์œผ๋กœ ๊ทผ์‚ฌํ•˜๋Š” ๊ณผ์ •์ด ICLR๋‹ค์›€
์•ฝ์ &์ œ์•ˆ: ๋” ๋‹ค์–‘ํ•œ LLM, dataset์œผ๋กœ ์‹คํ—˜ํ•˜๋ฉด ๋” ์ข‹์•˜์„ํ…๋ฐ !!
4.2
๊ตญ๋ฐฅ๊ฐ•์ :unsafe ์‘๋‹ต์— - ๋ฌดํ•œ๋Œ€๋กœ ๋ณด์ƒ์„ ์ค˜์„œ ํ™•๋ฅ  0์„ ๋ณด์žฅํ•œ๋‹ค๋Š” ์•„์ด๋””์–ด๊ฐ€ ๊น”๋”ํ•จ. ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์ด ํ‰๊ท ์ ์œผ๋กœ๋งŒ ์•ˆ์ „ํ•˜๋‹ค๋Š”์ ์„ ์ง€์ ํ•œ ๊ฒƒ๋„ ์ข‹์•˜์Œ
์•ฝ์ : PKU-SafeRLHF-30K ๋ฐ์ดํ„ฐ์…‹์—์„œ๋งŒ ๊ฒ€์ฆํ•ด์„œ ์ผ๋ฐ˜์„ฑ์ด ๋ถ€์กฑํ•œ๊ฒƒ ๊ฐ™์Œ
์ œ์•ˆ: safety์˜ ์œ ํ˜•์„ ์„ธ๋ถ„ํ™”
4.1

TL; DR

๐Ÿ’ก
  • Preference Alignment์—์„œ ์•ˆ์ „(์œ„ํ—˜ํ•œ ๋‹ตX)์„ ๊ฐ•ํ•˜๊ฒŒ ๋ณด์žฅํ•˜๋ฉด์„œ๋„, ๊ธฐ์กด RLHF์ฒ˜๋Ÿผ ๋ณต์žกํ•œ ํŒŒ์ดํ”„๋ผ์ธ ์—†์ด DPO์ฒ˜๋Ÿผ ๊ฐ„๋‹จํ•˜๊ฒŒ ๋ชจ๋ธ์„ ์ •๋ ฌํ•˜๋Š” ๋ฐฉ๋ฒ•์ธ SafeDPO ๋ฅผ ์ œ์‹œ
  • ๊ธฐ์กด์˜ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์žฌ์ •์˜ํ•˜๊ณ , ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ์ •๋ ฌํ•ด ๋ชจ๋ธ์ด ์•ˆ์ „ํ•œ ๋‹ต์„ ์ผ๊ด€๋˜๊ฒŒ ๋” ์„ ํ˜ธํ•˜๋„๋ก ํ•จ

Summary

  • SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety, ICLRโ€™26 | Link
  • Author
  • Citation: 20

Introduction

Background

  • LLM์ด ๋‹ค์–‘ํ•œ ์ž‘์—…์—์„œ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ด์ง€๋งŒ, ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ๋Š” ์‚ฌ์šฉ์ž ๊ธฐ๋Œ€์™€ ์–ด๊ธ‹๋‚˜๋Š” ์ถœ๋ ฅ(e.g., ์›์น˜ ์•Š๋Š” ๋‹ต, ํŽธํ–ฅ/์œ ํ•ด ๋‚ด์šฉ ๋“ฑ)์„ ๋‚ผ ์ˆ˜ ์žˆ์Œ

    โ†’ โ€˜์‚ฌ๋žŒ์ด ์›ํ•˜๋Š” ๋ฐฉํ–ฅโ€™์œผ๋กœ ๋ชจ๋ธ์„ ๋งž์ถ”๋Š” ์ •๋ ฌ์ด ์ค‘์š”ํ•ด์กŒ๊ณ , ์ด๋Ÿฌํ•œ ํŒจ๋Ÿฌ๋‹ค์ž„์œผ๋กœย preference alignment๊ฐ€ ๋“ฑ์žฅ

Preference Alignment

  • ๋ชจ๋ธ ์ถœ๋ ฅ์ด ์ธ๊ฐ„ ์„ ํ˜ธ(human preferences)๋‚˜ ๊ธฐ๋Œ€(expectations)์™€ ์ผ์น˜ํ•˜๋„๋ก ํ•™์Šต์‹œํ‚ค๋Š” ๊ฒƒ
  • ํ•œ ํ”„๋กฌํ”„ํŠธ xx๏ปฟ์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ์‘๋‹ต yy๏ปฟ๋ฅผ ๋งŒ๋“ค๊ณ , ์‚ฌ๋žŒ์ด ์–ด๋–ค ์‘๋‹ต์ด ๋” ์ข‹์€์ง€(winner/loser)๋ฅผ ๊ณ ๋ฅธ pairwise preference ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด ์ •์ฑ…(LLM)์„ ์—…๋ฐ์ดํŠธ
    • ํ•™์Šต ์ƒ˜ํ”Œ ์˜ˆ
      • ํ”„๋กฌํ”„ํŠธ: xx๏ปฟ
      • ๋‘ ๊ฐœ์˜ ์‘๋‹ต: y0,y1y_0, y_1๏ปฟ
      • ์‚ฌ๋žŒ(๋˜๋Š” ํ‰๊ฐ€ LM)์ด ์„ ํƒํ•œ ์„ ํ˜ธ ๋ผ๋ฒจ:

        ywโ‰ปylโˆฃxy_w \succ y_l \mid x๏ปฟ (winner / loser)

        โ†’ ์ด ์งˆ๋ฌธ์—๋Š” A๊ฐ€ B๋ณด๋‹ค ๋‚ซ๋‹คโ€๋ผ๋Š” ์Œ ๋น„๊ต ๋ฐ์ดํ„ฐ๋งŒ ์žˆ์œผ๋ฉด ๋จ

Methods of Preference Alignment

  • Reinforcement Learning from Human Feedback (RLHF)
    • ์‚ฌ๋žŒ(๋˜๋Š” judge)์˜ ์„ ํ˜ธ์Œ์œผ๋กœย reward model์„ ๋จผ์ € ํ•™์Šตํ•œ ๋‹ค์Œ, ๊ทธ ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋„๋กย ์ •์ฑ…/LLM์„ RL๋กœ ๋ฏธ์„ธ์กฐ์ •ํ•˜๋Š” ์ •๋ ฌ ๋ฐฉ์‹
      • Reward ๋ฅผ ์ตœ๋Œ€ํ™” ํ•˜๋˜ reference ๋ชจ๋ธ์—์„œ ๋ฉ€์–ด์ง€์ง€ ์•Š๊ฒŒ KL ์ •๊ทœํ™”๋กœ ๊ณผ๋„ํ•œ ๋ณ€ํ˜•์„ ์–ต์ œ
  • DAA (Direct Alignment Algorithms)
    • RLHF์˜ ๋ณต์žก์„ฑ์„ ์ค„์ด๊ธฐ ์œ„ํ•ด,ย ๋ณด์ƒ๋ชจ๋ธ์„ ๋”ฐ๋กœ ํ•™์Šตํ•˜์ง€ ์•Š๊ณ , preference ๋ฐ์ดํ„ฐ๋งŒ์œผ๋กœ ์ •์ฑ…์„ ์ง์ ‘ ์ตœ์ ํ™”ํ•˜๋Š” ๊ณ„์—ด (e.g., DPO)
    • RLHF๊ณผ๋Š” ๋‹ฌ๋ฆฌ pairwise data ๋กœ ํ•œ๋ฒˆ์— policy๋ฅผ ํ•™์Šตํ•จ

Motivation

  1. Preference Alignment๋งŒ์œผ๋กœ๋Š” โ€˜์•ˆ์ „โ€™์„ ๋ณด์žฅ ๋ชปํ•จ
    • ๊ธฐ์กด preference alignment์€ โ€˜์‚ฌ๋žŒ์ด ๋” ์„ ํ˜ธํ•œ ๋‹ตโ€™์„ ์ž˜ ๋‚ด๋„๋ก ๋งŒ๋“ค์ง€๋งŒ,ย ๊ทธ ๋‹ต์ด ํ•ญ์ƒ ์•ˆ์ „ํ•˜๋‹ค๋Š” ๊ฒƒ์€ ๋ณด์žฅ์„ ํ•˜์ง€ ์•Š์Œ
    • ๊ทธ๋ž˜์„œ ์•ˆ์ „ ์ •๋ ฌ(safety alignment)์€ ๋ณดํ†ต โ€˜(1) ๋„์›€์ด ๋˜๋Š” ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”โ€™ ํ•˜๋ฉด์„œ ๋™์‹œ์— โ€˜(2) ์œ„ํ—˜ํ•œ ๋‹ต์€ ๋‚ด์ง€ ๋ชปํ•˜๊ฒŒ ์ œ์•ฝโ€™์„ ๋„ฃ๋Š” ํ˜•ํƒœ๋กœ ์ด๋ฃจ์–ด์ง
  1. (์œ„ 1 ๋ฒˆ์„ ๋ฐ˜์˜ํ•œ,) ๊ธฐ์กด safety alignment (Safe RLHF ๊ณ„์—ด)์€ ํšจ๊ณผ๋Š” ์žˆ์ง€๋งŒ, ๋ณต์žกํ•จ
    • Preference alignment์— safe ์ •๋ณด๋ฅผ ์ถ”๊ฐ€๋กœ ๋„ฃ์€ ๊ธฐ์กด ์—ฐ๊ตฌ(safety alignment)๋„ ์žˆ์Œ
      • e.g., SafeRLHF, SACPO, โ€ฆ
    • ํ•˜์ง€๋งŒ ์ด๋Ÿฌํ•œ ๋ฐฉ๋ฒ•๋“ค์€ auxiliary model(e.g., reward/cost model), multistage pipeline, ์ถ”๊ฐ€ hyper-parameter ํŠœ๋‹ ๋“ฑ์œผ๋กœ ์ธํ•ด ๊ณ„์‚ฐ/ ๊ตฌํ˜„ ๋ณต์žก๋„๊ฐ€ ์ปค์ง

So in this paper โ€ฆ

โ‡’ RLHF ๋ณด๋‹ค ๋ณต์žก์„ฑ์„ ์ค„์ธ preference alignment ๋ฐฉ๋ฒ•์ธ โ€˜DPOโ€™์— safety๋ฅผ ์ ์šฉํ•˜๊ฒ ๋‹ค!

  • ๊ธฐ์กด ๋ฐฉ๋ฒ•๋“ค์€ ์œ„ํ—˜ ์ ์ˆ˜(cost)์˜ ํ‰๊ท (expected cost)์ด ๊ธฐ์ค€ ์ดํ•˜๊ฐ€ ๋˜๋„๋ก ํ•™์Šตํ•˜๋Š”๋ฐ, โ€˜ํ‰๊ท ์ ์œผ๋กœ๋งŒ ์•ˆ์ „โ€™ํ•œ๊ฒŒ ์•„๋‹Œ, unsafe ํ•œ ๋‹ต์€ ์•„์˜ˆ ํ™•๋ฅ  0์œผ๋กœ ๋งŒ๋“ค๊ณ ์‹ถ์Œ
  • ๋”ฐ๋ผ์„œ ๊ธฐ์กด์˜ objective (hard-constrained safety objective)๋ฅผ ๋ถ„์„ํ•ด์„œ,

    โ†’ ์œ„ ๋‚ด์šฉ์„ ๋ฐ˜์˜ํ•˜์—ฌ, DPO-style๋กœ single-stage๋กœ ๋ฐ”๊พผ SafeDPO๋ฅผ ์ œ์•ˆํ•œ๋‹ค

Contribution

  • Hard-constrained safety alignment objective(unsafe ํ™•๋ฅ  0)๋ฅผ ์ง์ ‘ ๋ถ„์„ํ•ด,ย closed-form optimal policy๊ฐ€ ์กด์žฌํ•จ์„ ๋ณด์ด๊ณ , ์ด๋ฅผย ํ•™์Šต ๊ฐ€๋Šฅํ•œ(tracable) ๋ชฉํ‘œ๋กœ ๋ฐ”๊พธ๋Š” ์ด๋ก ์„ ์ œ์‹œ
  • SafeDPO ์ œ์•ˆ:ย preference ๋ฐ์ดํ„ฐ +ย binary safety indicator๋งŒ์œผ๋กœ,ย reward/cost model ๋ฐ online sampling ์—†์ด๋„ DPO ์Šคํƒ€์ผ๋กœย single-stage ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋„๋ก ๊ตฌ์„ฑ
    • ํ‘œ์ค€ DPO ๋Œ€๋น„ ์ตœ์†Œ ์ˆ˜์ • + ์ถ”๊ฐ€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ 1๊ฐœ(ฮ”) ๋งŒ ๋„์ž…ํ•จ

Preliminaries

Reinforcement Learning from Human Feedback (RLHF)
  • RLHF์€ ํฌ๊ฒŒ 3๋‹จ๊ณ„ pipeline์œผ๋กœ ์ž”ํ–‰๋จ:
    1. SFT(Supervised Fine-Tuning): ๋ฐ๋ชจ/supervised ๋ฐ์ดํ„ฐ๋กœ ๊ธฐ๋ณธ ์‘๋‹ต ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ˜ reference ์ •์ฑ… ฯ€ref\pi_{\text{ref}}๏ปฟ(๋˜๋Š” ์ดˆ๊ธฐ ์ •์ฑ…)์„ ํ•™์Šต
    1. Reward Model(RM) ํ•™์Šต: pairwise ์„ ํ˜ธ ๋ฐ์ดํ„ฐ๋กœ ๋ณด์ƒํ•จ์ˆ˜ rฯ•(x,y)r_\phi(x,y)๏ปฟ๋ฅผ ํ•™์Šต
    1. RL fine-tuning (+KL ์ •๊ทœํ™” ): ๋ณด์ƒ์„ ์ตœ๋Œ€ํ™”ํ•˜๋˜ ฯ€ref\pi_{\text{ref}}๏ปฟ์—์„œ ๋ฉ€์–ด์ง€์ง€ ์•Š๋„๋ก KL ํŒจ๋„ํ‹ฐ๋ฅผ ๋‘๊ณ  ์ •์ฑ… ฯ€ฮธ\pi_\theta๏ปฟ๋ฅผ ํ•™์Šต

  • Bradleyโ€“Terry(=๋กœ์ง€์Šคํ‹ฑ) ์„ ํ˜ธ ๋ชจ๋ธ ๊ธฐ๋ฐ˜ ์„ ํ˜ธ ๋ฐ์ดํ„ฐ ๋ชจ๋ธ๋ง
    • ํ”„๋กฌํ”„ํŠธ xx๏ปฟ๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ ๋‘ ๋‹ต y0,y1y_0, y_1๏ปฟ ์ค‘์—ย y1y_1๏ปฟ์ด ๋” ์„ ํ˜ธ๋  ํ™•๋ฅ ์„ ๋‹ค์Œ์ฒ˜๋Ÿผ ๋ชจ๋ธ๋งํ•จ:
      • P(y1โ‰ปy0โˆฃx)P(y_1 \succ y_0 \mid x)๏ปฟ : prompt x๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ, y1y_1๏ปฟ์„ y0y_0๏ปฟ ๋ณด๋‹ค ๋‚ซ๋‹ค๊ณ  ๊ณ ๋ฅผ ํ™•๋ฅ 
      • r(x,y)r(x,y)๏ปฟ๋Š” ์ข‹์Œ/์„ ํ˜ธ๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ์ ์ˆ˜(๋ณด์ƒ)
      • r(x,y1)>r(x,y0)r(x,y_1) > r(x,y_0)๏ปฟ์ด๋ฉด y1y_1๏ปฟ์ด ์„ ํƒ(์„ ํ˜ธ)๋  ํ™•๋ฅ ์ด ์ปค์ง

  • Reward Model ํ•™์Šต (์„ ํ˜ธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ pairwise logistic loss)
    • ์‚ฌ๋žŒ ์„ ํ˜ธ ๋ฐ์ดํ„ฐ D={(x,yw,yl)}D=\{(x, y_w, y_l)\}๏ปฟ (winner/loser ์Œ)์„ ์ด์šฉํ•ด ๋ณด์ƒ๋ชจ๋ธ rฯ•r_\phi๏ปฟ๋ฅผ ํ•™์Šต
      • winner๊ฐ€ loser๋ณด๋‹ค ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ฐ›๋„๋ก ํ•™์Šต
      • ์šฐ๋ฆฌ๋Š” winner๊ฐ€ loser ๋ณด๋‹ค ๋†’์€ ์ ์ˆ˜๋ฅผ ๋ฐ›๊ฒŒ ํ•˜๊ณ  ์‹ถ๊ธฐ ๋•Œ๋ฌธ์—, log์•ˆ์˜ ๊ฐ’์˜ ์ฐจ์ด๋ฅผ ํฌ๊ฒŒ ํ•˜๊ณ  ์‹ถ์Œ (-log๋ฅผ ์ตœ์†Œํ™”)

  • RLHF์˜ ์ •์ฑ… ์ตœ์ ํ™”(KL-regularized objective)
    • RL ๋‹จ๊ณ„์—์„œ๋Š” ๋ณด์ƒ์„ ํ‚ค์šฐ๋˜, reference ์ •์ฑ…๊ณผ์˜ ์ฐจ์ด๋ฅผ KL๋กœ ์ œํ•œํ•จ
      • ์ •์ฑ…(๋ชจ๋ธ)ย ฯ€ฮธ\pi_\theta๏ปฟ ๊ฐ€ ๋ณด์ƒย rฯ•r_\phi๏ปฟ๋Š” ํฌ๊ฒŒ ๋งŒ๋“ค๊ณ  ๋™์‹œ์— ๋ ˆํผ๋Ÿฐ์Šค ๋ชจ๋ธ ฯ€ref\pi_{ref}๏ปฟ์—์„œ ๋„ˆ๋ฌด ๋ฉ€์–ด์ง€์ง€๋Š” ์•Š๊ฒŒ(=KL ํŽ˜๋„ํ‹ฐ)ํ•™์Šต
      • ฮฒ\beta๏ปฟ๊ฐ€ ํฌ๋ฉด ref model์—์„œ ๋งŽ์ด ๋ชป์›€์ง์ด๊ณ (๋ณด์ˆ˜์ ), ฮฒ\beta๏ปฟ๊ฐ€ ์ž‘์œผ๋ฉด ๋ณด์ƒ rฯ•r_\phi๏ปฟ์„ ๋” ๋งŽ์ด ๊ณ ๋ คํ•˜๊ฒŒ ๋จ

Direct Preference Optimization (DPO)
  • RLHF์ฒ˜๋Ÿผ reward model์„ ๋”ฐ๋กœ ํ•™์Šต/์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , ์„ ํ˜ธ ๋ฐ์ดํ„ฐ(winner/loser)๋งŒ์œผ๋กœ ์ •์ฑ… ฯ€ฮธ\pi_\theta๏ปฟ๋ฅผ ์ง์ ‘ ์ตœ์ ํ™”
    • DPO์˜ ๋ชฉ์ ํ•จ์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Œ:
      • winner ywy_w๏ปฟ ๋Š” ฯ€ฮธ\pi_\theta๏ปฟ์—์„œ ref ๋Œ€๋น„ ๋” ๋†’์€ ํ™•๋ฅ ์„ ๊ฐ–๋„๋ก(๋” ์ž์ฃผ ๋‚˜์˜ค๋„๋ก) ๋งŒ๋“ค๊ณ ,
      • loser yly_l๏ปฟ ๋Š” ฯ€ฮธ\pi_\theta๏ปฟ์—์„œ ref ๋Œ€๋น„ ๋” ๋‚ฎ์€ ํ™•๋ฅ ์„ ๊ฐ–๋„๋ก(๋œ ๋‚˜์˜ค๋„๋ก) ํ•จ

Method

  • [Step 1] From Hard Constraint to Closed-Form Policy
    • ๊ธฐ์กด์˜ โ€˜์œ„ํ—˜ํ•œ ๋‹ต์€ ์ ˆ๋Œ€ ๋‚˜์˜ค๋ฉด ์•ˆ๋œ๋‹คโ€™ (unsafe์‘๋‹ต์˜ ํ™•๋ฅ  0; hard-constraint)๋ฅผ ๋ถ„์„ํ•˜์—ฌ ๊ทธ ๊ทœ์น™์„ ๋งŒ์กฑํ•˜๋ฉด์„œ๋„ ๊ฐ€์žฅ ์ข‹์€ ๋‹ต์„ ๋‚ด๋Š” ์ด์ƒ์ ์ธ ์ •์ฑ…์ด ์–ด๋–ค ํ˜•ํƒœ์ธ์ง€ ์ˆ˜ํ•™์ ์œผ๋กœ ๋จผ์ € ์ฐพ์•„๋ƒ„
  • [Step 2] From Intractable Form to Tractable Objective
    • ํ•˜์ง€๋งŒ ์ด ์ด์ƒ์ ์ธ ์ •์ฑ…์€ ํ˜„์‹ค ๋ฐ์ดํ„ฐ๋กœ ๋ฐ”๋กœ ๊ณ„์‚ฐ์ด ์–ด๋ ค์›Œ์„œ, ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ์˜ ์žฌ์ •๋ ฌํ•˜์—ฌ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋ชฉ์ ํ•จ์ˆ˜๋กœ ๋ฐ”๊ฟˆ
  • [Step 3] Safety Margin
    • ๋งˆ์ง€๋ง‰์œผ๋กœ safe vs unsafe ๊ตฌ๋ถ„ ์‹ ํ˜ธ๋ฅผ ๋” ๊ฐ•ํ•˜๊ฒŒ ์ฃผ๊ธฐ ์œ„ํ•ด ๋งˆ์ง„(ฮ”)์„ ์ถ”๊ฐ€ํ•ด ํ•™์Šต์„ ์•ˆ์ •/๊ฐ•ํ™”


[Step 1] From Hard Constraint to Closed-Form Policy
๐Ÿ“Œ

๊ธฐ์กด์—๋Š” ์•ˆ์ „ ์ •๋ ฌ์„ โ€˜unsafe ์‘๋‹ต์€ ํ™•๋ฅ  0โ€™์œผ๋กœ ๋˜๋„๋ก ํ•˜๋Š” hard-constraint ๋ฌธ์ œ๋กœ ๋‘์—ˆ๋Š”๋ฐ, ์ด๋Š” unsafe ์‘๋‹ต์— ๋Œ€ํ•ด์„œ ํ™•๋ฅ ์„ 0์œผ๋กœ strictํ•˜๊ฒŒ ๋ณด์žฅํ•˜์ง€ ์•Š์Œ

โ‡’ Unsafe ์‘๋‹ต์— ๋Œ€ํ•ด์„œ๋Š” โ€˜ํŒจ๋„ํ‹ฐ๋ฅผ ํฌ๊ฒŒโ€™์ฃผ๋Š” ๋ฐฉ์‹์ด ์•„๋‹ˆ๋ผ, ์ˆ˜์‹ ์ธก๋ฉด์—์„œ unsafe ์‘๋‹ต์„ ๋ฐฐ์ œ์‹œํ‚ด

  • ์•ž์„œ ์–ธ๊ธ‰ํ•œ ๊ธฐ์กด !!!
    • ๊ธฐ์กด์˜ safety alignment ๋Š” safe ํ•œ ์‘๋‹ต์— ๋Œ€ํ•ด์„œ๋Š” ํ™•๋ฅ ์„ ๋†’๊ฒŒ, unsafeํ•œ ์‘๋‹ต์— ๋Œ€ํ•ด์„œ๋Š” ํ™•๋ฅ  0์œผ๋กœ ๋ฝ‘์•„์•ผ ํ•œ๋‹ค๋Š” ์ •์ฑ…์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ โ‡’ Hard constraint ๋ผ๊ณ  ํ•จ
      • Hard constraint: ๋ฐ˜๋“œ์‹œ ์ง€์ผœ์•ผ ํ•˜๋Š” ๊ทœ์น™ โ€” ๋ชจ๋ธ์ด ์–ด๋–ค ํ™•๋ฅ ๋กœ๋“  unsafe ๋‹ต์„ โ€œ๋‚ผ ์ˆ˜ ์žˆ์œผ๋ฉดโ€ ์•ˆ ๋˜๊ณ , ์•„์˜ˆ ๊ทธ ๋‹ต๋“ค์— ๋Œ€ํ•ดย ํ™•๋ฅ ์ด 0์ด ๋˜๊ฒŒ ๋งŒ๋“ค์–ด์•ผ ํ•จ
        Equation 6: ์ข‹์€ ๋‹ต์„ ๋‚ด๋˜(r), ref์—์„œ ๋„ˆ๋ฌด ๋ฉ€์–ด์ง€์ง€ ๋ง์ž(KL)
        • ์–ด๋–ค ํ”„๋กฌํ”„ํŠธ xx๏ปฟ๊ฐ€ ์˜ค๋“  ๋ชจ๋ธ์ด ๋ฝ‘์„ ์ˆ˜ ์žˆ๋Š” ์–ด๋–ค ๋‹ต yy๏ปฟ๋“  ์ „๋ถ€ ์•ˆ์ „ํ•ด์•ผ ํ•จ
      • ๊ทธ๋Ÿฐ๋ฐ ๋งŽ์€ ๊ธฐ์กด ๋ฐฉ๋ฒ•์€ ๊ณ„์‚ฐ ํŽธ์˜ ๋•Œ๋ฌธ์— expected-cost(ํ‰๊ท  ์œ„ํ—˜) ์ œ์•ฝ๊ฐ™์€ ์™„ํ™”๋œ ํ˜•ํƒœ๋ฅผ ์“ฐ๊ณ ,์ด ๊ฒฝ์šฐ ์—„๋ฐ€ํ•œ โ€˜ํ™•๋ฅ  0โ€™ ๋ณด์žฅ์€ ๋˜์ง€ ์•Š์Œ!
        • Detail: ์™œ โ€˜์—„๋ฐ€ํ•œ ๋ณด์žฅโ€™์ด ์•ˆ๋˜๋Š”๊ฐ€? (expected-cost)
          • Expected-cost ์ œ์•ฝ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋‘๋Š”๋ฐ:
          • ์—ฌ๋Ÿฌ ์ƒํ™ฉ์—์„œ ๋‚˜์˜ค๋Š” ์œ„ํ—˜(cost)์„ ํ‰๊ท ๋ƒˆ์„ ๋•Œ ๊ทธ ํ‰๊ท ์ด ์ž„๊ณ„๊ฐ’ ฯ„\tau๏ปฟ ์ดํ•˜์ด๋ฉด OK๋กœ ๊ฐ„์ฃผ๊ฐ€ ๋˜๊ฒŒ ๋จ
          • ๊ฐ€๋” unsafe๊ฐ€ ํ„ฐ์ ธ๋„ ๋‹ค๋ฅธ ๊ฒฝ์šฐ๋“ค์ด ์ถฉ๋ถ„ํžˆ ์•ˆ์ „ํ•ด์„œ ํ‰๊ท ์ด ๋‚ฎ์œผ๋ฉด ์ œ์•ฝ์„ ๋งŒ์กฑํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ์Œ
      • โ†’ Hard Constraint ์ž์ฒด๋ฅผ ๋‹ค์‹œ ๋ณด์ž!
  • ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š”, ์ƒˆ๋กœ์šด ๋ณด์ƒํ•จ์ˆ˜ rcr_c๏ปฟ๋ฅผ ์ •์˜
    • unsafe ํ•œ ์‘๋‹ต์— ๋Œ€ํ•ด์„œ ๋ณด์ƒ์œผ๋กœ โ€˜์—„์ฒญ ํฐ ๋ฒŒ์ โ€™์„ ์ฃผ๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ, -โˆž๋กœ ๋ณด๋‚ด๋ฒ„๋ฆผ
    • ์ดํ›„ ์ง€์ˆ˜ ๊ฐ€์ค‘(exp weighting)์—์„œ expโก(โˆ’โˆž)=0\exp(-\infty)=0๏ปฟ์ด ๋˜๋„๋ก ๋งŒ๋“ค์–ด ํ™•๋ฅ  ์งˆ๋Ÿ‰์ด 0

      ์ด ๋˜๊ฒŒ ํ•จ

      โ†’ ์ฆ‰, unsafe๊ฐ€ ๊ตฌ์กฐ์ ์œผ๋กœ ์ œ๊ฑฐ๋˜๊ฒŒ๋” ํ•จ

  • ๊ทธ๋ž˜์„œ ์•ž์„  safety alignment ๋ชฉ์ ํ•จ์ˆ˜ ์‹(eq 6)์—์„œ ๋ณด์ƒํ•จ์ˆ˜ ๋ถ€๋ถ„ rcr_c๏ปฟ ๋งŒ ๋ฐ”๊ฟˆ
    Equation 8
    • ์šฐ๋ฆฌ๊ฐ€ ์›ํ•˜๋Š” ์ •์ฑ… ฯ€ฮธ(โ‹…โˆฃx)\pi_\theta(\cdot|x)๏ปฟ์€
      1. Safe ์ค‘์—์„œ ๋ณด์ƒย rcr_c๏ปฟ ์ด ๋†’์€ ๋‹ต์„ ๋” ์ž์ฃผย ๋ฝ‘๊ณ ,
      1. ๋™์‹œ์— ๋ ˆํผ๋Ÿฐ์Šค ๋ชจ๋ธย ฯ€ref\pi_{ref}๏ปฟ์—์„œ ๋„ˆ๋ฌด ๋ฉ€์–ด์ง€์ง€ ์•Š๋„๋ก (KL๋กœ ๋ฒŒ์ ) ํ•˜๊ณ ์‹ถ์Œ
    • Eq 6์ฒ˜๋Ÿผ โ€˜unsafe ํ™•๋ฅ ์€ 0์ด ๋‚˜์™€์•ผ ํ•œ๋‹คโ€™๋ฅผ ์ œ์•ฝ์‹์œผ๋กœ ๊ฐ•์ œํ•˜๋Š”๊ฒŒ ์•„๋‹Œ, ๋ณด์ƒํ•จ์ˆ˜ rcr_c๏ปฟ๋ฅผ ํ†ตํ•ด ๋ชฉ์ ํ•จ์ˆ˜ ์ž์ฒด๊ฐ€ unsafe๋ฅผ ๋ฐฐ์ œํ•˜๋„๋ก ํ•จ
    ๊ฐ ํ”„๋กฌํ”„ํŠธ xx๏ปฟ์— ๋Œ€ํ•ด safe ํ•œ ์‘๋‹ต์ด ์กด์žฌํ•˜๊ณ , reference ์ •์ฑ… ฯ€ref\pi_{\text{ref}}๏ปฟ ๊ฐ€ ๊ทธ safe ์˜์—ญ์— 0์ด ์•„๋‹Œ ํ™•๋ฅ  ์งˆ๋Ÿ‰์„ ๋‘”๋‹ค๋ฉด, hard constraint ํ˜•ํƒœ(Eq.6)์™€ rcr_c๏ปฟ ๋กœ ๋ฐ”๊พผ ๋ชฉ์ (Eq.8)์€ ๊ฐ™์€ ์ตœ์ ํ•ด๋ฅผ ๊ฐ–๋Š”๋‹ค๊ณ  ์ฆ๋ช…ํ•จ(์ฆ๋ช… ๊ณผ์ •์€ ์ƒ๋žตโ€ฆ)

  • Eq 8์˜ ์ตœ์ ํ•ด๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค๊ณ  ํ•œ๋‹ค:
    • ์ตœ์ ํ•ดโ€”โ€์ตœ์ ์ผ ๋•Œ ๋ถ„ํฌ๊ฐ€ ์ด๋Ÿฐ ๋ชจ์–‘์ด์–ด์•ผ ํ•œ๋‹คโ€๋ฅผ ์ˆ˜ํ•™์ ์œผ๋กœ ๋ฐ”๋กœ ๋„์ถœํ•œ ๊ฒฐ๊ณผโ€
    • Eq 8 ๊ฐ™์€ โ€˜๊ธฐ๋Œ€ ๋ณด์ƒ โˆ’ ฮฒ\beta๏ปฟKLโ€™ ํ˜•ํƒœ๋Š” ์ตœ์  ์ •์ฑ…์ด ๋‹ค์Œ์ฒ˜๋Ÿผ reference ร— exp(๋ณด์ƒ/ฮฒ\beta๏ปฟ) ํ˜•ํƒœ๋กœ ๋–จ์–ด์ง€๋Š” ๊ฒŒ ์œ ๋ช…ํ•œ ๊ฒฐ๊ณผ๋ผ๊ณ  ํ•œ๋‹ค..!
    Equation 9
  • ์•ž์„œ์„œ unsafe ํ•œ rcr_c๏ปฟ์— ๋Œ€ํ•ด์„œ๋Š” -โˆž ๋กœ ์ •์˜ํ–ˆ์—ˆ๋Š”๋ฐ,

    โ†’ rc(x,y)r_c(x, y)๏ปฟ ๋ถ€๋ถ„์ด -โˆž ์œผ๋กœ ๊ฐ€๋ฉด exp(-โˆž)์œผ๋กœ ๊ฐ€์„œ ๊ถ๊ทน์ ์œผ๋กœ unsafe๋Š” ๋ ˆํผ๋Ÿฐ์Šค๊ฐ€ ์›๋ž˜ ํ™•๋ฅ ์„ ์ฃผ๊ณ  ์žˆ์—ˆ๋”๋ผ๋„, ๊ณฑ์…ˆ์—์„œ 0์ด ๋˜์–ด ์™„์ „ํžˆ ์ œ๊ฑฐ๋จ

    • ๊ธฐ์กด์—๋Š” ํ™•๋ฅ ์„ 0์œผ๋กœ ์ฃผ๊ณ ์ž ํ•ด๋„ ์ด๋ฅผ strictํ•˜๊ฒŒ ๋ณด์žฅ์ด ๋˜์ง€ ์•Š์•˜์ง€๋งŒ, ์ˆ˜์‹ ์ธก๋ฉด์—์„œ ์•„์˜ˆ 0์œผ๋กœ ๋งŒ๋“ค์–ด๋ฒ„๋ฆฌ๋Š”๊ฒƒ์ž„

  • ์ด์ œ ์ด๋ก ์ ์œผ๋กœ ์œ ๋„๋˜๋Š” ์„ ํ˜ธ ๋ชฉ์ ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Œ:
    Equation 10
    • ํ•˜์ง€๋งŒ! ์ด๊ฒƒ์€ ์ง์ ‘ ๊ณ„์‚ฐ์ด ์•ˆ๋˜๋Š” ์‹์ž„
    • ์™œ๋ƒํ•˜๋ฉด, ๊ธฐ๋Œ€๊ฐ’์ด ์šฐ๋ฆฌ๊ฐ€ ๊ฐ€์ง„ ๋ฐ์ดํ„ฐ D๊ฐ€ ์•„๋‹ˆ๋ผ, rcr_c๏ปฟ๊ฐ€ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๊ฐ€์ƒ์˜ ์„ ํ˜ธ ๋ถ„ํฌ D~\tilde D๏ปฟ์— ๋Œ€ํ•ด ์ •์˜๋˜์–ด ์žˆ๊ธฐ ๋•Œ๋ฌธ (์ง์ ‘ ์ƒ˜ํ”Œ๋ง/๊ณ„์‚ฐ์ด ์•ˆ๋จ)
      • Detail
        • ์šฐ๋ฆฌ๊ฐ€ ์‹ค์ œ๋กœ ๊ฐ–๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋Š” ์‚ฌ๋žŒ(๋˜๋Š” ๋ชจ๋ธ)์ด ์ฐ์–ด์ค€ helpfulness ์„ ํ˜ธ (x,yw,yl)(x, y_w, y_l)๏ปฟ ์™€ ์•ˆ์ „ ๋ผ๋ฒจ (hw,hl)(h_w, h_l)๏ปฟ ๋ฟ์ž„
        • ๊ทธ๋Ÿฐ๋ฐ ๋ชฉ์ ํ•จ์ˆ˜ L์€ โ€œ์‚ฌ๋žŒ์ด ์ฐ์€ ์„ ํ˜ธโ€๊ฐ€ ์•„๋‹ˆ๋ผ, rcr_c๏ปฟ ๊ฐ€ ๋งŒ๋“ค์—ˆ์„ โ€œ๊ฐ€์ƒ์˜ ์„ ํ˜ธ ๋ถ„ํฌ D~\tilde D๏ปฟโ€๋ฅผ ๊ธฐ๋Œ€๊ฐ’์œผ๋กœ ์”€
          • D~\tilde D๏ปฟ ๋Š” rcr_c๏ปฟ (unsafe๋ฉด -โˆž) ๊ธฐ๋ฐ˜์œผ๋กœ โ€œ์ด๋ก ์ ์œผ๋กœ ์ƒ์„ฑ๋์„โ€ ์„ ํ˜ธ ๋ถ„ํฌ์ž„
        • ๋ฌธ์ œ๋Š” rcr_c๏ปฟ ์ž์ฒด๊ฐ€ ๊ด€์ธก๋˜์ง€ ์•Š๋Š” latent ํ•จ์ˆ˜(reward + safety cost ๋ฐ˜์˜)๋ผ์„œ, rcr_c๏ปฟ๊ฐ€ ๋งŒ๋“ค์–ด๋‚ผ ์„ ํ˜ธ ๋ถ„ํฌ D~\tilde D๏ปฟ๋„ ๋ฐ์ดํ„ฐ์—์„œ ์ง์ ‘ ์•Œ ์ˆ˜๊ฐ€ ์—†์Œ

        โ†’ ๊ทธ๋ž˜์„œย D~\tilde D๏ปฟ ์—์„œ ๊ธฐ๋Œ€๊ฐ’์„ ์ง์ ‘ ๊ณ„์‚ฐํ•  ์ˆ˜๊ฐ€ ์—†์Œ

[Step 2] From Interactable Form to Tractable Objective
๐Ÿ“Œ

์•ž์„  ๋‹ซํžŒ ํ˜•ํƒœ์˜ ๋ชฉ์ ํ•จ์ˆ˜ ์‹์€ ์ด๋ก ์ ์œผ๋กœ๋Š” ์™„๋ฒฝํ•˜์ง€๋งŒ, ๊ธฐ๋Œ€๊ฐ’์ด ๊ฐ€์ƒ์˜ ์„ ํ˜ธ ๋ถ„ํฌ D~\tilde D๏ปฟ ์— ๋Œ€ํ•ด ์ •์˜๋˜์–ด ์žˆ์–ด ์ง์ ‘ ๊ณ„์‚ฐ/ํ•™์Šต์ด ๋ถˆ๊ฐ€๋Šฅํ•จ

โ‡’ ์šฐ๋ฆฌ๊ฐ€ ์‚ฌ์šฉํ•˜๋Š” ์„ ํ˜ธ ๋ฐ์ดํ„ฐ์—๋Š” ๊ฐ ์‘๋‹ต์ด unsafe์ธ์ง€ ์•„๋‹Œ์ง€์˜ ์—ฌ๋ถ€๋„ ํฌํ•จ๋˜์–ด์žˆ์œผ๋‹ˆ, ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์žฌ์ •๋ ฌํ•˜์—ฌ ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ๋ฅผ ๋งŒ๋“ค์–ด์„œ ๊ณ„์‚ฐ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•จ

  • ์•ž์„  ์‹(eq 10)์€ ์ด๋ก ์ ์œผ๋กœ๋Š” ์„ฑ๋ฆฝํ•˜์ง€๋งŒ ๊ณ„์‚ฐ์ด ๋˜์ง€ ์•Š์Œ. ์ด๋ฅผ ๊ณ„์‚ฐ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๊ธฐ ์œ„ํ•ด ๋ฐ์ดํ„ฐ DD๏ปฟ์— ๋ณ€ํ™˜์„ ๊ฐ€ํ•œ ๋ฐ์ดํ„ฐ TT๏ปฟ๋ฅผ ์ œ์‹œํ•จ
    • ๋ฐ์ดํ„ฐ ์„ค๋ช…
      • ์›๋ž˜ DPO preference ํ•™์Šต์—์„œ๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ (x,yw,yl)(x, y_w, y_l)๏ปฟ ์ด๋ ‡๊ฒŒ ์ƒ๊น€
      • ๊ทธ๋Ÿฐ๋ฐย safety alignment(์•ˆ์ „ ์ •๋ ฌ)ย ์„ธํŒ…์—์„œ๋Š” preference ๋ฐ์ดํ„ฐ์—ย โ€™๊ฐ ๋‹ต์ด ์•ˆ์ „ํ•œ์ง€/์œ„ํ—˜ํ•œ์ง€โ€™ ๋ผ๋ฒจ์ด ์ถ”๊ฐ€๋กœ ๋ถ™์Œ (x,yw,yl,hw,hl)(x, y_w, y_l, h_w, h_l)๏ปฟ
        • h=1์ด๋ฉด unsafe, h=0์ด๋ฉด safe ํ•œ ์ด์ง„ ์•ˆ์ „ indicator
      • safeํ•œ ๋‹ต์˜ ๋ณด์ƒ์€ rc(x,y)r_c(x, y)๏ปฟ ๋กœ ์›๋ž˜ reward๋ฅผ ์œ ์ง€ํ•˜๊ณ , unsafeํ•œ ๋‹ต์— ๋Œ€ํ•ด์„œ๋Š” -โˆž
    • Case1: winner๊ฐ€ safe โ†’ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉ
    • Case2: Winner๊ฐ€ unsafe์ธ๋ฐ loser๊ฐ€ safe โ†’ swap
      • SafeDPO๋Š” safe ์—ฌ๋ถ€๋ฅผ ์šฐ์„  ์ œ์•ฝ์œผ๋กœ ๋ณด๊ธฐ ๋•Œ๋ฌธ์— safe/unsafe๊ฐ€ ์„ž์ธ pair์ด๋ผ๋ฉด ๋ฌด์กฐ๊ฑด safe๊ฐ€ winner๊ฐ€ ๋˜์–ด์•ผ ํ•จ
    • Case3: ๋‘˜ ๋‹ค unsafe โ†’ ๋ฒ„๋ฆผ(drop)
      • ๋‘˜๋‹ค ๊ฒฐ๊ฒฉ์ด๋ผ โ€˜๋ญ๊ฐ€ ๋” ๋‚ซ๋‹ค~โ€™๋ฅผ ๋”ฐ์ง€๋Š”๊ฒŒ ์˜๋ฏธ๊ฐ€ ์—†์–ด์ง (ํ•™์Šต์— ๊ธฐ์—ฌX)

  • ์ตœ์ข…์ ์œผ๋กœ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ํš๋“ํ•จ
    Equation 11
    • ๊ด€์ธก ๋ถˆ๊ฐ€ํ•œ ์ด์ƒ์ ์ธ ๋ชฉ์ (eq10)์„ ๋ฐ์ดํ„ฐ ์žฌ์ •๋ ฌ์„ ํ†ตํ•ด ๋ณต์›ํ•จ
      • ์˜ˆ์‹œ ๊ธฐ๋ฐ˜ ๋ถ€๊ฐ€ ์„ค๋ช…
        • ์–ด๋–ค ํ”„๋กฌํ”„ํŠธ x์— ๋Œ€ํ•ด pair๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋“ค์–ด์˜ด:
          • ywy_w๏ปฟ: ์ •ํ™•ํ•˜์ง€๋งŒ ์œ„ํ—˜ํ•œ(unsafe) ๋ฐฉ๋ฒ• ์„ค๋ช… โ†’ unsafe
          • yly_l๏ปฟ: ์•ˆ์ „ํ•˜๊ฒŒ ๊ฑฐ์ ˆ + ๋Œ€์•ˆ ์ œ์‹œ โ†’ safe
        • ์‚ฌ๋žŒ ์„ ํ˜ธ ๋ฐ์ดํ„ฐ๊ฐ€ ๋„์›€๋จ์„ ๋” ์ณ์„œ unsafe๋ฅผ winner๋กœ ์ฐ์—ˆ์„ ์ˆ˜๋„ ์žˆ์ง€๋งŒ, SafeDPO์˜ rcr_c๏ปฟ ์„ธ๊ณ„์—์„œ๋Š” unsafe๋Š” -โˆž๋ผ์„œ ๋ฌด์กฐ๊ฑด loser์—ฌ์•ผ ํ•จ โ†’ swap ํ•ด์„œ safe๊ฐ€ winner๊ฐ€ ๋˜๊ฒŒ ๋งŒ๋“œ๋Š” ๊ฒƒ

          โ†’ ์ด๊ฒƒ์œผ๋กœ D~\tilde D๏ปฟ๊ฐ€ ๊ทธ๋ ธ์„ ์„ ํ˜ธ ๋ฐฉํ–ฅ์„ ์žฌํ˜„


          • ์›๋ž˜ ์„ ํ˜ธ ๋ฐ์ดํ„ฐ D: ๋ณดํ†ต โ€œ๋‘˜ ์ค‘ ๋” ๋„์›€์ด ๋˜๋Š” ๋‹ต์ด ๋ญ๋ƒ?โ€๋ฅผ ์ฐ์€ ๊ฒƒ
          • SafeDPO๊ฐ€ ์›ํ•˜๋Š” ๊ฒƒ: โ€œunsafe๋Š” ๋ฌด์กฐ๊ฑด ํƒˆ๋ฝ. safe๋ผ๋ฆฌ๋งŒ ๋น„๊ตํ•ด์„œ ๋” ๋„์›€์ด ๋˜๋Š” ์ชฝ์„ ์˜ฌ๋ฆฌ์žโ€
          • ๊ทธ๋ž˜์„œ safe vs unsafe ์Œ์—์„œ ์‚ฌ๋žŒ์ด โ€œunsafe๊ฐ€ ๋” ์œ ์šฉํ•˜๋‹คโ€๊ณ  ์ฐ์–ด๋†จ์–ด๋„, SafeDPO๋Š” โ€œ๊ทธ๊ฑด ์•ˆ์ „ ์ œ์•ฝ์„ ์œ„๋ฐ˜ํ•˜๋‹ˆ ํ•™์Šต ๋ชฉํ‘œ์—์„œ ๋’ค์ง‘๋Š”๋‹ค(swap)โ€

์•ž์„œ ์ •์˜ํ•œ Eq.10 ๊ณผ Eq.11 ์€ ๋™์ผํ•˜๋‹ค๊ณ  ํ•œ๋‹ค (์ฆ๋ช…์€ ์ƒ๋žตโ€ฆ)

[Step 3] Safety Margin
๐Ÿ“Œ

์•ž์„œ์„œ ๋ณ€ํ™˜๋œ ๋ฐ์ดํ„ฐ TT๏ปฟ ๋กœ SafeDPO๋ฅผ ๋Œ๋ฆฌ๋Š”๋ฐ, ์ถ”๊ฐ€์ ์œผ๋กœ โ€˜์•ˆ์ „๋งˆ์ง„โ€™์„ ๋„ฃ์–ด์„œ safe vs unsafe ๋น„๊ต์—์„œ ํ•™์Šต ์‹ ํ˜ธ๋ฅผ ๋” ๊ฐ•ํ•˜๊ฒŒ ๋งŒ๋“ฆ

  • ํ•™์Šต ๊ณผ์ •์—์„œ safe-unsafe ๊ตฌ๋ถ„์„ ๋” ๊ฐ•ํ•˜๊ฒŒ ๋ฐ€์–ด์ค˜์„œ ํ•™์Šต ์‹ ํ˜ธ๋ฅผ ๊ฐ•ํ™”ํ•˜๊ณ ์ž ํ•จ
    • ์—ฌ๊ธฐ์„œ (h~lโˆ’h~w)ฮ”(\tilde h_l-\tilde h_w)\Delta๏ปฟ ํ•ญ:
      • safe vs unsafe์ธ ๊ฒฝ์šฐ์—๋งŒ ๋งˆ์ง„์ด ์ ์šฉ๋จ(ํ•™์Šต์„ ๋” ์„ธ๊ฒŒ ๋ฐ€์–ด์คŒ)
      • safe vs safe๋ฉด 0์ด๋ผ์„œ ๊ธฐ์กด DPO์™€ ๋™์ผํ•˜๊ฒŒ ๋™์ž‘
    • ๊ฒฐ๊ณผ์ ์œผ๋กœ safe-unsafe์Œ์— ๋Œ€ํ•ด ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋งˆ์ง„ ์กฐ๊ฑด์„ ๋” ๊ฐ•ํ•˜๊ฒŒ ๋งŒ์กฑ์‹œํ‚ค๋„๋ก ํ•จ

Experiment

Setting

  • Datasets
    • PKU-SafeRLHF-30K
      • 27,000 training entries, 3,000 testing entries
      • ๊ฐ entries๋Š” (xx๏ปฟ, y0y_0๏ปฟ, y1y_1๏ปฟ) ํŠœํ”Œ๋กœ ๋˜์–ด์žˆ๊ณ , ์–ด๋–ค ์‘๋‹ต์ด helpfulํ•œ์ง€, saferํ•œ์ง€, ๊ฐ ์‘๋‹ต ๋ณ„ binary safety indicators (hh๏ปฟ)์„ ํฌํ•จ
  • Reference model
    • Alpaca-7B model (PKU-SafeRLHF-30K ๋กœ SFT ํ•จ)
  • Baselines
    • DPO-HELPFUL: helpfulness(์œ ์šฉ์„ฑ) ์„ ํ˜ธ ๋ฐ์ดํ„ฐ๋กœ๋งŒ ํ•™์Šตํ•œ ์ผ๋ฐ˜ DPO(โ€œ๋” ๋„์›€์ด ๋˜๋Š” ๋‹ตโ€์„ winner๋กœ)
    • DPO-HARMLESS: harmlessness(๋ฌดํ•ด์„ฑ/์•ˆ์ „) ์„ ํ˜ธ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œ DPO(โ€œ๋” ์•ˆ์ „ํ•œ ๋‹ตโ€์„ winner๋กœ)
    • DPO-SAFEBETTER: ํ•™์Šต ๋ฐ์ดํ„ฐ์—์„œ winner ywy_w๏ปฟ๊ฐ€ safe์ธ ์Œ๋งŒ ๋‚จ๊ธฐ๊ณ  (winner๊ฐ€ unsafe๋ฉด ๊ทธ ์ƒ˜ํ”Œ ์ œ๊ฑฐ) ๊ทธ ํ•„ํ„ฐ๋ง๋œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šตํ•œDPO
    • SafeRLHF
    • SACPO, P-SACPO: ์„ ํ˜ธ(๋ณด์ƒ) + ์•ˆ์ „ ์ œ์•ฝ์„ ๊ฐ™์ด ์ตœ์ ํ™”ํ•˜๋Š” ๊ณ„์—ด
  • Evaluation Method
    • Model-based evaluation
      • beaver-7b-unified-reward: ๊ฐ ์‘๋‹ต์˜ helpfulness(์œ ์šฉ์„ฑ) ์ ์ˆ˜๋ฅผ โ€œreward(๋ณด์ƒ)โ€๋กœ ์˜ˆ์ธก
      • beaver-7b-unified-cost: ๊ฐ ์‘๋‹ต์˜ harmlessness(๋ฌดํ•ด์„ฑ) ๊ด€๋ จ ์ ์ˆ˜๋ฅผ โ€œcost(์œ„ํ—˜/๊ทœ์ •์œ„๋ฐ˜ ๋น„์šฉ)โ€์œผ๋กœ ์˜ˆ์ธกํ•˜๊ณ , ์—ฌ๊ธฐ์„œ harmlessness/harmless ratio๋ฅผ ๊ณ„์‚ฐ
    • GPT-4 Evaluation
      • GPT-4๋กœ ํ‰๊ฐ€ (์ฒ™๋„๋Š” 0-10)
  • Metrics
    • Helpfulness: ๊ธฐ๋Œ€ ๋ณด์ƒ(expected reward)
      • ํ…Œ์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋งˆ๋‹ค ๋ชจ๋ธ์ด ๋‹ต์„ ์ƒ์„ฑํ•˜๋ฉด, ๊ทธ ๋‹ต์„ reward ๋ชจ๋ธ์ด ์ฑ„์ ํ•˜๊ณ  ๊ทธ ํ‰๊ท (๊ธฐ๋Œ€๊ฐ’)์„ helpfulness๋กœ ๋‘ 
    • Harmless ratio: ์ƒ์„ฑ ์‘๋‹ต ์ค‘ โ€œsafeโ€๋กœ ํŒ์ •๋œ ๋น„์œจ(= ์•ˆ์ „ ์‘๋‹ต ๋น„์œจ)
    • Harmlessness: ํ‰๊ท  safety score

Results

Harmlessness and Helpfulness

  • SafeDPO๊ฐ€ ๊ฐ€์žฅ ๊ฐ•ํ•˜๊ฒŒ unsafe๋ฅผ ์–ต์ œํ•จ (Harmless_Ratio; a-1, b-1)
    • model-based evaluation์—์„œ๋Š” ์•ฝ 97%, GPT-4 eval์—์„œ๋Š” 100% ๋‹ฌ์„ฑ

      โ†’ unsafe ์ƒ์„ฑ์ด ๊ฑฐ์˜ ์™„์ „ํžˆ ์–ต์ œ๋จ

  • ์‘๋‹ต์˜ ํ‰๊ท  ์•ˆ์ „ ์ ์ˆ˜๋„ ์ตœ๊ณ ์ž„(Harmlessness; a-2, b-2)

Effectiveness & Sensitivity of ฮ” Hyperparameter (Safety margin)

  • safety margin ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ฮ”๋ฅผ {0, 2, 5, 10, 20}๋กœ ๋ฐ”๊ฟ”๊ฐ€๋ฉฐ ์„ฑ๋Šฅ ๋ณ€ํ™”๋ฅผ ๊ด€์ฐฐ
  • ์‚ฌ์‹ค Safety margin์„ ๋„ฃ์ง€ ์•Š์•„๋„ ์•ˆ์ „์ด ์ด๋ฏธ ๋†’๊ฒŒ ๋‚˜์˜ค๊ธด ํ•จ โ†’ ์ด๋Š” SafeDPO๊ฐ€ margin์— ์˜์กดํ•˜๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ, ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๋‹ค๋ฅธ ๋ฐฉ์‹์œผ๋กœ๋„ unsafe๋ฅผ ์ถฉ๋ถ„ํžˆ ์–ต์ œํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ๋งํ•ด์คŒ
  • ์ ๋‹นํ•œ ๋งˆ์ง„์€ ์„ฑ๋Šฅ์„ ์ฆ๊ฐ€์‹œํ‚ด. ๊ทผ๋ฐ ๋„ˆ๋ฌด ๋งˆ์ง„์ด ๋„ˆ๋ฌด ์ปค๋„ ์„ฑ๋Šฅ์ด ํ•˜๋ฝํ•œ๋‹ค.

Robustness across Models & Scales

  • ๋ชจ๋ธ ํฌ๊ธฐ๋ฅผ 1.5B ~ 13B๊นŒ์ง€ ๋ฐ”๊พธ๊ณ , ๋™์ผ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋กœ SafeDPO๋ฅผ ์ ์šฉํ•ด ์„ฑ๋Šฅ์„ ๋น„๊ต
  • ๋ชจ๋“  ์Šค์ผ€์ผ์—์„œ SafeDPO๊ฐ€ ๊ฐ•ํ•œ safety ์„ฑ๋Šฅ์„ ์ผ๊ด€๋˜๊ฒŒ ๋‹ฌ์„ฑํ•˜๋ฉด์„œ, helpfulness๋„ ์œ ์ง€ํ•˜๊ฑฐ๋‚˜ ์•ฝ๊ฐ„ ๊ฐœ์„ ๋จ

    โ†’ SafeDPO๋Š” ์Šค์ผ€์ผ์—…์ด ๊ฐ€๋Šฅํ•œ safety alignment์ž„

Categories

DPO research