14 January 2026

Advancing Expert Specialization for Better MoE

๐Ÿ’กMixture-of-Experts ํ›ˆ๋ จ ์†์‹คํ•จ์ˆ˜์—๋Š” expert ๊ฐ„ routing ํšจ์œจ์„ฑ ์œ„ํ•œ objective term ์žˆ์Œ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Š” ๊ฐ expert์˜ ์ „๋ฌธ์„ฑ ํŠนํ™”๋ฅผ ๋ฐฉํ•ดํ•˜๋Š” ๋ถ€์ž‘์šฉ ์žˆ์Œโ‡’ routing ํšจ์œจ์„ฑ ๋ชฉํ‘œ๋ฅผ ๋ฐฉํ•ดํ•˜์ง€ ์•Š์œผ๋ฉด์„œ expert ์ „๋ฌธํ™”์— ๋„์›€๋˜๋Š” objective๋ฅผ ์ถ”๊ฐ€ํ•˜์ž

๐Ÿฅ‰

Advancing Expert Specialization for Better MoE

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
์ฐฐ๋‚˜์ด ๋…ผ๋ฌธ์„ ๋ณด๊ณ  ์ฒ˜์Œ ๋“ค์—ˆ๋˜ ์ƒ๊ฐ์€, MoE, Agent ๋“ฑ ๊ฐœ๋…๋“ค์„ ์‚ฌ์‹ค ์ฐจ์ด๋ฅผ ๋‚˜๋ณด๊ณ  ์„ค๋ช…ํ•˜๋ผ๊ณ  ํ•˜๋ฉด ๋ช…ํ™•ํ•˜๊ฒŒ ๋ชปํ•  ๊ฒƒ ๊ฐ™๋‹ค๋Š” ์ƒ๊ฐ์ด์—ˆ์Œ. ๋…ผ๋ฌธ๊ณผ๋Š” ์ข€ ๋ฌด๊ด€ํ•œ ์ด์•ผ๊ธฐ๊ธด ํ•˜์ง€๋งŒ, ๊ทธ๋Ÿฐ ์ƒ๊ฐ์ด ๋“ค๊ฒŒ ํ•˜๊ณ  ๊ณต๋ถ€ํ•˜๊ฒŒ ๋งŒ๋“ค์–ด์„œ ๋‚˜์—๊ฒŒ๋Š” ์ข‹์€ ๋…ผ๋ฌธ์ด์—ˆ์Œ. MoE ์ž์ฒด๋Š” ๊ฝค ์˜ค๋ž˜๋œ ๋ฐฉ๋ฒ•์ด์ง€๋งŒ, ๋ฐฉ๋ฒ• ์ž์ฒด๊ฐ€ ์‹ค์ œ ์‚ฌ๋ก€์™€ ๊ต‰์žฅํžˆ ๊ด€๋ จ ๊นŠ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๊ณ , ์„ฑ๋Šฅ ๊ฐœ์„  ๋ฐฉํ–ฅ ๋ฐ ๊ด€๋ จ ์—ฐ๊ตฌ๋กœ ์ฐธ์กฐํ•˜๊ธฐ ์ข‹์€ ์—ฐ๊ตฌ๋ผ๊ณ  ์ƒ๊ฐํ•จ. 4.3
์™€์‚ฌ๋น„๊ฝƒ๊ฒŒ๋ž‘ํ•˜๋‚˜์˜ ๋ฌธ์ œ๋ฅผ ์—ฌ๋Ÿฌ ๊ด€์  ๋ฐ ์—ญํ• ์œผ๋กœ ๋‚˜๋ˆ„์–ด ์ฒ˜๋ฆฌํ•˜์ž~ ๋ผ๋Š” ๊ฐœ๋…์ด ์ „์ฒด์ ์œผ๋กœ Attention head, MoE expert, multi-agent ๋“ฑ ์—ฌ๋Ÿฌ๋ถ„์•ผ์—์„œ ์œ ์‚ฌํ•˜๋‹ค๋Š” ๋А๋‚Œ์ด ๋“ค์Œ. ๋‹จ์ˆœํžˆ ์š”์†Œ๋“ค์˜ ๊ฐœ์ˆ˜๋Š” ๋Š˜๋ฆฌ๋Š”๊ฒŒ ์•„๋‹ˆ๋ผ ๊ฐ๊ฐ์˜ ์—ญํ• ์„ ๋ถ„๋ฆฌ๋˜๊ฒŒ๋” ๋ช…ํ™•ํ•˜๊ฒŒ ์ง€์ •ํ•ด์ฃผ๋Š”๊ฒŒ ์ค‘์š”ํ•œ๋“ฏ4
๋ฉ”๊ฐ€์ปคํ”ผMoE์˜ ๋ณธ์งˆ?์„ ์ง€ํ‚ค๊ธฐ ์œ„ํ•œ ์—ฐ๊ตฌ. ์†์‹คํ•จ์ˆ˜์— ๋‘ ๊ฐ€์ง€ ํ•ญ(orthogonality loss, variance loss)์„ ์ถ”๊ฐ€ํ–ˆ์Œ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด๋ฉด ์ƒ๋‹นํžˆ ์„ฑ๋Šฅ์ด ์˜ค๋ฅธ ๊ฑธ ๋ณผ ์ˆ˜ ์žˆ๋Š”๋ฐ motivation๋ถ€ํ„ฐ ๊ฒฐ๊ณผ๊นŒ์ง€ ๊น”๋”ํ•œ ๋…ผ๋ฌธ์ธ ๊ฒƒ ๊ฐ™๋‹ค.4.2
์š”๋ฆฌ๊ดด๋ฌผ๋…ผ๋ฌธ์ด ๋งค์šฐ ์–ด๋ ต๋‹ค... ์ „๋ฐ˜์ ์œผ๋กœ ๋‘ loss๋ฅผ ์ œ์•ˆํ•˜๋Š”๋ฐ ๋ชจ๋“  ํƒœ์Šคํฌ์— ๋Œ€ํ•ด ๊ฐ ์ „๋ฌธ๊ฐ€๊ฐ€ ๊ตฌ๋ณ„๋œ ํ‘œํ˜„์„ ๊ฐ€์ง€๋„๋ก ํ•˜๋Š”๊ฒƒ์ด ๋„์›€์ด ๋˜๋Š”์ง€ ์—ผ๋ ค๋จ.
Multilingual์ด๋‚˜ ์˜๋ฏธ ์œ ์‚ฌ๋„๊ณ ๋ คํ•˜๋Š” ํƒœ์Šคํฌ ๊ฐ™์€ ๊ฒฝ์šฐ...?
4.2
์ƒˆ์šฐ๊นก์ƒˆ๋กœ์šด ํ›ˆ๋ จ๋ชฉํ‘œ ๋„์ž…ํ•œ ๊ฒƒ๋„ ์˜๋ฏธ์žˆ์ง€๋งŒ, ๊ธฐ์กด ํ›ˆ๋ จ๋ชฉํ‘œ์˜ ์žฅ์ ์„ ๋ฐฉํ•ดํ•˜์ง€ ์•Š์œผ๋ฉด์„œ ์„ฑ๋Šฅ์— ๊ธ์ •์  ์˜ํ–ฅ ๋ฏธ์นœ๋‹ค๋Š” ๊ฑธ ์ด๋ก ์ ์œผ๋กœ๋„ ์‹คํ—˜ ๊ฒฐ๊ณผ๋กœ๋„ ์ž˜ ์ž…์ฆํ–ˆ๋‹ค. ๋ฐฉ๋ฒ•๋ก  ์ œ์•ˆ๋งŒ ํ•˜๊ณ  ๋๋‚˜๋Š” ๊ฒŒ ์•„๋‹ˆ๋ผ ํŠนํžˆ๋‚˜ ์„ค๋“๊นŒ์ง€ ๊ณต๋“ค์—ฌ ํ•œ ๋…ผ๋ฌธ4
์•ˆ์„ฑ์žฌMoE์˜ ํ•œ๊ณ„์ ์„ ์ •๋ง technicalํ•˜๊ฒŒ ์ž˜ ํ’€์–ด๋‚ธ ๊ฒƒ ๊ฐ™๋„ค์š”. ์ด์ •๋„ ํ…Œํฌ๋‹ˆ์…˜์ด๋ฉด ์ฐฝ์˜์ ์ด์ง€ ์•Š์•„๋„ ์•„๋ฌด๋‚˜ ๋ชปํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ๊ฐ€๋Šฅํ•ด์„œ ๊ทธ๊ฒƒ๋Œ€๋กœ ์ฐจ๋ณ„์ ์ด ๋“œ๋Ÿฌ๋‚˜๋Š” ๊ฒƒ ๊ฐ™์•„์š”. ์ƒ์กด์ž…๋‹ˆ๋‹ค.4
์Šคํƒ€๋ฒ…์ŠคMoE์˜ ์ทจ์•ฝ์ ์„ ์ฒด๊ณ„์ ์ด๊ณ  ์ˆ˜ํ•™์ ์œผ๋กœ ๋ถ„์„ํ•œ ์—ฐ๊ตฌ์ž…๋‹ˆ๋‹ค. ๋‹จ์ˆœํžˆ ์š”์†Œ์˜ ๊ฐœ์ˆ˜ ๋Š˜๋ฆฌ๊ณ  ํฌ๊ธฐ๋งŒ ๋Š˜๋ฆด๊ฒŒ ์•„๋‹ˆ๋ผ ์ฒด๊ณ„์ ์œผ๋กœ ํ‘œํ˜„ ๋ถ„์„์ด ์ค‘์š”ํ•จ์„ ์•Œ ์ˆ˜ ์žˆ์Œ.4.5
๊ณ ๊ตฌ๋งˆ๋ง›๋„๋ฆฌ๊ธฐ์กด MoE์˜ ํ•œ๊ณ„์ ์„ ๋ช…ํ™•ํ•˜๊ฒŒ ์ •์˜ํ•˜๊ณ , ์ด๋ฅผ objective๋กœ ์ž˜ ๊ตฌํ˜„ํ•ด๋‚ธ, ๊น”๋”ํ•˜๊ณ  ๊ตฐ๋”๋”๊ธฐ ์—†๋Š” ์ข‹์€ ์—ฐ๊ตฌ! ํŠนํžˆ MoE ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ฑด๋“ค์ด์ง€ ์•Š๊ณ ๋„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ฝ‘์•„๋‚ธ ๊ฒŒ, ์ด ๋…ผ๋ฌธ์˜ ์ตœ๊ณ  ๊ฐ•์ ์ด๋ผ๊ณ  ์ƒ๊ฐํ•จ๋‹ˆ๋‹ค 4.2

TL; DR

๐Ÿ’ก

Mixture-of-Experts ํ›ˆ๋ จ ์†์‹คํ•จ์ˆ˜์—๋Š” expert ๊ฐ„ routing ํšจ์œจ์„ฑ ์œ„ํ•œ objective term ์žˆ์Œ

  • ๊ทธ๋Ÿฌ๋‚˜ ์ด๋Š” ๊ฐ expert์˜ ์ „๋ฌธ์„ฑ ํŠนํ™”๋ฅผ ๋ฐฉํ•ดํ•˜๋Š” ๋ถ€์ž‘์šฉ ์žˆ์Œ
  • โ‡’ routing ํšจ์œจ์„ฑ ๋ชฉํ‘œ๋ฅผ ๋ฐฉํ•ดํ•˜์ง€ ์•Š์œผ๋ฉด์„œ expert ์ „๋ฌธํ™”์— ๋„์›€๋˜๋Š” objective๋ฅผ ์ถ”๊ฐ€ํ•˜์ž

Summary

1. Introduction

Background

  • LLM ๊ทœ๋ชจ ์ฆ๊ฐ€์— ๋”ฐ๋ผ ์ถ”๋ก  ๋น„์šฉ์ด ๊ธ‰๊ฒฉํžˆ ์ฆ๊ฐ€๋˜๋ฏ€๋กœ ์‹ค์šฉ์ ์ธ ๋ฐฐํฌ์™€ ํšจ์œจ์„ฑ์ด ์ €ํ•ด๋จ
  • Mixture-of-Experts (MoE) ์•„ํ‚คํ…์ฒ˜๋Š” ์ž…๋ ฅ์— ๋”ฐ๋ผ ํ•˜์œ„ ์ „๋ฌธ๊ฐ€(expert) ์ง‘ํ•ฉ๋งŒ์„ ํ™œ์„ฑํ™”ํ•˜์—ฌ ์ด ๋ฌธ์ œ๋ฅผ ์™„ํ™”
    • MoE ์ถ”๊ฐ€ ์„ค๋ช…
      • ํŠน์ • ๋ ˆ์ด์–ด ๋˜๋Š” ์—ฐ์‚ฐ(e.g., linear layer, MLP, attention projection)์„ ์—ฌ๋Ÿฌ โ€œexpertโ€ subnetwork๋กœ ๋ถ„ํ• 
        • ๊ฐ expert subnetwork๊ฐ€ ๋…๋ฆฝ์ ์œผ๋กœ ์—ฐ์‚ฐ ์ˆ˜ํ–‰ํ•˜๊ณ , ์—ฐ์‚ฐ ๊ฒฐ๊ณผ๋ฅผ ์ข…ํ•ฉํ•˜์—ฌ MoE ๋ ˆ์ด์–ด์˜ ์ตœ์ข… ์ถœ๋ ฅ ์ƒ์„ฑ
        • ์ฃผ์–ด์ง„ ์ž…๋ ฅ์— ๋Œ€ํ•ด ๋ชจ๋“  ์ „๋ฌธ๊ฐ€๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ๊ณ  (dense experts), ์ผ๋ถ€ top-k experts๋กœ ๊ตฌ์„ฑ๋œ subset๋งŒ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์Œ (sparse experts)
          • ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” sparse ์„ค์ • ์‚ฌ์šฉ
    • ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋ชจ๋ธ ํฌ๊ธฐ์— ๋น„๋ก€์ ์œผ๋กœ ์ฆ๊ฐ€ํ•˜์ง€ ์•Š์•„ ๋” ํฐ ์‚ฌ์ด์ฆˆ์˜ ๋ชจ๋ธ ์‚ฌ์šฉ ๊ฐ€๋Šฅ
    • ์ผ๋ฐ˜์ ์œผ๋กœ MoE ์‹œ์Šคํ…œ ์‚ฌ์ „ํ›ˆ๋ จ ์‹œ ํŒŒ๋ผ๋ฏธํ„ฐ ํ™œ์šฉ ๊ทน๋Œ€ํ™”๋ฅผ ์œ„ํ•ด ํ† ํฐ์ด ์ „๋ฌธ๊ฐ€์— ๋ณด๋‹ค ๊ท ๋“ฑํ•˜๊ฒŒ ๋ถ„๋ฐฐ๋˜๋„๋ก ํ•˜๋Š” load balancing objective ์‚ฌ์šฉ

Motivation

  • load balancing ๋ชฉํ‘œ๋Š” ์‚ฌ์ „ํ›ˆ๋ จ๋™์•ˆ ํ™œ์„ฑํ™”๋˜์ง€ ์•Š๋Š” ์ „๋ฌธ๊ฐ€๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” ๋ฐ ํšจ๊ณผ์ ์ด๋‚˜, ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ ์œ„ํ•œ ์‚ฌํ›„ํ›ˆ๋ จ์—์„œ ๋ชจ๋ธ์˜ ํšจ๊ณผ์ ์ธ ์ ์‘์„ ๋ง‰์Œ
    • ์ž…๋ ฅ๊ณผ ์ƒ๊ด€์—†์ด ๊ท ์ผํ•˜๊ฒŒ routingํ•˜๋„๋ก ์œ ๋„ํ•˜์—ฌ ์ „๋ฌธ๊ฐ€ ๊ฐ„ ํ† ํฐ ๋ถ„ํฌ๊ฐ€ ์ค‘๋ณต๋˜๋Š” ํ˜„์ƒ์ด ๋งŽ์ด ๋ฐœ์ƒ
    • ์ด๋Ÿฌํ•œ ์ค‘๋ณต์€ ์ „๋ฌธ๊ฐ€ representation์ด ์„œ๋กœ ๋น„์Šทํ•ด์ง€๋„๋ก ํ•˜์—ฌ ๊ฐ ์ „๋ฌธ๊ฐ€์˜ ๊ธฐ๋Šฅ ์ „๋ฌธํ™”๋ฅผ ๋ฐฉํ•ด
    • ์ „๋ฌธํ™” ๋ถ€์กฑ์œผ๋กœ ์ธํ•ด ๋ชจ๋ธ์„ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์— ํŒŒ์ธํŠœ๋‹ ์‹œ ์„ฑ๋Šฅ ์ €ํ•˜
  • load balancing ๋ชฉํ‘œ๊ฐ€ expert์™€ routing ๊ด€์  ๊ฐ๊ฐ์—์„œ ๊ฐ–๋Š” ๋ฌธ์ œ
    • expert ๊ด€์ ์—์„œ ๋ฌธ์ œ: ๊ฐ ์ „๋ฌธ๊ฐ€์˜ ๊ณ ์œ ํ•œ ํ–‰๋™ ๋ฐœ๋‹ฌ ๋ฐฉํ•ด
    • router ๊ด€์ ์—์„œ ๋ฌธ์ œ: ์ „๋ฌธ๊ฐ€์˜ ์ „๋ฌธํ™”๊ฐ€ ์•ฝํ™”๋ ์ˆ˜๋ก ์ „๋ฌธ๊ฐ€ ๊ฐ„ ์ฐจ์ด๊ฐ€ ๊ฐ์†Œ โ†’ token-to-expert ํ• ๋‹น์ด ์ ์  ๊ท ์ผํ•ด์ง
    • โ†’ ์ „๋ฌธํ™” ๊ฐ์†Œ์™€ ๋ผ์šฐํŒ… ๊ท ์ผํ™”๋Š” ์ ์  ์„œ๋กœ๋ฅผ ๊ฐ•ํ™”ํ•˜๋ฉฐ, ์ด๋Š” ์ „๋ฌธ๊ฐ€ ํ‘œํ˜„๊ณผ ๋ผ์šฐํŒ… ํ’ˆ์งˆ์„ ์ €ํ•˜์‹œํ‚ด
    • โ‡’ MoE ํ›ˆ๋ จ์˜ auxiliary loss (๋ณด์กฐ ์†์‹ค)์—์„œ ๊ธฐ์ธํ•˜๋Š” uniformity constraint์—์„œ ์ „๋ฌธ๊ฐ€ ์ „๋ฌธํ™”๋ฅผ ๋ถ„๋ฆฌํ•ด์•ผ ํ•จ

Contribution

  • auxiliary loss์˜ load balancing ์œ ์ง€ํ•˜๋ฉด์„œ, ์ „๋ฌธ๊ฐ€ ์ „๋ฌธํ™”์™€ ๋ผ์šฐํŒ… ๋‹ค์–‘ํ™”๋ฅผ ์ด‰์ง„ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ ์ œ์•ˆ: ๋‘๊ฐ€์ง€ ์ƒํ˜ธ๋ณด์™„์  objective ๋„์ž…
    • objective(1) expert specialization: ๊ฐ ์ „๋ฌธ๊ฐ€๊ฐ€ ์„œ๋กœ ๋‹ค๋ฅธ ํ† ํฐ ์ฒ˜๋ฆฌ์— ํŠนํ™”๋˜๋„๋ก ํ•˜์—ฌ, ์ „๋ฌธ๊ฐ€ ๊ฐ„ ๊ณ ์œ ํ•œ ํ‘œํ˜„ ๊ฐœ๋ฐœ ์ด‰์ง„
    • objective(2) routing diversification: ๋ผ์šฐํŒ… ๋ถ„์‚ฐ์„ ๊ฐ•ํ™”ํ•˜์—ฌ ์ฐจ๋ณ„ํ™”๋œ ๋ผ์šฐํŒ… ๊ฒฐ์ •์„ ์œ ๋„ํ•จ์œผ๋กœ์จ token-to-expert ํ• ๋‹น์˜ ์ •๋ฐ€์„ฑ ํ–ฅ์ƒ
    • โ‡’ ์ด๋Ÿฌํ•œ ๋ชฉํ‘œ๋ฅผ ๊ณต๋™ ์ตœ์ ํ™”ํ•˜์—ฌ MoE ํ›ˆ๋ จ ์‹œ ๋ชจ๋ธ ์„ฑ๋Šฅ๊ณผ ๋ผ์šฐํŒ… ํšจ์œจ์„ฑ ๊ฐ„ trade-off ์™„ํ™”
  • ์ œ์•ˆ ํ”„๋ ˆ์ž„์›Œํฌ ๋„์ž…ํ•จ์œผ๋กœ์จ ๋‹ค์Œ์„ ๋‹ฌ์„ฑ
    • enhanced expert-routing synergy: ๊ณต๋™ ๋ชฉํ‘œ๋กœ ์ „๋ฌธ๊ฐ€ ์ค‘๋ณต์„ ์ตœ๋Œ€ 45% ๊ฐ์†Œ, ๋ผ์šฐํŒ… ์ ์ˆ˜ ๋ถ„์‚ฐ์„ 150% ์ฆ๊ฐ€ โ†’ ๋” ๋ช…ํ™•ํ•œ ์ „๋ฌธ๊ฐ€ ์ „๋ฌธํ™”์™€ ์ฐจ๋ณ„ํ™”๋œ ์ „๋ฌธ๊ฐ€ ๋ผ์šฐํŒ… ๋‹ฌ์„ฑ
    • stable load balancing: ์ƒˆ๋กœ์šด objective ๋„์ž…ํ•จ์—๋„ ๋ชจ๋“  ๋ชจ๋ธ์—์„œ RMSE 8.63 ๋ฏธ๋งŒ์œผ๋กœ ๋ฒ ์ด์Šค๋ผ์ธ๊ณผ ๋™๋“ฑํ•œ load-balancing ์„ฑ๋Šฅ ๋‹ฌ์„ฑ
    • improved downstream performance: MoE ์•„ํ‚คํ…์ฒ˜ ์ˆ˜์ • ์—†์ด 11๊ฐœ ๋ฒค์น˜๋งˆํฌ์—์„œ 23.79%์˜ ์ƒ๋Œ€์  ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑ, 92.42% ํƒœ์Šคํฌ์—์„œ ๋ชจ๋“  ๋ฒ ์ด์Šค๋ผ์ธ ๋Šฅ๊ฐ€

2. Motivation

Preliminaries of MoE

  • MoE layer (notations)
    • nn๏ปฟ experts
    • input token sequence, X={x1,...,xN}X = \{x_1, ..., x_N\}๏ปฟ
    • routing score matrix, SS๏ปฟ: ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด ์ฒ˜๋ฆฌํ•  ์ƒ์œ„ k๊ฐœ expert๋ฅผ ํ• ๋‹นํ•˜๊ธฐ ์œ„ํ•œ score matrix
      • sijs_{ij}๏ปฟ : ii๏ปฟ๋ฒˆ์งธ ํ† ํฐ์— ๋Œ€ํ•œ jj๏ปฟ๋ฒˆ์งธ expert์˜ routing weight
    • F={f1,...,fn}F = \{f_1, ..., f_n\}๏ปฟ : ๊ฐ expert์— ํ• ๋‹น๋œ ํ† ํฐ์˜ ๋น„์œจ
      • fjf_j๏ปฟ : jj๏ปฟ๋ฒˆ์งธ expert์— ํ• ๋‹น๋œ ํ† ํฐ์˜ ์ˆ˜
    • total loss function, LL๏ปฟ
      • main task loss, LhL_h๏ปฟ : MoE layer์˜ output์œผ๋กœ๋ถ€ํ„ฐ ๊ณ„์‚ฐ๋˜๋Š” ์†์‹ค
      • auxiliary loss, LauxL_{aux}๏ปฟ
        • ฮฑ\alpha๏ปฟ : auxiliary loss ๋„์ž… ๊ณ„์ˆ˜
      • pjp_j๏ปฟ : jj๏ปฟ๋ฒˆ์งธ expert์— ๋Œ€ํ•œ total routing score
        • ์ฆ‰, jj๏ปฟ๋ฒˆ์งธ expert์— ํ• ๋‹น๋œ ๋ชจ๋“  ํ† ํฐ์˜ routing weight ํ•ฉ์‚ฐ
      auxiliary loss, L_aux : ๊ฐ expert์— ํ• ๋‹น๋˜๋Š” ํ† ํฐ ์ˆ˜(f_j)๊ฐ€ ๊ท ๋“ฑํ•ด์•ผ ์ตœ์†Œํ™”๋จ
      โ†’ load-balancing objective term (๊ฐ ์ „๋ฌธ๊ฐ€์— ๋น„์Šทํ•œ ์ˆ˜์˜ ํ† ํฐ์ด ํ• ๋‹น๋˜๋„๋ก ํ•จ)

Observations

  • obs(1) expert overlap: auxiliary loss ๋„์ž…์ด ์ „๋ฌธ๊ฐ€ ๊ฐ„ ํ† ํฐ ๋ถ„ํฌ๋ฅผ ๊ท ๋“ฑํ•˜๊ฒŒ ๋งŒ๋“ค๋ฉฐ, ์ด๋Š” ๊ฐ ์ „๋ฌธ๊ฐ€ ๊ฐ„ ๊ตฌ๋ณ„์„ฑ์„ ๊ฐ์†Œ์‹œํ‚ด
    • auxiliary loss๋Š” expert์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ฮธEj\theta_{E_j}๏ปฟ์™€ ๋…๋ฆฝ์  โ†’ jj๏ปฟ๋ฒˆ์งธ expert์˜ gradient๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Œ:
      • yhy_h๏ปฟ : MoE layer output
      • โ†’ total loss๋กœ ์ธํ•œ expert์˜ ํŒŒ๋ผ๋ฏธํ„ฐ์— ํ˜๋Ÿฌ๋“ค์–ด๊ฐ€๋Š” gradient์—๋Š”, ์ž…๋ ฅ ํ† ํฐ xix_i๏ปฟ๋“ค์ด ๊ด€์—ฌํ•จ
        • load-balancing routing์„ ๊ฐ•์ œํ•˜๋Š” auxiliary loss๋Š” ํ›ˆ๋ จ ๊ณผ์ •์—์„œ ์ „๋ฌธ๊ฐ€์— ๊ฑธ์นœ ๊ท ๋“ฑํ•œ ํ† ํฐ ๋ถ„ํฌ๋ฅผ ์œ ๋„ํ•จ
        • โ†’ ์ž…๋ ฅ ํ† ํฐ๋“ค์ด ๊ด€๋ จ ์ ์€ expert์— ํ• ๋‹น๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋Š” ์˜๋„์น˜ ์•Š์€ ์ „๋ฌธ๊ฐ€์— ๋Œ€ํ•œ gradient flow๋ฅผ ์ด๋Ž
  • obs(2) routing uniformity: ํ›ˆ๋ จ ์ง„ํ–‰์— ๋”ฐ๋ผ routing output์ด ์ ์ฐจ ๊ท ๋“ฑ(uniform)ํ•ด์ง€๋ฉฐ, expert weight ๋ถ„ํฌ๊ฐ€ ๊ท ์ผํ•ด์ง
    • routing์˜ output์€ score matrix Sย (sij)S\ (s_{ij})๏ปฟ โ†’ routing parameter ฮธR\theta_R๏ปฟ๊ณผ ๊ด€๋ จํ•œ gradient๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Œ:
      • xiโ‹…ฮธEjx_i \cdot \theta_{E_j}๏ปฟ : ํ† ํฐ xix_i๏ปฟ์— ๋Œ€ํ•œ expert jj๏ปฟ์˜ output
      • fjf_j๏ปฟ : expert jj๏ปฟ๊ฐ€ ์„ ํƒ๋˜๋Š” ๋นˆ๋„์ˆ˜
      • โ†’ routing ๊ด€๋ จ gradient๋Š” ์ฃผ๋กœ expert output๊ณผ expert์— ๊ฑธ์นœ ํ† ํฐ ๋ถ„ํฌ์— ์˜ํ–ฅ๋ฐ›์Œ
    • LauxL_{aux}๏ปฟ๋Š” fjf_j๏ปฟ์˜ uniformity๋ฅผ ์ง€ํ–ฅํ•˜๋Š”, ๊ท ํ˜•์žกํžŒ ํ† ํฐ ํ• ๋‹น์„ ๋…๋ คํ•˜๋Š” ์†์‹ค์ด์ง€๋งŒ fjf_j๏ปฟ๊ฐ€ ๋ฏธ๋ถ„ ๋ถˆ๊ฐ€๋Šฅํ•˜์—ฌ ์ง์ ‘ ์ตœ์ ํ™”ํ•˜๊ธฐ ์–ด๋ ค์›€
      • ์ด์— ๋ฏธ๋ถ„ ๊ฐ€๋Šฅํ•˜๋ฉฐ fjf_j๏ปฟ์™€ ์–‘์˜ ์ƒ๊ด€๊ด€๊ณ„ ๊ฐ–๋Š” pjp_j๏ปฟ (expert jj๏ปฟ์˜ total routing score) ์‚ฌ์šฉํ•˜์—ฌ routing network์˜ gradient๋ฅผ ๊ณ„์‚ฐ
      • โ†’ LauxL_{aux}๏ปฟ์˜ ์ตœ์ ํ™”๋Š” pjp_j๏ปฟ์˜ uniformity๋ฅผ ์ด‰์ง„ํ•˜๋ฉฐ, ์ด๋Š” ๋˜ํ•œ fjf_j๏ปฟ์˜ uniformity๋ฅผ ์ด๋Ž
      • โ‡’ obs(1) ์—์„œ ๋ณธ ๊ฒƒ๊ณผ ๊ฐ™์ด, ๋ถ€์ •ํ™•ํ•œ ์ „๋ฌธ๊ฐ€์— ํ† ํฐ ํ• ๋‹นํ•˜๋Š” ๊ฒƒ์€ ์ „๋ฌธ๊ฐ€ ๊ฐ„ gradient๊ฐ€ ์ค‘๋ณต๋˜๋„๋ก ํ•˜๋ฉฐ, ์ด๋Š” xiโ‹…ฮธEjx_i \cdot \theta_{E_j}๏ปฟ (expert output) ๊ฐ„ ์œ ์‚ฌ๋„๋ฅผ ์ฆ๊ฐ€์‹œํ‚ด
  • obs(3) expert-routing interaction: obs(1)์€ ์ „๋ฌธ๊ฐ€ ํŠนํ™”, obs(2)์€ routing uniformity ๊ด€๋ จ ๊ด€์ฐฐ์ด์—ˆ์Œ โ†’ ์•ž์„œ ๊ด€์ฐฐํ•œ ํ˜„์ƒ๋“ค์ด ํ›ˆ๋ จ ์ค‘ ์ƒํ˜ธ์ž‘์šฉํ•˜์—ฌ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ ํ•˜๋ฝ ์ด๋Ž
    • obs(1) ์—์„œ ๊ด€์ฐฐํ•œ ์ „๋ฌธ๊ฐ€ ์ธก๋ฉด ๋ฐฉํ•ด๋Š” ๋ชจํ˜ธํ•œ ์ „๋ฌธํ™” ๋‚ณ์Œ
      • ์ด๋กœ ์ธํ•ด ํ† ํฐ ๋ถ„ํฌ๊ฐ€ ๊ท ์ผํ•ด์ ธ ์ „๋ฌธ๊ฐ€ ๊ตฌ๋ณ„์„ ๋”์šฑ ๊ฐ์†Œ์‹œํ‚ค๋Š” gradient ์œ ๋ฐœ
    • ์ „๋ฌธ๊ฐ€ ์œ ์‚ฌ์„ฑ์€ ๋‹ค์‹œ routing์— ์˜ํ–ฅ ๋ฏธ์นจ (obs(2))
      • ์ „๋ฌธ๊ฐ€ ๊ฐ„ output์ด ์ ์ฐจ ์œ ์‚ฌํ•ด์ง€๋ฉด์„œ, routing network๋Š” ์ „๋ฌธ๊ฐ€ ๊ฐ„ ์ฐจ๋ณ„ํ™” ์‹ ํ˜ธ๋ฅผ ์‹๋ณ„ํ•˜๊ธฐ ์–ด๋ ค์›Œ์ง
      • ์ด๋กœ์จ ์ ์ฐจ ๋žœ๋คํ•˜๊ฒŒ top-k expert๋ฅผ ์„ ํƒํ•˜๊ฒŒ ํ•˜๊ณ , ํ† ํฐ๊ณผ ์ตœ์ ์˜ ์ „๋ฌธ๊ฐ€๊ฐ€ ์ •๋ ฌ๋˜์ง€ ๋ชปํ•˜๊ฒŒ ํ•จ

3. Method

  • โ‡’ ์ „๋ฌธ๊ฐ€ ๊ฐ„ ์ค‘๋ณต๊ณผ routing ๊ท ๋“ฑํ™”๋ฅผ ์™„ํ™”ํ•˜๋Š”, loss function LL๏ปฟ ์„ค๊ณ„
    • LauxL_{aux}๏ปฟ : ๊ธฐ์กด auxiliary loss
    • Lo,LvL_o, L_v๏ปฟ : ์ƒˆ๋กญ๊ฒŒ ๋„์ž…๋œ orthogonality loss์™€ variance loss
    • ฮฑ,ฮฒ,ฮณ\alpha, \beta, \gamma๏ปฟ : coefficients

Implementations of losses LoL_o๏ปฟ and LvL_v๏ปฟ

  • expert specialization: orthogonalization objective LoL_o๏ปฟ๊ฐ€ ์ „๋ฌธ๊ฐ€ ๊ฐ„ ๋…๋ฆฝ์ ์ธ ํ‘œํ˜„ ๊ฐœ๋ฐœ ์ด‰์ง„
    • x~ij\tilde{x}_{ij}๏ปฟ : top-k routing ์ดํ›„ ํ† ํฐ xix_i๏ปฟ์— ๋Œ€ํ•œ expert jj๏ปฟ์˜ output
    • โ†’ ์ž…๋ ฅ ํ† ํฐ์— ๋Œ€ํ•œ ๊ฐ expert์˜ output ๊ฐ„ projection ํ•ฉ์‚ฐ์ด ์ตœ์†Œํ™”๋˜๋„๋ก ํ•จ (orthogornalize)
      • ์ด๋กœ์จ ๊ฐ ์ „๋ฌธ๊ฐ€๊ฐ€ ์„œ๋กœ ๊ตฌ๋ณ„๋œ ํ‘œํ˜„ ๊ฐ–๋„๋ก ํ•จ
  • routing diversification: variance-based loss LvL_v๏ปฟ๊ฐ€ ๋ณด๋‹ค ๋‹ค์–‘ํ•œ routing ๊ฒฐ์ •๊ณผ ์ „๋ฌธ๊ฐ€ ์ „๋ฌธํ™”๋ฅผ ๋…๋ ค
    • sห‰j\bar{s}_j๏ปฟ : ๋ฐ์ดํ„ฐ ๋ฐฐ์น˜์— ๊ฑธ์นœ expert jj๏ปฟ์˜ ํ‰๊ท  routing score
    • โ†’ routing score์˜ ๋ถ„์‚ฐ์„ ์ตœ๋Œ€ํ™”ํ•˜์—ฌ, token-to-expert ํ• ๋‹น์ด ๊ท ๋“ฑํ•˜์ง€ ์•Š๋„๋ก ํ•จ

Compatibility of multi-objective optimization

  • ์ „๋ฌธ๊ฐ€์™€ ๋ผ์šฐํŒ… ๊ด€์ ์—์„œ ๋‘ ์†์‹ค์ด ํ˜ธํ™˜ ๊ฐ€๋Šฅํ•จ์„ ๋ณด์ž„
    • expert perspective
      • auxiliary loss LauxL_{aux}๏ปฟ์™€ variance loss LvL_v๏ปฟ๊ฐ€ expert ํŒŒ๋ผ๋ฏธํ„ฐ ฮธEj\theta_{E_j}๏ปฟ์— ์ง์ ‘ ๊ธฐ์—ฌํ•˜์ง€ ์•Š์Œ โ†’ ์ „๋ฌธ๊ฐ€ ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•œ ์ „์ฒด ์†์‹ค์˜ gradient์—๋Š” ํƒœ์Šคํฌ ์†์‹ค LhL_h๏ปฟ์™€ orthogonality loss LoL_o๏ปฟ๋งŒ ๊ด€์—ฌ:
        • gyi=โˆ‡yiLhg_{y_i} = \nabla_{y_i} L_h๏ปฟ : ๋ชจ๋ธ output์— ๋Œ€ํ•œ ํƒœ์Šคํฌ ์†์‹ค์˜ gradient
        • โ†’ routing score sijs_{ij}๏ปฟ์™€ expert representation x~ij\tilde{x}_{ij}๏ปฟ์— ์˜ํ–ฅ ๋ฐ›์Œ
        • โ‡’ ํ›ˆ๋ จ ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ expert weight์˜ ๋ถ„์‚ฐ์ด ์ฆ๊ฐ€ํ•˜๊ณ , gradient๋Š” ๊ฐ ํ† ํฐ์— ๋Œ€ํ•ด ๋‹ค๋ฅธ ๋ฐฉํ–ฅ์„ ๋”์šฑ ์„ ํ˜ธํ•˜๋„๋ก ์œ ๋„ํ•จ
    • routing perspecitve
      • routing ํŒŒ๋ผ๋ฏธํ„ฐ ฮธR\theta_R๏ปฟ์˜ gradient์— LoL_o๏ปฟ๊ฐ€ ์ง์ ‘ ๊ธฐ์—ฌํ•˜์ง€ ์•Š์Œ โ†’ ๋ผ์šฐํŒ… ํŒŒ๋ผ๋ฏธํ„ฐ์— ๋Œ€ํ•œ ์ „์ฒด ์†์‹ค์˜ gradient๋Š” expert representation x~ij\tilde{x}_{ij}๏ปฟ, expert load fjf_j๏ปฟ, routing weights sijs_{ij}๏ปฟ์— ์˜ํ–ฅ ๋ฐ›์Œ:
        • โ†’ ํ›ˆ๋ จ ์ง„ํ–‰๋จ์— ๋”ฐ๋ผ expert load๊ฐ€ ๊ท ํ˜• ์žกํžˆ๊ณ  routing weight ๋ถ„์‚ฐ ์ฆ๊ฐ€
          • ์ „๋ฌธ๊ฐ€ ํ‘œํ˜„ orthogalize๊ฐ€ routing gradient์˜ ์ง๊ตํ™”๋ฅผ ๋‚ณ๊ณ  routing weight ๋ถ„์‚ฐ์„ ์ฆ๊ฐ€์‹œํ‚ด
    • โ‡’ expert parameter ฮธEj\theta_{E_j}๏ปฟ๋Š” LoL_o๏ปฟ์˜ gradient์— ๋Œ€ํ•ด์„œ๋งŒ ์˜ํ–ฅ ๋ฐ›๊ณ , routing parameter ฮธR\theta_R๏ปฟ์€ Lo,LvL_o, L_v๏ปฟ ๋ชจ๋‘์— ์˜ํ–ฅ๋ฐ›์ง€๋งŒ ๋‘ loss์˜ ๋ชฉํ‘œ๊ฐ€ ์ถฉ๋Œํ•˜์ง€ ์•Š์Œ (์ „๋ฌธ๊ฐ€ ํ‘œํ˜„ ์ง๊ตํ™”์™€ ๋ผ์šฐํŒ… ์ ์ˆ˜ ๋‹ค์–‘ํ™”)
      • ๋‘ ๋ชฉํ‘œ๋ฅผ ์ถฉ๋Œ ์—†์ด ๊ณต๋™ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ์Œ

4. Experiments

Experimental Setup

  • datasets
    • ํ›ˆ๋ จ: Numina, GLUE, FLAN collection์˜ traning set
    • ํ…Œ์ŠคํŠธ
      • math: GSM8K, MATH500, Numina
      • multi-domain tasks: MMLU, MMLU-pro, BBH, GLUE, LiveBench, GPQA
      • code generation: HumanEval, MBPP
  • baselines (MoE training strategies)
    • Aux Loss, GShard, ST-MoE, Loss-Free Balancing
  • metrics
    • accuracy
    • expert load balancing (MaxVioglobal)
    • clustering quality (Silhouette Coefficient)
    • expert specialization (Expert Overlap)
    • routing stability (Routing Variance)
  • setup
    • 3 ์—ํญ์œผ๋กœ ํ›ˆ๋ จ (~550 steps)
    • LoRA ๊ธฐ๋ฐ˜ ํŒŒ์ธํŠœ๋‹ (router layer, expert layer ๋ชจ๋‘์— LoRA ๋ชจ๋“ˆ ์‚ฌ์šฉํ•˜์—ฌ ๊ณต๋™ ์ตœ์ ํ™” ํ•จ)

Performance in Downstream Tasks

  • ์ œ์•ˆ ๋ฐฉ์‹์ด ์ „๋ฌธ๊ฐ€ ์ „๋ฌธํ™” ์œ ๋„ํ•˜์—ฌ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์—์„œ ํšจ๊ณผ์ ์œผ๋กœ ํ–ฅ์ƒ๋œ ์„ฑ๋Šฅ ๋ณด์ž„
  • โ‡’ expert orthogonality์™€ routing output diversification์ด ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ ์„ฑ๋Šฅ์— ๊ธ์ •์  ์˜ํ–ฅ ๋ฏธ์น˜๋Š”์ง€ ํ™•์ธ

Load Balancing

MaxVio_global: load balancing ์ •๋„ ์ง€ํ‘œ. ๋‚ฎ์„์ˆ˜๋ก ์ข‹์Œ
RMSE: ๋‘ curve ๊ฐ„ ์œ ์‚ฌ๋„ ์ฐจ์ด ์ง€ํ‘œ
  • LauxL_{aux}๏ปฟ๋งŒ ์‚ฌ์šฉํ•˜๋Š” only aux์™€ w/o lv ( LoL_o๏ปฟ๋งŒ ์‚ฌ์šฉ), w/o lo ( LvL_v๏ปฟ๋งŒ ์‚ฌ์šฉ) ๊ฐ„ load balancing ์„ฑ๋Šฅ ์ถ”์„ธ๊ฐ€ ๊ฑฐ์˜ ๋™์ผ
    • ์„ฑ๋Šฅ ์ปค๋ธŒ ๊ฐ„ ์ฐจ์ด ์ง€ํ‘œ์ธ RMSE ๋˜ํ•œ 0.03 ๋ฏธ๋งŒ์ด์–ด์„œ ์ƒ๋‹นํžˆ ์œ ์‚ฌ
  • โ‡’ Lv,LoL_v, L_o๏ปฟ๊ฐ€ LauxL_{aux}๏ปฟ์˜ load balancing์— ์˜ํ–ฅ ๋ฏธ์น˜์ง€ ์•Š์Œ์„ ๋ณด์ž„

Behaviors of Experts and Routing

  • ์ฒ˜์Œ ๋‘๊ฐœ ๊ทธ๋ž˜ํ”„๋Š” ์ „๋ฌธ๊ฐ€ ์ง๊ต์„ฑ, ๋งˆ์ง€๋ง‰ ๊ทธ๋ž˜ํ”„๋Š” ๋ผ์šฐํŒ… ์ถœ๋ ฅ์˜ ๋‹ค์–‘์„ฑ ๋‚˜ํƒ€๋ƒ„
    • ์ฒ˜์Œ ๋‘ ๊ทธ๋ž˜ํ”„ โ†’ LoL_o๏ปฟ๊ฐ€ ์ „๋ฌธ๊ฐ€ ์ง๊ต์„ฑ์„ ์ง์ ‘ ์ด‰์ง„ํ•˜๋ฉฐ, LvL_v๏ปฟ๋„ ์ด์— ๊ธฐ์—ฌํ•จ
    • ๋งˆ์ง€๋ง‰ ๊ทธ๋ž˜ํ”„ โ†’ LvL_v๏ปฟ๊ฐ€ ๋ผ์šฐํŒ… ์ถœ๋ ฅ ๋‹ค์–‘์„ฑ์„ ์ง์ ‘ ํ–ฅ์ƒํ•˜๋ฉฐ, LoL_o๏ปฟ๋„ ์ด์— ๊ธฐ์—ฌํ•จ
  • โ‡’ Lv,LoL_v, L_o๏ปฟ๊ฐ€ ์ „๋ฌธ๊ฐ€ ์ง๊ต์„ฑ, ๋ผ์šฐํŒ… ์ ์ˆ˜ ๋‹ค์–‘ํ™”๋ฅผ ๊ณต๋™ ์ด‰์ง„ํ•  ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ž„

Ablation among Losses

  • Lo,LvL_o, L_v๏ปฟ์˜ ๊ฒฐํ•ฉ์ด ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์—์„œ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ์ƒ๋‹นํžˆ ํ–ฅ์ƒ์‹œํ‚ด
    • ๋˜ํ•œ ๊ฐ ์†์‹ค์ด ๊ฐœ๋ณ„์ ์œผ๋กœ ๋„์ž…๋  ๋•Œ๋„ ์„ฑ๋Šฅ ๊ฐœ์„  ๋ณด์ž„
  • โ‡’ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์—์„œ Lo,LvL_o, L_v๏ปฟ๊ฐ€ ๋ชจ๋‘ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•˜๋ฉฐ, ์ด๋“ค์˜ ๊ฒฐํ•ฉ์ด ์„œ๋กœ์˜ ํšจ๊ณผ๋ฅผ ์ฆ์ง„์‹œํ‚ค๋Š” ์‹œ๋„ˆ์ง€ ํšจ๊ณผ ๋ƒ„์„ ๋ณด์ž„

Categories

research