30 December 2025

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

๐Ÿ’กLLM ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ ์‹œ์Šคํ…œ์—์„œ ์˜ค๋ฅ˜๊ฐ€ ๋‚ฌ์„ ๋•Œ ๋ˆ„๊ฐ€ ์–ธ์ œ ์˜ค๋ฅ˜๋ƒˆ๋Š”์ง€ ์ž๋™์œผ๋กœ ํŒŒ์•…ํ•ด๋ณด์ž!๋ฒค์น˜๋งˆํฌ ์ œ์•ˆ ๋ฐ ํ˜„ LLM ์„ฑ๋Šฅ ํ‰๊ฐ€

๐Ÿฅ‡

Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems

Review

๋‹‰๋„ค์ž„ ํ•œ์ค„ํ‰๋ณ„์  (0/5)
๋ฐค๋‚˜์ค‘์—๋Š” LLM์ด LLM ์˜ค๋ฅ˜๋ฅผ ์ž๋™ ๋ถ„์„.. ํ•ด์ค„ ์ˆ˜๋„ ์žˆ๊ฒ ๋‹ค, ์ •ํ™•๋„ ์•„์ง ๋งŽ์ด ๋–จ์–ด์ง€๊ณ  ๋กœ๊ทธ ๊ธธ์ด ๊ธธ๋ฉด ์ž˜ ๋ชปํ•˜๊ธด ํ•˜์ง€๋งŒ. all-at-once์™€ step-by-step ์„ฑ๋Šฅ์ด ํฌ๊ฒŒ ์ฐจ์ด๋‚˜๋Š” ๊ฑด ์ด์œ ๊ฐ€ ๊ถ๊ธˆํ•˜๋‹ค3.9
๋ฆฌํ‹€์‹œ๋Œ€์˜ ํ๋ฆ„์ด ๋‹จ์ผ ์—์ด์ „ํŠธ ์‹œ์Šคํ…œ์—์„œ ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ๋กœ ๊ฐ€๊ณ  ์žˆ์–ด์„œ ๊ทธ๋Ÿฐ์ง€ ๋ฉ€ํ‹ฐ์—์ด์ „ํŠธ ๊ด€๋ จ ๋ฌธ์ œ์ ๋“ค์„ ๋‹ค๋ฃจ๋Š” ๋…ผ๋ฌธ๋“ค์ด ๋งŽ์ด ๋ณด์ธ๋‹ค. 3๊ฐ€์ง€ approach ใ…์•„์ง ๋ชจ๋‘ ์„ฑ๋Šฅ์ด ๋‚ฎ์ง€๋งŒ, ์–ด๋– ํ•œ ์—์ด์ „ํŠธ๊ฐ€ ์ž˜๋ชปํ–ˆ๋Š”์ง€๋ฅผ ์ •ํ™•ํžˆ ์‹๋ณ„ํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค๋ฉด ๋ฐ”์ด๋ธŒ ์ฝ”๋”ฉ์€ ์ง„์งœ ์™„์„ฑํ˜•์ด ๋  ๊ฒƒ ๊ฐ™๋‹ค..4.5
5์‹œLLM ์˜ค๋ฅ˜ ๋ถ„์„์„ ์ž๋™์œผ๋กœ ํ•  ์ˆ˜ ์žˆ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ(?)๊ฐ€ ์ œ์•ˆ๋˜์—ˆ๋‹ค๋Š” ์ ์—์„œ ์ด ๋…ผ๋ฌธ์ด ์ค‘์š”ํ•œ ๋“ฏํ•จ. ์•„์ง๊นŒ์ง€ ๊ธธ์–ด์ง„ ๋กœ๊ทธ์—์„œ ์˜ค๋ฅ˜๊ฐ€ ๋งŽ๊ณ , ๊ฐ•๋ ฅํ•œ ์ถ”๋ก ๋ชจ๋ธ์ด๋ผ ํ•˜๋”๋ผ๋„ ๋‚ฎ์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ์ ์—์„œ ์•„์‰ฝ์ง€๋งŒ, ์ด ๋ถ€๋ถ„์„ ํ•ด๊ฒฐํ•˜๋Š” ํ›„์† ์—ฐ๊ตฌ๊ฐ€ ๋“ฑ์žฅํ•œ๋‹ค๋ฉด ์Šค์Šค๋กœ ์˜ค๋ฅ˜๋ฅผ ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ์ง€ ์•Š์„๊นŒ ์ƒ๊ฐํ•จ.4.5
3์ผ์ „์ƒํ™ฉ์— ๋”ฐ๋ผ ์„œ๋กœ ๋‹ค๋ฅธ ๋กœ๊ทธ ๊ตฌ์„ฑ์œผ๋กœ ์‹คํŒจ ์›์ธ ๋ถ„์„์„ ์ด๋ ‡๊ฒŒ ํ•ด์•ผํ•˜๋Š”๊ตฌ๋‚˜๋ฅผ ์•Œ๋ ค์ฃผ๋Š” ๋…ผ๋ฌธ. Hybrid ๋ฐฉ์‹์„ ์ตœ์ ํ™”ํ•˜๋ฉด step-level ์ •ํ™•๋„๋„ ๊ฝค๋‚˜ ๋†’๊ฒŒ ์œ ์ง€๋ ๊ฑฐ๊ฐ™๋‹ค. ์ด๊ฑฐ๋ž‘ summarization์™€ ์ฐจ๋ณ„์ ์ด ๋ฌด์—‡์ผ ์ง€ ๊ถ๊ธˆํ•จ4.5
์ปคํŠผ์ฝœ์ด๋ ‡๊ฒŒ ๋ช…์‹œ์ ์œผ๋กœ โ€œ๋ˆ„๊ฐ€, ์–ธ์ œ ์‹คํŒจ๋ฅผ ์•ผ๊ธฐํ–ˆ๋Š”๊ฐ€?โ€๋ฅผ ์ •๋ฉด์œผ๋กœ ๋‹ค๋ฃจ๋Š” task๋Š” ์ƒ์†Œํ•œ๊ฒƒ ๊ฐ™์Œ. ๊ธฐ์กด ์—ฐ๊ตฌ๋“ค์ด ์ฃผ๋กœ ์ „์ฒด ์„ฑ๊ณต๋ฅ ์„ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ๋ฐ ์ดˆ์ ์„ ๋งž์ท„๋‹ค๋ฉด, ์ด ๋…ผ๋ฌธ์€ ์˜คํžˆ๋ ค ์‹คํŒจ ์‚ฌ๋ก€๋ฅผ ์ค‘์‹ฌ์— ๋‘๊ณ  ๋ถ„์„ํ•จ์œผ๋กœ์จ ๋ฉ€ํ‹ฐ์—์ด์ „ํŠธ ์‹œ์Šคํ…œ์˜ ๊ทผ๋ณธ์ ์ธ ์ทจ์•ฝ์ ์„ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•˜๋Š” ์‹œ๋„๊ฐ™์Œ4.2
๋…ธํŠธ๋ถ๋…ธ์…˜๋กœ๋”ฉ์•ˆ๋ผ์„œํฐ์œผ๋กœ์‹ ๋ขฐ์„ฑ์ด๋ผ๋Š” ๊ฑด ์ž˜ํ•˜๊ธฐ๋ณด๋‹ค ์‹ค์ˆ˜ํ•˜์ง€ ์•Š๊ณ , ์‹ค์ˆ˜ํ•˜๋”๋ผ๋„ ์ˆ˜์ •ํ•˜๋Š” ๊ฒƒ์—์„œ ๋‚˜์˜จ๋‹ค๊ณ  ์ƒ๊ฐํ•จ. ๊ณ ์ ์ด ๋†’๋ƒ์—๋งŒ ์ง‘์ค‘ํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ์ €์ ์ด ๋†’๋„๋ก ์œ ๋„ํ•˜๋Š” ๊ฒƒ๋ฆฌ ์ค‘์š”ํ•˜์ง€ ์•Š๋‚˜? ๊ด€๋ จ์—ฐ๊ตฌ๋กœ ์ข‹์€ ๋…ผ๋ฌธ์ธ ๊ฒƒ ๊ฐ™์Œ.4.5
๋™๊ธ€๋™๊ธ€ LLM์ด ์ ์  ์ง€์„ฑ ๊ทธ ์ž์ฒด๋กœ ์ง„ํ™”ํ•ด๊ฐ€๋„ค์š”,,, ์šฐ๋ฆฐ ๋ญ˜ํ•˜์ง€ 4.5
๋น ์Šค๊ตณ์ด ๋ฒค์น˜๋งˆํฌ๊นŒ์ง€ ํ•„์š”ํ•œ ์ผ์ธ๊ฐ€? ์‹ถ๊ธด ํ•˜๋ฉด์„œ๋„ ์™„์ „ ์ž๋™ํ™”๋ฅผ ์œ„ํ•ด์„œ๋ผ๋ฉด ํ•„์š”ํ•œ ์ผ ๊ฐ™๊ธฐ๋„ ํ•จ..3.8

TL; DR

๐Ÿ’ก

LLM ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ ์‹œ์Šคํ…œ์—์„œ ์˜ค๋ฅ˜๊ฐ€ ๋‚ฌ์„ ๋•Œ ๋ˆ„๊ฐ€ ์–ธ์ œ ์˜ค๋ฅ˜๋ƒˆ๋Š”์ง€ ์ž๋™์œผ๋กœ ํŒŒ์•…ํ•ด๋ณด์ž!
๋ฒค์น˜๋งˆํฌ ์ œ์•ˆ ๋ฐ ํ˜„ LLM ์„ฑ๋Šฅ ํ‰๊ฐ€

Summary

Motivation

  • Coding, research ๋“ฑ ๋‹ค์–‘ํ•œ ๋ถ„์•ผ์—์„œ LLM multi agent system์„ ์‚ฌ์šฉํ•˜๊ณ  ์žˆ๋Š”๋ฐ, ์‹œ์Šคํ…œ์ด ์‹คํŒจํ–ˆ์„ ๋•Œ ์›์ธ ๋ถ„์„์„ ์ฐพ๋Š”๊ฑด ์—ฌ์ „ํžˆ ์ˆ˜๋™์ ์ด๊ณ  ์‹œ๊ฐ„์ด ๋งŽ์ด ๋“ฆ
    • ex) ๋ฐ”์ด๋ธŒ ์ฝ”๋”ฉํ•  ๋•Œ ์›ํ•˜๋Š”๋Œ€๋กœ ๋™์ž‘ ์•ˆํ•˜๋ฉด ๊ฒฐ๊ตญ ์‚ฌ๋žŒ์ด ์ผ์ผํžˆ ์ฝ”๋“œ ์ฝ์–ด๋ด์•ผ ํ•จ
  • ์‹คํŒจ ๋ถ„์„์€ ๊ธด ๋กœ๊ทธ ์†์—์„œ ์–ด๋–ค ์—์ด์ „ํŠธ๊ฐ€ ์–ด๋–ค ์‹œ์ ์—์„œ ์ž˜๋ชปํ–ˆ๋Š”์ง€ ์•Œ์•„๋‚ด์•ผ ํ•จ!

    โ†’ ์ด๊ฑธ LLM์ด ์ž๋™์œผ๋กœ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ด๋ณด์ž

Contribution

  • Problem Definition
    • LLM multi agent system์ด ์–ธ์ œ ์–ด๋””์„œ ์ž˜๋ชปํ–ˆ๋Š”์ง€๋ฅผ ์‹๋ณ„ํ•˜๋Š” ๋ฌธ์ œ ์ œ๊ธฐ
  • Banchmark: Who&When
    • ์—์ด์ „ํŠธ ์‹œ์Šคํ…œ์˜ ์‹คํŒจ๋ฅผ ๋ถ„์„ํ•œ ๋ฒค์น˜๋งˆํฌ ๊ตฌ์ถ•
  • Can LLMs help identify When and Which agent causes task failures?
    • LLM์ด ์ž๋™์œผ๋กœ ์‹คํŒจ ๋ถ„์„์„ ์–ผ๋งˆ๋‚˜ ์ž˜ํ•˜๋Š”์ง€ ํ‰๊ฐ€ ๋ฐ ๋ถ„์„

Problem Definition

LLM Multi agent system๊ฐ€ ์‹คํŒจ๋ฅผ ํ–ˆ์„ ๋•Œ, ๊ฒฐ์ •์ ์ธ ์˜ค๋ฅ˜ ์ค‘ ๊ฐ€์žฅ ๋จผ์ € ๋ฐœ์ƒํ•œ ์‹ค์ˆ˜๋ฅผ ์ฐพ๊ณ 
์–ธ์ œ ๋ˆ„๊ฐ€ ๋ฐœ์ƒํ–ˆ๋Š”์ง€ ์•Œ์•„๋‚ด์ž!

  • LLM Multi agent system
    • M=(N,S,A,P,ฯ•)M = (\mathcal{N}, S, A, P, \phi)๏ปฟ
      • N: ์—์ด์ „ํŠธ ๊ฐœ์ˆ˜
      • S: ์ƒํƒœ(state) ์ง‘ํ•ฉ
      • A: ํ–‰๋™(action) ์ง‘ํ•ฉ
        • ๊ฐ ์—์ด์ „ํŠธ ii๏ปฟ๋Š” ํ–‰๋™ ์ง‘ํ•ฉ์˜ ๋ถ€๋ถ„์ง‘ํ•ฉ AiA_i๏ปฟ ์—์„œ ํ–‰๋™ ๊ฐ€๋Šฅ
      • P(st+1โˆฃst,at,ฯ•(t))P(s_{t+1}|s_t, a_t, \phi(t))๏ปฟ : ์‹œ๊ฐ„ t์— ฯ•(t)\phi(t)๏ปฟ๋งŒ ํ–‰๋™ํ•œ๋‹ค๋Š” ์กฐ๊ฑด์—์„œ์˜ ์ƒํƒœ ์ „์ด ํ™•๋ฅ 
      • ฯ•(t)\phi(t)๏ปฟ: ์‹œ๊ฐ„ ๋‹จ๊ณ„ t์—์„œ ํ–‰๋™ํ•˜๋Š” ์—์ด์ „ํŠธ

  • Trajectory
    • ฯ„=(s0,a0,s1,a1,...,sT)\tau = (s_0, a_0, s_1, a_1, ..., s_T)๏ปฟ
    • Trajectory ๋‚ด ์‹ค์ˆ˜๋ฅผ (i,t)(i, t)๏ปฟ๋กœ ๋‚˜ํƒ€๋ƒ„
      • ์‹œ๊ฐ„ tt๏ปฟ์—์„œ ii๏ปฟ ์—์ด์ „ํŠธ๊ฐ€ ํ•œ ata_t๏ปฟ๊ฐ€ ์˜ค๋ฅ˜์ธ ๊ฒƒ
  • Trajectory result fuction
    • Z(ฯ„)={1,ifย theย systemย ultimatelyย fails,0,otherwise.Z(\tau) = \begin{cases} 1, & \text{if the system ultimately fails,} \\ 0, & \text{otherwise.} \end{cases}๏ปฟ
    • ์‹คํŒจํ•˜๋ฉด 1, ์•„๋‹ˆ๋ฉด 0
  • ๊ฒฐ์ •์ ์ธ ์˜ค๋ฅ˜
    • tt๏ปฟ ์‹œ๊ฐ„์—์„œ ์—์ด์ „ํŠธ ii๏ปฟ์˜ ํ–‰๋™์„ ๊ต์ •ํ•œ trajectory
      • ฯ„(i,t)=I(i,t)(ฯ„)\tau^{(i,t)} = \mathcal{I}_{(i,t)}(\tau)๏ปฟ
    • ๊ต์ •ํ•ด์„œ ํ•ด๊ฒฐ๋˜๋ฉด ฮ”i,t(ฯ„) \Delta_{i,t}(\tau)๏ปฟ=1 ์•„๋‹ˆ๋ฉด 0
      • ฮ”i,t(ฯ„)={1,ifย Z(ฯ„)=1ย andย Z(ฯ„(i,t))=00,otherwise\Delta_{i,t}(\tau) = \begin{cases} 1, & \text{if } Z(\tau)=1 \text{ and } Z(\tau^{(i,t)})=0 \\ 0, & \text{otherwise} \end{cases}๏ปฟ
    • ฮ”i,t(ฯ„) \Delta_{i,t}(\tau)๏ปฟ=1๋ฅผ ๋งŒ์กฑํ•˜๋Š” (i,t)(i, t)๏ปฟ๊ฐ€ ๊ฒฐ์ •์ ์ธ ์˜ค๋ฅ˜๋“ค์ž„!
      • C(ฯ„)={(i,t)โˆฃฮ”i,t(ฯ„)=1}C(\tau) = \{(i, t) | \Delta_{i,t}(\tau) = 1\}๏ปฟ
  • Problem
    • ๊ฐ€์žฅ ๋จผ์ € ๋ฐœ์ƒํ•œ ๊ฒฐ์ •์ ์ธ ์˜ค๋ฅ˜๋ฅผ ์ฐพ์ž
    • (i^,t^)=argโกminโก(i,t)โˆˆC(ฯ„)ย t(\hat{i}, \hat{t }) = \underset{(i,t) \in C(\tau)}{\arg\min} \ t๏ปฟ

Banchmark: Who&When

  • LLM Multi Agent System์—์„œ ์˜ค๋ฅ˜๊ฐ€ ๋‚œ ๊ฒฝ์šฐ ๋ˆ„๊ฐ€ (Who) ์–ธ์ œ (When) ์˜ค๋ฅ˜๋ฅผ ๋งŒ๋“ค์—ˆ๋Š”์ง€ ์‹๋ณ„ํ•˜๋Š”
    ๋ฒค์น˜๋งˆํฌ ์ œ์•ˆ
    • 127๊ฐœ์˜ LLM Multi Agent System์—์„œ ์ˆ˜์ง‘ํ•œ ๋กœ๊ทธ ํฌํ•จ
      • 2๊ฐ€์ง€ ์ข…๋ฅ˜์˜ ์—์ด์ „ํŠธ ์‹œ์Šคํ…œ ์‚ฌ์šฉ
        • Algorithm-Generated Agentic Systems
          • CaptainAgent ์•Œ๊ณ ๋ฆฌ์ฆ˜: ์ฃผ์–ด์ง„ ํƒœ์Šคํฌ(GAIA, Assistant Bench)์— ๋งž์ถคํ™”๋œ ์—์ด์ „ํŠธ ํŒ€์„ ๊ตฌ์„ฑํ•˜๊ณ , ์ ์ ˆํ•œ ์—์ด์ „ํŠธ ์ด๋ฆ„, ํ”„๋กฌํ”„ํŠธ ๋ฐ ํ•„์š”ํ•œ ๋„๊ตฌ๋ฅผ ํ• ๋‹น
          • ๊ฐ ์ฟผ๋ฆฌ์— ๋Œ€ํ•ด ์ตœ์ ํ™”๋œ ์†”๋ฃจ์…˜์„ ๋‚˜ํƒ€๋‚ด๋Š” ์ตœ์ข… ๋ฉ€ํ‹ฐ ์—์ด์ „ํŠธ ๊ตฌ์„ฑ๊ณผ ํ•ด๋‹น ์‹คํ–‰ ๊ธฐ๋ก๋งŒ์„ ์„ ํƒ
          • ์‹คํŒจํ•œ ์ผ€์ด์Šค๋งŒ ๋ฒค์น˜๋งˆํฌ์— ํฌํ•จ์‹œํ‚ด
        • Hand-Crafted Agentic Systems
          • Magnetic-One: ์›น ๋ธŒ๋ผ์šฐ์ € ์กฐ์ž‘์ด๋‚˜ ๋กœ์ปฌ ํŒŒ์ผ ํƒ์ƒ‰๊ณผ ๊ฐ™์€ ๊ณ ์œ ํ•œ ๊ธฐ๋Šฅ์— ํŠนํ™”๋œ 5๊ฐœ์˜ ์ •๊ตํ•˜๊ฒŒ ์ œ์ž‘๋œ ์—์ด์ „ํŠธ๋กœ ๊ตฌ์„ฑ
          • GAIA, Assistant Bench์—์„œ Magnetic-One ํ‰๊ฐ€ํ•˜๊ณ  ์‹คํŒจ ๋กœ๊ทธ๋ฅผ ๋ฒค์น˜๋งˆํฌ์— ํฌํ•จ์‹œํ‚ด
    • 184๊ฐœ์˜ Failure Annotation Tasks(์‹คํŒจ ์ฃผ์„)์œผ๋กœ ์ด๋ฃจ์–ด์ง
      • 3๋ช…์˜ ์—์ด์ „ํŠธ ์ „๋ฌธ๊ฐ€๋“ค์ด multi round annotationํ•จ
        • round 1: ๋ชจ๋“  ์‹คํŒจ ๋กœ๊ทธ๋ฅผ ์ „๋ฌธ๊ฐ€ํ•œํ…Œ ๋ถ„๋ฐฐ ํ›„ ์–ธ์ œ ๋ˆ„๊ฐ€ ์˜ค๋ฅ˜๋ฅผ ๋ƒˆ๋Š”์ง€, ์˜ค๋ฅ˜ ์ดํ›„์˜ ์ถ”๋ก ์— ๋Œ€ํ•ด ์ฃผ์„ ์ฒ˜๋ฆฌํ•จ, ๊ทธ๋ฆฌ๊ณ  ํ™•์‹คํ•œ์ง€ ๋ถˆํ™•์‹คํ•œ์ง€ ๋ถ„๋ฅ˜
        • round 2: ๋ถˆํ™•์‹คํ•œ ๊ฒƒ๋“ค์— ๋Œ€ํ•ด ํ•ฉ์˜ ๋„์ถœ(๋งŒ์žฅ์ผ์น˜๊นŒ์ง€)
        • round 3: ๊ฐ ์ „๋ฌธ๊ฐ€๊ฐ€ ๋‚จ๊ธด ์ฃผ์„ ๊ฐ„ ๊ต์ฐจ ๊ฒ€์ฆ
        • a๋Š” ๊ฐ ์ „๋ฌธ๊ฐ€๊ฐ€ ๊ฑธ๋ฆฐ ์‹œ๊ฐ„์ธ๋ฐ 30.9, 30.2, 23.2์‹œ๊ฐ„ ์”€
        • b๋Š” ํ™•์‹คํ•œ ๊ฒƒ ๋ถˆํ™•์‹คํ•œ ๊ฒƒ ๋น„์œจ์ธ๋ฐ, ๊ฐ€์žฅ ํฐ ์˜ค๋ฅ˜๋ฅผ ๊ณ ๋ฅด๋Š” ๊ฒƒ์€ ์–ด๋ ค์›Œ์„œ ๋ถˆํ™•์‹คํ•œ ์ฃผ์„ ๋น„์œจ์ด ์ข€ ์žˆ์Œ
        • c๋Š” ์„œ๋กœ์˜ ๋ถˆํ™•์‹คํ•œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ํˆฌํ‘œํ•  ๋•Œ ๊ฐœ์ธ ๊ฐ„์˜ ๋ถˆ์ผ์น˜์œจ, ๊ฐœ์ธ์ฐจ๊ฐ€ ์ข€ ์žˆ์Œ
    • ๊ฐ ๋ฐ์ดํ„ฐ ์ธ์Šคํ„ด์Šค๋Š” ์ฟผ๋ฆฌ, ์‹คํŒจ ๋กœ๊ทธ, ์—์ด์ „ํŠธ ์‹œ์Šคํ…œ ์ •๋ณด, ์ฃผ์„์„ ํฌํ•จํ•จ
      • ์ฟผ๋ฆฌ: ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ฐ€์ ธ์˜จ ํ˜„์‹ค ์„ธ๊ณ„์—์„œ์˜ ์งˆ๋ฌธ
      • ์‹คํŒจ ๋กœ๊ทธ: ํŠน์ • ์‹œ์Šคํ…œ์ด ์ฟผ๋ฆฌ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ฐ ์‹คํŒจํ–ˆ์„ ๋•Œ์˜ ์ „์ฒด ์—์ด์ „ํŠธ ๋Œ€ํ™” ๋กœ๊ทธ
      • ์—์ด์ „ํŠธ ์‹œ์Šคํ…œ ์ •๋ณด: Algorithm-Generated Agentic Systems์˜ ๊ฒฝ์šฐ, ํ•ด๋‹น ์ฟผ๋ฆฌ์— ๋งž์ถฐ ์กฐ์ •๋œ ์‹œ์Šคํ…œ ํ”„๋กฌํ”„ํŠธ, tools, ์—์ด์ „ํŠธ ์ด๋ฆ„ ๋“ฑ์„ ํฌํ•จ
      • ์ฃผ์„: ํƒœ์Šคํฌ ์‹คํŒจ์— ์ฑ…์ž„์ด ์žˆ๋Š” ์—์ด์ „ํŠธ, ์‹คํŒจ๊ฐ€ ๋ฐœ์ƒํ•œ ๋‹จ๊ณ„, ๊ทธ๋ฆฌ๊ณ  ์‹คํŒจ ์ด์œ ์— ๋Œ€ํ•œ ํ‰์ดํ•œ ์–ธ์–ด ์„ค๋ช…์„ ํฌํ•จ
      • ๋ฐ์ดํ„ฐ ์˜ˆ์‹œ
  • Metric
    • ์—์ด์ „ํŠธ ์ˆ˜์ค€ ์ •ํ™•๋„: ๋ˆ„๊ฐ€ ํ‹€๋ ธ๋Š”์ง€ ๋งž์ถ”๊ธฐ
    • ๋‹จ๊ณ„ ์ˆ˜์ค€ ์ •ํ™•๋„: ์–ธ์ œ ํ‹€๋ ธ๋Š”์ง€ ๋งž์ถ”๊ธฐ
    • ํ—ˆ์šฉ ์˜ค์ฐจ๋ฅผ ํฌํ•จํ•œ ๋‹จ๊ณ„ ์ˆ˜์ค€ ์ •ํ™•๋„: ์กฐ๊ธˆ ๋„๋„ํ•˜๊ฒŒ ์–ธ์ œ ํ‹€๋ ธ๋Š”์ง€ ๋งž์ถ”๊ธฐ

Can LLMs help identify When and Which agent causes task failures?

  • LLM(GPT-4o)์„ ์‚ฌ์šฉํ•ด ์ž๋™์œผ๋กœ ์‹คํŒจ ๋ถ„์„ ํ•ด๋ณด๊ธฐ
    • 3๊ฐ€์ง€ approach
      • All-at-once: LLM์—๊ฒŒ ์ฟผ๋ฆฌ๋ž‘ ์ „์ฒด ์‹คํŒจ ๋กœ๊ทธ ์ฃผ๊ณ , ๋ˆ„๊ฐ€ ์–ธ์ œ ํ‹€๋ ธ๋Š”์ง€ ์‹๋ณ„ํ•˜๊ธฐ
      • Step-by-step: LLM์—๊ฒŒ ์ฟผ๋ฆฌ ์ œ๊ณตํ•˜๊ณ , ์‹คํŒจ ๋กœ๊ทธ๋ฅผ step๋ณ„๋กœ ์ œ์‹œํ•จ. LLM์€ ํ˜„์žฌ step์—์„œ ํ‹€๋ ธ๋Š”์ง€ ์•ˆ ํ‹€๋ ธ๋Š”์ง€ ํŒ๋‹จํ•จ
      • Binary search: ์ „์ฒด ์‹คํŒจ ๋กœ๊ทธ๋กœ ์‹œ์ž‘ํ•ด, ๋กœ๊ทธ์˜ ์•ž์ชฝ์—์„œ ์‹คํŒจํ–ˆ๋Š”์ง€, ๋’ค์ชฝ์—์„œ ์‹คํŒจํ–ˆ๋Š”์ง€ LLM์ด ํŒ๋ณ„, ํŒ๋ณ„๋œ ๋กœ๊ทธ๋ฅผ ๋‹ค์‹œ ๋ฐ˜์œผ๋กœ ๋‚˜๋ˆ„์–ด ๋‹จ์ผ step์ด ๋‚จ์„ ๋•Œ๊นŒ์ง€ ๋ฐ˜๋ณตํ•จ
    • 2๊ฐ€์ง€ setting
      • With Ground Truth (Oracle): ์ฟผ๋ฆฌ์˜ ์ตœ์ข… ์ •๋‹ต์„ ์•Œ๊ณ ์žˆ๋Š” ์„ธํŒ…
      • Without Ground Truth : ์ฟผ๋ฆฌ์˜ ์ •๋‹ต์„ ๋ชจ๋ฅด๋Š” ์ฑ„ ์‹คํŒจ๋ฅผ ์ฐพ๋Š” ์„ธํŒ…
  • ๊ฒฐ๊ณผ
    • ๋กœ๊ทธ ์ „์ฒด ์ฃผ๋ฉด ๋ˆ„๊ฐ€ ํ‹€๋ ธ๋Š”์ง€ ์ž˜ ์•Œ๊ณ , ์„ธ๋ฐ€ํ•˜๊ฒŒ ์ฃผ๋ฉด ์–ธ์ œ ํ‹€๋ ธ๋Š”์ง€ ์ž˜ ์•Œ๊ฒŒ ๋จ
      • Oracle ์„ธํŒ…์ด ๊ฝค ์œ ์šฉํ•จ

    • ๋กœ๊ทธ ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์ง€๋ฉด ์„ฑ๋Šฅ ๋–จ์–ด์ง
      • ์˜ค๋ฅ˜ step ์˜ˆ์ธกํ•˜๋Š”๊ฒŒ ๋” ์„ฑ๋Šฅ ๋น ๋ฅด๊ฒŒ ๋–จ์–ด์ง
    • ์˜ค๋ฅ˜ step ์˜ˆ์ธก์—์„œ ์ •๋‹ต๊ณผ์˜ ์˜ค์ฐจ ๋น„๊ต
    • ์˜ค์ฐจ ํ—ˆ์šฉ์„ ํฌ๊ฒŒ ์‹œ์ผœ์ค„์ˆ˜๋ก All-at-once๊ฐ€ ๊ฐ•์ ์„ ๊ฐ€์ง
    • ๋‹จ์ผ ๋ฐ์ดํ„ฐ๋ณด๋‹ค ํฌ๊ฒŒ ๋ดค์„ ๋•Œ ์ž์ฃผ ํ‹€๋ฆฌ๋Š” ๋†ˆ์„ ์•Œ๊ธฐ ์‰ฌ์›€
      • 3๋ฒˆ ์—์ด์ „ํŠธ๊ฐ€ ์˜ค๋ฅ˜์Ÿ์ด์ธ๋ฐ, ํ†ต๊ณ„์ ์œผ๋กœ 3๋ฒˆ ์—์ด์ „ํŠธ๊ฐ€ ๋ฌธ์ œ๋ผ๊ณ  ๋งŽ์ด ๊ผฝ์Œ
    • ํ•˜์ด๋ธŒ๋ฆฌ๋“œ(All-at-once๋กœ ๋ˆ„๊ฐ€ ์˜ค๋ฅ˜๋ƒˆ๋Š”์ง€ ํŒ์ • ํ›„ step-by-step์œผ๋กœ ์–ด๋””์„œ ์˜ค๋ฅ˜๋‚ฌ๋Š”์ง€ ํŒ๋ณ„)ํ•˜๋ฉด ๋” ์ž˜ํ•จ
      • ๋Œ€์‹  ๊ณ„์‚ฐ๋น„์šฉ์ด ๋†’์Œ
    • ๋” ๊ฐ•๋ ฅํ•œ ์ถ”๋ก ๋ชจ๋ธ์ด๋ผ๊ณ  ํ•ญ์ƒ ๋›ฐ์–ด๋‚œ๊ฑด ์•„๋‹˜; ๋ณ„๋กœ ์‹ค์šฉ์ ์ด์ง€ ์•Š์Œ
      • ๋Œ€์‹  ํ”„๋กฌํ”„ํŠธ์—์„œ ์ถ”๋ก ํ•˜๋ผ๊ณ  ํ•˜๋ฉด ์ž˜ํ•จ

Categories

research