SEAL: Steerable Reasoning Calibration of Large Language Models for Free

Review

닉네임	Strength & Weakness & Sugguestions	별점 (0/5)
댓츠노노	• 장점: reasoning process를 세부적으로 분석하고, 명확&간결한 추론을 위한 방법 제시 • 단점: technical한 impact가 약함 • 보완점: 모델마다 intervention layer 경향이 왜 다른지 분석 추가	3.3
아이리스	장점: 직관적 아이디어 좋음. Motivation도 좋다고 생각함. 단점: 방법론으로 이어지는 흐름이 다소 뜬금없게 느껴짐.이거 계산 속도나 효율성은 괜찮나? 보완점: 토큰을 억지로 생성시키는 건 별론가? 추가 계산 없이, 중간에 한번씩 끼어드는 느낌이나.	3.5
핸드크림	• 장점: LRM의 과도한 추론 문제를 execution 제외하고 줄임으로써 완화. 해결책의 효율과 효과가 모두 좋음 • 단점: reflection/transition 줄이는 게 무조건 효과적인가? 그러면 LRM 자체를 수정해야 하는 거 아닌가? reflection/transition이 많아서 틀렸다는 인과관계가 맞나? • 보완점: reflection/transition 더 필요할법한 까다로운 벤치마크 실험	3.2
3월	• 장점: 모델 추론의 최소단위를 나눈 motivation과 실험의 시각화가 잘되어있음 • 단점: Inference할 때 sterring 벡터 S를 항상 동일하게 적용하는데, 문제마다 reflection을 유지할 지, 제거할 지 다른 경우가 있지 않을까? 예를 들어 계산 오류를 검증할때는 reflection이 필요한데, counting 문제는 reflection보다는 transition이 훨씬 강하지 않나? • 보완점: 문제 유형 자동 분류를 통한 adaptive steering	3.4
에너지	• 장점 : LRM의 단계?를 execution, reflection, transitions 관점으로 분류하고 벡터의 성질을 이용해 reasoning을 더 보완하는 연구. 패턴별 분석부터 벡터 계산까지 흐름이 매우 직관적이고, 논리적이라고 생각함 ! • 약점 : space를 조정하는데 벡터 연산으로 충분할까 ..? • 보완점 : 벡터의 방향을 조정함으로써 실험결과는 좋긴하지만, space를 더 정밀하게 조절할 수 있는 방법이 충분히 제시될 수 있을 것 같음.	4.0
화이트노이즈	• 장점: LRM의 문제인 redundant verification loop와 reasoning detour를 잘 짚어서 motivation에 공감함 • 단점: reflection을 줄였을 때 생기는 부작용 • 보완점: 문제의 태스크에 따른 reflection을 어느정도할지 동적으로 정하는 추가 후속 연구	3.2
피즈치자	• 강점: Reasoning process (이 논문에서는 thought type)을 정의해서 '어떤 reasoning이 문제인가'를 해석하려는 관점이 좋은듯. 이걸 어떻게 정의하는지도 하나의 연구 기준이 될 수 있을듯 싶다 • 한계: 근데 기존 여러 방법들을 접목시킨 느낌이 강하긴 함 • 제안점: 문제 유형이나 난이도에 따른 조건별 분석이 추가로 있으면 좋을듯	4.0
제로콜라	• 장점: 추가 학습 없이 steering vector를 hidden state에 더해주는 것만으로 불필요한 reflection과 transition을 줄일 수 있다는 점이 간단하면서도 효과적인것 같다. • 단점: steering vector를 계산할 때 키워드 기반으로 execution / reflection / transition을 분류하는데, 실제로는 키워드 없이도 해당 단계에 해당하는 경우가 있을 것 같다. • 보완점: 문제 유형이나 난이도에 따라 steering 강도를 자동으로 다르게 적용하는 방식 추가	3.5
창백카츄	장점: 방법론이 training free여서, 같은 motivation을 가지는 다른 논문들과 차별점을 가지고 있음. 추론의 단계를 명시적으로 분류할 수 있음을 보인 것도 훌륭함 약점: 어떤 근거로 적절하게 길이를 조정하는지 모르겠음 제안점: 문제의 난이도를 confidence나 다른 방법론으로 측정하고 그걸 기반으로 steering하면 좋을 듯!	3.5

TL; DR

💡

너무 길고 복잡한 reasoning 경향을 완화하자!

⇒ reasoning process를 세단계로 분류하고, 그 중에 어떤 걸 줄여야 할지 분석하자

Summary

연구진

github: https://github.com/VITA-Group/SEAL

인용수: 40

Background & Motivation

LLM의 뛰어난 reasoning ability
- Chain-of-Thoughts (CoT) 를 시작으로 쭉쭉 발전함
- o1, R1 등 인간의 인지 단계를 모방하는 large reasoning model이 개발됨

but, LRM의 한계점 존재
- memory 등 cost issue
- 정답에 필요한 핵심 reasoning을 이미 상당히 이른 시점에 확보하고도 그 이후에 불필요한 thought를 계속 생성
  ⇒ redundant verification loop나 reasoning detour에 빠질 수 있음
  redundant verification loop란 ?
  초기 solution이 이미 정답을 내놨는데도 (약 92%의 확률!) reasoning process를 이어가며, 뒤쪽 solution들은 새로운 reasoning strategy를 주기보다, 앞선 solution을 다시 확인하거나 비슷한 방식으로 반복하는 경향이 있는 것
  참고: Do not think that much for 2+ 3=? on the overthinking of o1-like llms
  reasoning detour란?
  초반 thought가 맞는 방향인데도 그 thought를 끝까지 밀지 않고, 다른 전략으로 계속 갈아타는 현상
  참고: Thoughts are all over the place: On the underthinking of o1-like llms
- 항상 lengthy reasoning 이 필요한 건 아님

** Main motivation

Can we identify and calibrate the flawed reasoning pathways in current LLMs?

Contributions (What they’ve revealed)

O1/R1-like LLMs을 분석하여 execution / reflection / transition 의 세 단계로 구분함 & latent space 상에서 분석함
- Recognizing Reasoning Patterns in LLMs
  model output O이 “\n\n” 으로 구분되는 경향이 있음 ⇒ 각 chunk $T_n$ 으로 표현
  thought sequence $O = (T_1, T_2, ..., T_N)$
  각 chunk를 세가지로 분류함
  execution : 모델이 문제를 step-by-step으로 분석하는 단계
  reflection : 모델이 진행을 잠깐 중단하고, verify하는 단계 (e.g. 검토해보자/확인해보자)
  transition : 추론 흐름을 전환하고, 다른 관점에서 다시 해석하는 단계
  분류 예시
  DeepSeek-R1-DistillQwen-1.5B + Math-500 task에서의 분석 결과
  난이도가 높을수록 생성한 토큰 개수가 많아짐
  ⇒ 인간의 사고 과정에 빗대어 생각해보면 당연한 것
  동일 난이도에서, 오답의 토큰 개수가 많음
  즉, 과도한 추론 단계가 성능에 부정적인 영향을 끼침
  특히 reflection, transition이 증가되어서, 전체 output이 길어지는 경향이 강함
  ⇒ Efficiency & Effectiveness Issue
- Reasoning pattern 별 mechanisms 분석
  Latent Space 에서의 특성 분석
  why latent space? 내부 token이 너무 다양해서 embedding 등으로부터 특성을 찾기 어려움
  ⇒ layer-wise representation 을 관찰해야겠다!
  how to?
  DeepSeek-R1-DistillQwen-1.5B + Math-500 task에서 reasoning 수행
  1의 output에서 각 layer i에서 “\n\n” 에 해당하는 representation 수집
  T-distributed Stochastic Neighbor Embedding (t-SNE) 로 2를 2차원에 투영
  분석 결과
  execution 은 reflection & transition 과 명확하게 구분됨 (e.g. layer20)
  layer가 깊어질수록 각 reasoning pattern이 명확하게 구분됨
  얕은 layer는 low-level feature를 인식함
  기존 연구내용과 동일하게, 깊은 layer는 추상적인 개념 & 의미론적 지식을 인코딩함
  참고) https://scholar.google.com/scholar_url?url=https://aclanthology.org/2024.findings-acl.866/&hl=ko&sa=T&oi=gsr-r&ct=res&cd=0&d=6982973257792625628&ei=oKO0abfFNpm06rQP4vHSqAU&scisig=AFtJQiwAmnT0Fk30HpAdwkimjEZH https://scholar.google.com/scholar_url?url=https://aclanthology.org/2025.coling-main.37/&hl=ko&sa=T&oi=gsr-r&ct=res&cd=0&d=6429145741284466638&ei=q6O0aaa2CZCK6rQPms6o6Ag&scisig=AFtJQiyk_tdZzRc1WOT4zzCyqNNg
  reflection & transition 는 서로 유사함
  execution 와 달리, 둘다 이전단계의 추론을 재고하거나 수정함

분석한 내용을 바탕으로, reasoning process를 개선하기 위한 training-free strategy, SEAL(Steerable rEAsoning caLibration)을 제안
💡
reflection &transition의 비율을 조정할 수 있는 steering vector를 찾아서,
불필요한 token 생성을 막자!
1. extraction of the reasoning steering vector (핵심 아이디어!)
  Collecting Reasoning Processing
  Math dataset의 1000개의 training data / 각 target model 사용하여 reasoning process 얻음
  i을 키워드 기반으로 execution / reflection / transition 의 세 단계로 구분
  reflection or transition 키워드가 없으면 execution 으로
  Calculating Steering Vector
  각 thought j의 “\n\n”에 대한 representation을 i번째 transformer block에서 얻음 : $H_i^j$
  각 reasoning category 별로 average representations 얻음
  reasoning steering vector S 를 계산
  즉, execution 의 평균과는 가까워지고, reflection & transition 의 평균과는 멀어지도록
1. Decoding with Latent Space Intervention
  매 thought 끝, 즉 “\n\n” token representation에 대해 아래 연산을 적용
  $\tilde{H} = H+\alpha S$
  a(=1): steering strength를 조절하는 hyperparameter
  ablation을 통해, 모델마다 다른 intervention layer 적용
  20 for Deepseek-R1-Distill-Qwen-1.5B & Deepseek-R1-Distill-Qwen-7B
  55 for QwQ-32B-Preview

다양한 LLM, benchmark를 실험에 활용하여 SEAL의 우수성 증명
- Setting
  LLM: Deepseek-R1-distill-Qwen-1.5B, Deepseek-R1-distill-Qwen-7B, QwQ32B-Preview
  benchmark: Math500, GSM8k, LiveCodeBench
  Math500 Hard: Math500 중, difficulty 4 또는 5 문제 500개
  metrics: Acc, #Tokens
  baseline: Logit Penalty (training free 기법)
  https://arxiv.org/abs/2501.18585
  TL;DR thought-triggering token의 logit 값을 인위적으로 낮춰서 그 토큰이 나오기 어렵게 만드는 inference-time control 방법
- Main Results
  baseline 대비 Acc, #Tokens 개선
  Math500 결과
  Math500에서 얻은 vector → GSM8k, LiveCodeBench 적용 결과
  token-space adjustment (Logit Penalty) < Latent space calibration (SEAL)
  Logit Penalty 는 오히려 reflection / transition 증가를 유도함
- Quantitative Evaluation of Efficiency
  품질 뿐 아니라 추론 효율도 개선됨
  hidden state에 steering vector를 더하는 연산의 추가 계산비용은 거의 무시 가능함
  오히려 response length가 짧아지면서 전반적인 응답 시간 감소
- Ablation Study
  Ablation Study about the Steering Type
  refection&transition을 억제하는 게 가장 중요함
  Ablation Study about the Steering Layer
  각 intervention layer의 선별 기준
  초반보다는 중후반부 layer에서의 영향이 크다
  Ablation Study about the Steering Strength
  alpha 선별 기준

Yonsei Univ. ICL

SEAL: Steerable Reasoning Calibration of Large Language Models for Free

💡너무 길고 복잡한 reasoning 경향을 완화하자!⇒ reasoning process를 세단계로 분류하고, 그 중에 어떤 걸 줄여야 할지 분석하자

SEAL: Steerable Reasoning Calibration of Large Language Models for Free

Review

TL; DR

Summary

Background & Motivation

Contributions (What they’ve revealed)

Categories