๐ŸŽฏ Know What You Don't Know

Uncertainty Calibration of Process Reward Models

Young-Jin Park, Kristjan Greenewald, Kaveh Alim, Hao Wang, Navid Azizan

MIT & MIT-IBM Watson AI Lab & Red Hat AI Innovation | NeurIPS 2025

๐Ÿ“ arXiv ๐Ÿ’ป Code ๐Ÿ“Š Dataset

The Problem

AI spends equal effort on easy and hard problems

๐Ÿ’ก Our Solution

Teach AI to know when to think harder

โšก The Result

Up to 75% reduction in computation, with comparable performance

NeurIPS 2025 Presentation

Why This Matters

Imagine a student who spends 30 minutes on every test questionโ€”whether it's "2+2" or a complex calculus proof. Current AI systems often work this way.

Current approach: Fixed computation regardless of difficulty

We taught them to "think smarter not harder".

Easy vs Difficult Problem Comparison

Our approach: Adaptive computation based on problem difficulty

The Calibration Challenge

Off-the-shelf PRMs lack calibration guarantees because their training is inherently tied to a specific policy model. They learn from reasoning paths generated by one particular model (e.g., Qwen-Math-72B-Instruct), binding their accuracy to that model's unique capabilities.

Strong model used for PRM training

Strong Model
Often used for PRM Training

VS
Weak model used for test-time scaling

Weak Model
Often used for Test-Time Scaling

๐Ÿ” The Root Problem

The issue stems from the autoregressive nature of language models, where sequence generation is policy-dependent. The probability of future tokens ฯ€ฮธ(xt+1:T | x0:t) is conditioned on the model's parameters ฮธ. Consequently, a PRM is only calibrated to the statistical patterns of the policy it was trained with.

โš ๏ธ Distributional Mismatch

When a PRM trained on a highly capable 72B model is paired with a weaker 1B model, it overestimates success probability. The PRM expects the stronger model's performance and cannot account for the weaker model's higher tendency to make errors on complex logical steps. This results in inflated and unreliable confidence scores.

QwenPRM-7B calibration error histogram on MATH500

QwenPRM-7B on MATH500: Histogram shows estimation errors between PRM predictions and ground-truth success rates. Positive values (right side) indicate overestimation. Distribution skews right, particularly for weaker models like Llama-1B.

QwenPRM-72B calibration error histogram on AIME

QwenPRM-7B on AIME24-25: PRM show further miscalibration on challenging problems. Peak near 1.0 indicates systematic overconfidence, especially for weaker completion models.

Key Insight: PRMs consistently overestimate success probabilities. Miscalibration is particularly severe for (1) weaker models and (2) challenging, out-of-distribution problems.

๐Ÿ’ก Why Calibrate PRMs?

While calibration isn't required for simple ranking tasks (where any monotonic transformation preserves order), calibrated probabilities are essential for:

  • Interpretability: Understanding true confidence levels
  • Safety Monitoring: Systematic detection of uncertain predictions
  • Efficient Budget Allocation: Enabling instance-adaptive scaling

Instance-Adaptive Scaling with Calibrated PRMs

๐Ÿ“Š Calibrated Confidence

We use quantile regression to fix AI overconfidence, providing reliable uncertainty estimates with confidence intervals.

PRM Calibration Process

From Overconfident to Calibrated: Our method transforms existing PRMs (which overestimate with reward 0.95) into calibrated models that provide uncertainty quantification through quantile predictions (10%, 50%, 90% confidence intervals).

๐Ÿง  Smart Allocation

IAS dynamically adjusts computational effort with mathematical guarantees, allocating resources based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer.

Easy Problem โ†’ Quick Decision

Minimal computation for straightforward problems

Hard Problem โ†’ Extended Reasoning

More computation allocated to complex problems

Key Results

Calibration Method Comparison

Our quantile regression (QR) approach outperforms traditional calibration baselines (temperature scaling, isotonic regression, histogram binning) across both in-distribution (MATH500) and out-of-distribution (AIME24-25) datasets.

Llama-3.2-1B calibration comparison
Llama-3.1-8B calibration comparison
Qwen2.5-Math-7B calibration comparison
DeepSeek-R1-Distill-Qwen-7B calibration comparison

Brier score comparison across calibration methods on MATH500 (in-distribution) and AIME24-25 (out-of-distribution). Lower is better. Our QR method (purple) consistently achieves the lowest calibration error, with particularly strong improvements for weaker models and on challenging out-of-distribution problems.

Calibration Enables Instance-Adaptive Scaling

Calibration Results

The proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired.

Citation

@inproceedings{park2025prmcalibration,
  title={Know What You Don't Know: Uncertainty Calibration of Process Reward Models},
  author={Park, Young-Jin and Greenewald, Kristjan and Alim, Kaveh and Wang, Hao and Azizan, Navid},
  journal={arXiv preprint arXiv:2506.09338},
  year={2025}
}