Know What You Don't Know | PRM Calibration

The Problem

AI spends equal effort on easy and hard problems

💡 Our Solution

Teach AI to know when to think harder

⚡ The Result

Up to 75% reduction in computation, with comparable performance

NeurIPS 2025 Presentation

Why This Matters

Imagine a student who spends 30 minutes on every test question—whether it's "2+2" or a complex calculus proof. Current AI systems often work this way.

Current approach: Fixed computation regardless of difficulty

We taught them to "think smarter not harder".

Our approach: Adaptive computation based on problem difficulty

The Calibration Challenge

Off-the-shelf PRMs lack calibration guarantees because their training is inherently tied to a specific policy model. They learn from reasoning paths generated by one particular model (e.g., Qwen-Math-72B-Instruct), binding their accuracy to that model's unique capabilities.

Strong Model
Often used for PRM Training

Weak Model
Often used for Test-Time Scaling

🔍 The Root Problem

The issue stems from the autoregressive nature of language models, where sequence generation is policy-dependent. The probability of future tokens π_θ(x_t+1:T | x_0:t) is conditioned on the model's parameters θ. Consequently, a PRM is only calibrated to the statistical patterns of the policy it was trained with.

⚠️ Distributional Mismatch

When a PRM trained on a highly capable 72B model is paired with a weaker 1B model, it overestimates success probability. The PRM expects the stronger model's performance and cannot account for the weaker model's higher tendency to make errors on complex logical steps. This results in inflated and unreliable confidence scores.

QwenPRM-7B calibration error histogram on MATH500

QwenPRM-7B on MATH500: Histogram shows estimation errors between PRM predictions and ground-truth success rates. Positive values (right side) indicate overestimation. Distribution skews right, particularly for weaker models like Llama-1B.

QwenPRM-72B calibration error histogram on AIME

QwenPRM-7B on AIME24-25: PRM show further miscalibration on challenging problems. Peak near 1.0 indicates systematic overconfidence, especially for weaker completion models.

Key Insight: PRMs consistently overestimate success probabilities. Miscalibration is particularly severe for (1) weaker models and (2) challenging, out-of-distribution problems.

💡 Why Calibrate PRMs?

While calibration isn't required for simple ranking tasks (where any monotonic transformation preserves order), calibrated probabilities are essential for:

Interpretability: Understanding true confidence levels
Safety Monitoring: Systematic detection of uncertain predictions
Efficient Budget Allocation: Enabling instance-adaptive scaling

Instance-Adaptive Scaling with Calibrated PRMs

📊 Calibrated Confidence

We use quantile regression to fix AI overconfidence, providing reliable uncertainty estimates with confidence intervals.

From Overconfident to Calibrated: Our method transforms existing PRMs (which overestimate with reward 0.95) into calibrated models that provide uncertainty quantification through quantile predictions (10%, 50%, 90% confidence intervals).

🧠 Smart Allocation

IAS dynamically adjusts computational effort with mathematical guarantees, allocating resources based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer.

Easy Problem → Quick Decision

Minimal computation for straightforward problems

Hard Problem → Extended Reasoning

More computation allocated to complex problems

Key Results

Calibration Method Comparison

Our quantile regression (QR) approach outperforms traditional calibration baselines (temperature scaling, isotonic regression, histogram binning) across both in-distribution (MATH500) and out-of-distribution (AIME24-25) datasets.

$Qwen2.5-Math-7B calibration comparison$

DeepSeek-R1-Distill-Qwen-7B calibration comparison

Brier score comparison across calibration methods on MATH500 (in-distribution) and AIME24-25 (out-of-distribution). Lower is better. Our QR method (purple) consistently achieves the lowest calibration error, with particularly strong improvements for weaker models and on challenging out-of-distribution problems.

Calibration Enables Instance-Adaptive Scaling

The proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired.

Citation

@inproceedings{park2025prmcalibration,
  title={Know What You Don't Know: Uncertainty Calibration of Process Reward Models},
  author={Park, Young-Jin and Greenewald, Kristjan and Alim, Kaveh and Wang, Hao and Azizan, Navid},
  journal={arXiv preprint arXiv:2506.09338},
  year={2025}
}