The Problem
AI spends equal effort on easy and hard problems
๐ก Our Solution
Teach AI to know when to think harder
โก The Result
Up to 75% reduction in computation, with comparable performance
NeurIPS 2025 Presentation
Why This Matters
Imagine a student who spends 30 minutes on every test questionโwhether it's "2+2" or a complex calculus proof. Current AI systems often work this way.
Current approach: Fixed computation regardless of difficulty
We taught them to "think smarter not harder".
Our approach: Adaptive computation based on problem difficulty
The Calibration Challenge
Off-the-shelf PRMs lack calibration guarantees because their training is inherently tied to a specific policy model. They learn from reasoning paths generated by one particular model (e.g., Qwen-Math-72B-Instruct), binding their accuracy to that model's unique capabilities.
Strong Model
Often used for PRM Training
Weak Model
Often used for Test-Time Scaling
๐ The Root Problem
The issue stems from the autoregressive nature of language models, where sequence generation is policy-dependent. The probability of future tokens ฯฮธ(xt+1:T | x0:t) is conditioned on the model's parameters ฮธ. Consequently, a PRM is only calibrated to the statistical patterns of the policy it was trained with.
โ ๏ธ Distributional Mismatch
When a PRM trained on a highly capable 72B model is paired with a weaker 1B model, it overestimates success probability. The PRM expects the stronger model's performance and cannot account for the weaker model's higher tendency to make errors on complex logical steps. This results in inflated and unreliable confidence scores.
QwenPRM-7B on MATH500: Histogram shows estimation errors between PRM predictions and ground-truth success rates. Positive values (right side) indicate overestimation. Distribution skews right, particularly for weaker models like Llama-1B.
QwenPRM-7B on AIME24-25: PRM show further miscalibration on challenging problems. Peak near 1.0 indicates systematic overconfidence, especially for weaker completion models.
Key Insight: PRMs consistently overestimate success probabilities. Miscalibration is particularly severe for (1) weaker models and (2) challenging, out-of-distribution problems.
๐ก Why Calibrate PRMs?
While calibration isn't required for simple ranking tasks (where any monotonic transformation preserves order), calibrated probabilities are essential for:
- Interpretability: Understanding true confidence levels
- Safety Monitoring: Systematic detection of uncertain predictions
- Efficient Budget Allocation: Enabling instance-adaptive scaling
Instance-Adaptive Scaling with Calibrated PRMs
๐ Calibrated Confidence
We use quantile regression to fix AI overconfidence, providing reliable uncertainty estimates with confidence intervals.
From Overconfident to Calibrated: Our method transforms existing PRMs (which overestimate with reward 0.95) into calibrated models that provide uncertainty quantification through quantile predictions (10%, 50%, 90% confidence intervals).
๐ง Smart Allocation
IAS dynamically adjusts computational effort with mathematical guarantees, allocating resources based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer.
Easy Problem โ Quick Decision
Minimal computation for straightforward problems
Hard Problem โ Extended Reasoning
More computation allocated to complex problems
Key Results
Calibration Method Comparison
Our quantile regression (QR) approach outperforms traditional calibration baselines (temperature scaling, isotonic regression, histogram binning) across both in-distribution (MATH500) and out-of-distribution (AIME24-25) datasets.
Brier score comparison across calibration methods on MATH500 (in-distribution) and AIME24-25 (out-of-distribution). Lower is better. Our QR method (purple) consistently achieves the lowest calibration error, with particularly strong improvements for weaker models and on challenging out-of-distribution problems.
Calibration Enables Instance-Adaptive Scaling
The proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired.
Citation
@inproceedings{park2025prmcalibration,
title={Know What You Don't Know: Uncertainty Calibration of Process Reward Models},
author={Park, Young-Jin and Greenewald, Kristjan and Alim, Kaveh and Wang, Hao and Azizan, Navid},
journal={arXiv preprint arXiv:2506.09338},
year={2025}
}