Object-level Calibration Error (OCE)
uq_detr.oce(detections, ground_truths, ...)
OCE is the key metric introduced in our paper. Unlike prediction-level metrics (D-ECE, LA-ECE), OCE aggregates predictions per ground-truth object, making it suitable for jointly evaluating a model and its post-processing scheme.
Definition
For each ground-truth object \(y_{l,j}\) with class \(c_{l,j}\), OCE computes the Brier score between the aggregated prediction and a one-hot target:
where \(\bar{p}\) is the average predicted distribution over all detections with \(\text{IoU} \geq \tau\) to the ground-truth object. If no detection overlaps, the Brier score defaults to 1.0 (maximum penalty).
The final OCE averages over IoU thresholds \(\tau \in \{0.5, 0.75\}\).
Why OCE over D-ECE?
OCE addresses two pitfalls of prediction-level metrics:
-
D-ECE favors discarding predictions. Since D-ECE doesn't penalize missed objects, setting a very high confidence threshold yields artificially low D-ECE.
-
AP favors retaining everything. AP doesn't penalize low-confidence false positives, so the optimal AP threshold is near 0.
OCE's U-shaped curve naturally identifies the sweet spot --- the threshold that best recovers the well-calibrated primary predictions.
Usage
import uq_detr
result = uq_detr.oce(detections, ground_truths)
print(result.score) # aggregated OCE
print(result.per_element) # per-object Brier scores
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
detections |
list[Detections] |
required | Post-processed predictions per image |
ground_truths |
list[GroundTruth] |
required | Ground-truth annotations per image |
iou_thresholds |
list[float] |
[0.5, 0.75] |
IoU thresholds; final OCE is their average |
aggregation |
str |
"mean" |
How to aggregate predictions matched to the same object |
Aggregation Methods
| Value | Description |
|---|---|
"mean" |
Average predicted distribution over all matched detections |
"max_iou" |
Use only the detection with the highest IoU |
"iou_weighted" |
IoU-weighted average of matched detections |
All three are highly correlated (see paper, Section V-C4).
Binary Approximation
When scores is 1-D (max-confidence only), OCE uses a binary Brier approximation:
- Correct class (\(\hat{c} = c^*\)): \(\text{Brier} = 2(1-p)^2\)
- Wrong class (\(\hat{c} \neq c^*\)): \(\text{Brier} = 2p^2\)
This enables OCE computation with outputs from frameworks like supervision that only provide max-confidence scores.
Threshold Sweep Example
from uq_detr import select
import numpy as np
for thr in np.arange(0.1, 0.9, 0.1):
filtered = [select(q, method="threshold", param=thr) for q in all_queries]
score = uq_detr.oce(filtered, ground_truths).score
print(f"threshold={thr:.1f} OCE={score:.4f}")