Making predictions reliable and trustworthy
SuNaAI Lab
Technical Guide Series
Ensuring predictions match confidence
A model is well-calibrated if its predicted probabilities match the true likelihood of outcomes. In LLMs, confidence scores often don't reflect actual uncertainty, leading to overconfident or underconfident predictions.
In critical applications, being wrong with high confidence is dangerous. Proper calibration enables:
If a model predicts 0.9 confidence for 100 predictions, ideally 90 should be correct. Well-calibrated models' confidence scores match empirical accuracy.
Why LLMs are often miscalibrated
Model assigns high confidence to incorrect predictions.
Example: LLM says "This sentence is grammatical" with 0.95 confidence, but it contains a syntax error.
Model is less confident than it should be, even when correct.
Example: LLM predicts correctly but gives low confidence score (0.3), indicating uncertainty that doesn't match the accuracy.
Probability distribution doesn't reflect true likelihood.
Example: LLM assigns equal probabilities to all candidates when one is clearly most likely.
Softmax outputs aren't naturally calibrated, especially for out-of-distribution samples.
Cross-entropy loss optimizes accuracy, not calibration, leading to overconfident models.
Models calibrated on training data may be miscalibrated on different domains or tasks.
During training, models learn to overestimate confidence on familiar patterns.
Metrics for calibration quality
Measures the difference between predicted and empirical probabilities.
ECE = Σ |acc(B_m) - conf(B_m)| × |B_m| / n Where: - B_m: bin m with predictions in range (m-1/M, m/M] - acc(B_m): accuracy in bin - conf(B_m): average confidence in bin - n: total predictions - M: number of bins
Maximum deviation between confidence and accuracy across all bins.
Mean squared error between predicted probabilities and actual outcomes.
Brier Score = (1/n) Σ(p_i - y_i)² Where: - p_i: predicted probability - y_i: actual outcome (0 or 1) - n: number of predictions
Visual tool showing relationship between predicted confidence and empirical accuracy. Perfect calibration shows diagonal line.
Post-hoc calibration techniques
Post-hoc calibration methods adjust model outputs after training to improve calibration without retraining. These are popular because they're fast and preserve model performance.
| Method | Parameters | Flexibility | Speed |
|---|---|---|---|
| Temperature Scaling | 1 (T) | Low | Very Fast |
| Platt Scaling | 2 | Medium | Fast |
| Isotonic Regression | N | High | Medium |
The simplest and most effective method
Temperature scaling is a single-parameter post-hoc calibration method that applies a temperature parameter to the logits before softmax. It's simple, preserves rank orderings, and works well for neural networks.
P(y|x) = softmax(logits / T) Where: - T: temperature parameter (learned) - T > 1: makes distributions softer (less confident) - T < 1: makes distributions sharper (more confident) - T = 1: no change (original predictions)
import torch
import torch.nn as nn
class TemperatureScaling(nn.Module):
def __init__(self):
super().__init__()
self.temperature = nn.Parameter(torch.ones(1))
def forward(self, logits):
return logits / self.temperature
# Initialize
temp_scaling = TemperatureScaling()
# On validation set, learn optimal T
optimizer = torch.optim.LBFGS([temp_scaling.temperature])
# Loss function: cross-entropy on logits
def loss_fn(logits, labels):
return nn.CrossEntropyLoss()(logits, labels)
# Minimize NLL on validation set
def closure():
optimizer.zero_grad()
scaled_logits = temp_scaling(val_logits)
loss = loss_fn(scaled_logits, val_labels)
loss.backward()
return loss
optimizer.step(closure)
# Apply calibrated predictions
calibrated_probs = F.softmax(model(test_logits) / temp_scaling.temperature)Logistic regression on model outputs
Platt scaling fits a logistic regression model on the model's confidence scores to obtain calibrated probabilities. It's more flexible than temperature scaling but more prone to overfitting.
P(y=1|x) = σ(A * f(x) + B) Where: - f(x): original model score - σ: sigmoid function - A, B: learned parameters - A > 0 ensures monotonicity
from sklearn.linear_model import LogisticRegression import numpy as np # Get model predictions on validation set probs_val = model.predict_proba(X_val) scores_val = probs_val[:, 1] # Probability of positive class # Fit Platt scaling on validation set platt = LogisticRegression() platt.fit(scores_val.reshape(-1, 1), y_val) # Apply to test set probs_test = model.predict_proba(X_test) scores_test = probs_test[:, 1] calibrated_probs = platt.predict_proba(scores_test.reshape(-1, 1))[:, 1]
Non-parametric calibration
Isotonic regression is a non-parametric method that learns piecewise-constant calibrated probabilities. It's more flexible than previous methods and can fit arbitrary shapes but requires more data to avoid overfitting.
from sklearn.isotonic import IsotonicRegression # Fit isotonic regression iso_reg = IsotonicRegression(out_of_bounds='clip') iso_reg.fit(probs_val, y_val) # Apply calibration calibrated_probs = iso_reg.transform(probs_test)
Simple, fast, requires least data
More flexible, 2 parameters, prone to overfitting
Most flexible, many parameters, needs lots of data
Step-by-step calibration guide
Calculate ECE, MCE, and visualize reliability diagram to assess calibration.
Use separate validation set for learning calibration parameters (don't use test set).
Start with temperature scaling as baseline—it's simple and effective.
Measure calibration metrics and accuracy on held-out test set.
If temperature scaling isn't sufficient, try Platt scaling or isotonic regression.
Start with Temperature Scaling - it works well in most cases.
Use Platt Scaling when you have binary classification or need slightly more flexibility.
Try Isotonic Regression when you have lots of validation data and complex calibration patterns.
Consider ensemble methods for combining multiple calibration approaches.