Model Calibration in LLMs

Making predictions reliable and trustworthy

SuNaAI Lab

Technical Guide Series

ResourcesTechnical GuidesModel Calibration in LLMs

Chapter 1: The Calibration Challenge

Ensuring predictions match confidence

A model is well-calibrated if its predicted probabilities match the true likelihood of outcomes. In LLMs, confidence scores often don't reflect actual uncertainty, leading to overconfident or underconfident predictions.

Why Calibration Matters

In critical applications, being wrong with high confidence is dangerous. Proper calibration enables:

  • Better decision-making under uncertainty
  • Reliable risk assessment
  • Proper confidence intervals
  • Trust in model predictions

Perfect Calibration Example

If a model predicts 0.9 confidence for 100 predictions, ideally 90 should be correct. Well-calibrated models' confidence scores match empirical accuracy.

Chapter 2: Understanding Calibration Problems

Why LLMs are often miscalibrated

Common Issues

📈 Overconfidence

Model assigns high confidence to incorrect predictions.

Example: LLM says "This sentence is grammatical" with 0.95 confidence, but it contains a syntax error.

📉 Underconfidence

Model is less confident than it should be, even when correct.

Example: LLM predicts correctly but gives low confidence score (0.3), indicating uncertainty that doesn't match the accuracy.

🎯 Distribution Mismatch

Probability distribution doesn't reflect true likelihood.

Example: LLM assigns equal probabilities to all candidates when one is clearly most likely.

Causes in LLMs

🔤 Softmax Calibration

Softmax outputs aren't naturally calibrated, especially for out-of-distribution samples.

🎲 Training Objectives

Cross-entropy loss optimizes accuracy, not calibration, leading to overconfident models.

🔄 Domain Shift

Models calibrated on training data may be miscalibrated on different domains or tasks.

📊 Overfitting to Confidence

During training, models learn to overestimate confidence on familiar patterns.

Chapter 3: Evaluating Calibration

Metrics for calibration quality

Calibration Metrics

1. Expected Calibration Error (ECE)

Measures the difference between predicted and empirical probabilities.

ECE = Σ |acc(B_m) - conf(B_m)| × |B_m| / n

Where:
- B_m: bin m with predictions in range (m-1/M, m/M]
- acc(B_m): accuracy in bin
- conf(B_m): average confidence in bin
- n: total predictions
- M: number of bins

2. Maximum Calibration Error (MCE)

Maximum deviation between confidence and accuracy across all bins.

3. Brier Score

Mean squared error between predicted probabilities and actual outcomes.

Brier Score = (1/n) Σ(p_i - y_i)²

Where:
- p_i: predicted probability
- y_i: actual outcome (0 or 1)
- n: number of predictions

4. Reliability Diagrams

Visual tool showing relationship between predicted confidence and empirical accuracy. Perfect calibration shows diagonal line.

Chapter 4: Calibration Methods Overview

Post-hoc calibration techniques

Post-hoc calibration methods adjust model outputs after training to improve calibration without retraining. These are popular because they're fast and preserve model performance.

Method Comparison

MethodParametersFlexibilitySpeed
Temperature Scaling1 (T)LowVery Fast
Platt Scaling2MediumFast
Isotonic RegressionNHighMedium

Chapter 5: Temperature Scaling

The simplest and most effective method

Temperature scaling is a single-parameter post-hoc calibration method that applies a temperature parameter to the logits before softmax. It's simple, preserves rank orderings, and works well for neural networks.

Temperature Scaling Formula
P(y|x) = softmax(logits / T)

Where:
- T: temperature parameter (learned)
- T > 1: makes distributions softer (less confident)
- T < 1: makes distributions sharper (more confident)
- T = 1: no change (original predictions)

Implementation

Temperature Scaling Code
import torch
import torch.nn as nn

class TemperatureScaling(nn.Module):
    def __init__(self):
        super().__init__()
        self.temperature = nn.Parameter(torch.ones(1))
    
    def forward(self, logits):
        return logits / self.temperature

# Initialize
temp_scaling = TemperatureScaling()

# On validation set, learn optimal T
optimizer = torch.optim.LBFGS([temp_scaling.temperature])

# Loss function: cross-entropy on logits
def loss_fn(logits, labels):
    return nn.CrossEntropyLoss()(logits, labels)

# Minimize NLL on validation set
def closure():
    optimizer.zero_grad()
    scaled_logits = temp_scaling(val_logits)
    loss = loss_fn(scaled_logits, val_labels)
    loss.backward()
    return loss

optimizer.step(closure)

# Apply calibrated predictions
calibrated_probs = F.softmax(model(test_logits) / temp_scaling.temperature)

Temperature Scaling Properties

✅ Advantages

  • • Preserves rank order of classes
  • • Single parameter, fast to learn
  • • Works well across many models
  • • Doesn't affect model weights
  • • Highly effective for neural networks

❌ Limitations

  • • Assumes same temperature for all classes
  • • May not work for heavily miscalibrated models
  • • Cannot fix multimodal calibration issues
  • • Requires validation set

Chapter 6: Platt Scaling

Logistic regression on model outputs

Platt scaling fits a logistic regression model on the model's confidence scores to obtain calibrated probabilities. It's more flexible than temperature scaling but more prone to overfitting.

Platt Scaling Formula
P(y=1|x) = σ(A * f(x) + B)

Where:
- f(x): original model score
- σ: sigmoid function
- A, B: learned parameters
- A > 0 ensures monotonicity

Implementation

Platt Scaling Implementation
from sklearn.linear_model import LogisticRegression
import numpy as np

# Get model predictions on validation set
probs_val = model.predict_proba(X_val)
scores_val = probs_val[:, 1]  # Probability of positive class

# Fit Platt scaling on validation set
platt = LogisticRegression()
platt.fit(scores_val.reshape(-1, 1), y_val)

# Apply to test set
probs_test = model.predict_proba(X_test)
scores_test = probs_test[:, 1]
calibrated_probs = platt.predict_proba(scores_test.reshape(-1, 1))[:, 1]

Chapter 7: Isotonic Regression

Non-parametric calibration

Isotonic regression is a non-parametric method that learns piecewise-constant calibrated probabilities. It's more flexible than previous methods and can fit arbitrary shapes but requires more data to avoid overfitting.

Isotonic Regression
from sklearn.isotonic import IsotonicRegression

# Fit isotonic regression
iso_reg = IsotonicRegression(out_of_bounds='clip')
iso_reg.fit(probs_val, y_val)

# Apply calibration
calibrated_probs = iso_reg.transform(probs_test)

Comparison

Temperature

Simple, fast, requires least data

Best for: Most common case

Platt

More flexible, 2 parameters, prone to overfitting

Best for: Binary classification

Isotonic

Most flexible, many parameters, needs lots of data

Best for: Complex calibration needs

Chapter 8: Practical Implementation

Step-by-step calibration guide

Recommended Workflow

Step 1: Evaluate Calibration

Calculate ECE, MCE, and visualize reliability diagram to assess calibration.

Step 2: Split Data

Use separate validation set for learning calibration parameters (don't use test set).

Step 3: Apply Temperature Scaling

Start with temperature scaling as baseline—it's simple and effective.

Step 4: Evaluate Improvements

Measure calibration metrics and accuracy on held-out test set.

Step 5: Try Other Methods

If temperature scaling isn't sufficient, try Platt scaling or isotonic regression.

When to Use Each Method

Start with Temperature Scaling - it works well in most cases.

Use Platt Scaling when you have binary classification or need slightly more flexibility.

Try Isotonic Regression when you have lots of validation data and complex calibration patterns.

Consider ensemble methods for combining multiple calibration approaches.