Model Calibration in LLMs

Making predictions reliable and trustworthy

SuNaAI Lab

Technical Guide Series

Resources Technical GuidesModel Calibration in LLMs

Chapter 1: The Calibration Challenge

Ensuring predictions match confidence

A model is well-calibrated if its predicted probabilities match the true likelihood of outcomes. In LLMs, confidence scores often don't reflect actual uncertainty, leading to overconfident or underconfident predictions.

Why Calibration Matters

In critical applications, being wrong with high confidence is dangerous. Proper calibration enables:

Better decision-making under uncertainty
Reliable risk assessment
Proper confidence intervals
Trust in model predictions

Perfect Calibration Example

If a model predicts 0.9 confidence for 100 predictions, ideally 90 should be correct. Well-calibrated models' confidence scores match empirical accuracy.

Chapter 2: Understanding Calibration Problems

Why LLMs are often miscalibrated

Common Issues

📈 Overconfidence

Model assigns high confidence to incorrect predictions.

Example: LLM says "This sentence is grammatical" with 0.95 confidence, but it contains a syntax error.

📉 Underconfidence

Model is less confident than it should be, even when correct.

Example: LLM predicts correctly but gives low confidence score (0.3), indicating uncertainty that doesn't match the accuracy.

🎯 Distribution Mismatch

Probability distribution doesn't reflect true likelihood.

Example: LLM assigns equal probabilities to all candidates when one is clearly most likely.

Causes in LLMs

🔤 Softmax Calibration

Softmax outputs aren't naturally calibrated, especially for out-of-distribution samples.

🎲 Training Objectives

Cross-entropy loss optimizes accuracy, not calibration, leading to overconfident models.

🔄 Domain Shift

Models calibrated on training data may be miscalibrated on different domains or tasks.

📊 Overfitting to Confidence

During training, models learn to overestimate confidence on familiar patterns.

Chapter 3: Evaluating Calibration

Metrics for calibration quality

Calibration Metrics

1. Expected Calibration Error (ECE)

Measures the difference between predicted and empirical probabilities.

ECE = Σ |acc(B_m) - conf(B_m)| × |B_m| / n

Where:
- B_m: bin m with predictions in range (m-1/M, m/M]
- acc(B_m): accuracy in bin
- conf(B_m): average confidence in bin
- n: total predictions
- M: number of bins

2. Maximum Calibration Error (MCE)

Maximum deviation between confidence and accuracy across all bins.

3. Brier Score

Mean squared error between predicted probabilities and actual outcomes.

Brier Score = (1/n) Σ(p_i - y_i)²

Where:
- p_i: predicted probability
- y_i: actual outcome (0 or 1)
- n: number of predictions

4. Reliability Diagrams

Visual tool showing relationship between predicted confidence and empirical accuracy. Perfect calibration shows diagonal line.

Chapter 4: Calibration Methods Overview

Post-hoc calibration techniques

Post-hoc calibration methods adjust model outputs after training to improve calibration without retraining. These are popular because they're fast and preserve model performance.

Method Comparison

Method	Parameters	Flexibility	Speed
Temperature Scaling	1 (T)	Low	Very Fast
Platt Scaling	2	Medium	Fast
Isotonic Regression	N	High	Medium

Chapter 5: Temperature Scaling

The simplest and most effective method

Temperature scaling is a single-parameter post-hoc calibration method that applies a temperature parameter to the logits before softmax. It's simple, preserves rank orderings, and works well for neural networks.

Temperature Scaling Formula

P(y|x) = softmax(logits / T)

Where:
- T: temperature parameter (learned)
- T > 1: makes distributions softer (less confident)
- T < 1: makes distributions sharper (more confident)
- T = 1: no change (original predictions)

Implementation

Temperature Scaling Code

import torch
import torch.nn as nn

class TemperatureScaling(nn.Module):
    def __init__(self):
        super().__init__()
        self.temperature = nn.Parameter(torch.ones(1))
    
    def forward(self, logits):
        return logits / self.temperature

# Initialize
temp_scaling = TemperatureScaling()

# On validation set, learn optimal T
optimizer = torch.optim.LBFGS([temp_scaling.temperature])

# Loss function: cross-entropy on logits
def loss_fn(logits, labels):
    return nn.CrossEntropyLoss()(logits, labels)

# Minimize NLL on validation set
def closure():
    optimizer.zero_grad()
    scaled_logits = temp_scaling(val_logits)
    loss = loss_fn(scaled_logits, val_labels)
    loss.backward()
    return loss

optimizer.step(closure)

# Apply calibrated predictions
calibrated_probs = F.softmax(model(test_logits) / temp_scaling.temperature)

Temperature Scaling Properties

✅ Advantages

• Preserves rank order of classes
• Single parameter, fast to learn
• Works well across many models
• Doesn't affect model weights
• Highly effective for neural networks

❌ Limitations

• Assumes same temperature for all classes
• May not work for heavily miscalibrated models
• Cannot fix multimodal calibration issues
• Requires validation set

Chapter 6: Platt Scaling

Logistic regression on model outputs

Platt scaling fits a logistic regression model on the model's confidence scores to obtain calibrated probabilities. It's more flexible than temperature scaling but more prone to overfitting.

Platt Scaling Formula

P(y=1|x) = σ(A * f(x) + B)

Where:
- f(x): original model score
- σ: sigmoid function
- A, B: learned parameters
- A > 0 ensures monotonicity

Implementation

Platt Scaling Implementation

from sklearn.linear_model import LogisticRegression
import numpy as np

# Get model predictions on validation set
probs_val = model.predict_proba(X_val)
scores_val = probs_val[:, 1]  # Probability of positive class

# Fit Platt scaling on validation set
platt = LogisticRegression()
platt.fit(scores_val.reshape(-1, 1), y_val)

# Apply to test set
probs_test = model.predict_proba(X_test)
scores_test = probs_test[:, 1]
calibrated_probs = platt.predict_proba(scores_test.reshape(-1, 1))[:, 1]

Chapter 7: Isotonic Regression

Non-parametric calibration

Isotonic regression is a non-parametric method that learns piecewise-constant calibrated probabilities. It's more flexible than previous methods and can fit arbitrary shapes but requires more data to avoid overfitting.

Isotonic Regression

from sklearn.isotonic import IsotonicRegression

# Fit isotonic regression
iso_reg = IsotonicRegression(out_of_bounds='clip')
iso_reg.fit(probs_val, y_val)

# Apply calibration
calibrated_probs = iso_reg.transform(probs_test)

Comparison

Temperature

Simple, fast, requires least data

Best for: Most common case

Platt

More flexible, 2 parameters, prone to overfitting

Best for: Binary classification

Isotonic

Most flexible, many parameters, needs lots of data

Best for: Complex calibration needs

Chapter 8: Practical Implementation

Step-by-step calibration guide

Recommended Workflow

Step 1: Evaluate Calibration

Calculate ECE, MCE, and visualize reliability diagram to assess calibration.

Step 2: Split Data

Use separate validation set for learning calibration parameters (don't use test set).

Step 3: Apply Temperature Scaling

Start with temperature scaling as baseline—it's simple and effective.

Step 4: Evaluate Improvements

Measure calibration metrics and accuracy on held-out test set.

Step 5: Try Other Methods

If temperature scaling isn't sufficient, try Platt scaling or isotonic regression.

When to Use Each Method

Start with Temperature Scaling - it works well in most cases.

Use Platt Scaling when you have binary classification or need slightly more flexibility.

Try Isotonic Regression when you have lots of validation data and complex calibration patterns.

Consider ensemble methods for combining multiple calibration approaches.

Table of Contents

Model Calibration in LLMs

Chapter 1: The Calibration Challenge

Why Calibration Matters

Perfect Calibration Example

Chapter 2: Understanding Calibration Problems

Common Issues

📈 Overconfidence

📉 Underconfidence

🎯 Distribution Mismatch

Causes in LLMs

🔤 Softmax Calibration

🎲 Training Objectives

🔄 Domain Shift

📊 Overfitting to Confidence

Chapter 3: Evaluating Calibration

Calibration Metrics

1. Expected Calibration Error (ECE)

2. Maximum Calibration Error (MCE)

3. Brier Score

4. Reliability Diagrams

Chapter 4: Calibration Methods Overview

Method Comparison

Chapter 5: Temperature Scaling

Implementation

Temperature Scaling Properties

✅ Advantages

❌ Limitations

Chapter 6: Platt Scaling

Implementation

Chapter 7: Isotonic Regression

Comparison

Temperature

Platt

Isotonic

Chapter 8: Practical Implementation

Recommended Workflow

Step 1: Evaluate Calibration

Step 2: Split Data

Step 3: Apply Temperature Scaling

Step 4: Evaluate Improvements

Step 5: Try Other Methods

When to Use Each Method