LIME, SHAP, and Integrated Gradients
SuNaAI Lab
Technical Guide Series
Demystifying the black box: How to understand what your language models are really thinking
As language models become more powerful and ubiquitous, understanding their decision-making process becomes crucial. Explainability in NLP isn't just about satisfying curiosity—it's about building trust, ensuring fairness, and enabling better model development.
Modern NLP models process millions of parameters and complex attention mechanisms. Traditional debugging approaches don't work—we need specialized techniques to understand what's happening inside these sophisticated systems.
The critical importance of understanding AI decisions in real-world applications
Explainability isn't just a nice-to-have feature—it's becoming a regulatory requirement and business necessity. From healthcare to finance, stakeholders need to understand and trust AI decisions.
Doctors need to understand why an AI system recommends a specific treatment. Explainability enables trust and better clinical decision-making.
Loan approval systems must explain their decisions to comply with regulations and ensure fair lending practices.
Legal document analysis systems need to highlight relevant passages and explain their reasoning for case preparation.
Understanding model behavior helps researchers identify biases, improve architectures, and develop better training strategies.
Local Explanations
Explain individual predictions (e.g., "This word contributed +0.3 to the sentiment score")
Global Explanations
Understand overall model behavior (e.g., "The model focuses on negation words for sentiment")
Counterfactual Explanations
Show what would change the prediction (e.g., "Removing 'not' would flip the sentiment")
Understanding individual predictions through local linear approximations
LIME is a model-agnostic method that explains individual predictions by approximating the model locally with an interpretable model. It's particularly powerful for NLP because it can highlight which words or phrases are most important for a prediction.
1. Select an instance to explain 2. Generate perturbed versions of the instance 3. Get predictions for perturbed instances 4. Weight instances by proximity to original 5. Train interpretable model on weighted data 6. Extract feature importance from interpretable model
Text: "This movie is not bad at all"
Prediction: Positive (0.85)
LIME Explanation:
from lime import lime_text
from lime.lime_text import LimeTextExplainer
import numpy as np
# Initialize LIME explainer
explainer = LimeTextExplainer(class_names=['Negative', 'Positive'])
# Define prediction function
def predict_proba(texts):
# Your model's prediction function
return model.predict_proba(texts)
# Explain a single instance
text = "This movie is not bad at all"
explanation = explainer.explain_instance(
text,
predict_proba,
num_features=10
)
# Display explanation
explanation.show_in_notebook(text=True)Game-theoretic approach to feature attribution with theoretical guarantees
SHAP provides a unified framework for explaining model outputs using concepts from cooperative game theory. It satisfies several desirable properties and provides both local and global explanations.
Fast, exact SHAP values for tree-based models (XGBoost, LightGBM, etc.)
shap.TreeExplainer(model)Approximate SHAP values for deep learning models
shap.DeepExplainer(model, background)Model-agnostic SHAP using sampling (slower but universal)
shap.KernelExplainer(model.predict, background)Auto-selects best explainer based on model type
shap.Explainer(model)import shap import numpy as np # Initialize SHAP explainer explainer = shap.Explainer(model) # Calculate SHAP values shap_values = explainer(texts) # Visualize explanations shap.plots.text(shap_values[0]) # Summary plot for global understanding shap.summary_plot(shap_values, texts) # Waterfall plot for individual prediction shap.waterfall_plot(shap_values[0])
| Aspect | SHAP | LIME |
|---|---|---|
| Theoretical Foundation | Game theory (Shapley values) | Local linear approximation |
| Consistency | Theoretically guaranteed | Can vary between runs |
| Computational Cost | Higher (exponential in worst case) | Lower (polynomial) |
| Global Explanations | Yes (summary plots) | No (local only) |
Gradient-based attribution with axiomatic guarantees
Integrated Gradients is a gradient-based attribution method that satisfies the axioms of sensitivity and implementation invariance. It's particularly well-suited for neural networks and provides both local and global explanations.
For an input x and baseline x', the Integrated Gradients attribution for feature i is:
import torch
import numpy as np
def integrated_gradients(model, input_tensor, baseline_tensor,
target_class, steps=50):
"""
Compute Integrated Gradients attribution
"""
# Generate interpolated inputs
alphas = torch.linspace(0, 1, steps + 1)
interpolated_inputs = []
for alpha in alphas:
interpolated = baseline_tensor + alpha * (input_tensor - baseline_tensor)
interpolated_inputs.append(interpolated)
interpolated_inputs = torch.stack(interpolated_inputs)
interpolated_inputs.requires_grad_(True)
# Compute gradients
outputs = model(interpolated_inputs)
target_outputs = outputs[:, target_class]
gradients = torch.autograd.grad(
outputs=target_outputs.sum(),
inputs=interpolated_inputs,
create_graph=True
)[0]
# Average gradients and multiply by input difference
avg_gradients = gradients.mean(dim=0)
attributions = (input_tensor - baseline_tensor) * avg_gradients
return attributionsZero Baseline
All zeros - good for general attribution
Mean Baseline
Dataset mean - shows deviation from average
Random Baseline
Random noise - shows what's not random
When to use which explainability method
| Method | Best For | Speed | Accuracy | Use Case |
|---|---|---|---|---|
| LIME | Quick local explanations | Fast | Good | Debugging, exploration |
| SHAP | Theoretically sound explanations | Medium | Excellent | Production, research |
| Integrated Gradients | Neural network attribution | Fast | Excellent | Deep learning, gradients |
Need quick explanations?
→ Use LIME or Integrated Gradients
Need theoretical guarantees?
→ Use SHAP or Integrated Gradients
Working with neural networks?
→ Use Integrated Gradients
Need global understanding?
→ Use SHAP summary plots
Real-world strategies for implementing explainability in production
Sanity Checks
Verify explanations make intuitive sense
Consistency Tests
Check that similar inputs get similar explanations
Ablation Studies
Remove important features and verify prediction changes
Expert Review
Have domain experts validate explanations
Guidelines for effective explainability implementation
Different baselines can lead to very different explanations. Always document your baseline choice and consider its impact on interpretation.
High attribution doesn't mean the feature causes the prediction. Consider confounding variables and causal relationships.
Small attribution values might be noise. Focus on features with significantly high attributions and consider statistical significance.
Essential libraries and resources for NLP explainability