NLP Explainability

LIME, SHAP, and Integrated Gradients

SuNaAI Lab

Technical Guide Series

Resources Technical GuidesNLP Explainability

Chapter 1: Understanding NLP Explainability

Demystifying the black box: How to understand what your language models are really thinking

As language models become more powerful and ubiquitous, understanding their decision-making process becomes crucial. Explainability in NLP isn't just about satisfying curiosity—it's about building trust, ensuring fairness, and enabling better model development.

The Explainability Challenge

Modern NLP models process millions of parameters and complex attention mechanisms. Traditional debugging approaches don't work—we need specialized techniques to understand what's happening inside these sophisticated systems.

What You'll Learn

How LIME works and when to use it for local explanations
SHAP's game-theoretic approach to feature attribution
Integrated Gradients for gradient-based explanations
Practical implementation strategies for each method
Best practices for interpreting and validating explanations
Common pitfalls and how to avoid them

Chapter 2: Why Explainability Matters

The critical importance of understanding AI decisions in real-world applications

Explainability isn't just a nice-to-have feature—it's becoming a regulatory requirement and business necessity. From healthcare to finance, stakeholders need to understand and trust AI decisions.

🏥 Healthcare

Doctors need to understand why an AI system recommends a specific treatment. Explainability enables trust and better clinical decision-making.

🏦 Finance

Loan approval systems must explain their decisions to comply with regulations and ensure fair lending practices.

⚖️ Legal

Legal document analysis systems need to highlight relevant passages and explain their reasoning for case preparation.

🔍 Research

Understanding model behavior helps researchers identify biases, improve architectures, and develop better training strategies.

Types of Explanations

Local Explanations

Explain individual predictions (e.g., "This word contributed +0.3 to the sentiment score")

Global Explanations

Understand overall model behavior (e.g., "The model focuses on negation words for sentiment")

Counterfactual Explanations

Show what would change the prediction (e.g., "Removing 'not' would flip the sentiment")

Chapter 3: LIME (Local Interpretable Model-agnostic Explanations)

Understanding individual predictions through local linear approximations

LIME is a model-agnostic method that explains individual predictions by approximating the model locally with an interpretable model. It's particularly powerful for NLP because it can highlight which words or phrases are most important for a prediction.

LIME Algorithm

1. Select an instance to explain
2. Generate perturbed versions of the instance
3. Get predictions for perturbed instances
4. Weight instances by proximity to original
5. Train interpretable model on weighted data
6. Extract feature importance from interpretable model

LIME for Text Classification

Example: Sentiment Analysis

Text: "This movie is not bad at all"

Prediction: Positive (0.85)

LIME Explanation:

• "not bad" → +0.4 (strong positive indicator)
• "at all" → +0.2 (reinforces positive sentiment)
• "This movie" → +0.1 (neutral context)

Implementation with Python

LIME Implementation

from lime import lime_text
from lime.lime_text import LimeTextExplainer
import numpy as np

# Initialize LIME explainer
explainer = LimeTextExplainer(class_names=['Negative', 'Positive'])

# Define prediction function
def predict_proba(texts):
    # Your model's prediction function
    return model.predict_proba(texts)

# Explain a single instance
text = "This movie is not bad at all"
explanation = explainer.explain_instance(
    text, 
    predict_proba, 
    num_features=10
)

# Display explanation
explanation.show_in_notebook(text=True)

LIME Advantages & Limitations

✅ Advantages:

• Model-agnostic (works with any model)
• Intuitive word-level explanations
• Easy to implement and understand
• Good for debugging individual predictions
• Works well with text data

❌ Limitations:

• Only local explanations (not global)
• Can be unstable (different runs may vary)
• Computationally expensive
• May not capture complex interactions
• Sensitive to perturbation method

Chapter 4: SHAP (SHapley Additive exPlanations)

Game-theoretic approach to feature attribution with theoretical guarantees

SHAP provides a unified framework for explaining model outputs using concepts from cooperative game theory. It satisfies several desirable properties and provides both local and global explanations.

SHAP Properties

Efficiency: Sum of SHAP values equals model output
Symmetry: Equal features get equal SHAP values
Dummy: Features not affecting output get zero SHAP
Additivity: SHAP values are additive across models

SHAP Variants

TreeExplainer

Fast, exact SHAP values for tree-based models (XGBoost, LightGBM, etc.)

shap.TreeExplainer(model)

DeepExplainer

Approximate SHAP values for deep learning models

shap.DeepExplainer(model, background)

KernelExplainer

Model-agnostic SHAP using sampling (slower but universal)

shap.KernelExplainer(model.predict, background)

Explainer

Auto-selects best explainer based on model type

shap.Explainer(model)

SHAP Implementation

SHAP for Text Classification

import shap
import numpy as np

# Initialize SHAP explainer
explainer = shap.Explainer(model)

# Calculate SHAP values
shap_values = explainer(texts)

# Visualize explanations
shap.plots.text(shap_values[0])

# Summary plot for global understanding
shap.summary_plot(shap_values, texts)

# Waterfall plot for individual prediction
shap.waterfall_plot(shap_values[0])

SHAP vs LIME

Aspect	SHAP	LIME
Theoretical Foundation	Game theory (Shapley values)	Local linear approximation
Consistency	Theoretically guaranteed	Can vary between runs
Computational Cost	Higher (exponential in worst case)	Lower (polynomial)
Global Explanations	Yes (summary plots)	No (local only)

Chapter 5: Integrated Gradients

Gradient-based attribution with axiomatic guarantees

Integrated Gradients is a gradient-based attribution method that satisfies the axioms of sensitivity and implementation invariance. It's particularly well-suited for neural networks and provides both local and global explanations.

Key Axioms

Sensitivity: Features with different impacts get different attributions
Implementation Invariance: Equivalent models get same attributions
Saturation: Attributions sum to difference between baseline and prediction

Mathematical Foundation

For an input x and baseline x', the Integrated Gradients attribution for feature i is:

IG_i(x) = (x_i - x'_i) × ∫[α=0 to 1] ∂F(x' + α(x - x'))/∂x_i dα

Implementation

Integrated Gradients Implementation

import torch
import numpy as np

def integrated_gradients(model, input_tensor, baseline_tensor, 
                        target_class, steps=50):
    """
    Compute Integrated Gradients attribution
    """
    # Generate interpolated inputs
    alphas = torch.linspace(0, 1, steps + 1)
    interpolated_inputs = []
    
    for alpha in alphas:
        interpolated = baseline_tensor + alpha * (input_tensor - baseline_tensor)
        interpolated_inputs.append(interpolated)
    
    interpolated_inputs = torch.stack(interpolated_inputs)
    interpolated_inputs.requires_grad_(True)
    
    # Compute gradients
    outputs = model(interpolated_inputs)
    target_outputs = outputs[:, target_class]
    
    gradients = torch.autograd.grad(
        outputs=target_outputs.sum(),
        inputs=interpolated_inputs,
        create_graph=True
    )[0]
    
    # Average gradients and multiply by input difference
    avg_gradients = gradients.mean(dim=0)
    attributions = (input_tensor - baseline_tensor) * avg_gradients
    
    return attributions

Choosing a Baseline

Zero Baseline

All zeros - good for general attribution

Mean Baseline

Dataset mean - shows deviation from average

Random Baseline

Random noise - shows what's not random

Chapter 6: Method Comparison

When to use which explainability method

Method	Best For	Speed	Accuracy	Use Case
LIME	Quick local explanations	Fast	Good	Debugging, exploration
SHAP	Theoretically sound explanations	Medium	Excellent	Production, research
Integrated Gradients	Neural network attribution	Fast	Excellent	Deep learning, gradients

Decision Framework

Need quick explanations?

→ Use LIME or Integrated Gradients

Need theoretical guarantees?

→ Use SHAP or Integrated Gradients

Working with neural networks?

→ Use Integrated Gradients

Need global understanding?

→ Use SHAP summary plots

Chapter 7: Practical Implementation

Real-world strategies for implementing explainability in production

Production Considerations

⚡ Performance

• Cache explanations for common inputs
• Use approximate methods for speed
• Implement async explanation generation
• Consider explanation quality vs speed trade-offs

🔒 Security

• Validate explanation inputs
• Prevent explanation-based attacks
• Sanitize sensitive information
• Implement access controls

📊 Monitoring

• Track explanation quality metrics
• Monitor explanation consistency
• Alert on unusual patterns
• Log explanation generation times

🎯 User Experience

• Present explanations clearly
• Provide context and interpretation
• Allow users to explore details
• Offer multiple explanation formats

Validation Strategies

✓

Sanity Checks

Verify explanations make intuitive sense

✓

Consistency Tests

Check that similar inputs get similar explanations

✓

Ablation Studies

Remove important features and verify prediction changes

✓

Expert Review

Have domain experts validate explanations

Chapter 8: Best Practices

Guidelines for effective explainability implementation

Do's and Don'ts

✅ Do:

• Use multiple explanation methods
• Validate explanations with domain experts
• Consider your audience's technical level
• Document explanation methodology
• Test explanations on edge cases
• Monitor explanation quality over time
• Provide context for explanations
• Use appropriate baselines

❌ Don't:

• Trust explanations blindly
• Use only one explanation method
• Ignore computational costs
• Present explanations without context
• Assume explanations are always correct
• Overinterpret small attribution values
• Ignore baseline choice impact
• Skip validation steps

Common Pitfalls

Pitfall 1: Baseline Sensitivity

Different baselines can lead to very different explanations. Always document your baseline choice and consider its impact on interpretation.

Pitfall 2: Correlation vs Causation

High attribution doesn't mean the feature causes the prediction. Consider confounding variables and causal relationships.

Pitfall 3: Overinterpretation

Small attribution values might be noise. Focus on features with significantly high attributions and consider statistical significance.

Chapter 9: Tools & Resources

Essential libraries and resources for NLP explainability

Python Libraries

• SHAP - SHapley Additive exPlanations
• LIME - Local Interpretable Model-agnostic Explanations
• Captum - PyTorch attribution library
• Alibi - Algorithm-agnostic explanations
• Transformers Interpret - Hugging Face explanations

Visualization Tools

• SHAP Plots - Built-in visualization
• LIME Text - Text-specific visualizations
• Captum Insights - Interactive explanations
• What-If Tool - Google's explanation tool
• TensorBoard - TensorFlow explanations

Research Papers

• LIME Paper - Ribeiro et al. (2016)
• SHAP Paper - Lundberg & Lee (2017)
• Integrated Gradients - Sundararajan et al. (2017)
• Attention is Not Explanation - Jain & Wallace (2019)
• BERTology - Rogers et al. (2020)

Tutorials & Courses

• SHAP Tutorial - Official documentation
• LIME Tutorial - GitHub examples
• Captum Tutorial - PyTorch explanations
• Interpretable ML - Christoph Molnar's book
• Explainable AI Course - Online courses

Benchmarks & Datasets

• SST-2 - Stanford Sentiment Treebank
• IMDB - Movie review sentiment
• AG News - News classification
• SNLI - Natural language inference
• SQuAD - Question answering

Evaluation Metrics

• Faithfulness - How well explanations reflect model behavior
• Stability - Consistency across similar inputs
• Completeness - Coverage of important features
• Human Evaluation - Expert assessment of explanations
• Ablation Studies - Feature removal experiments

Table of Contents