NLP Explainability

LIME, SHAP, and Integrated Gradients

SuNaAI Lab

Technical Guide Series

ResourcesTechnical GuidesNLP Explainability

Chapter 1: Understanding NLP Explainability

Demystifying the black box: How to understand what your language models are really thinking

As language models become more powerful and ubiquitous, understanding their decision-making process becomes crucial. Explainability in NLP isn't just about satisfying curiosity—it's about building trust, ensuring fairness, and enabling better model development.

The Explainability Challenge

Modern NLP models process millions of parameters and complex attention mechanisms. Traditional debugging approaches don't work—we need specialized techniques to understand what's happening inside these sophisticated systems.

What You'll Learn

  • How LIME works and when to use it for local explanations
  • SHAP's game-theoretic approach to feature attribution
  • Integrated Gradients for gradient-based explanations
  • Practical implementation strategies for each method
  • Best practices for interpreting and validating explanations
  • Common pitfalls and how to avoid them

Chapter 2: Why Explainability Matters

The critical importance of understanding AI decisions in real-world applications

Explainability isn't just a nice-to-have feature—it's becoming a regulatory requirement and business necessity. From healthcare to finance, stakeholders need to understand and trust AI decisions.

🏥 Healthcare

Doctors need to understand why an AI system recommends a specific treatment. Explainability enables trust and better clinical decision-making.

🏦 Finance

Loan approval systems must explain their decisions to comply with regulations and ensure fair lending practices.

⚖️ Legal

Legal document analysis systems need to highlight relevant passages and explain their reasoning for case preparation.

🔍 Research

Understanding model behavior helps researchers identify biases, improve architectures, and develop better training strategies.

Types of Explanations

1

Local Explanations

Explain individual predictions (e.g., "This word contributed +0.3 to the sentiment score")

2

Global Explanations

Understand overall model behavior (e.g., "The model focuses on negation words for sentiment")

3

Counterfactual Explanations

Show what would change the prediction (e.g., "Removing 'not' would flip the sentiment")

Chapter 3: LIME (Local Interpretable Model-agnostic Explanations)

Understanding individual predictions through local linear approximations

LIME is a model-agnostic method that explains individual predictions by approximating the model locally with an interpretable model. It's particularly powerful for NLP because it can highlight which words or phrases are most important for a prediction.

LIME Algorithm
1. Select an instance to explain
2. Generate perturbed versions of the instance
3. Get predictions for perturbed instances
4. Weight instances by proximity to original
5. Train interpretable model on weighted data
6. Extract feature importance from interpretable model

LIME for Text Classification

Example: Sentiment Analysis

Text: "This movie is not bad at all"

Prediction: Positive (0.85)

LIME Explanation:

  • • "not bad" → +0.4 (strong positive indicator)
  • • "at all" → +0.2 (reinforces positive sentiment)
  • • "This movie" → +0.1 (neutral context)

Implementation with Python

LIME Implementation
from lime import lime_text
from lime.lime_text import LimeTextExplainer
import numpy as np

# Initialize LIME explainer
explainer = LimeTextExplainer(class_names=['Negative', 'Positive'])

# Define prediction function
def predict_proba(texts):
    # Your model's prediction function
    return model.predict_proba(texts)

# Explain a single instance
text = "This movie is not bad at all"
explanation = explainer.explain_instance(
    text, 
    predict_proba, 
    num_features=10
)

# Display explanation
explanation.show_in_notebook(text=True)

LIME Advantages & Limitations

✅ Advantages:

  • • Model-agnostic (works with any model)
  • • Intuitive word-level explanations
  • • Easy to implement and understand
  • • Good for debugging individual predictions
  • • Works well with text data

❌ Limitations:

  • • Only local explanations (not global)
  • • Can be unstable (different runs may vary)
  • • Computationally expensive
  • • May not capture complex interactions
  • • Sensitive to perturbation method

Chapter 4: SHAP (SHapley Additive exPlanations)

Game-theoretic approach to feature attribution with theoretical guarantees

SHAP provides a unified framework for explaining model outputs using concepts from cooperative game theory. It satisfies several desirable properties and provides both local and global explanations.

SHAP Properties

  • Efficiency: Sum of SHAP values equals model output
  • Symmetry: Equal features get equal SHAP values
  • Dummy: Features not affecting output get zero SHAP
  • Additivity: SHAP values are additive across models

SHAP Variants

TreeExplainer

Fast, exact SHAP values for tree-based models (XGBoost, LightGBM, etc.)

shap.TreeExplainer(model)

DeepExplainer

Approximate SHAP values for deep learning models

shap.DeepExplainer(model, background)

KernelExplainer

Model-agnostic SHAP using sampling (slower but universal)

shap.KernelExplainer(model.predict, background)

Explainer

Auto-selects best explainer based on model type

shap.Explainer(model)

SHAP Implementation

SHAP for Text Classification
import shap
import numpy as np

# Initialize SHAP explainer
explainer = shap.Explainer(model)

# Calculate SHAP values
shap_values = explainer(texts)

# Visualize explanations
shap.plots.text(shap_values[0])

# Summary plot for global understanding
shap.summary_plot(shap_values, texts)

# Waterfall plot for individual prediction
shap.waterfall_plot(shap_values[0])

SHAP vs LIME

AspectSHAPLIME
Theoretical FoundationGame theory (Shapley values)Local linear approximation
ConsistencyTheoretically guaranteedCan vary between runs
Computational CostHigher (exponential in worst case)Lower (polynomial)
Global ExplanationsYes (summary plots)No (local only)

Chapter 5: Integrated Gradients

Gradient-based attribution with axiomatic guarantees

Integrated Gradients is a gradient-based attribution method that satisfies the axioms of sensitivity and implementation invariance. It's particularly well-suited for neural networks and provides both local and global explanations.

Key Axioms

  • Sensitivity: Features with different impacts get different attributions
  • Implementation Invariance: Equivalent models get same attributions
  • Saturation: Attributions sum to difference between baseline and prediction

Mathematical Foundation

For an input x and baseline x', the Integrated Gradients attribution for feature i is:

IG_i(x) = (x_i - x'_i) × ∫[α=0 to 1] ∂F(x' + α(x - x'))/∂x_i dα

Implementation

Integrated Gradients Implementation
import torch
import numpy as np

def integrated_gradients(model, input_tensor, baseline_tensor, 
                        target_class, steps=50):
    """
    Compute Integrated Gradients attribution
    """
    # Generate interpolated inputs
    alphas = torch.linspace(0, 1, steps + 1)
    interpolated_inputs = []
    
    for alpha in alphas:
        interpolated = baseline_tensor + alpha * (input_tensor - baseline_tensor)
        interpolated_inputs.append(interpolated)
    
    interpolated_inputs = torch.stack(interpolated_inputs)
    interpolated_inputs.requires_grad_(True)
    
    # Compute gradients
    outputs = model(interpolated_inputs)
    target_outputs = outputs[:, target_class]
    
    gradients = torch.autograd.grad(
        outputs=target_outputs.sum(),
        inputs=interpolated_inputs,
        create_graph=True
    )[0]
    
    # Average gradients and multiply by input difference
    avg_gradients = gradients.mean(dim=0)
    attributions = (input_tensor - baseline_tensor) * avg_gradients
    
    return attributions

Choosing a Baseline

1

Zero Baseline

All zeros - good for general attribution

2

Mean Baseline

Dataset mean - shows deviation from average

3

Random Baseline

Random noise - shows what's not random

Chapter 6: Method Comparison

When to use which explainability method

MethodBest ForSpeedAccuracyUse Case
LIMEQuick local explanationsFastGoodDebugging, exploration
SHAPTheoretically sound explanationsMediumExcellentProduction, research
Integrated GradientsNeural network attributionFastExcellentDeep learning, gradients

Decision Framework

1

Need quick explanations?

→ Use LIME or Integrated Gradients

2

Need theoretical guarantees?

→ Use SHAP or Integrated Gradients

3

Working with neural networks?

→ Use Integrated Gradients

4

Need global understanding?

→ Use SHAP summary plots

Chapter 7: Practical Implementation

Real-world strategies for implementing explainability in production

Production Considerations

⚡ Performance

  • • Cache explanations for common inputs
  • • Use approximate methods for speed
  • • Implement async explanation generation
  • • Consider explanation quality vs speed trade-offs

🔒 Security

  • • Validate explanation inputs
  • • Prevent explanation-based attacks
  • • Sanitize sensitive information
  • • Implement access controls

📊 Monitoring

  • • Track explanation quality metrics
  • • Monitor explanation consistency
  • • Alert on unusual patterns
  • • Log explanation generation times

🎯 User Experience

  • • Present explanations clearly
  • • Provide context and interpretation
  • • Allow users to explore details
  • • Offer multiple explanation formats

Validation Strategies

Sanity Checks

Verify explanations make intuitive sense

Consistency Tests

Check that similar inputs get similar explanations

Ablation Studies

Remove important features and verify prediction changes

Expert Review

Have domain experts validate explanations

Chapter 8: Best Practices

Guidelines for effective explainability implementation

Do's and Don'ts

✅ Do:

  • • Use multiple explanation methods
  • • Validate explanations with domain experts
  • • Consider your audience's technical level
  • • Document explanation methodology
  • • Test explanations on edge cases
  • • Monitor explanation quality over time
  • • Provide context for explanations
  • • Use appropriate baselines

❌ Don't:

  • • Trust explanations blindly
  • • Use only one explanation method
  • • Ignore computational costs
  • • Present explanations without context
  • • Assume explanations are always correct
  • • Overinterpret small attribution values
  • • Ignore baseline choice impact
  • • Skip validation steps

Common Pitfalls

Pitfall 1: Baseline Sensitivity

Different baselines can lead to very different explanations. Always document your baseline choice and consider its impact on interpretation.

Pitfall 2: Correlation vs Causation

High attribution doesn't mean the feature causes the prediction. Consider confounding variables and causal relationships.

Pitfall 3: Overinterpretation

Small attribution values might be noise. Focus on features with significantly high attributions and consider statistical significance.

Chapter 9: Tools & Resources

Essential libraries and resources for NLP explainability

Python Libraries

  • SHAP - SHapley Additive exPlanations
  • LIME - Local Interpretable Model-agnostic Explanations
  • Captum - PyTorch attribution library
  • Alibi - Algorithm-agnostic explanations
  • Transformers Interpret - Hugging Face explanations

Visualization Tools

Research Papers

Tutorials & Courses

Benchmarks & Datasets

  • SST-2 - Stanford Sentiment Treebank
  • IMDB - Movie review sentiment
  • AG News - News classification
  • SNLI - Natural language inference
  • SQuAD - Question answering

Evaluation Metrics