eazyware
Playbook·April 17, 2023·11 min read

AI explainability techniques in 2026

SHAP, LIME, integrated gradients, attention visualization, counterfactuals. The techniques for making AI systems interpretable and when each works.

KR
Kushal R.
Engineering lead

Explainability for AI systems matters for different reasons at different times — debugging for engineers, justification for affected users, accountability for regulators, understanding for researchers. The technique set has matured but remains fragmented. This post is the explainability toolkit in 2026 and guidance on what to use when.

Three categories
Explainability techniques Feature attribution SHAP, LIME Integrated gradients What drove this output Counterfactuals What would change result Minimal perturbation Actionable for users Mechanistic Circuits, features in models Interpretability research Anthropic, DeepMind, OpenAI What to use when Tabular ML: SHAP dominant — mature tools, familiar to stakeholders User-facing explanation: counterfactuals often most useful LLM behavior: natural language self-explanation + mechanistic research
Feature attribution: SHAP, LIME, integrated gradients. Counterfactuals: what would change result. Mechanistic: circuits and features in models.

Feature attribution

SHAP (SHapley Additive exPlanations). Based on game theory; distributes prediction credit across features. Mature tools, interpretable.

LIME (Local Interpretable Model-agnostic Explanations). Approximates model locally with interpretable simple model. Explains specific predictions.

Integrated gradients. For neural networks; gradient-based attribution.

What you get. 'Feature X contributed +0.3 to this prediction; feature Y contributed -0.1.' Numerical, additive.

Limitations. Correlation vs causation; local vs global behavior; stability across runs.

Counterfactuals

What minimal change to input would change the output? 'If your income were $5K higher, loan would be approved.'

Actionable for users. Tells affected person what to change. Regulatory fit for adverse action notices.

Minimal perturbation. Change the least to get different result. Various algorithms (DiCE, FACE, etc.).

Multiple counterfactuals. 'Here are three ways to get approved.' Gives options.

Challenges. Immutable features (race, age) must be excluded. Realistic counterfactuals harder than arbitrary ones.

Mechanistic interpretability

Research area. Understand what specific circuits in neural networks compute.

Feature discovery. What concepts are represented in model weights? Sparse autoencoders, probing classifiers.

Circuit analysis. How do computations flow through model layers?

Anthropic's work. Circuits in vision models, features in LLMs, induction heads.

DeepMind, OpenAI similarly active. Field growing rapidly.

Mostly research stage. Production use cases emerging slowly (safety monitoring, debugging).

What to use when

Tabular ML models. SHAP dominant. Mature tools, stakeholder-friendly visualizations.

User-facing explanations. Counterfactuals often most useful. Actionable, concrete.

Regulatory compliance. Depends on jurisdiction; adverse action notices (US credit) require explanation of denied decision factors.

LLM behavior. Natural language self-explanation primary; mechanistic interpretability for research/safety.

Debugging. Feature attribution plus analysis of failures; find patterns in errors.

LLM-specific techniques

Chain of thought. Model explains reasoning; often post-hoc rationalization but sometimes faithful.

Attention visualization. Which tokens did model attend to? Partial information; not faithful attribution.

Probing classifiers. Train classifiers on hidden representations to detect concepts.

Activation patching. Systematically perturb internal activations; observe behavior changes. Causal intervention.

Sparse autoencoders. Anthropic's feature extraction work. Identifies interpretable features in model activations.

Explanation fidelity

Plausible vs faithful. Explanations can sound good but not reflect actual model behavior.

Faithfulness testing. Perturb explanation features; observe if prediction changes as expected.

Limits. Some behavior may not be interpretable by humans regardless of technique.

Tools

SHAP library. Python; de facto standard for tree and tabular models.

Captum (PyTorch). Feature attribution for neural networks.

InterpretML (Microsoft). Various techniques unified.

DiCE. Counterfactual explanations.

Specialized. NeuroSAE for LLMs (emerging); LLM observability tools with explanation features.

Governance role

Internal review. Explanations support model review, bias assessment, debugging.

External disclosure. Regulators, customers, affected individuals receive explanations.

Documentation. Model cards, impact assessments should include explanation methodology.

Accountability structures. Who's responsible when explanations reveal problems? Clear ownership matters.

Read next
Measuring AI bias: frameworks, metrics, practical steps
Read next
AI safety research today: what's happening, what matters
Read next
AI alignment in practice: from research to production
Tags
explainabilityXAIinterpretability
/ Next step

Want to talk about this?

We love debating this stuff. 30-minute call, no pitch, just engineering conversation.

~4h
avg response
Q2 '26
next slot
100%
NDA on request