Explainability for AI systems matters for different reasons at different times — debugging for engineers, justification for affected users, accountability for regulators, understanding for researchers. The technique set has matured but remains fragmented. This post is the explainability toolkit in 2026 and guidance on what to use when.
Feature attribution
SHAP (SHapley Additive exPlanations). Based on game theory; distributes prediction credit across features. Mature tools, interpretable.
LIME (Local Interpretable Model-agnostic Explanations). Approximates model locally with interpretable simple model. Explains specific predictions.
Integrated gradients. For neural networks; gradient-based attribution.
What you get. 'Feature X contributed +0.3 to this prediction; feature Y contributed -0.1.' Numerical, additive.
Limitations. Correlation vs causation; local vs global behavior; stability across runs.
Counterfactuals
What minimal change to input would change the output? 'If your income were $5K higher, loan would be approved.'
Actionable for users. Tells affected person what to change. Regulatory fit for adverse action notices.
Minimal perturbation. Change the least to get different result. Various algorithms (DiCE, FACE, etc.).
Multiple counterfactuals. 'Here are three ways to get approved.' Gives options.
Challenges. Immutable features (race, age) must be excluded. Realistic counterfactuals harder than arbitrary ones.
Mechanistic interpretability
Research area. Understand what specific circuits in neural networks compute.
Feature discovery. What concepts are represented in model weights? Sparse autoencoders, probing classifiers.
Circuit analysis. How do computations flow through model layers?
Anthropic's work. Circuits in vision models, features in LLMs, induction heads.
DeepMind, OpenAI similarly active. Field growing rapidly.
Mostly research stage. Production use cases emerging slowly (safety monitoring, debugging).
What to use when
Tabular ML models. SHAP dominant. Mature tools, stakeholder-friendly visualizations.
User-facing explanations. Counterfactuals often most useful. Actionable, concrete.
Regulatory compliance. Depends on jurisdiction; adverse action notices (US credit) require explanation of denied decision factors.
LLM behavior. Natural language self-explanation primary; mechanistic interpretability for research/safety.
Debugging. Feature attribution plus analysis of failures; find patterns in errors.
LLM-specific techniques
Chain of thought. Model explains reasoning; often post-hoc rationalization but sometimes faithful.
Attention visualization. Which tokens did model attend to? Partial information; not faithful attribution.
Probing classifiers. Train classifiers on hidden representations to detect concepts.
Activation patching. Systematically perturb internal activations; observe behavior changes. Causal intervention.
Sparse autoencoders. Anthropic's feature extraction work. Identifies interpretable features in model activations.
Explanation fidelity
Plausible vs faithful. Explanations can sound good but not reflect actual model behavior.
Faithfulness testing. Perturb explanation features; observe if prediction changes as expected.
Limits. Some behavior may not be interpretable by humans regardless of technique.
Tools
SHAP library. Python; de facto standard for tree and tabular models.
Captum (PyTorch). Feature attribution for neural networks.
InterpretML (Microsoft). Various techniques unified.
DiCE. Counterfactual explanations.
Specialized. NeuroSAE for LLMs (emerging); LLM observability tools with explanation features.
Governance role
Internal review. Explanations support model review, bias assessment, debugging.
External disclosure. Regulators, customers, affected individuals receive explanations.
Documentation. Model cards, impact assessments should include explanation methodology.
Accountability structures. Who's responsible when explanations reveal problems? Clear ownership matters.