Explainability for AI systems matters for different reasons at different times — debugging for engineers, justification for affected users, accountability for regulators, understanding for researchers. The technique set has matured but remains fragmented. This post is the explainability toolkit in 2026 and guidance on what to use when.

Three categories

Feature attribution: SHAP, LIME, integrated gradients. Counterfactuals: what would change result. Mechanistic: circuits and features in models.

Feature attribution

SHAP (SHapley Additive exPlanations). Based on game theory; distributes prediction credit across features. Mature tools, interpretable.

LIME (Local Interpretable Model-agnostic Explanations). Approximates model locally with interpretable simple model. Explains specific predictions.

Integrated gradients. For neural networks; gradient-based attribution.

What you get. 'Feature X contributed +0.3 to this prediction; feature Y contributed -0.1.' Numerical, additive.

Limitations. Correlation vs causation; local vs global behavior; stability across runs.

Counterfactuals

What minimal change to input would change the output? 'If your income were $5K higher, loan would be approved.'

Actionable for users. Tells affected person what to change. Regulatory fit for adverse action notices.

Minimal perturbation. Change the least to get different result. Various algorithms (DiCE, FACE, etc.).

Multiple counterfactuals. 'Here are three ways to get approved.' Gives options.

Challenges. Immutable features (race, age) must be excluded. Realistic counterfactuals harder than arbitrary ones.

Mechanistic interpretability

Research area. Understand what specific circuits in neural networks compute.

Feature discovery. What concepts are represented in model weights? Sparse autoencoders, probing classifiers.

Circuit analysis. How do computations flow through model layers?

Anthropic's work. Circuits in vision models, features in LLMs, induction heads.

DeepMind, OpenAI similarly active. Field growing rapidly.

Mostly research stage. Production use cases emerging slowly (safety monitoring, debugging).

What to use when

Tabular ML models. SHAP dominant. Mature tools, stakeholder-friendly visualizations.

User-facing explanations. Counterfactuals often most useful. Actionable, concrete.

Regulatory compliance. Depends on jurisdiction; adverse action notices (US credit) require explanation of denied decision factors.

LLM behavior. Natural language self-explanation primary; mechanistic interpretability for research/safety.

Debugging. Feature attribution plus analysis of failures; find patterns in errors.

LLM-specific techniques

Chain of thought. Model explains reasoning; often post-hoc rationalization but sometimes faithful.

Attention visualization. Which tokens did model attend to? Partial information; not faithful attribution.

Probing classifiers. Train classifiers on hidden representations to detect concepts.

Activation patching. Systematically perturb internal activations; observe behavior changes. Causal intervention.

Sparse autoencoders. Anthropic's feature extraction work. Identifies interpretable features in model activations.

Explanation fidelity

Plausible vs faithful. Explanations can sound good but not reflect actual model behavior.

Faithfulness testing. Perturb explanation features; observe if prediction changes as expected.

Limits. Some behavior may not be interpretable by humans regardless of technique.

Tools

SHAP library. Python; de facto standard for tree and tabular models.

Captum (PyTorch). Feature attribution for neural networks.

InterpretML (Microsoft). Various techniques unified.

DiCE. Counterfactual explanations.

Specialized. NeuroSAE for LLMs (emerging); LLM observability tools with explanation features.

Governance role

Internal review. Explanations support model review, bias assessment, debugging.

External disclosure. Regulators, customers, affected individuals receive explanations.

Documentation. Model cards, impact assessments should include explanation methodology.

Accountability structures. Who's responsible when explanations reveal problems? Clear ownership matters.

AI explainability techniques in 2026

Feature attribution

Counterfactuals

Mechanistic interpretability

What to use when

LLM-specific techniques

Explanation fidelity

Tools

Governance role

Continue the thread.

Measuring AI bias: frameworks, metrics, practical steps

AI safety research today: what's happening, what matters

AI alignment in practice: from research to production

Want to talk about this?