AI safety research in 2026 spans alignment, interpretability, evals, and governance — with significant activity at frontier labs, academic institutions, and newly-established government AI safety institutes. This post is a map of what's being worked on, who's doing it, and where the field has made progress vs stagnated.

Research areas

Alignment: RLHF, Constitutional AI, scalable oversight. Interpretability: mechanistic, feature discovery, circuits. Evals: capability benchmarks, dangerous capability, jailbreak.

Alignment research

RLHF and successors. Reinforcement learning from human feedback — foundation of safety tuning. RLAIF (from AI feedback) extending.

Constitutional AI. Anthropic's approach: principles-based training, self-critique, RLAIF. See constitutional approaches post.

Scalable oversight. As models become more capable, humans harder-pressed to evaluate outputs. Research on hierarchical or AI-assisted oversight.

Debate, recursive reward modeling, amplification. Paul Christiano's and others' approaches to scalable oversight.

Interpretability

Mechanistic interpretability. Understanding computations in neural networks at circuit level.

Feature discovery. Sparse autoencoders, probing classifiers. Identifying interpretable features in activations.

Circuit analysis. How do features combine and compose across layers?

Steering. Can we causally intervene in specific features? Applications for safety.

Anthropic's Sonnet and Claude 3.5/4 interpretability papers pushing frontier. DeepMind, OpenAI similarly active.

Evaluations and red-teaming

Capability benchmarks. MMLU, Big-Bench, HumanEval, SWE-Bench — standard capability measurement.

Dangerous capability evals. Bio, chem, cyber, persuasion capabilities. Labs evaluating pre-deployment; AISIs (UK, US) independent evaluation.

Jailbreak research. Finding ways to bypass safety measures. Adversarial robustness research.

Agentic evals. As models take multi-step actions, evals of agent behavior, planning, tool use.

Responsible Scaling Policies. Anthropic's framework; OpenAI's Preparedness Framework; Google DeepMind's Frontier Safety. Pre-commit to evals at capability thresholds.

Who is doing this work

Frontier labs. Anthropic, OpenAI, Google DeepMind have dedicated safety teams. 10-20% of staff focused on safety.

Independent labs. METR (Model Evaluation and Threat Research), Apollo Research, others doing pre-deployment evals.

Academic. MIT CSAIL, Stanford HAI, Berkeley CHAI, Oxford GovAI, NYU MATS, Redwood Research — active.

Government. UK AISI (created 2023), US AISI (2024), EU AI Office building capacity. Independent evaluation authority.

Nonprofits. MIRI (older, reduced activity), ARC (Alignment Research Center), FAR AI.

Progress vs stagnation

Progress. Interpretability has advanced meaningfully; we can now identify features in models. Evals are more sophisticated; RSPs formalize capability thresholds.

Stagnation. Core alignment problem (ensuring models pursue intended objectives) remains unsolved. Scalable oversight nascent.

Evaluation of alignment itself. Limited. We measure what we can measure; harder qualities evade measurement.

Gap between capabilities and safety research. Capabilities advancing faster than safety understanding.

Policy coupling

Research informs policy. AISI evals feed regulatory frameworks.

EU AI Act. Risk-based regulation; high-risk systems require safety measures. Operationalization ongoing.

US Executive Orders. Safe, Secure, and Trustworthy AI (2023); superseded and modified in 2025. Evolving.

International coordination. Bletchley Declaration, Seoul Declaration, IMDA initiatives. Still nascent.

Open problems

Deceptive alignment. Can models appear aligned during training but misbehave when deployed? Active research.

Goal misgeneralization. Models pursuing proxies rather than true objectives. Documented; solutions unclear.

Steganography. Models communicating hidden information. Detection limited.

Emergent capabilities. Unexpected capabilities appearing with scale. Hard to predict; hard to safety test.

Training-deployment gap. Behavior during safety evals vs real deployment may differ.

Funding and talent

Funding expanded. AI labs increased safety spending; philanthropists (OpenPhil, others); governments funding AISIs.

Talent gap. Alignment researchers scarce. Strong demand; limited training pipelines.

MATS, SERI, Astra fellowships. Training programs for junior safety researchers.

Practical implications

For AI companies. Safety research informs deployment decisions, RSPs, mitigations.

For policymakers. Science-based policy possible when safety research produces findings.

For the public. Transparency about known limitations and risks increasingly expected.

For engineers deploying AI. Understand what safety means in your deployment; what mitigations reasonable.

AI safety research today: what's happening, what matters

Alignment research

Interpretability

Evaluations and red-teaming

Who is doing this work

Progress vs stagnation

Policy coupling

Open problems

Funding and talent

Practical implications

Continue the thread.

AI alignment in practice: from research to production

Constitutional AI approaches to alignment

Measuring AI bias: frameworks, metrics, practical steps

Want to talk about this?