RESOURCES

Get started_

A collection of resources to get you started with mechanistic interpretability and AI red teaming. Updated regularly — work down the list, from orientation to hands-on practice.

Request access to the Discord community

Roadmaps

AI Red Teaming roadmaproadmap.sh

Courses

ARENA — Chapter 1: Transformer InterpretabilityCallum McDougall Introduction to Red Teaming AIHack The Box Learn Mechanistic InterpretabilityCat McGee Wiki OffSec MLoffsecml.com

Blogs

How to become a mechanistic interpretability researcherNeel Nanda — Head of Alignment, Google DeepMind A Mathematical Framework for Transformer CircuitsAnthropic Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown ResistanceAlignment Forum An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2Neel Nanda Interpretability DreamsChris Olah How to hack AI appsJoseph Thacker

Papers

Open Problems in Mechanistic InterpretabilityarXiv A Primer on the Inner Workings of Transformer-based Language ModelsarXiv Bypassing the Safety Training of Open-Source LLMs with Priming AttacksarXiv

Books

Red Teaming AI: Strategies & TechniquesPhilip A. Dursey

YouTube

What is LLM Red Teaming? How Generative AI Safety Testing Works Agentic AI Red Teaming: The Hottest Cyber Skill of 2026

GitHub

TransformerLensMechanistic interpretability of generative language models L1B3RT4SJailbreaking prompts CL4R1T4SLeaked system prompts Spiritual-Spell-Red-Teaming (ENI)Jailbreaking prompts VibeLearning — AI-secEman Herawy

Tools

PromptfooAutomated testing for AI risk in development nnsightInterpret and manipulate the internals of DLMs AvertaAI red teaming and guardrails

Low-refusal models

Gemma-4-12B ObliteratedAbliterated GLM-5.2Ultra-low refusal rates Qwen3.6-27B ObliteratedAbliterated

Platforms

Gandalf: Agent BreakerBreak the AI agent Gray Swan ArenaBreak AI. Win prizes. Get discovered.