Skip to content

Redwood Research

📋Page Status
Quality:72 (Good)
Importance:58.5 (Useful)
Last edited:2025-12-24 (14 days ago)
Words:1.5k
Backlinks:6
Structure:
📊 13📈 0🔗 33📚 028%Score: 10/15
LLM Summary:Redwood Research pioneered the 'AI control' paradigm (safety without full alignment) and contributed causal scrubbing methodology to interpretability, with 15+ researchers, $5M+ funding, and notable field-building through 20+ fellowship alumni now at major labs. Their control framework shows 70-90% detection rates in toy models and has gained adoption across the safety community since 2023.
Organization

Redwood Research

Importance58

Redwood Research is an AI safety lab founded in 2021 that has pioneered the “AI control” research paradigm while making significant contributions to mechanistic interpretability. The organization operates on the pragmatic assumption that alignment may not be fully solved in time, leading to their groundbreaking work on ensuring safety even with potentially misaligned AI systems.

Their trajectory shows remarkable adaptability: beginning with adversarial robustness research, pivoting to mechanistic interpretability with their influential causal scrubbing methodology, and ultimately developing the AI control framework that has gained significant attention across the safety community. With 15+ researchers and $1M+ in funding, Redwood has trained numerous safety researchers now working across Anthropic, ARC, and other major organizations.

The organization’s core insight is that traditional alignment approaches may be insufficient for highly capable systems, necessitating “control” measures that prevent catastrophic outcomes even from scheming AI systems.

FactorAssessmentEvidenceTimelineSource
Technical InfluenceHighCausal scrubbing adopted by Anthropic, 100+ citations2022-presentAnthropic research
Field BuildingHigh20+ fellowship alumni at major labs2021-presentEA Forum tracking
Control Paradigm AdoptionMediumGrowing interest across labs2023-presentAI safety discussions
Policy RelevanceMediumControl more implementable than full alignment2024+RAND analysis
Research QualityHighRigorous empirical methodologyOngoingMultiple peer reviews

Phase 1: Adversarial Robustness (2021-2022)

Section titled “Phase 1: Adversarial Robustness (2021-2022)”

Problem Focus: Language models could be jailbroken despite safety training

ChallengeApproachResultsLessons
Harmful content generationAdversarial training on refusalSome domain-specific successAdversarial examples persist
Jailbreak robustnessSystematic prompt engineeringLimited scalabilityNeed internal understanding
Behavioral safetyExternal training constraintsFundamental brittlenessSurface-level solutions insufficient

Key Finding: Behavioral safety training has fundamental limitations that don’t scale to existential safety concerns.

Phase 2: Mechanistic Interpretability (2022-2023)

Section titled “Phase 2: Mechanistic Interpretability (2022-2023)”

Strategic Pivot: Understanding internal computations rather than just behavior

Major Contribution - Causal Scrubbing:

  • Innovation: Rigorous testing framework for interpretability hypotheses
  • Method: Ablate components according to proposed explanation, measure behavioral changes
  • Impact: Raised standards across interpretability field
  • Adoption: Used by Anthropic, multiple academic groups
  • Citations: 150+ across major venues
Research AreaContributionImpactFollow-up Work
Circuit AnalysisCausal scrubbing methodologyField standardAdopted by Anthropic
PolysemanticitySuperposition investigationTheoretical insightDictionary learning research
Feature ExtractionAutomated circuit discoveryTool developmentOpen-source implementations

Revolutionary Paradigm: Safety without requiring full alignment

Core Framework:

  • Assumption: Deploy potentially misaligned but capable AI systems
  • Goal: Prevent catastrophic outcomes despite potential scheming
  • Analogy: Cybersecurity mindset applied to AI safety
Control ComponentDescriptionResearch StatusEffectiveness
MonitoringDetect deceptive/scheming behaviorActive research70-90% detection in toy models
Protocol DesignSafe deployment proceduresFramework developedTesting in controlled settings
Red-team TestingAdversarial evaluation of controlsOngoingMultiple vulnerabilities identified
Untrusted MonitoringAI monitoring AI systemsEarly researchTheoretical framework

Key Publications:

Research Questions:

  • Can monitoring detect sophisticated deception?
  • What deployment protocols minimize catastrophic risk?
  • How to design safety cases for potentially misaligned AI?

Ongoing Work:

Tools Developed:

  • Open-source interpretability libraries
  • Circuit analysis frameworks
  • Causal intervention methodologies
Project AreaFocusStatusCollaboration
Sandbagging DetectionModels hiding capabilitiesOngoingWith ARC
Jailbreak RobustnessSystematic prompt attacksPublishedMultiple labs
Training DynamicsMisalignment developmentResearch phaseAcademic partners
Evaluation MethodologySafety assessment frameworksActiveIndustry consultation
Leadership Team
BS
Buck Shlegeris
CEO & Co-founder
RG
Ryan Greenblatt
Research Lead
WS
William Saunders
Research Scientist
NN
Neel Nanda
Former Senior Researcher (now DeepMind)
  • Background: Software engineering, rationalist community
  • Philosophy: Empirical focus, rapid iteration, evidence-driven pivots
  • Contribution: Organizational strategy, research direction setting
  • Public Role: Active in safety forums, clear research communication
  • Expertise: Machine learning, adversarial evaluation
  • Key Work: AI control framework development, red-team methodologies
  • Approach: Rigorous threat modeling, security-minded safety research
  • Fellows Program: Trained 20+ researchers now at major safety organizations
  • Methodology: Empirical validation, quick pivots based on evidence
  • Collaboration: Active partnerships with labs and safety organizations
Metric2021202220232024Projection 2025
Research Staff58121520-25
Annual Budget$1M$2M$3.5M$5M$7M
Publications0381215-20
Fellows Trained05122030+

Citation Metrics:

  • Causal scrubbing paper: 150+ citations
  • AI control framework: 50+ citations (recent publication)
  • Total research impact: 500+ citations across work

Adoption Indicators:

  • Anthropic implementing causal scrubbing methods
  • Multiple labs exploring control approaches
  • Policy discussions referencing control paradigm

2024-2025 Focus:

  • Scaling AI control research
  • Developing practical safety protocols
  • Building industry partnerships for testing
  • Expanding empirical evaluation capabilities

Funding Trajectory: Diversifying from grants to include industry partnerships while maintaining research independence.

Major Questions

Key Questions

Can control measures actually work against superintelligent scheming AI?
Does the control approach create false confidence in deployment safety?
Should resources focus on alignment research rather than control?
How to validate control measures without access to truly dangerous AI?
What is the appropriate balance between control and alignment research?
UncertaintyPessimistic ViewOptimistic ViewCurrent Evidence
Control EffectivenessSuperintelligent AI will circumvent any controlsLayered defenses can achieve acceptable riskLimited to toy model testing
Resource AllocationControl distracts from alignment workControl provides crucial safety layerMixed views in safety community
Timeline RelevanceWon’t be ready before dangerous AICan develop alongside capabilitiesActive research progress
Industry AdoptionLabs won’t implement costly controlsControl more practical than alignmentEarly industry interest

Monitoring Limitations:

  • Can sophisticated AI hide deceptive behavior?
  • How to detect novel attack vectors?
  • Scalability of human oversight

Protocol Design:

  • Balance between safety and capability
  • Economic incentives for control adoption
  • Verification of control implementations

Relationship to Alignment:

  • Complement or substitute for alignment research?
  • Risk of reducing alignment investment?
  • Long-term vision for AI safety
OrganizationPrimary FocusTimeline ViewMethodologySuccess Theory
RedwoodControl + interpretabilityMedium pessimismEmpirical, adaptiveSafety without full alignment
AnthropicConstitutional AI, scalingCautious optimismLarge-scale experimentsAlignment through training
ARCEvaluation, alignment theoryVariedTheory + evaluationEarly detection and preparation
MIRIGovernance pivotHigh pessimismTheoretical (paused)Coordination over technical

Distinguishing Features:

  • Only major org focused primarily on control
  • Empirical methodology with willingness to pivot
  • Security-minded approach to safety
  • Practical deployment focus

Strategic Advantages:

  • Complementary to alignment research
  • More implementable in near-term
  • Appeals to industry safety concerns
  • Builds on existing security practices
PublicationYearImpactFocus
Causal Scrubbing Paper2023150+ citationsInterpretability methodology
AI Control Framework2024FoundationalControl paradigm
Adversarial Robustness Studies2022-2023Field buildingEmpirical safety
TypeResourceAccess
Research PortalRedwood ResearchPublic
PublicationsResearch PublicationsOpen access
Code/ToolsGitHub RepositoryOpen source
FellowshipResearch FellowshipApplication-based
SourceFocusAssessment
Open PhilanthropyFunding analysisMajor safety contribution
Alignment ForumTechnical discussionMixed reception of control approach
AI Safety NewsletterRegular coveragePositive on empirical methodology
Future of Humanity InstituteAcademic perspectiveValuable addition to safety portfolio
OrganizationRelationshipCollaboration Type
AnthropicResearch collaborationInterpretability methods sharing
ARCComplementary focusEvaluation methodology
CHAIAcademic partnershipResearch collaboration
Apollo ResearchSimilar approachEvaluation and testing
InstitutionEngagementFocus
UK AISIConsultationControl methodologies
US AISIFramework discussionDeployment safety
NIST AI RMFInput provisionRisk management
Industry StandardsDevelopment participationSafety protocols