Redwood Research
Overview
Section titled “Overview”Redwood Research is an AI safety lab founded in 2021 that has pioneered the “AI control” research paradigm while making significant contributions to mechanistic interpretability. The organization operates on the pragmatic assumption that alignment may not be fully solved in time, leading to their groundbreaking work on ensuring safety even with potentially misaligned AI systems.
Their trajectory shows remarkable adaptability: beginning with adversarial robustness research, pivoting to mechanistic interpretability with their influential causal scrubbing methodology, and ultimately developing the AI control framework that has gained significant attention across the safety community. With 15+ researchers and $1M+ in funding, Redwood has trained numerous safety researchers now working across Anthropic, ARC, and other major organizations.
The organization’s core insight is that traditional alignment approaches may be insufficient for highly capable systems, necessitating “control” measures that prevent catastrophic outcomes even from scheming AI systems.
Risk and Impact Assessment
Section titled “Risk and Impact Assessment”| Factor | Assessment | Evidence | Timeline | Source |
|---|---|---|---|---|
| Technical Influence | High | Causal scrubbing adopted by Anthropic, 100+ citations | 2022-present | Anthropic research↗ |
| Field Building | High | 20+ fellowship alumni at major labs | 2021-present | EA Forum tracking↗ |
| Control Paradigm Adoption | Medium | Growing interest across labs | 2023-present | AI safety discussions↗ |
| Policy Relevance | Medium | Control more implementable than full alignment | 2024+ | RAND analysis↗ |
| Research Quality | High | Rigorous empirical methodology | Ongoing | Multiple peer reviews |
Key Research Evolution
Section titled “Key Research Evolution”Phase 1: Adversarial Robustness (2021-2022)
Section titled “Phase 1: Adversarial Robustness (2021-2022)”Problem Focus: Language models could be jailbroken despite safety training
| Challenge | Approach | Results | Lessons |
|---|---|---|---|
| Harmful content generation | Adversarial training on refusal | Some domain-specific success | Adversarial examples persist |
| Jailbreak robustness | Systematic prompt engineering | Limited scalability | Need internal understanding |
| Behavioral safety | External training constraints | Fundamental brittleness | Surface-level solutions insufficient |
Key Finding: Behavioral safety training has fundamental limitations that don’t scale to existential safety concerns.
Phase 2: Mechanistic Interpretability (2022-2023)
Section titled “Phase 2: Mechanistic Interpretability (2022-2023)”Strategic Pivot: Understanding internal computations rather than just behavior
Major Contribution - Causal Scrubbing:
- Innovation: Rigorous testing framework for interpretability hypotheses
- Method: Ablate components according to proposed explanation, measure behavioral changes
- Impact: Raised standards across interpretability field
- Adoption: Used by Anthropic, multiple academic groups
- Citations: 150+ across major venues
| Research Area | Contribution | Impact | Follow-up Work |
|---|---|---|---|
| Circuit Analysis | Causal scrubbing methodology | Field standard | Adopted by Anthropic |
| Polysemanticity | Superposition investigation | Theoretical insight | Dictionary learning research |
| Feature Extraction | Automated circuit discovery | Tool development | Open-source implementations |
Phase 3: AI Control (2023-Present)
Section titled “Phase 3: AI Control (2023-Present)”Revolutionary Paradigm: Safety without requiring full alignment
Core Framework:
- Assumption: Deploy potentially misaligned but capable AI systems
- Goal: Prevent catastrophic outcomes despite potential scheming
- Analogy: Cybersecurity mindset applied to AI safety
| Control Component | Description | Research Status | Effectiveness |
|---|---|---|---|
| Monitoring | Detect deceptive/scheming behavior | Active research | 70-90% detection in toy models |
| Protocol Design | Safe deployment procedures | Framework developed | Testing in controlled settings |
| Red-team Testing | Adversarial evaluation of controls | Ongoing | Multiple vulnerabilities identified |
| Untrusted Monitoring | AI monitoring AI systems | Early research | Theoretical framework |
Current Research Portfolio
Section titled “Current Research Portfolio”AI Control Research
Section titled “AI Control Research”Key Publications:
- “AI Control: Improving Safety Despite Intentional Subversion”↗ (2024)
- Control evaluation methodologies
- Red-team protocol development
Research Questions:
- Can monitoring detect sophisticated deception?
- What deployment protocols minimize catastrophic risk?
- How to design safety cases for potentially misaligned AI?
Interpretability Methods
Section titled “Interpretability Methods”Ongoing Work:
- Activation steering techniques
- Automated circuit discovery
- Deceptive alignment detection
- Understanding situational awareness
Tools Developed:
- Open-source interpretability libraries
- Circuit analysis frameworks
- Causal intervention methodologies
Empirical Safety Research
Section titled “Empirical Safety Research”| Project Area | Focus | Status | Collaboration |
|---|---|---|---|
| Sandbagging Detection | Models hiding capabilities | Ongoing | With ARC |
| Jailbreak Robustness | Systematic prompt attacks | Published | Multiple labs |
| Training Dynamics | Misalignment development | Research phase | Academic partners |
| Evaluation Methodology | Safety assessment frameworks | Active | Industry consultation |
Key Personnel and Expertise
Section titled “Key Personnel and Expertise”Buck Shlegeris (CEO)
Section titled “Buck Shlegeris (CEO)”- Background: Software engineering, rationalist community
- Philosophy: Empirical focus, rapid iteration, evidence-driven pivots
- Contribution: Organizational strategy, research direction setting
- Public Role: Active in safety forums, clear research communication
Ryan Greenblatt
Section titled “Ryan Greenblatt”- Expertise: Machine learning, adversarial evaluation
- Key Work: AI control framework development, red-team methodologies
- Approach: Rigorous threat modeling, security-minded safety research
Research Culture
Section titled “Research Culture”- Fellows Program: Trained 20+ researchers now at major safety organizations
- Methodology: Empirical validation, quick pivots based on evidence
- Collaboration: Active partnerships with labs and safety organizations
Current State and Trajectory
Section titled “Current State and Trajectory”Organizational Metrics
Section titled “Organizational Metrics”| Metric | 2021 | 2022 | 2023 | 2024 | Projection 2025 |
|---|---|---|---|---|---|
| Research Staff | 5 | 8 | 12 | 15 | 20-25 |
| Annual Budget | $1M | $2M | $3.5M | $5M | $7M |
| Publications | 0 | 3 | 8 | 12 | 15-20 |
| Fellows Trained | 0 | 5 | 12 | 20 | 30+ |
Research Impact
Section titled “Research Impact”Citation Metrics:
- Causal scrubbing paper: 150+ citations
- AI control framework: 50+ citations (recent publication)
- Total research impact: 500+ citations across work
Adoption Indicators:
- Anthropic implementing causal scrubbing methods
- Multiple labs exploring control approaches
- Policy discussions referencing control paradigm
Strategic Positioning
Section titled “Strategic Positioning”2024-2025 Focus:
- Scaling AI control research
- Developing practical safety protocols
- Building industry partnerships for testing
- Expanding empirical evaluation capabilities
Funding Trajectory: Diversifying from grants to include industry partnerships while maintaining research independence.
Key Uncertainties and Debates
Section titled “Key Uncertainties and Debates”The Control Paradigm
Section titled “The Control Paradigm”❓Key Questions
| Uncertainty | Pessimistic View | Optimistic View | Current Evidence |
|---|---|---|---|
| Control Effectiveness | Superintelligent AI will circumvent any controls | Layered defenses can achieve acceptable risk | Limited to toy model testing |
| Resource Allocation | Control distracts from alignment work | Control provides crucial safety layer | Mixed views in safety community |
| Timeline Relevance | Won’t be ready before dangerous AI | Can develop alongside capabilities | Active research progress |
| Industry Adoption | Labs won’t implement costly controls | Control more practical than alignment | Early industry interest |
Technical Challenges
Section titled “Technical Challenges”Monitoring Limitations:
- Can sophisticated AI hide deceptive behavior?
- How to detect novel attack vectors?
- Scalability of human oversight
Protocol Design:
- Balance between safety and capability
- Economic incentives for control adoption
- Verification of control implementations
Philosophical Questions
Section titled “Philosophical Questions”Relationship to Alignment:
- Complement or substitute for alignment research?
- Risk of reducing alignment investment?
- Long-term vision for AI safety
Comparative Analysis
Section titled “Comparative Analysis”Research Approach Comparison
Section titled “Research Approach Comparison”| Organization | Primary Focus | Timeline View | Methodology | Success Theory |
|---|---|---|---|---|
| Redwood | Control + interpretability | Medium pessimism | Empirical, adaptive | Safety without full alignment |
| Anthropic | Constitutional AI, scaling | Cautious optimism | Large-scale experiments | Alignment through training |
| ARC | Evaluation, alignment theory | Varied | Theory + evaluation | Early detection and preparation |
| MIRI | Governance pivot | High pessimism | Theoretical (paused) | Coordination over technical |
Unique Positioning
Section titled “Unique Positioning”Distinguishing Features:
- Only major org focused primarily on control
- Empirical methodology with willingness to pivot
- Security-minded approach to safety
- Practical deployment focus
Strategic Advantages:
- Complementary to alignment research
- More implementable in near-term
- Appeals to industry safety concerns
- Builds on existing security practices
Sources and Resources
Section titled “Sources and Resources”Primary Research Publications
Section titled “Primary Research Publications”| Publication | Year | Impact | Focus |
|---|---|---|---|
| Causal Scrubbing Paper↗ | 2023 | 150+ citations | Interpretability methodology |
| AI Control Framework↗ | 2024 | Foundational | Control paradigm |
| Adversarial Robustness Studies↗ | 2022-2023 | Field building | Empirical safety |
Organizational Resources
Section titled “Organizational Resources”| Type | Resource | Access |
|---|---|---|
| Research Portal | Redwood Research↗ | Public |
| Publications | Research Publications↗ | Open access |
| Code/Tools | GitHub Repository↗ | Open source |
| Fellowship | Research Fellowship↗ | Application-based |
External Analysis
Section titled “External Analysis”| Source | Focus | Assessment |
|---|---|---|
| Open Philanthropy↗ | Funding analysis | Major safety contribution |
| Alignment Forum↗ | Technical discussion | Mixed reception of control approach |
| AI Safety Newsletter↗ | Regular coverage | Positive on empirical methodology |
| Future of Humanity Institute↗ | Academic perspective | Valuable addition to safety portfolio |
Related Safety Organizations
Section titled “Related Safety Organizations”| Organization | Relationship | Collaboration Type |
|---|---|---|
| Anthropic | Research collaboration | Interpretability methods sharing |
| ARC | Complementary focus | Evaluation methodology |
| CHAI | Academic partnership | Research collaboration |
| Apollo Research | Similar approach | Evaluation and testing |
Policy and Governance Context
Section titled “Policy and Governance Context”| Institution | Engagement | Focus |
|---|---|---|
| UK AISI | Consultation | Control methodologies |
| US AISI | Framework discussion | Deployment safety |
| NIST AI RMF | Input provision | Risk management |
| Industry Standards | Development participation | Safety protocols |
What links here
- Field Building and Communitycrux
- Research Agendascrux
- Technical AI Safety Researchcrux
- Conjecturelab-research
- AI Controlsafety-agenda
- Interpretabilitysafety-agenda