Jan Leike
Jan Leike
Background
Section titled âBackgroundâJan Leike is a leading AI alignment researcher currently serving as Head of Alignment at Anthropic. He has a PhD from Australian National University, where he worked on AI safety under Marcus Hutter.
His career has been defined by practical, empirical approaches to alignment:
- Early work on safe exploration in reinforcement learning
- Pioneering research on learning from human feedback
- Leadership of alignment teams at DeepMind, OpenAI, and now Anthropic
- Focus on scalable methods that can work with current ML paradigms
Career Trajectory
Section titled âCareer TrajectoryâDeepMind (2017-2021)
Section titled âDeepMind (2017-2021)âWorked on the first implementations of learning from human feedback, including:
- Safe exploration methods
- Reward modeling
- Scalable agent alignment
OpenAI (2021-2024)
Section titled âOpenAI (2021-2024)â- Joined to lead alignment research
- Co-led the Superalignment team (announced July 2023)
- Secured 20% of OpenAIâs compute for alignment research
- Departed May 2024 over disagreements about safety prioritization
Anthropic (2024-present)
Section titled âAnthropic (2024-present)â- Joined as Head of Alignment
- Reunited with former OpenAI colleagues
- Leading alignment research on Claude and future systems
Key Contributions
Section titled âKey ContributionsâRLHF Pioneer
Section titled âRLHF PioneerâJan was one of the early researchers to demonstrate that reinforcement learning from human feedback (RLHF) could work at scale:
- Co-authored seminal papers on reward learning
- Showed how to train language models to be helpful and harmless
- Methods he developed became standard across industry
Scalable Oversight Research
Section titled âScalable Oversight ResearchâCore focus on how to supervise AI systems more capable than humans:
- Recursive reward modeling
- AI-assisted human evaluation
- Process supervision vs. outcome supervision
- Weak-to-strong generalization
Superalignment Vision
Section titled âSuperalignment VisionâAt OpenAI, co-led (with Ilya Sutskever) the Superalignment team, which aimed to:
- Solve alignment for superintelligent systems
- Use AI to help align even more capable AI
- Achieve this within four years
- Dedicate significant compute resources to the problem
Views on Key Cruxes
Section titled âViews on Key CruxesâBased on public statements and research priorities
| Source | Estimate | Date |
|---|---|---|
| Alignment urgency | Very high | 2024 |
| Timeline pressure | Next 3-5 years critical | 2024 |
| Technical tractability | Difficult but solvable | 2024 |
Alignment urgency: Left OpenAI over concerns about insufficient safety prioritization
Timeline pressure: Emphasized need to solve alignment soon
Technical tractability: Optimistic about scalable oversight approaches
Core Beliefs
Section titled âCore Beliefsâ- Alignment is urgent: We have limited time to solve this before transformative AI
- Scalable oversight is key: Central challenge is supervising superhuman AI
- Empirical work is essential: Need to test alignment techniques on increasingly capable systems
- Safety must be prioritized: Cannot let capability research consistently outpace safety
- Current methods are insufficient: RLHF and similar techniques wonât scale to superintelligence without major improvements
Why He Left OpenAI
Section titled âWhy He Left OpenAIâIn May 2024, Jan departed OpenAI and posted on X (Twitter):
- âBuilding smarter-than-human machines is an inherently dangerous endeavorâ
- âOver the past years, safety culture and processes have taken a backseat to shiny productsâ
- Expressed concern about compute and priority allocation for safety
This departure, along with Ilya Sutskeverâs, raised significant questions about OpenAIâs commitment to safety research.
Research Focus
Section titled âResearch FocusâCurrent Priorities at Anthropic
Section titled âCurrent Priorities at Anthropicâ- Weak-to-strong generalization: How can weaker systems (including humans) effectively supervise stronger ones?
- Scalable oversight techniques: Making human feedback work for superhuman systems
- Honest AI systems: Ensuring AI systems accurately report their reasoning and limitations
- Automated alignment research: Using AI to help solve alignment
Key Technical Problems
Section titled âKey Technical ProblemsâJan has identified several crucial challenges:
- Reward hacking: Systems optimizing proxies rather than true objectives
- Distributional shift: Maintaining alignment in novel situations
- Deceptive alignment: Preventing systems from appearing aligned while pursuing other goals
- Superalignment: Aligning systems smarter than humans
Public Communication
Section titled âPublic CommunicationâJan is known for:
- Clear, technical communication about alignment challenges
- Willingness to raise concerns publicly
- Engagement on Twitter/X about safety issues
- Focus on concrete, actionable research directions
His departure from OpenAI sparked significant public discussion about AI safety prioritization at major labs.
Strategic Views
Section titled âStrategic ViewsâOn AI Development
Section titled âOn AI Developmentâ- Safety must keep pace: Capability advances should be matched by safety advances
- Need serious compute: Alignment research requires significant computational resources
- Coordination is important: Labs should share safety insights
- Race dynamics are dangerous: Competition that sacrifices safety is unacceptable
On Research Approach
Section titled âOn Research Approachâ- Empirical and theoretical: Need both practical testing and conceptual work
- Learn from current systems: Can make progress by studying existing models
- Prepare for qualitative jumps: Current techniques may not suffice for superintelligence
- Automate alignment work: Use AI to scale up alignment research itself
Influence and Impact
Section titled âInfluence and ImpactâResearch Impact
Section titled âResearch Impactâ- RLHF work influenced every major language model deployment
- Scalable oversight framework guides significant research programs
- Superalignment vision shaped discourse on superintelligence alignment
Field Building
Section titled âField Buildingâ- Mentored numerous alignment researchers
- Built and led multiple alignment teams
- Raised profile of alignment research within major labs
Institutional Influence
Section titled âInstitutional Influenceâ- Secured major compute allocation for alignment at OpenAI
- Helped shape Anthropicâs research priorities
- Demonstrated importance of independent safety research
Key Publications
Section titled âKey Publicationsâ- âDeep Reinforcement Learning from Human Preferencesâ (2017) - Early RLHF paper
- âScalable agent alignment via reward modelingâ (2018) - Reward learning framework
- âRecursively Summarizing Books with Human Feedbackâ (2021) - Demonstrating RLHF scaling
- Various blog posts on alignment challenges and approaches
Current Challenges
Section titled âCurrent ChallengesâAt Anthropic, Jan faces several key challenges:
- Time pressure: Transformative AI may arrive soon, requiring rapid progress
- Scaling RLHF: Current techniques may not work for superintelligent systems
- Evaluation: How to know if alignment techniques actually work
- Automation: Using AI to help solve alignment before it becomes too capable
- Coordination: Ensuring insights are shared across safety community
Related Pages
Section titled âRelated PagesâWhat links here
- Anthropiclab
- OpenAIlab
- Dario Amodeiresearcher
- Ilya Sutskeverresearcher
- Paul Christianoresearcher