Skip to content

AI Capabilities Metrics

📋Page Status
Quality:88 (Comprehensive)⚠️
Importance:85 (High)
Words:3.0k
Backlinks:1
Structure:
📊 28📈 0🔗 66📚 011%Score: 10/15
LLM Summary:Systematically tracks AI capability benchmarks from 2020-2025, documenting rapid exponential progress with frontier models reaching 86-96% on language/coding tasks and o3 achieving 87.5% on ARC-AGI (exceeding human baseline). Critical finding: benchmark saturation creates evaluation-reality gaps as models approach AGI capability thresholds faster than safety evaluation frameworks can be established.

This page tracks concrete, measurable indicators of AI capabilities across multiple domains through systematic benchmark analysis. Understanding capability trajectories is critical for forecasting transformative AI timelines, anticipating safety challenges, and evaluating whether alignment techniques scale with emerging capabilities.

Key Finding: Most benchmarks show exponential progress from 2023-2025, with frontier models achieving 86-96% performance on language understanding (MMLU), coding (HumanEval), and math (GSM8K). However, significant gaps persist in robustness, adversarial resistance, and sustained multi-day task completion, indicating a disconnect between benchmark performance and production reliability.

Safety Implication: Rapid capability advancement outpaces development of reliable evaluation methods, creating blind spots in AI risk assessment as models approach human-level performance on narrow benchmarks while exhibiting unpredictable behaviors in real-world deployment. This evaluation-reality gap poses critical challenges for alignment research and safety validation.

The recent o3 release achieved 87.5% on ARC-AGI, representing a 3x improvement over previous models and bringing us to the threshold of AGI capability markers on several benchmarks simultaneously.


Risk

Capability Progress vs. Safety Evaluation


Language Understanding & General Knowledge

Section titled “Language Understanding & General Knowledge”

MMLU (Massive Multitask Language Understanding) consists of 15,908 multiple-choice questions spanning 57 subjects, from STEM fields to humanities, serving as the primary benchmark for general knowledge assessment.

ModelRelease DateMMLU ScorePerformance GapProgress Rate
GPT-3 (175B)June 202043.9%-45.9% vs humanBaseline
GPT-4March 202386.4%-3.4% vs human+42.5%
Gemini 1.0 UltraDec 202390.0%+0.2% vs human5-shot evaluation
Claude 3.5 SonnetJune 202488.3%-1.5% vs humanNear saturation
OpenAI o1Sept 202492.3%+2.5% vs humanClear saturation
OpenAI o3Dec 202496.7%+6.9% vs humanSuper-human

Human Expert Baseline: 89.8% (established through expert evaluation by Hendrycks et al.)

Critical Observations:

MMLU-Pro: Introduced as harder variant to address saturation, featuring more complex reasoning requirements and reduced guessing advantage.

ModelMMLU-Pro ScorePerformance vs. MMLUSaturation Status
GPT-4o72.6%-13.8% difficulty gapModerate headroom
Claude 3.5 Sonnet78.0%-10.3% difficulty gapApproaching saturation
OpenAI o392.1%-4.6% gapNear saturation

Evaluation Evolution: MMLU-Pro approached saturation within 18 months of o3’s release, demonstrating the accelerating pace of capability advancement. SimpleQA and other fact-grounded benchmarks now serve as primary discriminators.


ARC-AGI (Abstraction and Reasoning Corpus) contains 800+ visual reasoning tasks designed to test general intelligence through pattern recognition and abstraction, widely considered the most reliable AGI capability indicator.

ModelARC-AGI ScoreHuman PerformanceAGI AssessmentBreakthrough Factor
Human Baseline85%Reference standardAGI threshold
GPT-4o (2024)9.2%Far below thresholdNot AGI
Claude 3.5 Sonnet14.7%Below thresholdNot AGI
OpenAI o1-preview25%Approaching thresholdEarly AGI signals2.7x improvement
OpenAI o387.5%Exceeds human baselineAGI capability achieved3.5x over o1

Critical Breakthrough: o3’s 87.5% ARC-AGI performance represents the first model to exceed human-level general reasoning capability, marking a potential AGI milestone per François Chollet and Mike Knoop’s analysis.

Validation Concerns:

  • Test-time compute scaling: o3 required $10,000+ per task using massive inference compute
  • Efficiency gap: 1000x more expensive than human-equivalent performance
  • Generalization uncertainty: Performance on holdout sets vs. public benchmarks unknown

MATH Dataset Performance Evolution:

Model TypeMATH ScoreImprovement vs 2023Human Baseline Comparison
GPT-4 (2023)42.5%BaselineHuman competitive (40%)
OpenAI o183.3%+40.8%2.1x human performance
OpenAI o396.7%+54.2%2.4x human performance

Competition Mathematics (AIME 2024):

  • o1-preview: Scored 83rd percentile among human competitors
  • o3: Achieved 96.7% accuracy, surpassing 99% of human mathematical competition participants

Implication for Scientific Research: Mathematical breakthrough capability suggests potential for automated theorem proving and advanced scientific reasoning, though formal verification gaps persist.


HumanEval contains 164 Python programming problems testing code generation from natural language specifications, serving as the standard coding benchmark.

ModelHumanEval ScoreEvalPlus ScoreRobustness GapProgress Notes
OpenAI o396.3%89.2%-7.1%Near-perfect
OpenAI o192.1%89.0%-3.1%Strong performance
Claude 3.5 Sonnet92.0%87.3%-4.7%Balanced
Qwen2.5-Coder-32B89.5%87.2%-2.3%Specialized model
Historical (2021)Codex: 28.8%Initial baseline

Progress Rate: 28.8% → 96.3% in 3.5 years (2021-2024), representing the fastest benchmark progression observed across all domains.

Robustness Analysis: The persistent 3-7% gap between HumanEval and EvalPlus (with additional test cases) reveals ongoing reliability challenges, highlighting concerns for software safety applications.

SWE-bench: Contains 2,294 real GitHub issues from open-source repositories, testing actual software engineering capabilities.

Benchmark Version2024 Best2025 CurrentModelImprovement Factor
SWE-bench Full12.3% (Devin)48.9% (o3)OpenAI o34.0x
SWE-bench Lite43.0% (Multiple)71.7% (o3)OpenAI o31.7x
SWE-bench Verified33.2% (Claude)71.2% (o3)OpenAI o32.1x

Key Insights:

  • Capability leap: o3 represents a 4x improvement over 2024’s best autonomous coding systems
  • Remaining gaps: Even 71% success on curated problems indicates significant real-world deployment limitations
  • Agent orchestration: Best results achieved through sophisticated multi-agent workflows rather than single model inference
Competitiono3 PerformanceHuman BaselineCompetitive Ranking
Codeforces2727 Elo rating~1500 Elo (average)Top 175th percentile
IOI (International Olympiad)Gold medal equivalentVariable by yearElite competitive level
USACOAdvanced divisionBeginner-AdvancedTop tier

Significance: Programming competition success demonstrates sophisticated algorithmic thinking and optimization capabilities relevant to self-improvement and autonomous research.


Autonomous Task Performance & Agent Capabilities

Section titled “Autonomous Task Performance & Agent Capabilities”

50%-task-completion time horizon: The duration of tasks that AI systems can complete with 50% reliability, serving as a key metric for practical autonomous capability.

ModelRelease Date50% Time HorizonDoubling PeriodPerformance Notes
Early models2019~5 minutesBasic task completion
GPT-4March 2023~15 minutes7 months (historical)Reasoning breakthrough
Claude 3.5 SonnetJune 2024~45 minutes5-6 months trendPlanning advancement
OpenAI o1Sept 2024~90 minutesAcceleratingReasoning models
Projected (o3 class)Dec 2024~3 hours4-month doublingAgent workflows

Projections Based on Current Trends:

  • Conservative (6-month doubling): Day-long autonomous tasks by late 2027
  • Accelerated (4-month trend): Day-long tasks by mid-2026
  • Critical threshold: Week-long reliable task completion by 2027-2029

SWE-Agent Autonomous Success Rates:

Task Category6-month Success Rate12-month Success RateKey LimitationsSafety Implications
Code Generation85%95%Requirements clarityAutomated vulnerability introduction
Bug Fixes65%78%Legacy system complexityCritical system modification
Feature Implementation45%67%Cross-component integrationUnintended behavioral changes
System Architecture15%23%Long-term consequencesFundamental security design flaws

Critical Finding: Success rates show clear inverse correlation with task complexity and planning horizon, indicating fundamental limitations in long-horizon planning despite benchmark achievements.

DomainCurrent Success2025 ProjectionCritical Success FactorsSafety Concerns
Software Development67%85%Clear specificationsCode security, backdoors
Research Analysis52%72%Data access, validationBiased conclusions, fabrication
Financial Analysis23%35%Regulatory complianceMarket manipulation potential
Administrative Tasks8%15%Human relationship managementPrivacy, compliance violations
Creative Content91%97%Quality evaluation metricsMisinformation, copyright

Model Family2022202320242025Growth Factor
GPT4K8K128K2M (o3)500x
Claude100K200K200K2x
Gemini1M2M2x
ModelAdvertised CapacityEffective UtilizationPerformance at LimitsValidation Method
o1/o3 class2M tokens~1.8M tokens<10% degradationChain-of-thought maintained
Claude 3.5 Sonnet200K tokens~190K tokens<5% degradationNeedle-in-haystack
Gemini 2.0 Flash2M tokens~1.5M tokens15% degradationCommunity testing

Safety Implications of Massive Context:

  • Enhanced situational awareness through comprehensive information integration
  • Ability to process entire software projects for vulnerability analysis
  • Comprehensive document analysis enabling sophisticated deceptive alignment strategies
  • Long-term conversation memory enabling persistent relationship manipulation

ModelTextVisionAudioVideo3D/SpatialNative Integration
Gemini 2.0 FlashUnified architecture
GPT-4oReal-time processing
OpenAI o1LimitedText-vision focus
Claude 3.5 SonnetVision-language only

MMMU (Multimodal Understanding): College-level tasks requiring integration of text, images, and diagrams.

ModelMMMU ScoreGap to Human Expert (82.6%)Annual ProgressSaturation Timeline
OpenAI o1 (2024)78.2%-4.4%Near-human6 months to parity
Google Gemini 1.562.4%-20.2%12-18 months
Annual Improvement+15.8%76% gap closedAccelerating

Real-Time Processing Capabilities:

ApplicationLatency RequirementCurrent BestModelDeployment Readiness
Voice Conversation<300ms180msGemini 2.0 FlashProduction ready
Video Analysis<1 second/frame200msGPT-4oBeta deployment
AR/VR Integration<20ms50msSpecialized modelsResearch phase
Robotics Control<10ms100msNot achievedDevelopment needed

Scientific Discovery & Research Capabilities

Section titled “Scientific Discovery & Research Capabilities”

AlphaFold Global Scientific Impact (2020-2025):

Impact Metric2024 Value2025 ProjectionGlobal SignificanceTransformation Factor
Active Researchers3.2 million4.5 millionUniversal adoption150x access increase
Academic Citations47,000+65,000+Most cited AI work15x normal impact
Drug Discovery Programs1,200+2,000+Pharmaceutical industry3x traditional methods
Clinical Trial Drugs615-20Direct medical impactFirst AI-designed medicines

Molecular Interaction Modeling:

Interaction TypePrediction AccuracyPrevious MethodsImprovement FactorApplications
Protein-Ligand76%~40%1.9xDrug design
Protein-DNA72%~45%1.6xGene regulation
Protein-RNA69%~30%2.3xTherapeutic RNA
Complex Assembly67%~25%2.7xVaccine development

FunSearch Mathematical Discoveries (2024):

  • Set Cover Problem: Found new constructions improving 20-year-old bounds
  • Bin Packing: Discovered novel online algorithms exceeding previous best practices
  • Combinatorial Optimization: Generated proofs for previously unknown mathematical relationships

Research Paper Generation:

CapabilityCurrent PerformanceHuman ComparisonReliability AssessmentDomain Limitations
Literature Review85% qualityCompetitiveHigh accuracyRequires fact-checking
Hypothesis Generation72% noveltyAbove averageMedium reliabilityDomain knowledge gaps
Experimental Design45% feasibilityBelow expertLow reliabilityContext limitations
Data Analysis81% accuracyNear expertHigh reliabilityStatistical validity

Critical Assessment: While AI accelerates research processes, breakthrough discoveries remain primarily human-driven with AI assistance rather than AI-originated insights, indicating gaps in creative scientific research capabilities.


Adversarial Attack Resistance (2025 Assessment)

Section titled “Adversarial Attack Resistance (2025 Assessment)”

Scale AI Adversarial Robustness: 1,500+ human-designed attacks across risk categories.

Attack CategorySuccess Rate vs. Best ModelsDefense EffectivenessResearch PriorityProgress Trend
Prompt Injection75-90%MinimalCriticalWorsening
Multimodal Jailbreaking80-95%Largely undefendedCriticalNew vulnerability
Chain-of-Thought Manipulation60-75%ModerateHighEmerging
Adversarial Examples45-65%Some progressMediumArms race

Constitutional AI Defense Analysis:

Defense MethodRobustness GainCapability LossImplementation CostAdoption Rate
Constitutional Training+32%-8%High60% (major labs)
Adversarial Fine-tuning+45%-15%Very High25% (research)
Input Filtering+25%-3%Medium85% (production)
Output Monitoring+18%<1%Low95% (deployed)

Persistent Vulnerabilities & Attack Evolution

Section titled “Persistent Vulnerabilities & Attack Evolution”

Automated Jailbreaking Systems:

Attack SystemSuccess Rate 2024Success Rate 2025Evolution FactorDefense Response
GCG65%78%Automated scalingMinimal
AutoDAN52%71%LLM-generated attacksReactive patching
Multimodal Injection89%94%Image-text fusionNo systematic defense
Gradient-based Methods43%67%White-box optimizationResearch countermeasures

Critical Security Gap: Attack sophistication is advancing faster than defense mechanisms, particularly for multimodal and chain-of-thought systems, creating escalating vulnerabilities for AI safety deployment.


Capability ThresholdAchievement DateModelSignificance LevelNext Barrier
AGI Reasoning (ARC-AGI 85%+)Dec 2024OpenAI o3AGI milestoneEfficiency scaling
Human-Level ProgrammingSept 2024OpenAI o1Professional capabilityReal-world deployment
PhD-Level MathematicsSept 2024OpenAI o1Academic expertiseTheorem proving
Multimodal IntegrationJune 2024GPT-4oPractical applicationsReal-time robotics
Expert-Level KnowledgeMarch 2023GPT-4General competenceSpecialized domains

Chinchilla Scaling vs. Observed Performance:

Model GenerationPredicted PerformanceActual PerformanceScaling FactorExplanation
GPT-485% MMLU (predicted)86.4% MMLU1.02xOn-trend
o1-class87% MMLU (predicted)92.3% MMLU1.06xReasoning breakthrough
o3-class93% MMLU (predicted)96.7% MMLU1.04xTest-time compute

Post-Training Enhancement Impact:

  • Reinforcement Learning from Human Feedback (RLHF): 5-15% capability improvements
  • Constitutional AI: 8-20% safety improvements with minimal capability loss
  • Test-time Compute Scaling: 15-40% improvements on reasoning tasks
  • Multi-agent Orchestration: 25-60% improvements on complex tasks

Saturated Evaluation Targets:

BenchmarkSaturation DateBest PerformanceReplacement EvaluationStatus
MMLUDec 202496.7% (o3)MMLU-Pro, domain-specificExhausted
GSM8KEarly 202497%+ (multiple)Competition mathExhausted
HumanEvalMid 202496.3% (o3)SWE-bench, real systemsExhausted
HellaSwagLate 202395%+ (multiple)Situational reasoningExhausted

Emerging Evaluation Priorities:

  • Real-world deployment reliability: Multi-day task success rates
  • Adversarial robustness: Systematic attack resistance
  • Safety alignment: Value preservation under capability scaling
  • Efficiency metrics: Performance per FLOP and per dollar cost

Uncertainty DomainImpact on AI SafetyCurrent Evidence QualityResearch Priority
Deployment ReliabilityCriticalPoor (limited studies)Urgent
Emergent Capability PredictionHighMedium (scaling laws)High
Long-term Behavior ConsistencyHighPoor (no systematic tracking)High
Adversarial Robustness ScalingMediumPoor (conflicting results)Medium

AGI Timeline Assessments Post-o3:

Expert CategoryAGI TimelineKey EvidenceConfidence Level
Optimistic Researchers2025-2027o3 ARC-AGI breakthroughHigh
Cautious Researchers2028-2032Efficiency and deployment gapsMedium
Conservative Researchers2030+Real-world deployment limitationsLow

Crux: Test-Time Compute Scaling:

  • Scaling optimists: Test-time compute represents sustainable capability advancement
  • Scaling pessimists: Cost prohibitive for practical deployment ($10K per difficult task)
  • Evidence: o3 required 172x more compute than o1 for ARC-AGI performance gains

Alignment Research Prioritization Debates:

Research DirectionCurrent InvestmentCapability-Safety GapPriority Assessment
InterpretabilityHighWideningCritical
Robustness TrainingMediumStableHigh
AI Alignment TheoryMediumWidening