Skip to content

Algorithms (AI Capabilities): Research Report

📋Page Status
Quality:3 (Stub)⚠️
Words:4.6k
Structure:
📊 21📈 0🔗 0📚 5617%Score: 10/15
FindingKey DataImplication
Efficiency doubling time8 months (95% CI: 5-14 months) for same performance with half the computeFaster than Moore’s Law; major contributor to AI progress
Scale-dependent progress91% of gains from Transformers + Chinchilla when extrapolated to 2025 frontierLimits to compute scaling may slow algorithmic innovation
Software efficiency gains23x improvement from architecture + MoE + speculative decoding + KV cachingOutperforms hardware improvements by order of magnitude
Attribution analysis60-95% progress from compute/data scaling; only 5-40% from novel algorithmsCompute scaling historically more important than algorithms
Recent accelerationPost-training methods (RLHF, distillation) add 3-16x efficiency gainsCatch-up progress may be 16-60x/year when including post-training
Governance challengeAlgorithms diffuse instantly through papers/code; cannot be physically controlledShifts governance focus to compute, data, and evaluation protocols

Algorithmic progress in AI refers to improvements in methods, architectures, and techniques that enable more efficient conversion of computational resources into capabilities. Research measuring algorithmic efficiency in language models finds that the compute required to reach fixed performance has halved roughly every 8 months from 2012-2023, significantly faster than Moore’s Law. However, recent analysis reveals that much of this progress is scale-dependent: innovations like the Transformer architecture and Chinchilla scaling laws account for 91% of efficiency gains when extrapolated to frontier compute scales, but provide minimal benefits at smaller scales.

Attribution analysis suggests that compute and data scaling have contributed 60-95% of performance improvements, with novel algorithms responsible for only 5-40%. Software optimizations deliver substantial gains—23x improvement through architectural enhancements, Mixture-of-Experts approaches, speculative decoding, and KV caching—outperforming hardware efficiency improvements by an order of magnitude. Recent developments in post-training methods add another 3-16x efficiency gains, suggesting catch-up algorithmic progress may be 16-60x per year when combining pre-training and post-training innovations.

Key architectural advances include Chinchilla’s compute-optimal 20:1 token-to-parameter ratio, Grouped-Query Attention for memory efficiency, DeepSeekMoE’s fine-grained expert segmentation achieving comparable performance with 40% of computations, and state space models like Mamba offering linear-time complexity for long sequences. However, the coupling between algorithmic progress and compute investment suggests that limits to compute scaling—from energy constraints, semiconductor supply chains, or regulatory restrictions—may substantially slow AI algorithmic innovation. Unlike compute, algorithms diffuse instantly through publications and code, making direct governance nearly impossible and shifting focus to controlling compute, data, and establishing evaluation protocols.


Algorithmic progress encompasses the methods, architectures, and training techniques that determine how efficiently AI systems convert computational resources into capabilities. While compute and data are essential inputs, algorithmic innovations can deliver equivalent capability improvements without requiring proportional increases in hardware or datasets. A more efficient algorithm can achieve the same performance with dramatically less compute, or significantly greater capabilities with the same resources.

The fundamental question for AI governance is how to measure and predict algorithmic progress. If algorithmic efficiency improvements follow predictable trends, they can be incorporated into capability forecasts. If they exhibit sudden breakthroughs or are fundamentally coupled to compute investment, this has significant implications for governance strategies and risk timelines.


Measuring Algorithmic Efficiency: The 8-Month Doubling Time

Section titled “Measuring Algorithmic Efficiency: The 8-Month Doubling Time”

The most comprehensive measurement of algorithmic progress comes from analysis of over 200 language model evaluations from 2012-2023 on benchmarks like Wikitext and Penn Treebank. This research finds that the compute required to reach a fixed performance level halves roughly every 8 months (95% confidence interval: 5-14 months).

This rate significantly exceeds Moore’s Law (doubling every 18-24 months), indicating that algorithmic improvements have been a major driver of AI capability growth. However, recent analysis challenges the magnitude of these gains when scrutinized at the component level.

Scale-Dependent vs. Scale-Independent Progress

Section titled “Scale-Dependent vs. Scale-Independent Progress”

A critical finding from recent research is that algorithmic progress is highly scale-dependent. Analysis of innovations from 2012-2023 reveals:

Innovation TypeMeasured ImpactScale Dependence
Small-scale ablationsLess than 10x total gainsMany innovations provide minimal benefit at small scales
LSTMs → TransformersMajor contributor to 91% of gains at frontierStrong scale dependence; minimal benefit at small scales
Kaplan → Chinchilla rebalancingMajor contributor to 91% of gains at frontierStrong scale dependence
Other documented innovationsLess than 10x additional gainsVariable scale dependence
Total estimated (2012-2023)Less than 100x when measured at component levelChallenges earlier 22,000x estimates

Multiple attribution analyses consistently find that compute and data scaling have contributed more to AI progress than algorithmic innovations:

StudyTime PeriodCompute/Data ContributionAlgorithmic Contribution
Epoch AI (language models)2012-202360-95%5-40%
Ho et al. (2024)2014-2024~65% (compute scaling 2x as important)~35% (efficiency improvements)
Epoch AI (computer vision)Past decade~45% compute, ~10% data~45% algorithms

The relative importance of algorithmic improvements has decreased over time in language modeling, suggesting that as the field matures, continued progress may rely more heavily on scaling compute rather than breakthrough innovations.

Recent analysis of production AI systems from May 2024 to May 2025 reveals that software optimizations deliver 23x efficiency improvements, significantly outperforming hardware advances:

Optimization CategoryContributionKey Techniques
Model architecture23x improvementMoE, MLA, quantization, architectural innovations
Better utilization1.4x improvementBatching, scheduling, load balancing
Hardware improvements30% annual cost reduction, 40% energy efficiencyAI accelerators (but order of magnitude slower than software)
Combined effect33x energy reduction per promptSoftware efficiency dominates hardware gains

Scaling Laws Evolution: From Kaplan to Chinchilla

Section titled “Scaling Laws Evolution: From Kaplan to Chinchilla”

The evolution of scaling laws represents a major algorithmic advance with significant resource implications:

Kaplan et al. (2020): The Original Scaling Laws

Section titled “Kaplan et al. (2020): The Original Scaling Laws”

The original scaling laws established that language model performance follows power laws with respect to model size (N), dataset size (D), and compute (C). The authors found that model size was the most important factor, suggesting that larger models trained on relatively fixed datasets would continue to improve.

Hoffmann et al. (2022): Chinchilla Scaling Laws

Section titled “Hoffmann et al. (2022): Chinchilla Scaling Laws”

The Chinchilla paper fundamentally challenged prevailing assumptions by proposing that model size and training tokens must be scaled equally for compute-optimal training:

ModelParametersTraining TokensToken-to-Parameter RatioAssessment
GPT-3175B300B1.7:1Significantly undertrained
Gopher280B300B1.1:1Significantly undertrained
Chinchilla70B1.4T20:1Compute-optimal
LLaMA7-65BScaled appropriatelyFollows ChinchillaPost-Chinchilla architecture

Impact on the field: Models like LLaMA (Meta, 2023) were explicitly designed following compute-optimal principles, with sizes ranging from 7B to 65B parameters trained on significantly more data than previous models of similar size. The immediate effect was that frontier labs began training smaller, more data-efficient models that achieved competitive or superior performance.

Recent critiques: Replication attempts have found issues with the original parametric estimates. Some recent work (like MiniCPM from Tsinghua University) indicates the optimal data size should be 192x larger than model size on average, rather than 20x. Additionally, Chinchilla optimality was defined for training compute, whereas production systems must also consider inference costs—“overtraining” during training can yield better inference performance.

Architectural Innovations: The Efficiency Toolkit

Section titled “Architectural Innovations: The Efficiency Toolkit”

Modern transformer architectures incorporate numerous efficiency improvements beyond the original 2017 design:

InnovationBenefitAdoption
Multi-Query Attention (MQA)Shares key/value projections across query heads; significant memory reductionEarly models
Grouped-Query Attention (GQA)Balances quality and efficiency; enables scaling to 100B+ parametersGPT-4, Llama 3, modern large models
Multi-Head Latent Attention (MLA)Low-rank joint projection; 93% KV cache reduction vs. 67B dense modelDeepSeek-v2/v3
InnovationBenefitTechnical Detail
Pre-normalizationImproved gradient flow in deep networksNormalization before attention/FFN instead of after
RMSNormFaster computation, reduced parametersNormalizes only magnitudes, not centering
Rotary Positional Embeddings (RoPE)Better generalization to longer sequencesEncodes relative positions by rotating Q/K vectors

Mixture-of-Experts: Sparse Activation for Efficiency

Section titled “Mixture-of-Experts: Sparse Activation for Efficiency”

Mixture-of-Experts (MoE) architectures represent a fundamental shift in how models scale:

DeepSeek’s innovations address traditional MoE challenges around expert specialization:

ComponentInnovationResult
Fine-grained expert segmentationMore experts activated at constant computeMore flexible expert combinations; higher specialization
Shared expert isolationDedicated experts for common knowledgeReduced redundancy in routed experts
Scale demonstrationDeepSeekMoE 16B vs. LLaMA2 7BComparable performance with only 40% of computations
Frontier scalingDeepSeekMoE 145B vs. DeepSeek 67BComparable performance with 28.5% (possibly 18.2%) of computations

DeepSeek-V3 demonstrates MoE efficiency at frontier scale:

MetricValueComparison
Total parameters671B-
Activated per token37B~5.5% activation rate
Training cost$5.6M (2.788M H800 GPU hours)Order of magnitude cheaper than comparable models
Training innovationFirst large-scale validation of FP8 mixed precision8-bit training for large-scale LLMs
PerformanceComparable to leading closed-source modelsOutperforms open-source alternatives

Alternative Architectures: Beyond Transformers

Section titled “Alternative Architectures: Beyond Transformers”

While transformers dominate, alternative architectures address specific limitations:

Mamba is based on structured state space sequence (S4) models and addresses transformer limitations in processing long sequences:

FeatureTransformerMamba (SSM)
Computational complexityO(n²) quadratic in sequence lengthO(n) linear in sequence length
Long-range dependenciesLimited by attention windowEfficient handling of arbitrary lengths
Selective attentionBuilt-in via attention mechanismAdded via selective state space model
Hardware efficiencyMature optimization (CUDA kernels)Hardware-aware parallel scan for GPUs

Mamba-2 and State Space Duality (SSD): Mamba-2 introduces a mathematical bridge between SSMs and Transformers’ attention mechanism, preserving efficient FLOP counts while dramatically speeding up training via matrix multiplications.

Hybrid approaches: Research from NVIDIA (2024) demonstrates that hybrid models combining transformers and Mamba can outperform pure implementations of either. Jamba, for example, is a hybrid Transformer-Mamba MoE architecture that combines the efficiency of Mamba with transformers’ in-context learning capabilities.

Current limitations: SSMs face challenges with copying long sequences, in-context learning, induction heads, and visual tasks requiring both local and global features. These limitations suggest SSMs are complementary rather than replacement architectures.

Recent analysis suggests that post-training methods add 3-16x efficiency gains beyond pre-training algorithmic progress:

MethodEstimated GainApplication
Pre-training efficiency3x/yearArchitecture, scaling laws, training techniques
Post-training (RLHF, distillation)3x/year (informal Anthropic estimate)Alignment, instruction-following, reasoning
Combined algorithmic progress~9x/yearPre-training + post-training
Catch-up progress estimate16-60x/yearIncluding all post-2023 innovations

Quantization: Precision-Efficiency Tradeoffs

Section titled “Quantization: Precision-Efficiency Tradeoffs”

Quantization reduces model precision to decrease memory and computational requirements:

Quantization LevelAccuracy RecoveryDeployment Benefit
8-bit (INT8)99.9% of full precisionWidely adopted; minimal quality loss
4-bit (INT4)98.9% of full precisionSignificant memory reduction; some quality tradeoff
2-bit (Apple)Context-dependentOn-device deployment for 3B model with KV-cache sharing

Rigorous evaluation of over 500,000 benchmarks demonstrates that when properly implemented with appropriate hyperparameter tuning, quantization delivers substantial resource savings without discernible quality degradation. DeepSeek-V3 represents the first large-scale validation of FP8 (8-bit floating point) mixed precision training for LLMs, suggesting quantization can be applied during training rather than only post-training.

The combined effect of algorithmic improvements has led to dramatic cost reductions:

MetricTime PeriodImprovementDriver
Inference cost24 months280x reduction for GPT-3.5-equivalent performancePrimarily software/algorithmic
Energy per promptMay 2024 - May 202533x reduction23x software, 1.4x utilization
AI accelerator efficiencyAnnual30% cost reduction, 40% energy efficiencyHardware improvements (slower than software)

Two key trends emerged in 2024-2025:

  1. Test-time compute: Models like OpenAI’s o1 and DeepSeek’s r1 allocate significant computation during inference to improve reasoning, shifting resources from training (one-time cost) to deployment (ongoing cost).

  2. New modeling paradigms: Real-time video models and decentralized training represent architectural innovations extending beyond text.

The test-time compute trend suggests that future algorithmic progress may focus as much on inference-time algorithms (search, verification, self-correction) as on training-time innovations.


The following factors influence algorithmic progress in AI, organized by strength of influence. This analysis is designed to inform future cause-effect diagram creation for the AI Transition Model.

FactorDirectionTypeEvidenceConfidence
Compute Availability↑ Algorithmic Progresscause91% of gains from scale-dependent innovations (Transformers, Chinchilla)High
Academic Research Infrastructure↑ Algorithmic ProgressintermediateTransformer breakthroughs from well-funded labs; pre-training progress 3x/yearHigh
Scaling Law Insights↑ Training EfficiencyintermediateChinchilla 20:1 rule enabled smaller, better-trained modelsHigh
Open Research Culture↑ Diffusion SpeedleafPapers/code shared within months; instant replication possibleHigh
Hardware Constraints↓ Algorithmic InnovationcauseScale-dependent progress suggests compute limits slow algorithmic advancesMedium-High
FactorDirectionTypeEvidenceConfidence
Competition Dynamics↑ Innovation RateleafFrontier labs racing to achieve efficiency gains; open-source pressureMedium
Post-Training Methods↑ Practical Capabilitiescause3-16x gains from RLHF, distillation; catch-up progress 16-60x/yearMedium
Alternative ArchitecturesMixed EffectintermediateMamba/SSMs offer efficiency for specific tasks but don’t outscale TransformersMedium
Quantization Techniques↑ Deployment Efficiencyintermediate8-bit: 99.9% accuracy, 4-bit: 98.9%; FP8 training validated at scaleMedium
MoE Architectures↑ Parameter EfficiencycauseDeepSeek 40% compute for comparable performance; 5.5% activation rate at 671B scaleMedium
FactorDirectionTypeEvidenceConfidence
Benchmarking Standards↑ Measurable ProgressintermediateWikitext, Penn Treebank enable progress tracking; may incentivize benchmark-specific optimizationLow-Medium
Interdisciplinary Transfer↑ Novel ApproachesleafNeuroscience, physics inspire architectures (attention from cognitive science, SSMs from control theory)Low
Regulatory Pressure↓ Open PublicationleafConcerns about capabilities disclosure may slow diffusion; limited evidenceLow
Hardware Co-Design↑ Practical EfficiencycauseMamba’s hardware-aware scan; Transformer optimizations for CUDALow-Medium

ChallengeMechanismGovernance Implication
Instant diffusionPapers published on arXiv; code on GitHub; implementations within weeksCannot create chokepoints in algorithm distribution
Independent discoveryMultiple groups discover similar innovations (e.g., attention mechanism, scaling laws)Export controls ineffective for fundamental insights
Zero marginal costAlgorithms are information; copying requires no physical resourcesCannot limit access through resource scarcity
Embedded in trained modelsModel weights implicitly encode algorithmic innovationsReverse engineering can recover techniques
Subjective metricsTraining objectives, loss functions, optimization choices introduce bias/valuesTechnical decisions have ethical implications

AI algorithms are designed using subjective metrics, deterministic models, and probabilistic reasoning, each introducing unique governance challenges:

  • Black box problem: Many AI models function as “black boxes,” making it difficult to interpret decisions and identify bias or unfairness
  • Centralized training data: Reliance on datasets like Wikipedia raises concerns about representational bias and exclusion of diverse perspectives
  • Inconsistent behavior: AI systems may produce unpredictable results due to biased data or flawed algorithms

Algorithmic improvements that reduce compute requirements have dual-use implications:

  • Democratization vs. proliferation: More efficient algorithms lower barriers to entry for beneficial applications but also enable resource-constrained malicious actors
  • Capability surprise: Sudden algorithmic breakthroughs can create capability jumps that outpace safety research and governance preparation
  • Diffusion control: Once published, efficient algorithms cannot be “contained” to authorized users

Given the difficulty of directly controlling algorithms, governance strategies focus on adjacent leverage points:

ApproachMechanismLimitations
Compute governanceControl hardware access; implement usage thresholdsScale-dependent algorithms may work around compute limits
Data governanceLimit access to high-quality training dataData can be accumulated over time; synthetic data alternatives
Evaluation protocolsMandatory testing before deploymentRequires cooperation; cannot prevent private development
Publication normsResponsible disclosure; staged releaseVoluntary; competitive pressure undermines norms
Transparency requirementsAlgorithmic auditing; explainability standardsBlack box models limit effectiveness

QuestionWhy It MattersCurrent State
Will algorithmic progress decouple from compute scaling?Scale-dependent innovations suggest coupling; decoupling would invalidate compute-based governance91% of gains scale-dependent; unclear if this generalizes to future innovations
What is the ceiling for software efficiency gains?Determines whether algorithmic progress can continue if compute scaling plateaus23x recent gains suggest substantial headroom; theoretical limits unknown
How predictable are algorithmic breakthroughs?Unpredictable breakthroughs create capability surprise; predictable trends enable governance preparationTransformer (2017) was transformative but not predicted; unclear if outlier
Will post-training methods continue accelerating?16-60x/year catch-up progress suggests major gains; unclear if sustainableRecent trend; insufficient data to establish long-term rate
Can alternative architectures outscale Transformers?Mamba/SSMs offer efficiency; unclear if competitive at frontier scaleTransformers consistently scale better despite local advantages of alternatives
How will test-time compute change the landscape?Shifts resources from training to inference; changes governance focuso1/r1 demonstrate importance; implications for compute thresholds uncertain
What role does synthetic data play?Could bypass data governance if high-quality synthetic data enables trainingActive research area; quality parity with human data unclear
Will quantization reach fundamental limits?2-4 bit quantization approaches theoretical minimums8-bit validated at scale; 4-bit shows 98.9% recovery; further reduction questionable

Algorithmic progress could evolve along several distinct pathways with different implications for AI safety:

ScenarioMechanismTimelineWarning SignsGovernance Implications
Algorithmic PlateauDiminishing returns as field matures; scale-dependent innovations exhaust design space3-7 yearsSlowing efficiency gains; architectural innovations providing less than 2x improvementsCapability growth limited by compute/data scaling; governance window extends
Breakthrough DiscoveryTransformative architectural innovation comparable to Transformers (2017)UnpredictableNovel mathematical framework; orders-of-magnitude efficiency gainsRapid capability jump; potential governance surprise; safety research behind
Post-Training DominancePre-training efficiency plateaus; post-training methods become primary driver2-5 yearsCatch-up progress sustained at 16-60x/year; frontier labs focus on RLHF/reasoningDeployment compute becomes critical governance variable
Decoupling from ScaleScale-independent innovations emerge that provide benefits at all compute levels3-8 yearsSmall models approaching larger model capabilities; efficiency gains at low computeCompute-based governance becomes less effective; broader access to capabilities
Hardware Co-EvolutionSpecialized accelerators and algorithms co-designed; hardware-specific advantages5-10 yearsMamba-style hardware-aware designs; architecture-specific chipsFragmentation of AI development; reduced transferability across platforms


Model ElementRelationshipKey Insights
AI Capabilities (Compute)Multiplicative effectAlgorithmic efficiency acts as a multiplier on compute; 91% of gains are scale-dependent, coupling progress to compute investment
AI Capabilities (Adoption)EnablingEfficiency improvements (280x inference cost reduction) make deployment economically viable at scale
AI Ownership (Companies)ConcentratingScale-dependent innovations favor well-resourced labs; but efficiency gains enable smaller players (DeepSeek $5.6M training)
AI Ownership (Countries)Mixed effectOpen publication accelerates global diffusion; but frontier innovations require substantial compute (scale-dependent)
Misalignment Potential (AI Governance)Governance challengeCannot directly control algorithm diffusion; must govern via compute, data, evaluations
Misalignment Potential (Technical AI Safety)BidirectionalSafety research benefits from efficiency (more experimentation); but rapid capability gains can outpace safety
Misuse PotentialDual useEfficiency improvements democratize access (beneficial for researchers, concerning for misuse)
Transition Turbulence (Racing)AcceleratorAlgorithmic breakthroughs create competitive pressure and capability surprise
Civilizational Competence (Epistemics)Measurement challengeDifficulty measuring/predicting algorithmic progress creates forecasting uncertainty

The research reveals several strategic considerations for the AI transition:

  1. Scale-dependence brittleness: If 91% of algorithmic efficiency gains depend on frontier compute scales, then limits to compute scaling (energy, supply chain, regulation) may substantially slow algorithmic progress. This makes capability trajectories more brittle than models assuming independent algorithmic and compute progress.

  2. Governance indirection: Direct algorithmic governance is infeasible due to instant diffusion and zero marginal cost replication. Effective governance must work through adjacent leverage points: compute access, data availability, evaluation protocols, and publication norms.

  3. Efficiency paradox: Software optimizations (23x improvement) dramatically outpace hardware advances (30-40% annually), suggesting that even with compute constraints, efficiency gains could continue. This creates uncertainty about whether hardware-based governance can effectively limit capability growth.

  4. Post-training acceleration: The 16-60x/year catch-up progress estimate (including post-training methods) is dramatically higher than the 3x/year pre-training rate. If this continues, post-training innovations may drive capability growth as much or more than pre-training algorithmic advances, shifting governance focus to deployment compute and inference optimization.

  5. Capability surprise risk: The Transformer architecture (2017) was transformative but not predicted by the research community beforehand. If future algorithmic breakthroughs follow similar patterns, capability forecasting based on continuous trends may systematically underestimate discontinuous jumps.

  6. Democratization vs. concentration: Efficiency improvements have contradictory effects. On one hand, they lower barriers to entry (DeepSeek’s $5.6M training cost vs. hundreds of millions for comparable models). On the other hand, scale-dependent innovations favor organizations with access to frontier compute. The net effect on concentration depends on which force dominates.

  7. Test-time compute shift: Models like o1 and r1 demonstrate that significant inference-time computation can improve reasoning capabilities. This shifts resources from training (one-time, governable via thresholds) to deployment (ongoing, harder to monitor). Governance frameworks may need to adapt to regulate deployment compute in addition to training compute.

The algorithmic landscape exhibits rapid change across multiple dimensions (architecture, training methods, post-training techniques, deployment optimization), creating substantial uncertainty for AI governance and safety. The coupling between algorithmic progress and compute investment suggests that compute governance remains the most tractable leverage point, but efficiency gains may provide an “escape valve” that limits the effectiveness of compute-based interventions.