Algorithms (AI Capabilities): Research Report
Executive Summary
Section titled “Executive Summary”| Finding | Key Data | Implication |
|---|---|---|
| Efficiency doubling time | 8 months (95% CI: 5-14 months) for same performance with half the compute | Faster than Moore’s Law; major contributor to AI progress |
| Scale-dependent progress | 91% of gains from Transformers + Chinchilla when extrapolated to 2025 frontier | Limits to compute scaling may slow algorithmic innovation |
| Software efficiency gains | 23x improvement from architecture + MoE + speculative decoding + KV caching | Outperforms hardware improvements by order of magnitude |
| Attribution analysis | 60-95% progress from compute/data scaling; only 5-40% from novel algorithms | Compute scaling historically more important than algorithms |
| Recent acceleration | Post-training methods (RLHF, distillation) add 3-16x efficiency gains | Catch-up progress may be 16-60x/year when including post-training |
| Governance challenge | Algorithms diffuse instantly through papers/code; cannot be physically controlled | Shifts governance focus to compute, data, and evaluation protocols |
Research Summary
Section titled “Research Summary”Algorithmic progress in AI refers to improvements in methods, architectures, and techniques that enable more efficient conversion of computational resources into capabilities. Research measuring algorithmic efficiency in language models finds that the compute required to reach fixed performance has halved roughly every 8 months from 2012-2023, significantly faster than Moore’s Law. However, recent analysis reveals that much of this progress is scale-dependent: innovations like the Transformer architecture and Chinchilla scaling laws account for 91% of efficiency gains when extrapolated to frontier compute scales, but provide minimal benefits at smaller scales.
Attribution analysis suggests that compute and data scaling have contributed 60-95% of performance improvements, with novel algorithms responsible for only 5-40%. Software optimizations deliver substantial gains—23x improvement through architectural enhancements, Mixture-of-Experts approaches, speculative decoding, and KV caching—outperforming hardware efficiency improvements by an order of magnitude. Recent developments in post-training methods add another 3-16x efficiency gains, suggesting catch-up algorithmic progress may be 16-60x per year when combining pre-training and post-training innovations.
Key architectural advances include Chinchilla’s compute-optimal 20:1 token-to-parameter ratio, Grouped-Query Attention for memory efficiency, DeepSeekMoE’s fine-grained expert segmentation achieving comparable performance with 40% of computations, and state space models like Mamba offering linear-time complexity for long sequences. However, the coupling between algorithmic progress and compute investment suggests that limits to compute scaling—from energy constraints, semiconductor supply chains, or regulatory restrictions—may substantially slow AI algorithmic innovation. Unlike compute, algorithms diffuse instantly through publications and code, making direct governance nearly impossible and shifting focus to controlling compute, data, and establishing evaluation protocols.
Background
Section titled “Background”Algorithmic progress encompasses the methods, architectures, and training techniques that determine how efficiently AI systems convert computational resources into capabilities. While compute and data are essential inputs, algorithmic innovations can deliver equivalent capability improvements without requiring proportional increases in hardware or datasets. A more efficient algorithm can achieve the same performance with dramatically less compute, or significantly greater capabilities with the same resources.
The fundamental question for AI governance is how to measure and predict algorithmic progress. If algorithmic efficiency improvements follow predictable trends, they can be incorporated into capability forecasts. If they exhibit sudden breakthroughs or are fundamentally coupled to compute investment, this has significant implications for governance strategies and risk timelines.
Key Findings
Section titled “Key Findings”Measuring Algorithmic Efficiency: The 8-Month Doubling Time
Section titled “Measuring Algorithmic Efficiency: The 8-Month Doubling Time”The most comprehensive measurement of algorithmic progress comes from analysis of over 200 language model evaluations from 2012-2023 on benchmarks like Wikitext and Penn Treebank. This research finds that the compute required to reach a fixed performance level halves roughly every 8 months (95% confidence interval: 5-14 months).
This rate significantly exceeds Moore’s Law (doubling every 18-24 months), indicating that algorithmic improvements have been a major driver of AI capability growth. However, recent analysis challenges the magnitude of these gains when scrutinized at the component level.
Scale-Dependent vs. Scale-Independent Progress
Section titled “Scale-Dependent vs. Scale-Independent Progress”A critical finding from recent research is that algorithmic progress is highly scale-dependent. Analysis of innovations from 2012-2023 reveals:
| Innovation Type | Measured Impact | Scale Dependence |
|---|---|---|
| Small-scale ablations | Less than 10x total gains | Many innovations provide minimal benefit at small scales |
| LSTMs → Transformers | Major contributor to 91% of gains at frontier | Strong scale dependence; minimal benefit at small scales |
| Kaplan → Chinchilla rebalancing | Major contributor to 91% of gains at frontier | Strong scale dependence |
| Other documented innovations | Less than 10x additional gains | Variable scale dependence |
| Total estimated (2012-2023) | Less than 100x when measured at component level | Challenges earlier 22,000x estimates |
Attribution: Compute vs. Algorithms
Section titled “Attribution: Compute vs. Algorithms”Multiple attribution analyses consistently find that compute and data scaling have contributed more to AI progress than algorithmic innovations:
| Study | Time Period | Compute/Data Contribution | Algorithmic Contribution |
|---|---|---|---|
| Epoch AI (language models) | 2012-2023 | 60-95% | 5-40% |
| Ho et al. (2024) | 2014-2024 | ~65% (compute scaling 2x as important) | ~35% (efficiency improvements) |
| Epoch AI (computer vision) | Past decade | ~45% compute, ~10% data | ~45% algorithms |
The relative importance of algorithmic improvements has decreased over time in language modeling, suggesting that as the field matures, continued progress may rely more heavily on scaling compute rather than breakthrough innovations.
Software Optimization: The 23x Multiplier
Section titled “Software Optimization: The 23x Multiplier”Recent analysis of production AI systems from May 2024 to May 2025 reveals that software optimizations deliver 23x efficiency improvements, significantly outperforming hardware advances:
| Optimization Category | Contribution | Key Techniques |
|---|---|---|
| Model architecture | 23x improvement | MoE, MLA, quantization, architectural innovations |
| Better utilization | 1.4x improvement | Batching, scheduling, load balancing |
| Hardware improvements | 30% annual cost reduction, 40% energy efficiency | AI accelerators (but order of magnitude slower than software) |
| Combined effect | 33x energy reduction per prompt | Software efficiency dominates hardware gains |
Scaling Laws Evolution: From Kaplan to Chinchilla
Section titled “Scaling Laws Evolution: From Kaplan to Chinchilla”The evolution of scaling laws represents a major algorithmic advance with significant resource implications:
Kaplan et al. (2020): The Original Scaling Laws
Section titled “Kaplan et al. (2020): The Original Scaling Laws”The original scaling laws established that language model performance follows power laws with respect to model size (N), dataset size (D), and compute (C). The authors found that model size was the most important factor, suggesting that larger models trained on relatively fixed datasets would continue to improve.
Hoffmann et al. (2022): Chinchilla Scaling Laws
Section titled “Hoffmann et al. (2022): Chinchilla Scaling Laws”The Chinchilla paper fundamentally challenged prevailing assumptions by proposing that model size and training tokens must be scaled equally for compute-optimal training:
| Model | Parameters | Training Tokens | Token-to-Parameter Ratio | Assessment |
|---|---|---|---|---|
| GPT-3 | 175B | 300B | 1.7:1 | Significantly undertrained |
| Gopher | 280B | 300B | 1.1:1 | Significantly undertrained |
| Chinchilla | 70B | 1.4T | 20:1 | Compute-optimal |
| LLaMA | 7-65B | Scaled appropriately | Follows Chinchilla | Post-Chinchilla architecture |
Impact on the field: Models like LLaMA (Meta, 2023) were explicitly designed following compute-optimal principles, with sizes ranging from 7B to 65B parameters trained on significantly more data than previous models of similar size. The immediate effect was that frontier labs began training smaller, more data-efficient models that achieved competitive or superior performance.
Recent critiques: Replication attempts have found issues with the original parametric estimates. Some recent work (like MiniCPM from Tsinghua University) indicates the optimal data size should be 192x larger than model size on average, rather than 20x. Additionally, Chinchilla optimality was defined for training compute, whereas production systems must also consider inference costs—“overtraining” during training can yield better inference performance.
Architectural Innovations: The Efficiency Toolkit
Section titled “Architectural Innovations: The Efficiency Toolkit”Modern transformer architectures incorporate numerous efficiency improvements beyond the original 2017 design:
Attention Mechanism Optimizations
Section titled “Attention Mechanism Optimizations”| Innovation | Benefit | Adoption |
|---|---|---|
| Multi-Query Attention (MQA) | Shares key/value projections across query heads; significant memory reduction | Early models |
| Grouped-Query Attention (GQA) | Balances quality and efficiency; enables scaling to 100B+ parameters | GPT-4, Llama 3, modern large models |
| Multi-Head Latent Attention (MLA) | Low-rank joint projection; 93% KV cache reduction vs. 67B dense model | DeepSeek-v2/v3 |
Normalization and Positional Encoding
Section titled “Normalization and Positional Encoding”| Innovation | Benefit | Technical Detail |
|---|---|---|
| Pre-normalization | Improved gradient flow in deep networks | Normalization before attention/FFN instead of after |
| RMSNorm | Faster computation, reduced parameters | Normalizes only magnitudes, not centering |
| Rotary Positional Embeddings (RoPE) | Better generalization to longer sequences | Encodes relative positions by rotating Q/K vectors |
Mixture-of-Experts: Sparse Activation for Efficiency
Section titled “Mixture-of-Experts: Sparse Activation for Efficiency”Mixture-of-Experts (MoE) architectures represent a fundamental shift in how models scale:
DeepSeekMoE Architecture
Section titled “DeepSeekMoE Architecture”DeepSeek’s innovations address traditional MoE challenges around expert specialization:
| Component | Innovation | Result |
|---|---|---|
| Fine-grained expert segmentation | More experts activated at constant compute | More flexible expert combinations; higher specialization |
| Shared expert isolation | Dedicated experts for common knowledge | Reduced redundancy in routed experts |
| Scale demonstration | DeepSeekMoE 16B vs. LLaMA2 7B | Comparable performance with only 40% of computations |
| Frontier scaling | DeepSeekMoE 145B vs. DeepSeek 67B | Comparable performance with 28.5% (possibly 18.2%) of computations |
DeepSeek-V3: Production-Scale MoE
Section titled “DeepSeek-V3: Production-Scale MoE”DeepSeek-V3 demonstrates MoE efficiency at frontier scale:
| Metric | Value | Comparison |
|---|---|---|
| Total parameters | 671B | - |
| Activated per token | 37B | ~5.5% activation rate |
| Training cost | $5.6M (2.788M H800 GPU hours) | Order of magnitude cheaper than comparable models |
| Training innovation | First large-scale validation of FP8 mixed precision | 8-bit training for large-scale LLMs |
| Performance | Comparable to leading closed-source models | Outperforms open-source alternatives |
Alternative Architectures: Beyond Transformers
Section titled “Alternative Architectures: Beyond Transformers”While transformers dominate, alternative architectures address specific limitations:
Mamba and State Space Models
Section titled “Mamba and State Space Models”Mamba is based on structured state space sequence (S4) models and addresses transformer limitations in processing long sequences:
| Feature | Transformer | Mamba (SSM) |
|---|---|---|
| Computational complexity | O(n²) quadratic in sequence length | O(n) linear in sequence length |
| Long-range dependencies | Limited by attention window | Efficient handling of arbitrary lengths |
| Selective attention | Built-in via attention mechanism | Added via selective state space model |
| Hardware efficiency | Mature optimization (CUDA kernels) | Hardware-aware parallel scan for GPUs |
Mamba-2 and State Space Duality (SSD): Mamba-2 introduces a mathematical bridge between SSMs and Transformers’ attention mechanism, preserving efficient FLOP counts while dramatically speeding up training via matrix multiplications.
Hybrid approaches: Research from NVIDIA (2024) demonstrates that hybrid models combining transformers and Mamba can outperform pure implementations of either. Jamba, for example, is a hybrid Transformer-Mamba MoE architecture that combines the efficiency of Mamba with transformers’ in-context learning capabilities.
Current limitations: SSMs face challenges with copying long sequences, in-context learning, induction heads, and visual tasks requiring both local and global features. These limitations suggest SSMs are complementary rather than replacement architectures.
Post-Training Algorithmic Progress
Section titled “Post-Training Algorithmic Progress”Recent analysis suggests that post-training methods add 3-16x efficiency gains beyond pre-training algorithmic progress:
| Method | Estimated Gain | Application |
|---|---|---|
| Pre-training efficiency | 3x/year | Architecture, scaling laws, training techniques |
| Post-training (RLHF, distillation) | 3x/year (informal Anthropic estimate) | Alignment, instruction-following, reasoning |
| Combined algorithmic progress | ~9x/year | Pre-training + post-training |
| Catch-up progress estimate | 16-60x/year | Including all post-2023 innovations |
Quantization: Precision-Efficiency Tradeoffs
Section titled “Quantization: Precision-Efficiency Tradeoffs”Quantization reduces model precision to decrease memory and computational requirements:
| Quantization Level | Accuracy Recovery | Deployment Benefit |
|---|---|---|
| 8-bit (INT8) | 99.9% of full precision | Widely adopted; minimal quality loss |
| 4-bit (INT4) | 98.9% of full precision | Significant memory reduction; some quality tradeoff |
| 2-bit (Apple) | Context-dependent | On-device deployment for 3B model with KV-cache sharing |
Rigorous evaluation of over 500,000 benchmarks demonstrates that when properly implemented with appropriate hyperparameter tuning, quantization delivers substantial resource savings without discernible quality degradation. DeepSeek-V3 represents the first large-scale validation of FP8 (8-bit floating point) mixed precision training for LLMs, suggesting quantization can be applied during training rather than only post-training.
Cost and Inference Efficiency Trends
Section titled “Cost and Inference Efficiency Trends”The combined effect of algorithmic improvements has led to dramatic cost reductions:
| Metric | Time Period | Improvement | Driver |
|---|---|---|---|
| Inference cost | 24 months | 280x reduction for GPT-3.5-equivalent performance | Primarily software/algorithmic |
| Energy per prompt | May 2024 - May 2025 | 33x reduction | 23x software, 1.4x utilization |
| AI accelerator efficiency | Annual | 30% cost reduction, 40% energy efficiency | Hardware improvements (slower than software) |
Emerging Trends: Test-Time Compute
Section titled “Emerging Trends: Test-Time Compute”Two key trends emerged in 2024-2025:
-
Test-time compute: Models like OpenAI’s o1 and DeepSeek’s r1 allocate significant computation during inference to improve reasoning, shifting resources from training (one-time cost) to deployment (ongoing cost).
-
New modeling paradigms: Real-time video models and decentralized training represent architectural innovations extending beyond text.
The test-time compute trend suggests that future algorithmic progress may focus as much on inference-time algorithms (search, verification, self-correction) as on training-time innovations.
Causal Factors
Section titled “Causal Factors”The following factors influence algorithmic progress in AI, organized by strength of influence. This analysis is designed to inform future cause-effect diagram creation for the AI Transition Model.
Primary Factors (Strong Influence)
Section titled “Primary Factors (Strong Influence)”| Factor | Direction | Type | Evidence | Confidence |
|---|---|---|---|---|
| Compute Availability | ↑ Algorithmic Progress | cause | 91% of gains from scale-dependent innovations (Transformers, Chinchilla) | High |
| Academic Research Infrastructure | ↑ Algorithmic Progress | intermediate | Transformer breakthroughs from well-funded labs; pre-training progress 3x/year | High |
| Scaling Law Insights | ↑ Training Efficiency | intermediate | Chinchilla 20:1 rule enabled smaller, better-trained models | High |
| Open Research Culture | ↑ Diffusion Speed | leaf | Papers/code shared within months; instant replication possible | High |
| Hardware Constraints | ↓ Algorithmic Innovation | cause | Scale-dependent progress suggests compute limits slow algorithmic advances | Medium-High |
Secondary Factors (Medium Influence)
Section titled “Secondary Factors (Medium Influence)”| Factor | Direction | Type | Evidence | Confidence |
|---|---|---|---|---|
| Competition Dynamics | ↑ Innovation Rate | leaf | Frontier labs racing to achieve efficiency gains; open-source pressure | Medium |
| Post-Training Methods | ↑ Practical Capabilities | cause | 3-16x gains from RLHF, distillation; catch-up progress 16-60x/year | Medium |
| Alternative Architectures | Mixed Effect | intermediate | Mamba/SSMs offer efficiency for specific tasks but don’t outscale Transformers | Medium |
| Quantization Techniques | ↑ Deployment Efficiency | intermediate | 8-bit: 99.9% accuracy, 4-bit: 98.9%; FP8 training validated at scale | Medium |
| MoE Architectures | ↑ Parameter Efficiency | cause | DeepSeek 40% compute for comparable performance; 5.5% activation rate at 671B scale | Medium |
Minor Factors (Weak Influence)
Section titled “Minor Factors (Weak Influence)”| Factor | Direction | Type | Evidence | Confidence |
|---|---|---|---|---|
| Benchmarking Standards | ↑ Measurable Progress | intermediate | Wikitext, Penn Treebank enable progress tracking; may incentivize benchmark-specific optimization | Low-Medium |
| Interdisciplinary Transfer | ↑ Novel Approaches | leaf | Neuroscience, physics inspire architectures (attention from cognitive science, SSMs from control theory) | Low |
| Regulatory Pressure | ↓ Open Publication | leaf | Concerns about capabilities disclosure may slow diffusion; limited evidence | Low |
| Hardware Co-Design | ↑ Practical Efficiency | cause | Mamba’s hardware-aware scan; Transformer optimizations for CUDA | Low-Medium |
Governance Challenges
Section titled “Governance Challenges”Fundamental Challenges
Section titled “Fundamental Challenges”| Challenge | Mechanism | Governance Implication |
|---|---|---|
| Instant diffusion | Papers published on arXiv; code on GitHub; implementations within weeks | Cannot create chokepoints in algorithm distribution |
| Independent discovery | Multiple groups discover similar innovations (e.g., attention mechanism, scaling laws) | Export controls ineffective for fundamental insights |
| Zero marginal cost | Algorithms are information; copying requires no physical resources | Cannot limit access through resource scarcity |
| Embedded in trained models | Model weights implicitly encode algorithmic innovations | Reverse engineering can recover techniques |
| Subjective metrics | Training objectives, loss functions, optimization choices introduce bias/values | Technical decisions have ethical implications |
Specific Governance Issues
Section titled “Specific Governance Issues”Bias and Transparency
Section titled “Bias and Transparency”AI algorithms are designed using subjective metrics, deterministic models, and probabilistic reasoning, each introducing unique governance challenges:
- Black box problem: Many AI models function as “black boxes,” making it difficult to interpret decisions and identify bias or unfairness
- Centralized training data: Reliance on datasets like Wikipedia raises concerns about representational bias and exclusion of diverse perspectives
- Inconsistent behavior: AI systems may produce unpredictable results due to biased data or flawed algorithms
Dual Use and Misuse
Section titled “Dual Use and Misuse”Algorithmic improvements that reduce compute requirements have dual-use implications:
- Democratization vs. proliferation: More efficient algorithms lower barriers to entry for beneficial applications but also enable resource-constrained malicious actors
- Capability surprise: Sudden algorithmic breakthroughs can create capability jumps that outpace safety research and governance preparation
- Diffusion control: Once published, efficient algorithms cannot be “contained” to authorized users
Strategic Governance Approaches
Section titled “Strategic Governance Approaches”Given the difficulty of directly controlling algorithms, governance strategies focus on adjacent leverage points:
| Approach | Mechanism | Limitations |
|---|---|---|
| Compute governance | Control hardware access; implement usage thresholds | Scale-dependent algorithms may work around compute limits |
| Data governance | Limit access to high-quality training data | Data can be accumulated over time; synthetic data alternatives |
| Evaluation protocols | Mandatory testing before deployment | Requires cooperation; cannot prevent private development |
| Publication norms | Responsible disclosure; staged release | Voluntary; competitive pressure undermines norms |
| Transparency requirements | Algorithmic auditing; explainability standards | Black box models limit effectiveness |
Open Questions
Section titled “Open Questions”| Question | Why It Matters | Current State |
|---|---|---|
| Will algorithmic progress decouple from compute scaling? | Scale-dependent innovations suggest coupling; decoupling would invalidate compute-based governance | 91% of gains scale-dependent; unclear if this generalizes to future innovations |
| What is the ceiling for software efficiency gains? | Determines whether algorithmic progress can continue if compute scaling plateaus | 23x recent gains suggest substantial headroom; theoretical limits unknown |
| How predictable are algorithmic breakthroughs? | Unpredictable breakthroughs create capability surprise; predictable trends enable governance preparation | Transformer (2017) was transformative but not predicted; unclear if outlier |
| Will post-training methods continue accelerating? | 16-60x/year catch-up progress suggests major gains; unclear if sustainable | Recent trend; insufficient data to establish long-term rate |
| Can alternative architectures outscale Transformers? | Mamba/SSMs offer efficiency; unclear if competitive at frontier scale | Transformers consistently scale better despite local advantages of alternatives |
| How will test-time compute change the landscape? | Shifts resources from training to inference; changes governance focus | o1/r1 demonstrate importance; implications for compute thresholds uncertain |
| What role does synthetic data play? | Could bypass data governance if high-quality synthetic data enables training | Active research area; quality parity with human data unclear |
| Will quantization reach fundamental limits? | 2-4 bit quantization approaches theoretical minimums | 8-bit validated at scale; 4-bit shows 98.9% recovery; further reduction questionable |
Scenario Variants
Section titled “Scenario Variants”Algorithmic progress could evolve along several distinct pathways with different implications for AI safety:
| Scenario | Mechanism | Timeline | Warning Signs | Governance Implications |
|---|---|---|---|---|
| Algorithmic Plateau | Diminishing returns as field matures; scale-dependent innovations exhaust design space | 3-7 years | Slowing efficiency gains; architectural innovations providing less than 2x improvements | Capability growth limited by compute/data scaling; governance window extends |
| Breakthrough Discovery | Transformative architectural innovation comparable to Transformers (2017) | Unpredictable | Novel mathematical framework; orders-of-magnitude efficiency gains | Rapid capability jump; potential governance surprise; safety research behind |
| Post-Training Dominance | Pre-training efficiency plateaus; post-training methods become primary driver | 2-5 years | Catch-up progress sustained at 16-60x/year; frontier labs focus on RLHF/reasoning | Deployment compute becomes critical governance variable |
| Decoupling from Scale | Scale-independent innovations emerge that provide benefits at all compute levels | 3-8 years | Small models approaching larger model capabilities; efficiency gains at low compute | Compute-based governance becomes less effective; broader access to capabilities |
| Hardware Co-Evolution | Specialized accelerators and algorithms co-designed; hardware-specific advantages | 5-10 years | Mamba-style hardware-aware designs; architecture-specific chips | Fragmentation of AI development; reduced transferability across platforms |
Sources
Section titled “Sources”Academic Papers & Preprints
Section titled “Academic Papers & Preprints”- arXiv (2024). “On the Origin of Algorithmic Progress in AI” - Scale-dependent innovations; 91% of gains from Transformers + Chinchilla
- Epoch AI (2024). “Algorithmic progress in language models” - 8-month doubling time; 5-14 month CI
- Epoch AI (2024). “Revisiting algorithmic progress” - Attribution analysis methodology
- arXiv (2024). “Understanding Scaling Laws with Statistical and Approximation Theory for Transformer Neural Networks” - Intrinsic dimension affects scaling
- arXiv (2024). “Scaling Laws for Diffusion Transformers” - DiT scalability and efficiency
- arXiv (2024). “Beyond Scaling Laws: Understanding Transformer Performance with Associative Memory” - Hopfield networks explain attention
- arXiv (2024). “Towards Precise Scaling Laws for Video Diffusion Transformers” - Video models more sensitive to hyperparameters
- OpenAI (2023). “GPT-4 Technical Report” - Official GPT-4 specifications (limited architectural details)
- Hoffmann et al. (2022). “Training Compute-Optimal Large Language Models” (Chinchilla) - 20:1 token-to-parameter ratio
- Epoch AI (2024). “Chinchilla Scaling: A replication attempt” - Replication issues with original estimates
Architectural Innovations
Section titled “Architectural Innovations”- arXiv (2024). “DeepSeekMoE: Towards Ultimate Expert Specialization” - Fine-grained experts; 40% compute for comparable performance
- DeepSeek-V3 (Hugging Face) - 671B parameters, 37B activated; $5.6M training cost
- GitHub - DeepSeek-MoE - Open source implementation
- ACL Anthology (2024). “DeepSeekMoE” - Conference paper
- Medium - DeepSeek Technical Analysis - MoE efficiency breakdown
- Tri Dao (2024). “State Space Duality (Mamba-2)” - SSM-attention bridge
- Wikipedia - Mamba - Overview of SSM architecture
- arXiv (2024). “Mamba-360: Survey of state space models” - Comprehensive SSM survey
- Maarten Grootendorst (2024). “A Visual Guide to Mamba and State Space Models” - Accessible explanation
Efficiency and Optimization
Section titled “Efficiency and Optimization”- Arcade.dev (2025). “AI Compute Optimization & Cost Efficiency Analysis” - 23x software improvements; 33x energy reduction
- Neptune.ai (2025). “State of Foundation Model Training Report” - Industry efficiency trends
- Gradient Flow (2025). “Foundation Models: What’s Next for 2025 and Beyond” - MoE trends; test-time compute
- Apple ML Research (2025). “Apple Foundation Models 2025 Updates” - 3B model; 2-bit quantization
- ACM Computing Surveys (2024). “Resource-efficient Algorithms and Systems of Foundation Models” - Comprehensive survey
- Restack.io (2024). “Transformer Models Innovations 2024” - GQA, RMSNorm, RoPE
- Medium (2024). “The Evolution of Transformer Architecture: From 2017 to 2024” - Historical overview
Scaling Laws and Measurement
Section titled “Scaling Laws and Measurement”- Analytics Vidhya (2024). “What is the Chinchilla Scaling Law?” - Accessible explanation
- Life Architect. “Chinchilla data-optimal scaling laws: In plain English” - 20:1 ratio details
- Michael Brenndoerfer. “Chinchilla Scaling Laws” - Resource allocation analysis
- MIT CSAIL. “From recurrent networks to GPT-4: Measuring algorithmic progress” - Progress measurement methodology
- LessWrong (2024). “Catch-Up Algorithmic Progress Might Actually be 60× per year” - Post-training efficiency gains
Governance and Policy
Section titled “Governance and Policy”- United Nations University. “The Algorithmic Problem in Artificial Intelligence Governance” - Fundamental governance challenges
- Stanford FSI. “Regulating Under Uncertainty: Governance Options for Generative AI” - Policy frameworks
- Oxford Academic (2024). “Governance of Generative AI” - Scholarly analysis
- OECD. “Governing with Artificial Intelligence” - Government implementation challenges
- World Economic Forum (2024). “How to balance innovation and governance in the age of AI” - 360° governance framework
Industry and Trends
Section titled “Industry and Trends”- Epoch AI. “Machine Learning Trends” - Database of ML progress
- Epoch AI. “Quantifying the algorithmic improvement from reasoning models” - Test-time compute analysis
- Epoch AI. “AI capabilities progress has sped up” - Acceleration around April 2024
- Latent Space (2024). “2024 in Post-Transformers Architectures” - SSM, RWKV developments
- NVIDIA Blog. “Mixture of Experts Powers the Most Intelligent Frontier AI Models” - MoE production deployment
- IBM. “What Is A Mamba Model?” - Industry perspective on SSMs
AI Transition Model Context
Section titled “AI Transition Model Context”Connections to Other Model Elements
Section titled “Connections to Other Model Elements”| Model Element | Relationship | Key Insights |
|---|---|---|
| AI Capabilities (Compute) | Multiplicative effect | Algorithmic efficiency acts as a multiplier on compute; 91% of gains are scale-dependent, coupling progress to compute investment |
| AI Capabilities (Adoption) | Enabling | Efficiency improvements (280x inference cost reduction) make deployment economically viable at scale |
| AI Ownership (Companies) | Concentrating | Scale-dependent innovations favor well-resourced labs; but efficiency gains enable smaller players (DeepSeek $5.6M training) |
| AI Ownership (Countries) | Mixed effect | Open publication accelerates global diffusion; but frontier innovations require substantial compute (scale-dependent) |
| Misalignment Potential (AI Governance) | Governance challenge | Cannot directly control algorithm diffusion; must govern via compute, data, evaluations |
| Misalignment Potential (Technical AI Safety) | Bidirectional | Safety research benefits from efficiency (more experimentation); but rapid capability gains can outpace safety |
| Misuse Potential | Dual use | Efficiency improvements democratize access (beneficial for researchers, concerning for misuse) |
| Transition Turbulence (Racing) | Accelerator | Algorithmic breakthroughs create competitive pressure and capability surprise |
| Civilizational Competence (Epistemics) | Measurement challenge | Difficulty measuring/predicting algorithmic progress creates forecasting uncertainty |
Strategic Implications
Section titled “Strategic Implications”The research reveals several strategic considerations for the AI transition:
-
Scale-dependence brittleness: If 91% of algorithmic efficiency gains depend on frontier compute scales, then limits to compute scaling (energy, supply chain, regulation) may substantially slow algorithmic progress. This makes capability trajectories more brittle than models assuming independent algorithmic and compute progress.
-
Governance indirection: Direct algorithmic governance is infeasible due to instant diffusion and zero marginal cost replication. Effective governance must work through adjacent leverage points: compute access, data availability, evaluation protocols, and publication norms.
-
Efficiency paradox: Software optimizations (23x improvement) dramatically outpace hardware advances (30-40% annually), suggesting that even with compute constraints, efficiency gains could continue. This creates uncertainty about whether hardware-based governance can effectively limit capability growth.
-
Post-training acceleration: The 16-60x/year catch-up progress estimate (including post-training methods) is dramatically higher than the 3x/year pre-training rate. If this continues, post-training innovations may drive capability growth as much or more than pre-training algorithmic advances, shifting governance focus to deployment compute and inference optimization.
-
Capability surprise risk: The Transformer architecture (2017) was transformative but not predicted by the research community beforehand. If future algorithmic breakthroughs follow similar patterns, capability forecasting based on continuous trends may systematically underestimate discontinuous jumps.
-
Democratization vs. concentration: Efficiency improvements have contradictory effects. On one hand, they lower barriers to entry (DeepSeek’s $5.6M training cost vs. hundreds of millions for comparable models). On the other hand, scale-dependent innovations favor organizations with access to frontier compute. The net effect on concentration depends on which force dominates.
-
Test-time compute shift: Models like o1 and r1 demonstrate that significant inference-time computation can improve reasoning capabilities. This shifts resources from training (one-time, governable via thresholds) to deployment (ongoing, harder to monitor). Governance frameworks may need to adapt to regulate deployment compute in addition to training compute.
The algorithmic landscape exhibits rapid change across multiple dimensions (architecture, training methods, post-training techniques, deployment optimization), creating substantial uncertainty for AI governance and safety. The coupling between algorithmic progress and compute investment suggests that compute governance remains the most tractable leverage point, but efficiency gains may provide an “escape valve” that limits the effectiveness of compute-based interventions.