Skip to content

Compute (AI Capabilities): Research Report

📋Page Status
Quality:3 (Stub)⚠️
Words:3.6k
Structure:
📊 15📈 0🔗 0📚 5614%Score: 10/15
FindingKey DataImplication
Training costs escalating2.4x/year growth; GPT-4 cost $78M, Gemini Ultra $191MBillion-dollar training runs by 2027; only well-funded orgs can compete
Supply chain concentrationTSMC: 92% of advanced chips; ASML: 100% of EUV machinesSingle points of failure create governance opportunities and geopolitical risk
Hardware bottlenecks easingH100 lead times dropped from 11 months (2023) to 8-12 weeks (2024)Memory (HBM) now the binding constraint through 2027
Compute governance emergingUS: 10²⁶ FLOPs; EU: 10²⁵ FLOPs reporting thresholdsCompute is measurable, concentrated, physical—ideal governance lever
Energy demand doublingData centers: 415 TWh (2024) → 945 TWh (2030); AI: 35-50% of DC loadInfrastructure growth outpacing electricity supply; nuclear partnerships emerging

Compute has emerged as the most governable input to AI development because it is measurable, concentrated, and physical. Training costs for frontier AI models have grown at 2.4× per year since 2016, with GPT-4 costing approximately $78 million and Gemini Ultra around $191 million—projecting to billion-dollar training runs by 2027. The supply chain exhibits extreme concentration: TSMC manufactures 92% of advanced chips, ASML holds a monopoly on EUV lithography equipment, and NVIDIA controls roughly 80% of the AI accelerator market.

These chokepoints create natural governance leverage. The US, EU, and California have all implemented compute-based regulatory thresholds (10²⁵-10²⁶ FLOPs) that trigger reporting requirements. Hardware bottlenecks have shifted from GPU availability to high-bandwidth memory (HBM), with lead times dropping from 11 months to 8-12 weeks. However, energy constraints are intensifying: AI-driven data center demand is projected to consume 945 TWh by 2030, prompting major tech companies to pursue nuclear partnerships. The concentration of advanced chip manufacturing in Taiwan (TSMC) and lithography equipment in the Netherlands (ASML) creates significant geopolitical risk, driving US efforts to reshore semiconductor production through the CHIPS Act.


Compute—the hardware resources required to train and run AI systems—has emerged as perhaps the most tractable lever for AI governance. Unlike algorithms (which can be shared instantly) or data (which is hard to track), compute is measurable (FLOPs, GPU-hours), concentrated (few chokepoints like TSMC, ASML, NVIDIA), and physical (can be tracked, controlled, and metered).

Training frontier models now costs tens to hundreds of millions of dollars in compute alone. Anthropic CEO Dario Amodei has stated that frontier AI developers are likely to spend close to a billion dollars on a single training run in 2025, with up to ten billion-dollar training runs expected in the next two years. This concentration of resources creates natural governance chokepoints.


Training Compute Costs: Exponential Growth

Section titled “Training Compute Costs: Exponential Growth”

The cost trajectory for training frontier AI models has followed a remarkably consistent exponential trend:

ModelYearTraining Compute CostNotes
GPT-32020$4-12MEstablished LLM scaling paradigm
GPT-42023$78MPer Stanford AI Index 2024
Gemini Ultra2024$191MGoogle’s flagship model
Projected (2027)2027>$1BIf 2.4x/year growth continues

Epoch AI’s analysis found that the amortized hardware and energy cost for the final training run of frontier models has grown at a rate of 2.4x per year since 2016 (95% CI: 2.0x to 3.1x). This rate significantly exceeds Moore’s Law and suggests that “given that total model development costs at the frontier are already over $100 million, these advances may only be accessible to the largest companies and government institutions.”

OpenAI’s 2024 compute spending illustrates the scale: $3 billion on training compute, $1.8 billion on inference compute, and $1 billion on research compute amortized over multiple years (Epoch AI, 2024).

While traditional scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022/Chinchilla) focused on training compute, recent research has expanded to inference scaling laws:

Scaling DimensionKey FindingImplication
TrainingPerformance ∝ compute^α (α ≈ 0.5-0.7)Predictable capability growth with compute
InferenceTest-time compute can be more efficient than parameter scalingSmaller models + advanced inference may be Pareto-optimal
ArchitectureMLP-to-attention ratio, GQA affect inference costConditional scaling laws needed for deployment
Efficiency”Densing law”: capability density doubles every 3.5 monthsSame capability with exponentially fewer parameters over time

The “densing law” published in Nature Machine Intelligence states that capability density (capability per parameter) doubles approximately every 3.5 months, indicating that equivalent model performance can be achieved with exponentially fewer parameters over time. This has significant implications for compute efficiency and deployment costs.

Hardware Supply Chain: Critical Chokepoints

Section titled “Hardware Supply Chain: Critical Chokepoints”

The AI compute supply chain exhibits extreme concentration at multiple levels:

MetricValueSource
Market share (advanced chips)92%AILAB Blog (2025)
Geographic concentrationSingle island (Taiwan), 13,976 sq miGlobal Taiwan Institute
Economic impact of disruption$10 trillion (10% of global GDP)Verdantix (2025)
Key customersApple, NVIDIA, Qualcomm, Samsung, AMDIndustry analysis

TSMC’s CoWoS (Chip-on-Wafer-on-Substrate) packaging technology has emerged as the specific bottleneck for AI chips. The production bottleneck lies primarily in TSMC’s CoWoS packaging process, which cannot scale fast enough. Additionally, the TRX5090 substrate—a critical component binding the GPU core to its high-bandwidth memory—is in extremely limited supply, with only a handful of manufacturers in Japan and Taiwan able to produce it at required precision and volume.

TSMC is diversifying with plans for six semiconductor fabs in Arizona, plus expansion in Japan and Germany. However, diversification is not feasible in the short term due to high reshoring costs and talent acquisition challenges (NPR, 2025).

MetricValueSource
EUV market share100%CNBC (2022)
Machine cost$150-400M (High-NA: $370M+)Industry reports
R&D investment$9B over 30 yearsxLight analysis
Parts per machine>100,000Technical specifications
Suppliers800+ globallySupply chain analysis

ASML builds 100% of the world’s extreme ultraviolet lithography machines, without which cutting-edge chips are simply impossible to make. The Dutch company is the sole supplier of EUV machines, winning a 30-year race that granted a monopoly on selling the tool essential for fabricating leading-edge semiconductors (Strange VC (2025)).

Potential competition: Pat Gelsinger (ousted Intel CEO) as executive chairman of xLight—a startup founded in 2024—is developing free-electron lasers driven by compact particle accelerators as an alternative to ASML’s laser-produced plasma. The Trump administration injected up to $150 million into xLight from the 2022 CHIPS and Science Act, though this represents early-stage R&D with uncertain timeline (24/7 Wall St., 2025).

GPU shortages dominated 2022-2023, but the situation has evolved:

PeriodH100 Lead TimeBinding Constraint
2023 (peak shortage)11 monthsGPU chip supply
Early 20243-4 monthsCoWoS packaging
Late 20248-12 weeksHigh-bandwidth memory (HBM)
2025-2027VariableMemory supply (SK Hynix, Samsung, Micron)

NVIDIA’s market dominance is reflected in financials: Q3 fiscal 2026 total revenue reached $57.0 billion, with data center operations accounting for $51.2 billion—90% of the company’s entire business. An NVIDIA H100 AI accelerator sells for $25,000-40,000, giving the company unusual pricing power during shortage periods (BattleforgePC (2025)).

Compute thresholds have emerged as a central mechanism for AI governance across multiple jurisdictions:

JurisdictionThresholdReporting RequirementsStatus
US (EO 14,110)10²⁶ FLOPs (10²³ for bio)Notify government, report security measures, share red-team resultsRevoked by Trump EO 14,148; rules proposed
EU AI Act10²⁵ FLOPsNotify Commission, perform evaluations, assess systemic risks, report incidentsActive; affects 5-15 companies
California (SB 53)10²⁶ FLOPs + $500M revenueDisclose “frontier AI framework,” report catastrophic risk assessments quarterlyEnacted 2024
New York (RAISE Act)Revenue-based (removed compute threshold)Report critical incidents within 72 hoursSigned into law 2025

Rationale: Training compute can serve as a proxy for the capabilities of AI models. A compute threshold operates as a regulatory trigger, identifying which models might possess more powerful and dangerous capabilities that warrant greater scrutiny (GovAI Research Paper).

Limitations: The debate has centered mostly around a single training compute threshold, but governments could adopt a pluralistic and risk-adjusted approach by introducing multiple compute thresholds that trigger different measures according to degree or nature of risk. Some proposals recommend a tiered approach that would create fewer obligations for models trained on less compute (Institute for Law & AI, 2024).

The US has implemented increasingly strict export controls on advanced computing chips to China, with significant implications:

Policy ActionTargetImpactAssessment
October 2022Advanced chips (A100, H100) to ChinaCreated incentive for compute-efficient algorithmsAccelerated Chinese innovation
October 2023Expanded chip restrictionsCloud computing loopholes remainLimited effectiveness
December 2024High-bandwidth memory (HBM) restrictionsTargets deployment compute for reasoning modelsAddresses inference scaling
AI Diffusion FrameworkThree-tier country systemByzantine limits for ~150 middle-tier countriesCritiqued as overreach

Hardware-Enabled Governance Mechanisms (HEMs): RAND researchers introduced the concept of HEMs, which could be installed on chips otherwise prohibited from export. HEMs could provide “some level of confidence that they could not be misused” through technical enforcement (RAND (2024)). However, HEMs face significant threat vectors and would require robust protection measures.

Cloud Computing Alternative: Providing compute as a service offers superior governance opportunities compared to chip export controls. Unlike AI chips accumulated over time, cloud access provides point-in-time computing power that can be restricted or shut off as needed—making it a more precise tool for oversight (Brookings (2024)).

Energy Consumption: The Infrastructure Challenge

Section titled “Energy Consumption: The Infrastructure Challenge”

AI’s compute demands translate directly into energy consumption at unprecedented scale:

Metric20242030 (Projected)Growth Rate
Global data center electricity415 TWh (1.5% of global)945 TWh (3% of global)15%/year
US data center electricity183 TWh (4% of US total)~320 TWh (8.6% of US total by 2035)Doubling by 2035
AI share of DC power5-15%35-50%AI-specific servers: 30%/year
AI-specific servers53-76 TWh165-326 TWh4.3x growth

Energy sources (US, 2024): Natural gas supplied over 40% of electricity for data centers, renewables (wind/solar) about 24%, nuclear around 20%, and coal around 15% (Pew Research, 2025).

Environmental footprint: Company-wide metrics from environmental disclosures suggest that AI systems may have a carbon footprint equivalent to that of New York City in 2025. In 2023, US data centers directly consumed about 17 billion gallons of water, with hyperscale facilities expected to consume 16-33 billion gallons annually by 2028 (ScienceDirect (2025)).

Infrastructure investment: The data center real estate build-out has reached record levels based on select major hyperscalers’ capital expenditure, trending at roughly $200 billion as of 2024 and estimated to exceed $220B by 2025 (Deloitte (2025)).


The following factors influence AI compute availability, cost, and governance effectiveness. This analysis is designed to inform future cause-effect diagram creation for the AI Transition Model.

FactorDirectionTypeEvidenceConfidence
Scaling Laws↑ Compute Demandcause2.4x/year cost growth; predictable capability returnsHigh
Supply Chain Concentration↑ Governance TractabilityintermediateTSMC 92%, ASML 100% create chokepointsHigh
Training Cost Escalation↓ Actor Diversityintermediate$100M+ limits to well-funded orgs; billion-dollar runs by 2027High
Hardware Bottlenecks↓ Capability Growth RateleafMemory (HBM) shortages through 2027High
Energy Infrastructure↓ Deployment Speedleaf15%/year DC growth exceeds grid capacity in some regionsHigh
FactorDirectionTypeEvidenceConfidence
Compute Thresholds↑ Regulatory OversightintermediateUS/EU/CA frameworks at 10²⁵-10²⁶ FLOPsMedium
Export ControlsMixed EffectleafMay accelerate efficient algorithms (DeepSeek example)Medium
Algorithmic Efficiency↓ Compute Demandcause”Densing law”: 2x capability density every 3.5 monthsMedium
Inference Scaling↑ Deployment ComputecauseTest-time compute increasingly important (o1, r1 models)Medium
Geopolitical Tensions↑ Supply RiskleafTaiwan Strait conflict would disrupt 92% of advanced chipsMedium
FactorDirectionTypeEvidenceConfidence
Cloud Governance↑ Oversight PrecisionintermediatePoint-in-time control vs. chip accumulationLow
Hardware-Enabled Mechanisms↑ Export FlexibilityintermediateRAND proposal; unproven threat modelLow
ASML Competition↓ Monopoly RiskleafxLight startup; early stage ($150M funding)Low
TSMC Diversification↓ Geographic RiskleafArizona/Japan/Germany fabs; long timelineLow

Compute dynamics could evolve along several distinct pathways with different implications for AI safety:

ScenarioMechanismTimelineWarning SignsGovernance Implications
Compute OverhangAlgorithmic breakthroughs make existing compute far more capable2-5 yearsEfficiency gains exceed hardware scaling; DeepSeek-style innovationsThresholds become unreliable proxies
Hardware PlateauPhysical limits (energy, memory, lithography) constrain scaling5-10 yearsSlowing Moore’s Law; energy grid bottlenecksIncreased focus on algorithmic safety
Geopolitical DisruptionTaiwan conflict disrupts TSMC; China controls advanced chips1-10 yearsEscalating Taiwan Strait tensionsWestern AI capabilities severely degraded
Decentralized ComputeDistributed training across many small GPUs becomes viable3-8 yearsSuccessful federated learning for frontier modelsGovernance via chokepoints becomes infeasible
Energy BottleneckGrid capacity limits data center expansion before compute saturation3-7 yearsBrownouts near mega-clusters; nuclear partnerships stallNatural brake on capability growth

QuestionWhy It MattersCurrent State
How robust are compute thresholds to algorithmic progress?DeepSeek achieved competitive results with less than 10²⁵ FLOPsThresholds are static; efficiency gains accelerating
What is the energy ceiling for AI?May be binding constraint before chip supplyProjections vary widely; grid capacity unclear
Can TSMC diversification succeed in time?Arizona fabs won’t reach volume until late 2020sGeopolitical risk timeline uncertain
Will inference scaling change governance calculus?Shifts compute from training (one-time) to deployment (ongoing)Reasoning models (o1, r1) show importance; December 2024 HBM controls respond
How effective are export controls?May accelerate rather than impede Chinese AI progressDeepSeek case study suggests efficiency paradox
What are the limits of hardware-enabled governance?Could enable export of otherwise-restricted chipsThreat model unproven; RAND early-stage research
Will memory (HBM) remain the bottleneck?Determines whether GPU shortages returnSK Hynix: shortages through late 2027
Can cloud-based governance scale globally?More precise than chip controls but requires infrastructureLoopholes exist; university research concerns


Model ElementRelationshipKey Insights
AI Capabilities (Algorithms)ComplementaryAlgorithmic efficiency (densing law) reduces compute requirements; may undermine threshold-based governance
AI Capabilities (Adoption)EnablingTraining costs ($100M+) create barriers to entry; energy infrastructure limits deployment speed
AI Ownership (Companies)ConcentratingOnly well-funded organizations can afford frontier models; drives consolidation
AI Ownership (Countries)Geopolitical leverExport controls and supply chain chokepoints (TSMC, ASML) create interstate competition
Misalignment Potential (AI Governance)Primary intervention pointCompute thresholds enable reporting, evaluations, audits before deployment
Misuse PotentialLimiting factorHigh training costs reduce rogue actor capabilities (though inference compute different)
Transition Turbulence (Racing)AcceleratorShortages and strategic competition increase pressure for rapid deployment
Civilizational Competence (Governance)Test caseEffectiveness of compute governance indicates broader governance capacity

The research reveals several strategic considerations for the AI transition:

  1. Governance window closing: If algorithmic efficiency continues to double capability density every 3.5 months (densing law), compute thresholds will become less reliable proxies for risk within 2-3 years. This suggests urgency in establishing complementary governance mechanisms.

  2. Energy as natural brake: Infrastructure constraints (15%/year data center growth vs. slower grid expansion) may limit capability growth independent of policy choices. This could provide additional time for governance development but may also increase racing incentives.

  3. Geopolitical fragility: 92% concentration in Taiwan creates catastrophic downside risk. TSMC diversification timeline (late 2020s for volume production) may not align with AI capabilities timeline (potentially transformative AI by 2027-2030).

  4. Efficiency paradox: Export controls may accelerate development of compute-efficient algorithms, potentially making frontier capabilities accessible to a wider range of actors. This suggests limits to supply-side interventions.

  5. Inference shift: Growing importance of deployment/inference compute (o1, r1 reasoning models) changes governance focus from one-time training runs to ongoing operational compute. Cloud-based governance may be more effective for this regime.

The compute landscape is evolving rapidly, with multiple uncertainty dimensions (algorithmic efficiency, hardware bottlenecks, energy constraints, geopolitical shocks) that could significantly alter the AI transition trajectory. Adaptive governance mechanisms that can respond to these shifts will be critical.