Compute Thresholds

📋Page Status

Quality:82 (Comprehensive)

Importance:78.5 (High)

Last edited:2025-12-28 (10 days ago)

Words:3.4k

Structure:

📊 10📈 1🔗 16📚 24•2%Score: 14/15

LLM Summary:Analyzes compute thresholds (EU AI Act at 10^25 FLOP, US EO at 10^26 FLOP) as regulatory triggers for AI safety requirements, documenting their current implementation and quantifying the core challenge: algorithmic efficiency improvements of ~2x/year threaten to make static thresholds obsolete within 3-5 years. Provides concrete data on evasion strategies, enforcement gaps, and trajectory forecasts showing models exceeding thresholds will grow from 5-10 (2024) to 100-200 (2028).

Policy

Compute Thresholds

ApproachDefine capability boundaries via compute

StatusEstablished in US and EU policy

Policies

Overview

Compute thresholds represent one of the most concrete regulatory approaches to AI governance implemented to date, using training compute as a measurable trigger for safety and transparency requirements. Unlike export controls that restrict access or monitoring systems that provide ongoing visibility, thresholds create a simple binary rule: if you train a model above X floating-point operations (FLOP), you must comply with specific regulatory obligations.

This approach has gained traction because compute is both measurable and correlates with model capabilities, albeit imperfectly. The European Union’s AI Act established a 10^25 FLOP threshold in 2024, while the US Executive Order on AI set a 10^26 FLOP trigger in October 2023. These implementations represent the first large-scale attempt to regulate AI development based on resource consumption rather than demonstrated capabilities or actor identity.

However, compute thresholds face a fundamental challenge: algorithmic efficiency improvements of approximately 2x per year are decoupling compute requirements from capabilities. A model requiring 10^25 FLOP in 2023 might achieve equivalent performance with only 10^24 FLOP by 2026, potentially making static thresholds obsolete within 3-5 years. This creates an ongoing tension between the tractability of compute-based triggers and their diminishing relevance as a proxy for AI capabilities and associated risks.

Quick Assessment

Dimension	Assessment	Notes
Tractability	High	Already implemented in major jurisdictions
Mechanism	Trigger-based	Cross threshold → face requirements
Current Status	Active	EU AI Act, US EO both use compute thresholds
Main Challenge	Algorithmic efficiency	Same capabilities with less compute over time
Time Horizon	3-5 years	Before efficiency gains make current thresholds irrelevant

Risks Addressed

Risk	Mechanism	Effectiveness
Racing Dynamics	Forces safety testing before deployment	Medium
Bioweapons	Lower thresholds for bio-sequence models	Medium
Deceptive Alignment	Requires evaluation before deployment	Low-Medium

Global Compute Threshold Comparison

The following table compares compute threshold implementations across major jurisdictions, revealing significant variation in both threshold levels and triggered requirements:

Jurisdiction	Threshold	Scope	Key Requirements	Status	Source
EU AI Act	10^25 FLOP	GPAI with systemic risk	Transparency, risk evaluation, incident reporting, adversarial testing	Effective Aug 2025	EC Guidelines
US EO 14110	10^26 FLOP	General AI systems	Pre-training notification, safety testing, security measures	Active (Oct 2023)	Commerce reporting
US EO 14110	10^23 FLOP	Biological sequence models	Same as above, lower threshold for bio-risk	Active (Oct 2023)	Commerce reporting
China Draft AI Law	TBD (compute + parameters)	“Critical AI” systems	Assessment and approval before market deployment	Draft stage	Asia Society
UK AISI	Capability-based	Frontier models	Voluntary evaluation, no formal threshold	Monitoring only	AISI Framework

The 1000x difference between the US biological threshold (10^23) and general threshold (10^26) reflects assessment that biological capabilities may emerge at much smaller model scales. The EU’s 10^25 threshold sits between these extremes, calibrated to capture approximately GPT-4-scale models while excluding smaller systems.

Current Implementations and Evidence

EU AI Act Foundation Models Regulation (2024)

The EU AI Act, which entered into force in August 2024, establishes the most comprehensive compute threshold regime to date. Models trained with more than 10^25 FLOP are classified as General Purpose AI (GPAI) systems with systemic risk, triggering substantial obligations including transparency requirements about training data and processes, systemic risk evaluations, mandatory incident reporting, adversarial testing requirements, and comprehensive documentation and compliance obligations.

The 10^25 FLOP threshold was calibrated to capture models at roughly GPT-4’s training scale, which required approximately 2-5 × 10^25 FLOP based on available estimates. This places the threshold at the current frontier while avoiding over-regulation of smaller models. The EU’s approach focuses heavily on transparency and risk assessment rather than prohibition, reflecting a regulatory philosophy of managed deployment rather than restriction.

The implementation timeline is aggressive, with full compliance required by August 2025. Early evidence suggests major AI developers are already adapting their processes to meet EU requirements, with OpenAI, Anthropic, and Google all announcing compliance programs. However, the threshold’s static nature has drawn criticism from researchers who argue that algorithmic improvements will rapidly make it obsolete.

US Executive Order 14110 (October 2023)

The United States took a different approach with Executive Order 14110, setting a higher threshold of 10^26 FLOP for general AI systems while establishing a much lower 10^23 FLOP threshold specifically for models trained primarily on biological sequence data. Models exceeding these thresholds must report training plans to the Department of Commerce before beginning training, share safety test results with the government, and implement security measures for model weights and training infrastructure.

The dual-threshold approach reflects differentiated risk assessment, with the biological threshold set at roughly GPT-3 scale (10^23 FLOP) to capture potential bioweapon development risks at lower capability levels. This represents approximately 100x lower compute than the general threshold, acknowledging that biological capabilities may emerge at smaller scales than general intelligence capabilities.

Notable implementations include Meta’s reporting of Llama 3 training (estimated ~4 × 10^25 FLOP) and OpenAI’s compliance with pre-training notification requirements for models approaching the 10^26 threshold. The Department of Commerce has established preliminary reporting mechanisms, though full regulatory infrastructure is still under development.

Comparative International Approaches

The UK has taken a more cautious approach, with the Frontier AI Taskforce (now AI Safety Institute) monitoring compute thresholds without establishing formal regulatory triggers. China’s approach remains opaque, though draft regulations suggest consideration of compute-based measures alongside capability assessments. The result is a fragmented global landscape where companies must navigate multiple threshold regimes with different requirements and measurement standards.

Threshold Mechanisms and Implementation

Regulatory Pipeline Architecture

Compute thresholds operate through a multi-stage regulatory pipeline that begins before training commences. The typical sequence involves threshold definition by regulators, pre-training notification by AI developers, threshold crossing triggering specific requirements, mandatory evaluation and testing phases, implementation of required safeguards, and finally authorized deployment under ongoing monitoring.

Loading diagram...

This pipeline structure is designed to provide regulatory visibility into AI development before capabilities emerge, rather than reacting after deployment. However, implementation varies significantly between jurisdictions, with the EU emphasizing post-training compliance verification while the US focuses on pre-training notification and ongoing cooperation.

Triggered Requirements Spectrum

Pre-training requirements typically include notification of training intent, security measures for training infrastructure, and preliminary risk assessments. Pre-deployment obligations encompass comprehensive safety evaluations including red-teaming exercises, capability testing across multiple domains, detailed risk assessments, and extensive documentation of training processes and data sources.

Ongoing requirements extend throughout the model lifecycle, including incident reporting for safety failures or misuse, monitoring systems for detecting problematic applications, cooperation with regulatory investigations, and periodic compliance audits. The breadth of these requirements reflects the challenge of governing AI systems whose capabilities and risks may emerge or change after initial deployment.

Core Challenges and Limitations

The Algorithmic Efficiency Problem

The most fundamental challenge facing compute thresholds is the rapid growth of frontier AI capabilities, which threatens to make static thresholds increasingly irrelevant. Research by Epoch AI documents that training compute of frontier AI models has grown by 4-5x per year since 2020, while hardware efficiency improvements (12x over the past decade) and lower precision formats (8x improvement) are simultaneously reducing the compute needed to achieve equivalent capabilities.

Trend	Rate	Implication for Thresholds
Frontier compute growth	4-5x/year	More models will exceed thresholds
Hardware efficiency (FLOP/W)	1.28x/year	Same compute costs less
Training cost growth	2.4x/year	Frontier models now cost hundreds of millions USD
Capability improvement	~15 points/year (2024)	Nearly doubled from ~8 points/year

This creates a dual challenge: on one hand, if capability growth continues to accelerate, today’s thresholds may capture far fewer models than intended. On the other hand, if algorithmic efficiency improves faster than expected, equivalent capabilities could be achieved with 10-100x less compute, allowing dangerous models to evade oversight. The GovAI research on training compute thresholds explicitly notes that “training compute is an imperfect proxy for risk” and should be used to “detect potentially risky GPAI models that warrant regulatory oversight” rather than as a standalone regulatory mechanism.

The problem is compounded by the uneven nature of efficiency improvements, which vary significantly across model architectures and training paradigms. Language models, multimodal systems, and specialized scientific models each follow different efficiency trajectories, making it difficult to set universal thresholds that remain relevant across domains. The EU AI Act acknowledges this by including Article 51(3) provisions for the Commission to “amend the thresholds… in light of evolving technological developments, such as algorithmic improvements or increased hardware efficiency.”

Gaming and Evasion Strategies

Sophisticated actors have multiple strategies for evading compute thresholds while achieving equivalent model performance. The following table summarizes key evasion vectors identified in governance research:

Evasion Strategy	Mechanism	Difficulty	Potential Countermeasure
Training run splitting	Multiple sub-threshold runs combined via fine-tuning or merging	Medium	Cumulative compute tracking across related runs
Model distillation	Train large teacher model privately, distill to smaller student	High	Teacher model reporting requirements
Jurisdictional arbitrage	Train in unregulated jurisdiction, deploy globally	Low	Deployment-based jurisdiction rules
Creative accounting	Exclude fine-tuning, inference, or multi-stage compute	Medium	Standardized compute definitions
Distributed training	Split training across jurisdictions/entities	Medium	Consolidated reporting requirements
Inference-time scaling	Use test-time compute instead of training compute	Low (emerging)	Include inference thresholds

The distillation loophole is particularly concerning: as noted by governance researchers, “a company might use greater than 10^25 FLOPs to train a teacher model that is never marketed or used in the EU, then use that teacher model to train a smaller student model that is nearly as capable but trained using less than 10^25 FLOPs.” This allows regulatory evasion while achieving equivalent model performance.

International arbitrage allows organizations to conduct high-compute training in jurisdictions without established thresholds, then deploy globally. This creates competitive pressure for regulatory harmonization while potentially undermining the effectiveness of unilateral threshold implementations. The GovAI Know-Your-Customer proposal suggests that compute providers could help close these loopholes by identifying and reporting potentially problematic training runs.

Measurement and Verification Challenges

Current threshold regimes rely primarily on self-reporting by AI developers, creating significant verification challenges. While major companies have generally complied in good faith with existing requirements, the absence of technical verification mechanisms creates enforcement vulnerabilities. Hardware-level monitoring could provide more reliable compute measurement, but raises significant privacy and trade secret concerns for AI developers.

Definitional ambiguities compound measurement challenges, particularly around edge cases like multi-stage training, transfer learning, and inference-time computation. The emergence of techniques like chain-of-thought reasoning and test-time training blur traditional boundaries between training and inference, potentially creating new categories of compute that existing thresholds don’t address.

Cloud computing platforms could provide third-party verification of compute usage, but this would require standardized reporting mechanisms and potentially compromise competitive sensitive information about training methodologies and resource allocation strategies.

The Inference Scaling Challenge

A particularly significant emerging challenge is the shift from training-time to inference-time compute scaling. As GovAI research on inference scaling warns, “the shift from scaling up pre-training compute to inference compute may have profound effects on AI governance. Rapid scaling of inference-at-deployment could potentially undermine AI governance measures that rely on training-compute thresholds.”

Models like OpenAI’s o1 demonstrate that substantial capability improvements can come from inference-time computation rather than training compute. This creates a fundamental gap in current threshold regimes:

Compute Type	Current Coverage	Governance Challenge
Training compute	Covered by EU/US thresholds	Well-defined, measurable
Fine-tuning compute	Ambiguous coverage	May be excluded from calculations
Inference compute (deployment)	Not covered	Grows with usage, hard to predict
Test-time training	Not covered	Blurs training/inference boundary

As inference-time scaling becomes more prevalent, a model trained with below-threshold compute could achieve above-threshold capabilities through extensive inference-time computation, completely evading current regulatory frameworks.

Safety Implications and Risk Assessment

Promising Aspects

Compute thresholds provide several valuable safety benefits despite their limitations. They create predictable regulatory entry points that allow companies to plan safety investments and compliance strategies in advance, rather than reacting to post-deployment requirements. The transparency requirements triggered by thresholds generate valuable information about frontier AI development that enables better risk assessment and policy development.

Threshold systems also establish precedents for AI-specific regulation that can evolve toward more sophisticated approaches over time. They provide regulatory agencies with initial experience governing AI development while building institutional capacity for more complex oversight mechanisms. The international coordination emerging around threshold harmonization creates foundations for broader AI governance cooperation.

From an industry perspective, thresholds provide regulatory certainty that enables long-term investment in safety infrastructure while creating level playing fields where all frontier developers face similar requirements.

Concerning Limitations

However, compute thresholds exhibit significant safety limitations that could create false confidence in regulatory coverage. They may miss dangerous capabilities that emerge at lower compute levels, particularly in specialized domains like biotechnology or cybersecurity where domain-specific training data matters more than raw computational scale.

The static nature of current thresholds creates growing blind spots as algorithmic efficiency improves, potentially allowing increasingly capable systems to evade oversight. Threshold evasion strategies could enable bad actors to develop dangerous capabilities while avoiding regulatory scrutiny, particularly if enforcement mechanisms remain weak.

Perhaps most concerning, compute thresholds may distract from more direct capability-based assessments that could provide better safety coverage. The focus on computational inputs rather than capability outputs could lead to regulatory frameworks that miss the most important risk factors while imposing compliance burdens on relatively safe high-compute applications.

Future Trajectory and Evolution

Short-term Developments (1-2 years)

The immediate future will see operationalization of existing threshold regimes, with EU AI Act requirements becoming fully effective in August 2025 and US Executive Order provisions being codified into formal regulations. This period will provide crucial empirical data about threshold effectiveness, compliance costs, and gaming strategies that will inform future policy development.

According to GovAI forecasts on frontier model counts, the number of models exceeding absolute compute thresholds will increase superlinearly, while thresholds defined relative to the largest training run see a more stable trend of 14-16 models captured annually from 2025-2028. This suggests static absolute thresholds like the current EU and US implementations will capture an increasing number of models over time, potentially requiring significant regulatory scaling.

Year	Models Exceeding 10^25 FLOP (Estimate)	Models Exceeding Relative Threshold	Regulatory Implication
2024	5-10	14-16	Current capacity adequate
2025	15-25	14-16	EU compliance begins
2026	30-50	14-16	May need threshold adjustment
2027	60-100	14-16	Scaling challenges
2028	100-200	14-16	Potential capacity crisis

International harmonization discussions are likely to intensify as the compliance burden of divergent threshold regimes becomes apparent to global AI developers. The OECD and UK AI Safety Institute collaborative session on thresholds at the February 2025 AI Action Summit exemplifies growing international coordination efforts. Technical standards development will accelerate, particularly around compute measurement methodologies and verification mechanisms.

Medium-term Evolution (3-5 years)

The medium-term trajectory will likely see significant evolution away from purely static thresholds toward more sophisticated triggering mechanisms. Algorithmic efficiency improvements will force either frequent threshold updates or adoption of alternative approaches that maintain regulatory relevance despite efficiency gains.

Capability-based triggers are expected to emerge as a complement to or replacement for compute thresholds, using standardized benchmark evaluations to determine regulatory requirements based on demonstrated abilities rather than resource consumption. GovAI research on risk thresholds recommends that “companies define risk thresholds to provide a principled foundation for their decision-making, use these to help set capability thresholds, and then primarily rely on capability thresholds.”

Threshold Type	Advantages	Disadvantages	Best Use Case
Compute-based (absolute)	Simple, measurable, predictable	Becomes obsolete with efficiency gains	Initial screening, pre-training notification
Compute-based (relative)	Adapts to frontier advances	Requires ongoing calibration	Capturing only true frontier models
Capability-based	Directly measures risk-relevant properties	Hard to evaluate, may miss novel capabilities	Post-training safety assessment
Risk-based	Most principled approach	Most difficult to evaluate reliably	Strategic decision frameworks
Hybrid	Balances predictability with relevance	Complex to implement	Long-term regulatory evolution

International regime development will likely produce multilateral frameworks for threshold coordination, potentially through new international organizations or expanded mandates for existing bodies like the OECD or UN. These frameworks will need to address both threshold harmonization and enforcement cooperation to be effective.

Long-term Uncertainty (5+ years)

The long-term future of compute thresholds depends critically on the pace of algorithmic efficiency improvements and the development of alternative governance mechanisms. If efficiency gains continue at current rates, compute-based triggers may become obsolete entirely, requiring wholesale transition to capability-based or other approaches.

Alternatively, threshold evolution could incorporate dynamic adjustment mechanisms that automatically update based on efficiency benchmarks or capability correlations, maintaining relevance despite technological change. This would require sophisticated measurement systems and potentially automated regulatory frameworks.

The emergence of novel AI architectures like neuromorphic computing or quantum-classical hybrid systems could fundamentally alter the compute-capability relationship, potentially making current FLOP-based measurements irrelevant and requiring entirely new regulatory metrics.

Key Uncertainties and Research Questions

Several critical uncertainties will determine the future effectiveness of compute threshold approaches. The pace and trajectory of algorithmic efficiency improvements remains unpredictable, with potential for breakthrough innovations that dramatically decouple compute from capabilities. Current trend extrapolation suggests 2x annual improvements, but this could accelerate or plateau depending on fundamental algorithmic advances.

The correlation between compute and dangerous capabilities is empirically understudied, particularly for specialized risks like bioweapons development or deceptive alignment. Better understanding these relationships is crucial for calibrating threshold levels and determining when capability-based triggers might be more appropriate.

Enforcement mechanisms remain largely theoretical, with limited real-world testing of verification systems or consequences for non-compliance. The willingness and ability of regulatory agencies to detect and respond to threshold evasion will ultimately determine system effectiveness.

International coordination dynamics are highly uncertain, particularly regarding participation by major AI powers like China and cooperation between democratic and authoritarian governance systems. The success of threshold regimes may depend critically on achieving sufficient global coverage to prevent regulatory arbitrage.

The development of standardized capability evaluation systems presents both technical and political challenges that could determine whether hybrid threshold-capability approaches become feasible. Progress on evaluation methodology, benchmark development, and international standards will shape the evolution of regulatory frameworks beyond pure compute triggers.

Key Research and Sources

The following research organizations have produced foundational work on compute threshold governance:

Organization	Key Contribution	Focus Area
GovAI	Training Compute Thresholds, Inference Scaling Governance, Risk Thresholds	Threshold design, alternative approaches
CSET Georgetown	AI Governance at the Frontier, preparedness frameworks	Policy implementation, US context
Epoch AI	Compute trends, training cost analysis	Empirical compute data, forecasting
UK AI Security Institute	Frontier AI Trends Report, capability evaluations	Empirical capability assessment
OECD	Thresholds for Frontier AI sessions	International coordination, standards

Export Controls — Restricting access rather than triggering requirements
Compute Monitoring — Ongoing visibility into training
International Regimes — Multilateral threshold coordination

AI Transition Model Context

Compute thresholds improve the Ai Transition Model through Civilizational Competence:

Factor	Parameter	Impact
Civilizational Competence	Regulatory Capacity	Objective triggers enable automated enforcement of safety requirements
Civilizational Competence	Institutional Quality	Clear thresholds reduce regulatory discretion and political capture

Threshold effectiveness depends on keeping pace with algorithmic efficiency improvements; static thresholds become obsolete within 3-5 years.

Compute Thresholds

Compute Thresholds

Overview

Quick Assessment

Risks Addressed

Global Compute Threshold Comparison

Current Implementations and Evidence

EU AI Act Foundation Models Regulation (2024)

US Executive Order 14110 (October 2023)

Comparative International Approaches

Threshold Mechanisms and Implementation

Regulatory Pipeline Architecture

Triggered Requirements Spectrum

Core Challenges and Limitations

The Algorithmic Efficiency Problem

Gaming and Evasion Strategies

Measurement and Verification Challenges

The Inference Scaling Challenge

Safety Implications and Risk Assessment

Promising Aspects

Concerning Limitations

Future Trajectory and Evolution

Short-term Developments (1-2 years)

Medium-term Evolution (3-5 years)

Long-term Uncertainty (5+ years)

Key Uncertainties and Research Questions

Key Research and Sources

Related Approaches

Related Pages

AI Transition Model Context