Compute Thresholds
Compute Thresholds
Overview
Section titled “Overview”Compute thresholds represent one of the most concrete regulatory approaches to AI governance implemented to date, using training compute as a measurable trigger for safety and transparency requirements. Unlike export controls that restrict access or monitoring systems that provide ongoing visibility, thresholds create a simple binary rule: if you train a model above X floating-point operations (FLOP), you must comply with specific regulatory obligations.
This approach has gained traction because compute is both measurable and correlates with model capabilities, albeit imperfectly. The European Union’s AI Act established a 10^25 FLOP threshold in 2024, while the US Executive Order on AI set a 10^26 FLOP trigger in October 2023. These implementations represent the first large-scale attempt to regulate AI development based on resource consumption rather than demonstrated capabilities or actor identity.
However, compute thresholds face a fundamental challenge: algorithmic efficiency improvements of approximately 2x per year are decoupling compute requirements from capabilities. A model requiring 10^25 FLOP in 2023 might achieve equivalent performance with only 10^24 FLOP by 2026, potentially making static thresholds obsolete within 3-5 years. This creates an ongoing tension between the tractability of compute-based triggers and their diminishing relevance as a proxy for AI capabilities and associated risks.
Quick Assessment
Section titled “Quick Assessment”| Dimension | Assessment | Notes |
|---|---|---|
| Tractability | High | Already implemented in major jurisdictions |
| Mechanism | Trigger-based | Cross threshold → face requirements |
| Current Status | Active | EU AI Act, US EO both use compute thresholds |
| Main Challenge | Algorithmic efficiency | Same capabilities with less compute over time |
| Time Horizon | 3-5 years | Before efficiency gains make current thresholds irrelevant |
Risks Addressed
Section titled “Risks Addressed”| Risk | Mechanism | Effectiveness |
|---|---|---|
| Racing Dynamics | Forces safety testing before deployment | Medium |
| Bioweapons | Lower thresholds for bio-sequence models | Medium |
| Deceptive Alignment | Requires evaluation before deployment | Low-Medium |
Global Compute Threshold Comparison
Section titled “Global Compute Threshold Comparison”The following table compares compute threshold implementations across major jurisdictions, revealing significant variation in both threshold levels and triggered requirements:
| Jurisdiction | Threshold | Scope | Key Requirements | Status | Source |
|---|---|---|---|---|---|
| EU AI Act | 10^25 FLOP | GPAI with systemic risk | Transparency, risk evaluation, incident reporting, adversarial testing | Effective Aug 2025 | EC Guidelines |
| US EO 14110 | 10^26 FLOP | General AI systems | Pre-training notification, safety testing, security measures | Active (Oct 2023) | Commerce reporting |
| US EO 14110 | 10^23 FLOP | Biological sequence models | Same as above, lower threshold for bio-risk | Active (Oct 2023) | Commerce reporting |
| China Draft AI Law | TBD (compute + parameters) | “Critical AI” systems | Assessment and approval before market deployment | Draft stage | Asia Society |
| UK AISI | Capability-based | Frontier models | Voluntary evaluation, no formal threshold | Monitoring only | AISI Framework |
The 1000x difference between the US biological threshold (10^23) and general threshold (10^26) reflects assessment that biological capabilities may emerge at much smaller model scales. The EU’s 10^25 threshold sits between these extremes, calibrated to capture approximately GPT-4-scale models while excluding smaller systems.
Current Implementations and Evidence
Section titled “Current Implementations and Evidence”EU AI Act Foundation Models Regulation (2024)
Section titled “EU AI Act Foundation Models Regulation (2024)”The EU AI Act, which entered into force in August 2024, establishes the most comprehensive compute threshold regime to date. Models trained with more than 10^25 FLOP are classified as General Purpose AI (GPAI) systems with systemic risk, triggering substantial obligations including transparency requirements about training data and processes, systemic risk evaluations, mandatory incident reporting, adversarial testing requirements, and comprehensive documentation and compliance obligations.
The 10^25 FLOP threshold was calibrated to capture models at roughly GPT-4’s training scale, which required approximately 2-5 × 10^25 FLOP based on available estimates. This places the threshold at the current frontier while avoiding over-regulation of smaller models. The EU’s approach focuses heavily on transparency and risk assessment rather than prohibition, reflecting a regulatory philosophy of managed deployment rather than restriction.
The implementation timeline is aggressive, with full compliance required by August 2025. Early evidence suggests major AI developers are already adapting their processes to meet EU requirements, with OpenAI, Anthropic, and Google all announcing compliance programs. However, the threshold’s static nature has drawn criticism from researchers who argue that algorithmic improvements will rapidly make it obsolete.
US Executive Order 14110 (October 2023)
Section titled “US Executive Order 14110 (October 2023)”The United States took a different approach with Executive Order 14110, setting a higher threshold of 10^26 FLOP for general AI systems while establishing a much lower 10^23 FLOP threshold specifically for models trained primarily on biological sequence data. Models exceeding these thresholds must report training plans to the Department of Commerce before beginning training, share safety test results with the government, and implement security measures for model weights and training infrastructure.
The dual-threshold approach reflects differentiated risk assessment, with the biological threshold set at roughly GPT-3 scale (10^23 FLOP) to capture potential bioweapon development risks at lower capability levels. This represents approximately 100x lower compute than the general threshold, acknowledging that biological capabilities may emerge at smaller scales than general intelligence capabilities.
Notable implementations include Meta’s reporting of Llama 3 training (estimated ~4 × 10^25 FLOP) and OpenAI’s compliance with pre-training notification requirements for models approaching the 10^26 threshold. The Department of Commerce has established preliminary reporting mechanisms, though full regulatory infrastructure is still under development.
Comparative International Approaches
Section titled “Comparative International Approaches”The UK has taken a more cautious approach, with the Frontier AI Taskforce (now AI Safety Institute) monitoring compute thresholds without establishing formal regulatory triggers. China’s approach remains opaque, though draft regulations suggest consideration of compute-based measures alongside capability assessments. The result is a fragmented global landscape where companies must navigate multiple threshold regimes with different requirements and measurement standards.
Threshold Mechanisms and Implementation
Section titled “Threshold Mechanisms and Implementation”Regulatory Pipeline Architecture
Section titled “Regulatory Pipeline Architecture”Compute thresholds operate through a multi-stage regulatory pipeline that begins before training commences. The typical sequence involves threshold definition by regulators, pre-training notification by AI developers, threshold crossing triggering specific requirements, mandatory evaluation and testing phases, implementation of required safeguards, and finally authorized deployment under ongoing monitoring.
This pipeline structure is designed to provide regulatory visibility into AI development before capabilities emerge, rather than reacting after deployment. However, implementation varies significantly between jurisdictions, with the EU emphasizing post-training compliance verification while the US focuses on pre-training notification and ongoing cooperation.
Triggered Requirements Spectrum
Section titled “Triggered Requirements Spectrum”Pre-training requirements typically include notification of training intent, security measures for training infrastructure, and preliminary risk assessments. Pre-deployment obligations encompass comprehensive safety evaluations including red-teaming exercises, capability testing across multiple domains, detailed risk assessments, and extensive documentation of training processes and data sources.
Ongoing requirements extend throughout the model lifecycle, including incident reporting for safety failures or misuse, monitoring systems for detecting problematic applications, cooperation with regulatory investigations, and periodic compliance audits. The breadth of these requirements reflects the challenge of governing AI systems whose capabilities and risks may emerge or change after initial deployment.
Core Challenges and Limitations
Section titled “Core Challenges and Limitations”The Algorithmic Efficiency Problem
Section titled “The Algorithmic Efficiency Problem”The most fundamental challenge facing compute thresholds is the rapid growth of frontier AI capabilities, which threatens to make static thresholds increasingly irrelevant. Research by Epoch AI documents that training compute of frontier AI models has grown by 4-5x per year since 2020, while hardware efficiency improvements (12x over the past decade) and lower precision formats (8x improvement) are simultaneously reducing the compute needed to achieve equivalent capabilities.
| Trend | Rate | Implication for Thresholds |
|---|---|---|
| Frontier compute growth | 4-5x/year | More models will exceed thresholds |
| Hardware efficiency (FLOP/W) | 1.28x/year | Same compute costs less |
| Training cost growth | 2.4x/year | Frontier models now cost hundreds of millions USD |
| Capability improvement | ~15 points/year (2024) | Nearly doubled from ~8 points/year |
This creates a dual challenge: on one hand, if capability growth continues to accelerate, today’s thresholds may capture far fewer models than intended. On the other hand, if algorithmic efficiency improves faster than expected, equivalent capabilities could be achieved with 10-100x less compute, allowing dangerous models to evade oversight. The GovAI research on training compute thresholds explicitly notes that “training compute is an imperfect proxy for risk” and should be used to “detect potentially risky GPAI models that warrant regulatory oversight” rather than as a standalone regulatory mechanism.
The problem is compounded by the uneven nature of efficiency improvements, which vary significantly across model architectures and training paradigms. Language models, multimodal systems, and specialized scientific models each follow different efficiency trajectories, making it difficult to set universal thresholds that remain relevant across domains. The EU AI Act acknowledges this by including Article 51(3) provisions for the Commission to “amend the thresholds… in light of evolving technological developments, such as algorithmic improvements or increased hardware efficiency.”
Gaming and Evasion Strategies
Section titled “Gaming and Evasion Strategies”Sophisticated actors have multiple strategies for evading compute thresholds while achieving equivalent model performance. The following table summarizes key evasion vectors identified in governance research:
| Evasion Strategy | Mechanism | Difficulty | Potential Countermeasure |
|---|---|---|---|
| Training run splitting | Multiple sub-threshold runs combined via fine-tuning or merging | Medium | Cumulative compute tracking across related runs |
| Model distillation | Train large teacher model privately, distill to smaller student | High | Teacher model reporting requirements |
| Jurisdictional arbitrage | Train in unregulated jurisdiction, deploy globally | Low | Deployment-based jurisdiction rules |
| Creative accounting | Exclude fine-tuning, inference, or multi-stage compute | Medium | Standardized compute definitions |
| Distributed training | Split training across jurisdictions/entities | Medium | Consolidated reporting requirements |
| Inference-time scaling | Use test-time compute instead of training compute | Low (emerging) | Include inference thresholds |
The distillation loophole is particularly concerning: as noted by governance researchers, “a company might use greater than 10^25 FLOPs to train a teacher model that is never marketed or used in the EU, then use that teacher model to train a smaller student model that is nearly as capable but trained using less than 10^25 FLOPs.” This allows regulatory evasion while achieving equivalent model performance.
International arbitrage allows organizations to conduct high-compute training in jurisdictions without established thresholds, then deploy globally. This creates competitive pressure for regulatory harmonization while potentially undermining the effectiveness of unilateral threshold implementations. The GovAI Know-Your-Customer proposal suggests that compute providers could help close these loopholes by identifying and reporting potentially problematic training runs.
Measurement and Verification Challenges
Section titled “Measurement and Verification Challenges”Current threshold regimes rely primarily on self-reporting by AI developers, creating significant verification challenges. While major companies have generally complied in good faith with existing requirements, the absence of technical verification mechanisms creates enforcement vulnerabilities. Hardware-level monitoring could provide more reliable compute measurement, but raises significant privacy and trade secret concerns for AI developers.
Definitional ambiguities compound measurement challenges, particularly around edge cases like multi-stage training, transfer learning, and inference-time computation. The emergence of techniques like chain-of-thought reasoning and test-time training blur traditional boundaries between training and inference, potentially creating new categories of compute that existing thresholds don’t address.
Cloud computing platforms could provide third-party verification of compute usage, but this would require standardized reporting mechanisms and potentially compromise competitive sensitive information about training methodologies and resource allocation strategies.
The Inference Scaling Challenge
Section titled “The Inference Scaling Challenge”A particularly significant emerging challenge is the shift from training-time to inference-time compute scaling. As GovAI research on inference scaling warns, “the shift from scaling up pre-training compute to inference compute may have profound effects on AI governance. Rapid scaling of inference-at-deployment could potentially undermine AI governance measures that rely on training-compute thresholds.”
Models like OpenAI’s o1 demonstrate that substantial capability improvements can come from inference-time computation rather than training compute. This creates a fundamental gap in current threshold regimes:
| Compute Type | Current Coverage | Governance Challenge |
|---|---|---|
| Training compute | Covered by EU/US thresholds | Well-defined, measurable |
| Fine-tuning compute | Ambiguous coverage | May be excluded from calculations |
| Inference compute (deployment) | Not covered | Grows with usage, hard to predict |
| Test-time training | Not covered | Blurs training/inference boundary |
As inference-time scaling becomes more prevalent, a model trained with below-threshold compute could achieve above-threshold capabilities through extensive inference-time computation, completely evading current regulatory frameworks.
Safety Implications and Risk Assessment
Section titled “Safety Implications and Risk Assessment”Promising Aspects
Section titled “Promising Aspects”Compute thresholds provide several valuable safety benefits despite their limitations. They create predictable regulatory entry points that allow companies to plan safety investments and compliance strategies in advance, rather than reacting to post-deployment requirements. The transparency requirements triggered by thresholds generate valuable information about frontier AI development that enables better risk assessment and policy development.
Threshold systems also establish precedents for AI-specific regulation that can evolve toward more sophisticated approaches over time. They provide regulatory agencies with initial experience governing AI development while building institutional capacity for more complex oversight mechanisms. The international coordination emerging around threshold harmonization creates foundations for broader AI governance cooperation.
From an industry perspective, thresholds provide regulatory certainty that enables long-term investment in safety infrastructure while creating level playing fields where all frontier developers face similar requirements.
Concerning Limitations
Section titled “Concerning Limitations”However, compute thresholds exhibit significant safety limitations that could create false confidence in regulatory coverage. They may miss dangerous capabilities that emerge at lower compute levels, particularly in specialized domains like biotechnology or cybersecurity where domain-specific training data matters more than raw computational scale.
The static nature of current thresholds creates growing blind spots as algorithmic efficiency improves, potentially allowing increasingly capable systems to evade oversight. Threshold evasion strategies could enable bad actors to develop dangerous capabilities while avoiding regulatory scrutiny, particularly if enforcement mechanisms remain weak.
Perhaps most concerning, compute thresholds may distract from more direct capability-based assessments that could provide better safety coverage. The focus on computational inputs rather than capability outputs could lead to regulatory frameworks that miss the most important risk factors while imposing compliance burdens on relatively safe high-compute applications.
Future Trajectory and Evolution
Section titled “Future Trajectory and Evolution”Short-term Developments (1-2 years)
Section titled “Short-term Developments (1-2 years)”The immediate future will see operationalization of existing threshold regimes, with EU AI Act requirements becoming fully effective in August 2025 and US Executive Order provisions being codified into formal regulations. This period will provide crucial empirical data about threshold effectiveness, compliance costs, and gaming strategies that will inform future policy development.
According to GovAI forecasts on frontier model counts, the number of models exceeding absolute compute thresholds will increase superlinearly, while thresholds defined relative to the largest training run see a more stable trend of 14-16 models captured annually from 2025-2028. This suggests static absolute thresholds like the current EU and US implementations will capture an increasing number of models over time, potentially requiring significant regulatory scaling.
| Year | Models Exceeding 10^25 FLOP (Estimate) | Models Exceeding Relative Threshold | Regulatory Implication |
|---|---|---|---|
| 2024 | 5-10 | 14-16 | Current capacity adequate |
| 2025 | 15-25 | 14-16 | EU compliance begins |
| 2026 | 30-50 | 14-16 | May need threshold adjustment |
| 2027 | 60-100 | 14-16 | Scaling challenges |
| 2028 | 100-200 | 14-16 | Potential capacity crisis |
International harmonization discussions are likely to intensify as the compliance burden of divergent threshold regimes becomes apparent to global AI developers. The OECD and UK AI Safety Institute collaborative session on thresholds at the February 2025 AI Action Summit exemplifies growing international coordination efforts. Technical standards development will accelerate, particularly around compute measurement methodologies and verification mechanisms.
Medium-term Evolution (3-5 years)
Section titled “Medium-term Evolution (3-5 years)”The medium-term trajectory will likely see significant evolution away from purely static thresholds toward more sophisticated triggering mechanisms. Algorithmic efficiency improvements will force either frequent threshold updates or adoption of alternative approaches that maintain regulatory relevance despite efficiency gains.
Capability-based triggers are expected to emerge as a complement to or replacement for compute thresholds, using standardized benchmark evaluations to determine regulatory requirements based on demonstrated abilities rather than resource consumption. GovAI research on risk thresholds recommends that “companies define risk thresholds to provide a principled foundation for their decision-making, use these to help set capability thresholds, and then primarily rely on capability thresholds.”
| Threshold Type | Advantages | Disadvantages | Best Use Case |
|---|---|---|---|
| Compute-based (absolute) | Simple, measurable, predictable | Becomes obsolete with efficiency gains | Initial screening, pre-training notification |
| Compute-based (relative) | Adapts to frontier advances | Requires ongoing calibration | Capturing only true frontier models |
| Capability-based | Directly measures risk-relevant properties | Hard to evaluate, may miss novel capabilities | Post-training safety assessment |
| Risk-based | Most principled approach | Most difficult to evaluate reliably | Strategic decision frameworks |
| Hybrid | Balances predictability with relevance | Complex to implement | Long-term regulatory evolution |
International regime development will likely produce multilateral frameworks for threshold coordination, potentially through new international organizations or expanded mandates for existing bodies like the OECD or UN. These frameworks will need to address both threshold harmonization and enforcement cooperation to be effective.
Long-term Uncertainty (5+ years)
Section titled “Long-term Uncertainty (5+ years)”The long-term future of compute thresholds depends critically on the pace of algorithmic efficiency improvements and the development of alternative governance mechanisms. If efficiency gains continue at current rates, compute-based triggers may become obsolete entirely, requiring wholesale transition to capability-based or other approaches.
Alternatively, threshold evolution could incorporate dynamic adjustment mechanisms that automatically update based on efficiency benchmarks or capability correlations, maintaining relevance despite technological change. This would require sophisticated measurement systems and potentially automated regulatory frameworks.
The emergence of novel AI architectures like neuromorphic computing or quantum-classical hybrid systems could fundamentally alter the compute-capability relationship, potentially making current FLOP-based measurements irrelevant and requiring entirely new regulatory metrics.
Key Uncertainties and Research Questions
Section titled “Key Uncertainties and Research Questions”Several critical uncertainties will determine the future effectiveness of compute threshold approaches. The pace and trajectory of algorithmic efficiency improvements remains unpredictable, with potential for breakthrough innovations that dramatically decouple compute from capabilities. Current trend extrapolation suggests 2x annual improvements, but this could accelerate or plateau depending on fundamental algorithmic advances.
The correlation between compute and dangerous capabilities is empirically understudied, particularly for specialized risks like bioweapons development or deceptive alignment. Better understanding these relationships is crucial for calibrating threshold levels and determining when capability-based triggers might be more appropriate.
Enforcement mechanisms remain largely theoretical, with limited real-world testing of verification systems or consequences for non-compliance. The willingness and ability of regulatory agencies to detect and respond to threshold evasion will ultimately determine system effectiveness.
International coordination dynamics are highly uncertain, particularly regarding participation by major AI powers like China and cooperation between democratic and authoritarian governance systems. The success of threshold regimes may depend critically on achieving sufficient global coverage to prevent regulatory arbitrage.
The development of standardized capability evaluation systems presents both technical and political challenges that could determine whether hybrid threshold-capability approaches become feasible. Progress on evaluation methodology, benchmark development, and international standards will shape the evolution of regulatory frameworks beyond pure compute triggers.
Key Research and Sources
Section titled “Key Research and Sources”The following research organizations have produced foundational work on compute threshold governance:
| Organization | Key Contribution | Focus Area |
|---|---|---|
| GovAI | Training Compute Thresholds, Inference Scaling Governance, Risk Thresholds | Threshold design, alternative approaches |
| CSET Georgetown | AI Governance at the Frontier, preparedness frameworks | Policy implementation, US context |
| Epoch AI | Compute trends, training cost analysis | Empirical compute data, forecasting |
| UK AI Security Institute | Frontier AI Trends Report, capability evaluations | Empirical capability assessment |
| OECD | Thresholds for Frontier AI sessions | International coordination, standards |
Related Approaches
Section titled “Related Approaches”- Export Controls — Restricting access rather than triggering requirements
- Compute Monitoring — Ongoing visibility into training
- International Regimes — Multilateral threshold coordination
Related Pages
Section titled “Related Pages”AI Transition Model Context
Section titled “AI Transition Model Context”Compute thresholds improve the Ai Transition Model through Civilizational Competence:
| Factor | Parameter | Impact |
|---|---|---|
| Civilizational Competence | Regulatory Capacity | Objective triggers enable automated enforcement of safety requirements |
| Civilizational Competence | Institutional Quality | Clear thresholds reduce regulatory discretion and political capture |
Threshold effectiveness depends on keeping pace with algorithmic efficiency improvements; static thresholds become obsolete within 3-5 years.