Skip to content

Compute Thresholds

📋Page Status
Quality:82 (Comprehensive)
Importance:78.5 (High)
Last edited:2025-12-28 (10 days ago)
Words:3.4k
Structure:
📊 10📈 1🔗 16📚 242%Score: 14/15
LLM Summary:Analyzes compute thresholds (EU AI Act at 10^25 FLOP, US EO at 10^26 FLOP) as regulatory triggers for AI safety requirements, documenting their current implementation and quantifying the core challenge: algorithmic efficiency improvements of ~2x/year threaten to make static thresholds obsolete within 3-5 years. Provides concrete data on evasion strategies, enforcement gaps, and trajectory forecasts showing models exceeding thresholds will grow from 5-10 (2024) to 100-200 (2028).
Policy

Compute Thresholds

ApproachDefine capability boundaries via compute
StatusEstablished in US and EU policy

Compute thresholds represent one of the most concrete regulatory approaches to AI governance implemented to date, using training compute as a measurable trigger for safety and transparency requirements. Unlike export controls that restrict access or monitoring systems that provide ongoing visibility, thresholds create a simple binary rule: if you train a model above X floating-point operations (FLOP), you must comply with specific regulatory obligations.

This approach has gained traction because compute is both measurable and correlates with model capabilities, albeit imperfectly. The European Union’s AI Act established a 10^25 FLOP threshold in 2024, while the US Executive Order on AI set a 10^26 FLOP trigger in October 2023. These implementations represent the first large-scale attempt to regulate AI development based on resource consumption rather than demonstrated capabilities or actor identity.

However, compute thresholds face a fundamental challenge: algorithmic efficiency improvements of approximately 2x per year are decoupling compute requirements from capabilities. A model requiring 10^25 FLOP in 2023 might achieve equivalent performance with only 10^24 FLOP by 2026, potentially making static thresholds obsolete within 3-5 years. This creates an ongoing tension between the tractability of compute-based triggers and their diminishing relevance as a proxy for AI capabilities and associated risks.

DimensionAssessmentNotes
TractabilityHighAlready implemented in major jurisdictions
MechanismTrigger-basedCross threshold → face requirements
Current StatusActiveEU AI Act, US EO both use compute thresholds
Main ChallengeAlgorithmic efficiencySame capabilities with less compute over time
Time Horizon3-5 yearsBefore efficiency gains make current thresholds irrelevant
RiskMechanismEffectiveness
Racing DynamicsForces safety testing before deploymentMedium
BioweaponsLower thresholds for bio-sequence modelsMedium
Deceptive AlignmentRequires evaluation before deploymentLow-Medium

The following table compares compute threshold implementations across major jurisdictions, revealing significant variation in both threshold levels and triggered requirements:

JurisdictionThresholdScopeKey RequirementsStatusSource
EU AI Act10^25 FLOPGPAI with systemic riskTransparency, risk evaluation, incident reporting, adversarial testingEffective Aug 2025EC Guidelines
US EO 1411010^26 FLOPGeneral AI systemsPre-training notification, safety testing, security measuresActive (Oct 2023)Commerce reporting
US EO 1411010^23 FLOPBiological sequence modelsSame as above, lower threshold for bio-riskActive (Oct 2023)Commerce reporting
China Draft AI LawTBD (compute + parameters)“Critical AI” systemsAssessment and approval before market deploymentDraft stageAsia Society
UK AISICapability-basedFrontier modelsVoluntary evaluation, no formal thresholdMonitoring onlyAISI Framework

The 1000x difference between the US biological threshold (10^23) and general threshold (10^26) reflects assessment that biological capabilities may emerge at much smaller model scales. The EU’s 10^25 threshold sits between these extremes, calibrated to capture approximately GPT-4-scale models while excluding smaller systems.

EU AI Act Foundation Models Regulation (2024)

Section titled “EU AI Act Foundation Models Regulation (2024)”

The EU AI Act, which entered into force in August 2024, establishes the most comprehensive compute threshold regime to date. Models trained with more than 10^25 FLOP are classified as General Purpose AI (GPAI) systems with systemic risk, triggering substantial obligations including transparency requirements about training data and processes, systemic risk evaluations, mandatory incident reporting, adversarial testing requirements, and comprehensive documentation and compliance obligations.

The 10^25 FLOP threshold was calibrated to capture models at roughly GPT-4’s training scale, which required approximately 2-5 × 10^25 FLOP based on available estimates. This places the threshold at the current frontier while avoiding over-regulation of smaller models. The EU’s approach focuses heavily on transparency and risk assessment rather than prohibition, reflecting a regulatory philosophy of managed deployment rather than restriction.

The implementation timeline is aggressive, with full compliance required by August 2025. Early evidence suggests major AI developers are already adapting their processes to meet EU requirements, with OpenAI, Anthropic, and Google all announcing compliance programs. However, the threshold’s static nature has drawn criticism from researchers who argue that algorithmic improvements will rapidly make it obsolete.

The United States took a different approach with Executive Order 14110, setting a higher threshold of 10^26 FLOP for general AI systems while establishing a much lower 10^23 FLOP threshold specifically for models trained primarily on biological sequence data. Models exceeding these thresholds must report training plans to the Department of Commerce before beginning training, share safety test results with the government, and implement security measures for model weights and training infrastructure.

The dual-threshold approach reflects differentiated risk assessment, with the biological threshold set at roughly GPT-3 scale (10^23 FLOP) to capture potential bioweapon development risks at lower capability levels. This represents approximately 100x lower compute than the general threshold, acknowledging that biological capabilities may emerge at smaller scales than general intelligence capabilities.

Notable implementations include Meta’s reporting of Llama 3 training (estimated ~4 × 10^25 FLOP) and OpenAI’s compliance with pre-training notification requirements for models approaching the 10^26 threshold. The Department of Commerce has established preliminary reporting mechanisms, though full regulatory infrastructure is still under development.

The UK has taken a more cautious approach, with the Frontier AI Taskforce (now AI Safety Institute) monitoring compute thresholds without establishing formal regulatory triggers. China’s approach remains opaque, though draft regulations suggest consideration of compute-based measures alongside capability assessments. The result is a fragmented global landscape where companies must navigate multiple threshold regimes with different requirements and measurement standards.

Compute thresholds operate through a multi-stage regulatory pipeline that begins before training commences. The typical sequence involves threshold definition by regulators, pre-training notification by AI developers, threshold crossing triggering specific requirements, mandatory evaluation and testing phases, implementation of required safeguards, and finally authorized deployment under ongoing monitoring.

Loading diagram...

This pipeline structure is designed to provide regulatory visibility into AI development before capabilities emerge, rather than reacting after deployment. However, implementation varies significantly between jurisdictions, with the EU emphasizing post-training compliance verification while the US focuses on pre-training notification and ongoing cooperation.

Pre-training requirements typically include notification of training intent, security measures for training infrastructure, and preliminary risk assessments. Pre-deployment obligations encompass comprehensive safety evaluations including red-teaming exercises, capability testing across multiple domains, detailed risk assessments, and extensive documentation of training processes and data sources.

Ongoing requirements extend throughout the model lifecycle, including incident reporting for safety failures or misuse, monitoring systems for detecting problematic applications, cooperation with regulatory investigations, and periodic compliance audits. The breadth of these requirements reflects the challenge of governing AI systems whose capabilities and risks may emerge or change after initial deployment.

The most fundamental challenge facing compute thresholds is the rapid growth of frontier AI capabilities, which threatens to make static thresholds increasingly irrelevant. Research by Epoch AI documents that training compute of frontier AI models has grown by 4-5x per year since 2020, while hardware efficiency improvements (12x over the past decade) and lower precision formats (8x improvement) are simultaneously reducing the compute needed to achieve equivalent capabilities.

TrendRateImplication for Thresholds
Frontier compute growth4-5x/yearMore models will exceed thresholds
Hardware efficiency (FLOP/W)1.28x/yearSame compute costs less
Training cost growth2.4x/yearFrontier models now cost hundreds of millions USD
Capability improvement~15 points/year (2024)Nearly doubled from ~8 points/year

This creates a dual challenge: on one hand, if capability growth continues to accelerate, today’s thresholds may capture far fewer models than intended. On the other hand, if algorithmic efficiency improves faster than expected, equivalent capabilities could be achieved with 10-100x less compute, allowing dangerous models to evade oversight. The GovAI research on training compute thresholds explicitly notes that “training compute is an imperfect proxy for risk” and should be used to “detect potentially risky GPAI models that warrant regulatory oversight” rather than as a standalone regulatory mechanism.

The problem is compounded by the uneven nature of efficiency improvements, which vary significantly across model architectures and training paradigms. Language models, multimodal systems, and specialized scientific models each follow different efficiency trajectories, making it difficult to set universal thresholds that remain relevant across domains. The EU AI Act acknowledges this by including Article 51(3) provisions for the Commission to “amend the thresholds… in light of evolving technological developments, such as algorithmic improvements or increased hardware efficiency.”

Sophisticated actors have multiple strategies for evading compute thresholds while achieving equivalent model performance. The following table summarizes key evasion vectors identified in governance research:

Evasion StrategyMechanismDifficultyPotential Countermeasure
Training run splittingMultiple sub-threshold runs combined via fine-tuning or mergingMediumCumulative compute tracking across related runs
Model distillationTrain large teacher model privately, distill to smaller studentHighTeacher model reporting requirements
Jurisdictional arbitrageTrain in unregulated jurisdiction, deploy globallyLowDeployment-based jurisdiction rules
Creative accountingExclude fine-tuning, inference, or multi-stage computeMediumStandardized compute definitions
Distributed trainingSplit training across jurisdictions/entitiesMediumConsolidated reporting requirements
Inference-time scalingUse test-time compute instead of training computeLow (emerging)Include inference thresholds

The distillation loophole is particularly concerning: as noted by governance researchers, “a company might use greater than 10^25 FLOPs to train a teacher model that is never marketed or used in the EU, then use that teacher model to train a smaller student model that is nearly as capable but trained using less than 10^25 FLOPs.” This allows regulatory evasion while achieving equivalent model performance.

International arbitrage allows organizations to conduct high-compute training in jurisdictions without established thresholds, then deploy globally. This creates competitive pressure for regulatory harmonization while potentially undermining the effectiveness of unilateral threshold implementations. The GovAI Know-Your-Customer proposal suggests that compute providers could help close these loopholes by identifying and reporting potentially problematic training runs.

Current threshold regimes rely primarily on self-reporting by AI developers, creating significant verification challenges. While major companies have generally complied in good faith with existing requirements, the absence of technical verification mechanisms creates enforcement vulnerabilities. Hardware-level monitoring could provide more reliable compute measurement, but raises significant privacy and trade secret concerns for AI developers.

Definitional ambiguities compound measurement challenges, particularly around edge cases like multi-stage training, transfer learning, and inference-time computation. The emergence of techniques like chain-of-thought reasoning and test-time training blur traditional boundaries between training and inference, potentially creating new categories of compute that existing thresholds don’t address.

Cloud computing platforms could provide third-party verification of compute usage, but this would require standardized reporting mechanisms and potentially compromise competitive sensitive information about training methodologies and resource allocation strategies.

A particularly significant emerging challenge is the shift from training-time to inference-time compute scaling. As GovAI research on inference scaling warns, “the shift from scaling up pre-training compute to inference compute may have profound effects on AI governance. Rapid scaling of inference-at-deployment could potentially undermine AI governance measures that rely on training-compute thresholds.”

Models like OpenAI’s o1 demonstrate that substantial capability improvements can come from inference-time computation rather than training compute. This creates a fundamental gap in current threshold regimes:

Compute TypeCurrent CoverageGovernance Challenge
Training computeCovered by EU/US thresholdsWell-defined, measurable
Fine-tuning computeAmbiguous coverageMay be excluded from calculations
Inference compute (deployment)Not coveredGrows with usage, hard to predict
Test-time trainingNot coveredBlurs training/inference boundary

As inference-time scaling becomes more prevalent, a model trained with below-threshold compute could achieve above-threshold capabilities through extensive inference-time computation, completely evading current regulatory frameworks.

Compute thresholds provide several valuable safety benefits despite their limitations. They create predictable regulatory entry points that allow companies to plan safety investments and compliance strategies in advance, rather than reacting to post-deployment requirements. The transparency requirements triggered by thresholds generate valuable information about frontier AI development that enables better risk assessment and policy development.

Threshold systems also establish precedents for AI-specific regulation that can evolve toward more sophisticated approaches over time. They provide regulatory agencies with initial experience governing AI development while building institutional capacity for more complex oversight mechanisms. The international coordination emerging around threshold harmonization creates foundations for broader AI governance cooperation.

From an industry perspective, thresholds provide regulatory certainty that enables long-term investment in safety infrastructure while creating level playing fields where all frontier developers face similar requirements.

However, compute thresholds exhibit significant safety limitations that could create false confidence in regulatory coverage. They may miss dangerous capabilities that emerge at lower compute levels, particularly in specialized domains like biotechnology or cybersecurity where domain-specific training data matters more than raw computational scale.

The static nature of current thresholds creates growing blind spots as algorithmic efficiency improves, potentially allowing increasingly capable systems to evade oversight. Threshold evasion strategies could enable bad actors to develop dangerous capabilities while avoiding regulatory scrutiny, particularly if enforcement mechanisms remain weak.

Perhaps most concerning, compute thresholds may distract from more direct capability-based assessments that could provide better safety coverage. The focus on computational inputs rather than capability outputs could lead to regulatory frameworks that miss the most important risk factors while imposing compliance burdens on relatively safe high-compute applications.

The immediate future will see operationalization of existing threshold regimes, with EU AI Act requirements becoming fully effective in August 2025 and US Executive Order provisions being codified into formal regulations. This period will provide crucial empirical data about threshold effectiveness, compliance costs, and gaming strategies that will inform future policy development.

According to GovAI forecasts on frontier model counts, the number of models exceeding absolute compute thresholds will increase superlinearly, while thresholds defined relative to the largest training run see a more stable trend of 14-16 models captured annually from 2025-2028. This suggests static absolute thresholds like the current EU and US implementations will capture an increasing number of models over time, potentially requiring significant regulatory scaling.

YearModels Exceeding 10^25 FLOP (Estimate)Models Exceeding Relative ThresholdRegulatory Implication
20245-1014-16Current capacity adequate
202515-2514-16EU compliance begins
202630-5014-16May need threshold adjustment
202760-10014-16Scaling challenges
2028100-20014-16Potential capacity crisis

International harmonization discussions are likely to intensify as the compliance burden of divergent threshold regimes becomes apparent to global AI developers. The OECD and UK AI Safety Institute collaborative session on thresholds at the February 2025 AI Action Summit exemplifies growing international coordination efforts. Technical standards development will accelerate, particularly around compute measurement methodologies and verification mechanisms.

The medium-term trajectory will likely see significant evolution away from purely static thresholds toward more sophisticated triggering mechanisms. Algorithmic efficiency improvements will force either frequent threshold updates or adoption of alternative approaches that maintain regulatory relevance despite efficiency gains.

Capability-based triggers are expected to emerge as a complement to or replacement for compute thresholds, using standardized benchmark evaluations to determine regulatory requirements based on demonstrated abilities rather than resource consumption. GovAI research on risk thresholds recommends that “companies define risk thresholds to provide a principled foundation for their decision-making, use these to help set capability thresholds, and then primarily rely on capability thresholds.”

Threshold TypeAdvantagesDisadvantagesBest Use Case
Compute-based (absolute)Simple, measurable, predictableBecomes obsolete with efficiency gainsInitial screening, pre-training notification
Compute-based (relative)Adapts to frontier advancesRequires ongoing calibrationCapturing only true frontier models
Capability-basedDirectly measures risk-relevant propertiesHard to evaluate, may miss novel capabilitiesPost-training safety assessment
Risk-basedMost principled approachMost difficult to evaluate reliablyStrategic decision frameworks
HybridBalances predictability with relevanceComplex to implementLong-term regulatory evolution

International regime development will likely produce multilateral frameworks for threshold coordination, potentially through new international organizations or expanded mandates for existing bodies like the OECD or UN. These frameworks will need to address both threshold harmonization and enforcement cooperation to be effective.

The long-term future of compute thresholds depends critically on the pace of algorithmic efficiency improvements and the development of alternative governance mechanisms. If efficiency gains continue at current rates, compute-based triggers may become obsolete entirely, requiring wholesale transition to capability-based or other approaches.

Alternatively, threshold evolution could incorporate dynamic adjustment mechanisms that automatically update based on efficiency benchmarks or capability correlations, maintaining relevance despite technological change. This would require sophisticated measurement systems and potentially automated regulatory frameworks.

The emergence of novel AI architectures like neuromorphic computing or quantum-classical hybrid systems could fundamentally alter the compute-capability relationship, potentially making current FLOP-based measurements irrelevant and requiring entirely new regulatory metrics.

Several critical uncertainties will determine the future effectiveness of compute threshold approaches. The pace and trajectory of algorithmic efficiency improvements remains unpredictable, with potential for breakthrough innovations that dramatically decouple compute from capabilities. Current trend extrapolation suggests 2x annual improvements, but this could accelerate or plateau depending on fundamental algorithmic advances.

The correlation between compute and dangerous capabilities is empirically understudied, particularly for specialized risks like bioweapons development or deceptive alignment. Better understanding these relationships is crucial for calibrating threshold levels and determining when capability-based triggers might be more appropriate.

Enforcement mechanisms remain largely theoretical, with limited real-world testing of verification systems or consequences for non-compliance. The willingness and ability of regulatory agencies to detect and respond to threshold evasion will ultimately determine system effectiveness.

International coordination dynamics are highly uncertain, particularly regarding participation by major AI powers like China and cooperation between democratic and authoritarian governance systems. The success of threshold regimes may depend critically on achieving sufficient global coverage to prevent regulatory arbitrage.

The development of standardized capability evaluation systems presents both technical and political challenges that could determine whether hybrid threshold-capability approaches become feasible. Progress on evaluation methodology, benchmark development, and international standards will shape the evolution of regulatory frameworks beyond pure compute triggers.


The following research organizations have produced foundational work on compute threshold governance:

OrganizationKey ContributionFocus Area
GovAITraining Compute Thresholds, Inference Scaling Governance, Risk ThresholdsThreshold design, alternative approaches
CSET GeorgetownAI Governance at the Frontier, preparedness frameworksPolicy implementation, US context
Epoch AICompute trends, training cost analysisEmpirical compute data, forecasting
UK AI Security InstituteFrontier AI Trends Report, capability evaluationsEmpirical capability assessment
OECDThresholds for Frontier AI sessionsInternational coordination, standards



Compute thresholds improve the Ai Transition Model through Civilizational Competence:

FactorParameterImpact
Civilizational CompetenceRegulatory CapacityObjective triggers enable automated enforcement of safety requirements
Civilizational CompetenceInstitutional QualityClear thresholds reduce regulatory discretion and political capture

Threshold effectiveness depends on keeping pace with algorithmic efficiency improvements; static thresholds become obsolete within 3-5 years.