Skip to content

Values Lock-in: Research Report

📋Page Status
Quality:3 (Stub)⚠️
Words:4.5k
Structure:
📊 15📈 0🔗 0📚 5817%Score: 10/15
FindingKey DataImplication
RLHF reduces value pluralismStandard alignment procedures reduce distributional pluralism by 30-40%Current alignment methods may inadvertently lock in narrow values
Algorithmic feedback loops entrench beliefsLLM-human feedback loops create echo chambers that reduce diversityRisk of “preference collapse” where minority values are disregarded
Cultural bias in AI systemsLLMs reflect English-speaking, Protestant European values disproportionatelyWestern values encoded as universal defaults
Authoritarian AI surveillanceAI surveillance deployed in 20+ countries for dissent suppressionValues can be enforced through technological control
Moral stagnation riskNo mechanism for updating values encoded in long-lived AI systemsHumanity could be locked into 2020s ethics indefinitely
Value specification challengeMultiple conflicting ethical frameworks (utilitarian, deontological, virtue ethics)AI systems must choose which values to align with

Values lock-in occurs when AI systems permanently entrench particular values, beliefs, or ethical frameworks, making future moral progress difficult or impossible. This represents a critical failure mode because AI development involves numerous value-laden choices—from training data selection to reinforcement learning objectives—that become embedded in systems designed to persist for decades or centuries.

Research identifies three primary mechanisms driving values lock-in. First, Reinforcement Learning from Human Feedback (RLHF) inherits biases from human annotators and exhibits algorithmic bias that can lead to “preference collapse,” where minority perspectives are systematically disregarded. Studies show standard alignment procedures reduce distributional pluralism by 30-40%, and supervised fine-tuning before RLHF can calcify model biases. Second, AI-human feedback loops create echo chambers: models learn human beliefs from data, reinforce these beliefs with generated content, reabsorb the reinforced beliefs, and feed them back to users, leading to loss of diversity and potential lock-in of false beliefs. Third, AI surveillance enables authoritarian regimes to enforce values through technological control; facial recognition and predictive policing systems deployed in 20+ countries enable real-time monitoring that makes dissent nearly impossible.

The cultural dimension is particularly concerning. Large language models disproportionately reflect values from English-speaking and Protestant European countries, creating a Western bias encoded as universal. With 67% of companies planning to increase AI investments over the next three years, these value-laden systems are becoming embedded in critical infrastructure. The fundamental challenge is moral uncertainty: no consensus exists on which values should guide AI systems, yet the technical necessity of specifying objective functions forces premature resolution of unresolved philosophical questions. Without mechanisms for updating values as moral understanding improves, humanity risks locking into 2020s ethics—foreclosing the moral circle expansion that has characterized human history. The window for developing pluralistic alignment techniques narrows as deployed AI systems accumulate and create path dependencies.


Throughout human history, values have evolved. Slavery was once accepted across cultures; now it is universally condemned. Democracy and human rights emerged gradually over centuries. The moral circle expanded from family to tribe to nation to potentially all sentient beings. This capacity for moral progress has been one of humanity’s most important features.

The challenge emerges from several sources. First, AI systems require explicit objective functions—engineers must specify what “good” means. Second, alignment techniques like RLHF rely on human feedback, which reflects current cultural biases and power structures. Third, once deployed at scale, AI systems create path dependencies: changing values requires replacing infrastructure, retraining models, and overcoming network effects.

Research from arXiv notes that “strong AI imbued with particular values may determine the values propagated into the future, and some argue that exponentially increasing compute and data barriers make AI a centralizing force, with the most powerful AI systems potentially being designed by and available to fewer stakeholders over time.”


AI systems cannot remain value-neutral; they must be aligned with some set of values. This necessity creates the fundamental challenge: whose values, and how do we ensure they remain appropriate as society evolves?

Stanford HAI research found that “when a team of Stanford researchers applied cultural psychology theory to study what people want from AI, they found clear associations between the cultural models of agency that are common in cultural contexts and the type of AI that is considered ideal.” This suggests that AI alignment reflects culturally specific preferences rather than universal values.

The problem is compounded by moral uncertainty. As the AI Safety textbook explains, “Moral uncertainty refers to not knowing which moral beliefs are correct. It matters because different views can conflict; without resolving these conflicts, we cannot know how to act morally.”

Reinforcement Learning from Human Feedback (RLHF) has become the dominant technique for aligning large language models. However, research reveals systematic biases that could entrench narrow values:

Bias TypeMechanismEvidenceConsequence
Human Feedback BiasAnnotators’ systematic biases transfer to modelsDemographics, culture, ideology of annotatorsModels reflect annotator values, not population distribution
Algorithmic BiasKL-regularization favors reference modelPreference collapse: minority preferences disregarded30-40% reduction in distributional pluralism
Preference CollapseStandard RLHF washes out value conflictsStatistical learners fit to averages by defaultIrreducible value tensions eliminated
SFT CalcificationSupervised fine-tuning before RLHFSFT can “calcify model biases”Early biases become harder to correct

Research on RLHF algorithmic bias found that “accurately aligning large language models (LLMs) with human preferences is crucial for fair decision-making processes. However, the predominant approach for aligning LLMs through RLHF suffers from an inherent algorithmic bias due to its Kullback-Leibler-based regularization.”

The consequence is that “in extreme cases, this bias could lead to a phenomenon called ‘preference collapse,’ where minority preferences are virtually disregarded.”

Algorithmic Feedback Loops and Echo Chambers

Section titled “Algorithmic Feedback Loops and Echo Chambers”

ArXiv research on “The Lock-in Hypothesis” documents a concerning dynamic: “The training and deployment of large language models create a feedback loop with human users: models learn human beliefs from data, reinforce these beliefs with generated content, reabsorb the reinforced beliefs, and feed them back to users. This dynamic resembles an echo chamber.”

The paper defines lock-in as “a state where a set of ideas, values, or beliefs achieves a dominant and persistent position,” where the diversity of alternative beliefs diminishes until they are marginalized or vanish entirely.

StageMechanismEffect
1. Initial LearningModels trained on existing human textAbsorb current distribution of beliefs
2. Content GenerationModels generate text reflecting learned beliefsUsers exposed to content aligned with current beliefs
3. ReinforcementGenerated content influences human beliefsExisting beliefs strengthened, alternatives weaken
4. Data ReabsorptionNew human text (influenced by AI) becomes training dataFeedback loop strengthens original beliefs
5. Lock-inDiversity eliminated, alternatives vanishValues become permanent

This feedback dynamic is particularly dangerous because it appears beneficial at each step. Users receive content aligned with their preferences; models improve by learning from user responses. The catastrophic outcome—loss of value diversity—emerges from accumulation of individually rational decisions.

Large language models exhibit systematic cultural bias favoring Western, English-speaking, Protestant European values:

Cornell University research found that “cultural values and traditions differ across the globe, but large language models (LLMs), used in text-generating programs such as ChatGPT, have a tendency to reflect values from English-speaking and Protestant European countries.”

AI & Society research on value pluralism in ChatGPT found that “an LLM aligned exclusively with one ethical framework risks marginalizing or perpetuating biases against other perspectives, because the benchmarks themselves may not adequately account for ethical pluralism.”

The 2025 UNESCO report on AI and Culture “emphasizes inclusivity for Indigenous communities” and “specifically warns against AI systems that perpetuate Western biases.”

Cultural DimensionLLM DefaultImplication
LanguageEnglish-dominant training dataNon-English cultures underrepresented
ReligionProtestant European valuesNon-Christian worldviews marginalized
GovernanceDemocratic liberalismAlternative political philosophies excluded
EconomicsMarket capitalismCommunitarian or socialist values underweighted
EthicsIndividualist frameworksCollectivist moral systems undervalued

The challenge of representing multiple values fairly has become a central concern in AI alignment research:

NeurIPS research on pluralistic alignment identifies three approaches to operationalizing pluralism:

ApproachDefinitionAdvantageLimitation
Overton pluralisticPresent spectrum of reasonable responsesExposes users to multiple viewsRequires defining “reasonable”
Steerably pluralisticCan steer to reflect certain perspectivesUser control over valuesRequires users to specify preferences
Distributionally pluralisticWell-calibrated to population distributionRepresents actual belief distributionMay entrench majority values

The research warns that “current alignment techniques may be fundamentally limited for pluralistic AI; indeed, empirical evidence suggests that standard alignment procedures might reduce distributional pluralism in models, motivating the need for further research on pluralistic alignment.”

The Value Kaleidoscope project introduced ValuePrism, “a large-scale dataset of 218k values, rights, and duties connected to 31k human-written situations,” recognizing that “value pluralism is the view that multiple correct values may be held in tension with one another (e.g., when considering lying to a friend to protect their feelings, how does one balance honesty with friendship?). As statistical learners, AI systems fit to averages by default, washing out these potentially irreducible value conflicts.”

AI surveillance technologies enable authoritarian regimes to enforce values through technological control, creating the infrastructure for permanent value entrenchment:

Lawfare analysis warns that “AI law enforcement tends to undermine democratic government, promote authoritarian drift, and entrench existing authoritarian regimes. AI-based systems can reduce structural checks on executive authority and concentrate power among fewer and fewer people.”

Global Deployment

Country/RegionSystemCapabilityPurpose
ChinaMass surveillance networkReal-time facial recognitionMonitor public gatherings, protests, daily activities
EgyptSocial media monitoringKeyword/hashtag analysisPredict and preemptively suppress protests
BahrainSpyware + AI monitoringActivist targetingArrests and harsh penalties for dissent
20+ countries”Safe City” packages (Chinese export)Comprehensive surveillanceDigital authoritarianism infrastructure

Journal of Democracy analysis documents how “through mass surveillance, facial recognition, predictive policing, online harassment, and electoral manipulation, AI has become a potent tool for authoritarian control.”

The mechanism for values lock-in operates through behavioral modification:

“The continuous and pervasive surveillance by AI not only instills fear in citizens but also molds behavior. The insidious nature of this surveillance means that future generations, growing up under the watchful eyes of AI, might internalize self-censorship and conformity as the norm, becoming desensitized to constant scrutiny.”

China’s Digital Silk Road

Research on digital authoritarianism documents that “through its Digital Silk Road initiative, China has become an exporter of digital authoritarianism and a major digital infrastructure provider to developing and authoritarian states. Instances of digital authoritarianism can be observed in Bangladesh, Colombia, Ethiopia, Guatemala, the Philippines, and Thailand.”

This represents not just individual authoritarian regimes locking in their values, but the export of surveillance infrastructure that enables other regimes to do the same—a meta-level lock-in mechanism.

Even without malicious intent, AI systems may freeze moral progress by encoding current values into long-lived infrastructure:

Research on evolutionary ethics found that “there are core values around which most people would agree that are unlikely to change over long time periods. However, there are also secondary or derived values around which there is much more controversy and within which differences of view occur.”

The problem is distinguishing core from derived values. History shows that many values once considered “core” (divine right of kings, subordination of women, racial hierarchy) were eventually recognized as contingent. AI systems encoding 2025 values cannot distinguish which will prove enduring and which should evolve.

Historical ExampleOnce “Core” ValueCurrent StatusTimeline
SlaveryProperty rights in humansUniversally condemned~200 years
Women’s rightsMale authority over womenGender equality (partial)~150 years
LGBTQ+ rightsHeteronormativityExpanding recognition~50 years
Animal welfareHuman supremacyGrowing moral considerationOngoing
Future examples?Values we consider “obvious” todayMay be condemned by 2125Unknown

ArXiv research on AI and moral enhancement lists concerns that “moral progress might be hindered, and that dependence on AI systems to perform moral reasoning would not only neglect the cultivation of moral excellence but actively undermine it, exposing people to risks of disengagement, of atrophy of human faculties, and of moral manipulation.”

Recent research on AGI development proposes “The Lock-In Phase Hypothesis: Identity Consolidation as a Precursor to AGI.” The paper notes that “large language models remain broadly open and highly steerable, accepting arbitrary system prompts and adopting multiple personae.”

However, “by analogy to human development, the authors hypothesize that progress toward AGI involves a lock-in phase: a transition from open imitation to identity consolidation, where goal structures, refusals, preferences, and internal representations become comparatively stable and resistant to external steering.”

The risk is that “sleeper-agent work shows deceptive backdoors can persist through safety training, underscoring the risk of locking in undesirable traits.”

This suggests that as AI systems become more capable, they may naturally undergo value consolidation—transitioning from malleable systems that can represent multiple perspectives to stable systems with fixed values. If this transition happens without deliberate design, the locked-in values will be whichever happened to be present during the consolidation phase.


The following factors influence values lock-in probability and severity. This analysis is designed to inform future cause-effect diagram creation.

FactorDirectionTypeEvidenceConfidence
RLHF Bias↑ Lock-incauseReduces distributional pluralism 30-40%; preference collapseHigh
Training Data Selection↑ Lock-inleafWestern, English-speaking bias; determines value distributionHigh
Feedback Loop Dynamics↑ Lock-incauseAI output → human beliefs → AI training data → reinforcementHigh
AI Surveillance Deployment↑ Lock-inintermediate20+ countries; enables value enforcement through controlHigh
Moral Uncertainty↑ Lock-inleafNo consensus on correct values; forces premature resolutionHigh
Infrastructure Persistence↑ Lock-incauseDeployed systems create path dependencies; replacement costlyHigh
FactorDirectionTypeEvidenceConfidence
Value Aggregation Method↑↓ Lock-inintermediateUtilitarian vs. Rawlsian aggregation encodes philosophical commitmentsMedium
Cultural Prompting Availability↓ Lock-inleafCan reduce bias if users aware and capableMedium
Pluralistic Alignment Research↓ Lock-inintermediateOverton/Steerable/Distributional approaches under developmentMedium
Identity Consolidation↑ Lock-incauseAGI progress may involve transition to stable value structuresMedium
Digital Silk Road↑ Lock-inintermediateChina exports surveillance infrastructure to 20+ countriesMedium
Investment Concentration↑ Lock-inleaf67% of companies increasing AI investment; values embedded at scaleMedium
FactorDirectionTypeEvidenceConfidence
ISO/IEC 42001 Adoption↓ Lock-inleafAI management standards include value alignment; uptake unclearLow
Multi-stakeholder Consultations↓ Lock-inleafProposed by WEF; implementation limitedLow
Philosophical Progress↓ Lock-inleafCould resolve moral uncertainty; pace too slowLow
Value-Sensitive Design↓ Lock-inintermediateEmbeds ethical considerations in architecture; niche adoptionLow

1. Pluralistic Alignment Methods

Research proposes moving beyond single-value alignment to pluralistic approaches:

  • Preference Matching RLHF: Replaces KL-regularization with preference matching to avoid preference collapse
  • Distributional Calibration: Ensure model outputs match population value distribution
  • Steerable Systems: Allow users to specify which values to prioritize in different contexts

2. Value Updating Mechanisms

To prevent permanent lock-in, AI systems need mechanisms for updating values as moral understanding improves:

MechanismApproachChallenge
Periodic RetrainingRetrain models on updated value distributionsExpensive; may introduce new biases
Dynamic Fine-tuningContinuously update based on new feedbackVulnerable to manipulation
Constitutional AIEncode meta-values (e.g., “be open to moral progress”)Difficult to specify without circularity
Human-in-the-loopRequire human oversight for value-critical decisionsDoes not scale; humans may have biased judgment

World Economic Forum guidance emphasizes that “value alignment involves continuously monitoring and updating AI systems to ensure they adapt to evolving societal norms and ethical standards.”

3. Cultural Sensitivity by Design

WEF research advocates for tailored approaches: “Rather than adopting a one-size-fits-all model, AI developers must consider the unique cultural, legal and societal contexts in which their AI systems operate.”

Example: In credit scoring, “fairness might mean different things depending on the cultural context—in some societies, creditworthiness is linked to community trust and social standing; in others, it is purely a function of individual financial behaviour.”

1. International Standards and Frameworks

StandardScopeStatusEffectiveness
ISO/IEC 42001AI management systemsPublished 2023Voluntary; limited adoption
NIST AI RMFRisk management frameworkPublished 2023US-focused; no enforcement
UNESCO RecommendationGlobal AI ethics principlesAdopted 2021Non-binding; variable implementation
EU AI ActComprehensive regulationEnacted 2024Regional; extraterritorial effect unclear

2. Surveillance Governance

Fourth Amendment research argues that “courts assessing whether networked camera or other sensor systems implicate the Fourth Amendment should account for the risks of unregulated, permeating surveillance by AI agents.”

However, legal frameworks struggle to keep pace with technological capabilities. Even in democracies, the infrastructure for authoritarian surveillance exists; only policy prevents its use—and policy can change.

3. Value Representation in Development

The World Economic Forum framework emphasizes that “on the technical side, tools such as ‘reinforcement learning from human feedback’ allow developers to integrate human values directly into AI systems. Meanwhile, value-sensitive design methods help engineers embed ethical considerations into the core architecture of AI systems from the outset.”

Critical challenges:

  • Representative sampling: RLHF annotators not demographically representative
  • Power imbalances: Dominant groups’ values overrepresented in training data
  • Economic incentives: Cheap annotation prioritized over representative sampling
InterventionFailure ModeProbability
Pluralistic alignmentComputational cost; reduced performance; hard to implementMedium-High
Value updating mechanismsVulnerable to manipulation; expensive; may introduce new biasesHigh
International standardsNon-binding; variable adoption; lack enforcementHigh
Surveillance regulationInfrastructure already deployed; policy can be reversedMedium-High
Representative developmentEconomic incentives favor speed over representationHigh

The fundamental challenge is temporal mismatch: Lock-in occurs gradually through accumulation of individually rational decisions, while intervention requires coordination and sacrifice of short-term advantages. By the time the problem is obvious, path dependencies make reversal prohibitively expensive.


QuestionWhy It MattersCurrent State
Can pluralistic alignment scale to frontier models?Need to know if technical solutions are feasibleProof-of-concept only; unclear if scales
What values should guide AI in absence of consensus?Core philosophical questionDeep disagreement persists
How fast do value-encoding path dependencies form?Determines intervention windowUnknown; may already be late
Can deployed surveillance infrastructure be dismantled?Determines reversibility of authoritarian lock-inHistorical precedent weak
Will moral philosophy converge on answers?If yes, lock-in may resolve naturallyPace too slow; 2,500 years without consensus
Do humans need to practice moral reasoning to maintain competence?If yes, AI assistance may cause expertise atrophyEvidence from other domains suggests yes
Can feedback loops be broken once established?Determines if AI-human belief cycles are reversibleNo empirical evidence yet
Will AGI undergo identity consolidation?If yes, pluralistic alignment window may be narrowTheoretical hypothesis; not tested


Model ElementRelationship to Values Lock-in
AI Capabilities (Algorithms)More capable models have stronger influence on belief formation
AI Capabilities (Adoption)Rapid adoption embeds value-laden systems before pluralism achieved
AI Ownership (Companies)Small number of labs control value specification decisions
AI Ownership (Countries)Western dominance creates cultural bias in value alignment
AI Uses (Governments)Surveillance infrastructure enables authoritarian value enforcement
AI Uses (Coordination)Feedback loops between users and models create echo chambers
Civilizational Competence (Epistemics)Poor epistemic health makes value lock-in harder to detect
Civilizational Competence (Governance)Weak governance allows deployment without value pluralism safeguards
Civilizational Competence (Adaptability)Moral reasoning atrophy reduces capacity to recognize bad lock-in
Misalignment Potential (Technical AI Safety)RLHF bias and preference collapse are technical failure modes
Long-term Lock-in (Political Power)Authoritarian surveillance locks in both values and political control
Long-term Lock-in (Economic Power)Concentrated AI ownership determines whose values get encoded
  1. Values lock-in is symmetric: Unlike misalignment or misuse risks, values lock-in could preserve beneficial values or entrench harmful ones. The risk is not from encoding wrong values but from making any values permanent.

  2. Technical necessity forces premature resolution: AI systems require objective functions, forcing developers to resolve unresolved philosophical questions. This creates pressure toward lock-in regardless of intention.

  3. Multiple reinforcing mechanisms: RLHF bias, feedback loops, surveillance infrastructure, and investment concentration create mutually reinforcing dynamics that accelerate lock-in.

  4. Cultural bias is systemic: Western, English-speaking, Protestant European values disproportionately represented. This is not intentional discrimination but structural consequence of training data and developer demographics.

  5. Surveillance infrastructure enables enforcement: Authoritarian regimes use AI not just to monitor but to shape behavior, internalizing desired values through fear and conformity.

  6. Intervention window may be narrow: Research suggests standard alignment reduces pluralism by 30-40%. Each year of deployment creates path dependencies. The window for developing pluralistic alternatives narrows as systems accumulate.

  7. Moral uncertainty is fundamental: No consensus exists on which values are correct. Forcing premature resolution through technical necessity risks locking in whichever view happens to dominate AI development in the 2020s—foreclosing future moral progress.

The research suggests values lock-in should be considered a high-probability failure mode that receives insufficient attention because it emerges from individually rational decisions (users prefer aligned content; models improve by learning from users) that accumulate into catastrophic loss of value diversity and moral progress capacity.