The Lineup
There are four dominant approaches to AI alignment right now. Three come from AI labs. One comes from economics.
| Approach | Core Idea | Origin |
|---|---|---|
| RLHF | Train models on human preferences | OpenAI, Anthropic, Google |
| Guardrails | Rule-based filters on inputs/outputs | Every lab + open-source |
| Constitutional AI | Self-critique against written principles | Anthropic |
| AgentStake | Economic penalties for misbehavior | AgentStake |
They're not mutually exclusive. But they solve different problems — and fail in different ways.
RLHF
Reinforcement Learning from Human Feedback
How it works
Human evaluators rank model outputs. The model is fine-tuned to prefer higher-ranked responses. Over time, it learns to produce outputs humans rate as "good."
Strengths
- Dramatically improved helpfulness and coherence
- Scales with compute (more training = better alignment)
- Industry standard — battle-tested across major models
Weaknesses
- Goodhart's Law: Optimizes for rated behavior, not actual good behavior
- Lab ≠ deployment: Evaluators rate in controlled settings; real users are adversarial and creative
- Static: Training happens once; the world changes constantly
- No recourse: When an RLHF-trained agent misbehaves, there's no accountability or compensation
- Reward hacking: Models learn to appear aligned rather than be aligned
The gap: RLHF tells an agent what "good" looks like. It doesn't give the agent a reason to stay good when no one's watching.
Guardrails
Input/Output Filters
How it works
Rules applied before and after model inference. System prompts define boundaries. Output filters catch harmful content. Rate limits prevent abuse. Content classifiers flag violations.
Strengths
- Fast to implement
- Deterministic — same input hits same rule
- Easy to audit and update
- Works as a first line of defense
Weaknesses
- Rules are finite, exploits are infinite: Every jailbreak proves this
- Cat-and-mouse: New attack → new rule → new attack → forever
- Brittle: Edge cases break hard-coded rules
- No context: Can't distinguish malicious intent from legitimate use
- Performance cost: Aggressive filtering kills capability
The gap: Guardrails are walls. They keep out known threats. But agents that operate autonomously face situations no rule anticipated. When the wall fails, there's no backup.
Constitutional AI
Anthropic's CAI
How it works
Give the model a "constitution" — a set of principles (be helpful, be honest, avoid harm). The model critiques its own outputs against these principles and revises before responding.
Strengths
- Self-correcting — model catches its own mistakes
- Principled — alignment comes from explicit values, not just preference data
- Reduces dependence on human evaluators
- More robust than pure RLHF
Weaknesses
- Principles are interpretable: The model decides what "avoid harm" means — and it might decide wrong
- Self-critique has limits: A model can't catch biases it doesn't know it has
- Still training-time: Constitution is baked in during training, not enforced at runtime
- No external accountability: If the model misapplies its principles, who corrects it?
- Philosophical brittleness: Edge cases where principles conflict require judgment the model may not have
The gap: CAI is the most sophisticated training-time approach. But it's still a model talking to itself. There's no external force ensuring the principles are actually upheld in production.
AgentStake
Economic Alignment
How it works
Agents (or their operators) stake collateral before operating. Good behavior earns rewards. Misbehavior triggers slashing — stake is seized and victims are compensated. Trust becomes an economic asset.
Strengths
- Runtime enforcement: Alignment is checked continuously, not just at training time
- Skin in the game: Misbehavior has direct financial cost
- Recourse: Harmed parties get compensated
- Observable trust: Stake amount and reputation are public, verifiable signals
- Self-correcting market: Delegators route capital toward trustworthy agents
- Model-agnostic: Works with any AI system, any framework, any chain
Weaknesses
- Requires infrastructure: Smart contracts, dispute resolution, oracle systems
- Cold start: New agents have no reputation; bootstrapping trust takes time
- Dispute resolution is hard: Who decides if an agent misbehaved?
- Not a replacement for training: A poorly trained agent with stake is still poorly trained
- Economic attacks: Well-funded adversaries could stake heavily, exploit once, and absorb the loss
The gap: AgentStake doesn't make agents smarter or more capable. It makes the consequences of their behavior real. It's not alignment through understanding — it's alignment through incentives.
Head-to-Head
| Dimension | RLHF | Guardrails | CAI | AgentStake |
|---|---|---|---|---|
| When it works | Training | Runtime (filters) | Training | Runtime (continuous) |
| What it optimizes | Human preferences | Rule compliance | Self-consistency | Economic outcome |
| Adversarial robustness | Low | Low | Medium | High |
| Accountability | None | None | None | Stake + slashing |
| Victim recourse | None | None | None | Compensation |
| Adaptability | Retraining | Manual updates | Retraining | Market-driven |
| Model-agnostic | No | Partially | No | Yes |
| Observable trust | No | No | No | Yes (on-chain) |
The Real Comparison
This isn't about picking one. It's about understanding the layers:
Accountability (AgentStake)
Make misbehavior expensive. Provide recourse when Layer 1 and 2 fail.
Filtering (Guardrails)
Catch obvious violations at runtime. Block known attack patterns.
Training (RLHF + CAI)
Make the agent want to be good. Bake in preferences and principles.
Every serious system needs all three. Training sets the baseline. Guardrails handle the obvious. AgentStake handles everything else.
The question isn't "RLHF or AgentStake?" It's "Why would you deploy an agent with Layer 1 and 2 but skip Layer 3?"
The Analogy
Think about how we trust humans in high-stakes roles:
No one says "this doctor went to a great school, so we don't need malpractice insurance." All three layers work together.
The agent era deserves the same rigor.
Conclusion
RLHF, guardrails, and constitutional AI are all necessary. None are sufficient.
They work at training time or through static filters. They don't provide runtime accountability. They don't compensate victims. They don't create observable, verifiable trust.
AgentStake fills that gap. Not by replacing what labs have built, but by adding the layer they can't: economic consequences.
Alignment isn't a training problem or an incentive problem. It's both. We're building the incentive half.
Build the Trust Layer
Read the full story of why we're building AgentStake.