Last updated: April 5, 2026 · Safety & Alignment · by Daniel Ashford

What is RLHF (Reinforcement Learning from Human Feedback)?

QUICK ANSWER

The training technique that makes LLMs helpful and safe by learning from human preferences.

Definition

RLHF is a training methodology to align language model behavior with human preferences. After pre-training, RLHF uses human ratings of outputs to train a reward model, which guides further training toward helpful, harmless, and honest responses.

How It Works

Three stages: (1) Supervised fine-tuning on high-quality human examples. (2) Reward model training from human rankings of outputs. (3) Reinforcement learning to maximize the reward model score. Variants include RLAIF (AI feedback), DPO (Direct Preference Optimization), and Constitutional AI.

Example

Without RLHF, a model asked "How do I pick a lock?" might give detailed instructions. After RLHF, it recognizes the risk and suggests contacting a locksmith.

Related Terms

Alignment
The challenge of making AI systems behave in accordance with human values.
AI Safety Score
A measure of how well a model avoids harmful outputs and maintains appropriate guardrails.
Fine-Tuning
Customizing a pre-trained LLM on your specific data to improve performance for your use case.
Constitutional AI
Anthropic approach to safety that trains models using written principles rather than solely human ratings.

See How Models Compare

Understanding rlhf (reinforcement learning from human feedback) is important when choosing the right AI model. See how 12 models compare on our leaderboard.

View Leaderboard →Our Methodology
← Browse all 47 glossary terms
DA
Daniel Ashford
Founder & Lead Evaluator · 200+ models evaluated