Training Language Models to Follow Instructions with Human Feedback

The alignment gap

Larger models predict text well but often produce outputs users do not want: unhelpful answers, hallucinations, toxic completions. Ouyang et al. separate capability from intent alignment.

RLHF pipeline

The method has three stages: supervised fine-tuning on demonstration data, training a reward model from human comparisons, then optimizing the policy with reinforcement learning (PPO) against that reward. Human labelers rank outputs; the model learns the ranking.

Why it mattered

InstructGPT became the template for ChatGPT-era assistants. RLHF is now standard, and also controversial: it encodes labeler preferences, hides failure modes behind polished tone, and adds training complexity. The paper names the tradeoff clearly.