Training Language Models to Follow Instructions with Human Feedback
Making language models bigger does not inherently make them better at following a user's intent.
¶
The alignment gap
¶
Larger models predict text well but often produce outputs users do not want: unhelpful answers, hallucinations, toxic completions. Ouyang et al. separate capability from intent alignment.
¶
RLHF pipeline
¶
The method has three stages: supervised fine-tuning on demonstration data, training a reward model from human comparisons, then optimizing the policy with reinforcement learning (PPO) against that reward. Human labelers rank outputs; the model learns the ranking.
¶
Why it mattered
¶
InstructGPT became the template for ChatGPT-era assistants. RLHF is now standard, and also controversial: it encodes labeler preferences, hides failure modes behind polished tone, and adds training complexity. The paper names the tradeoff clearly.