AI Glossary

Reward Model

A model trained to predict human preferences, used in RLHF to provide a scalar reward signal that guides language model training toward more helpful and harmless outputs.

How It's Trained

Human annotators compare pairs of model responses and indicate which is better. The reward model learns to assign higher scores to preferred responses. It's typically a language model itself, with a scalar output head.

Role in RLHF

The reward model serves as a proxy for human judgment. During RL training, it scores the language model's outputs, and the language model is optimized to maximize this reward (using PPO or similar algorithms).

Challenges

Reward hacking (the model exploits the reward model's weaknesses), distributional shift (the RL model generates text unlike the reward model's training data), and the difficulty of capturing nuanced human preferences in a single scalar score.

← Back to AI Glossary

Reward Model

How It's Trained

Role in RLHF

Challenges

Related Articles

Direct Preference Optimization: RLHF Without the RL

Model Deployment: From Jupyter to Production APIs

Model Monitoring in Production: Detecting Drift and Degradation

RLHF: How Human Feedback Makes AI Better

AI Alignment Research: Ensuring AI Does What We Want

Related Concepts