Reward Model
A model trained to predict human preferences, used in RLHF to provide a scalar reward signal that guides language model training toward more helpful and harmless outputs.
How It's Trained
Human annotators compare pairs of model responses and indicate which is better. The reward model learns to assign higher scores to preferred responses. It's typically a language model itself, with a scalar output head.
Role in RLHF
The reward model serves as a proxy for human judgment. During RL training, it scores the language model's outputs, and the language model is optimized to maximize this reward (using PPO or similar algorithms).
Challenges
Reward hacking (the model exploits the reward model's weaknesses), distributional shift (the RL model generates text unlike the reward model's training data), and the difficulty of capturing nuanced human preferences in a single scalar score.