Constitutional AI (Method)
An alignment technique where AI systems self-improve using a set of principles (a constitution) as guidance.
Overview
Constitutional AI (CAI), developed by Anthropic, is an alignment technique that trains AI models to be helpful, harmless, and honest using a set of written principles (the 'constitution') rather than relying solely on human feedback for each response. The process has two phases: supervised learning from AI-revised responses, and reinforcement learning from AI feedback (RLAIF).
Key Details
In the first phase, the model generates responses, then critiques and revises them according to the constitutional principles. In the second phase, the model evaluates response pairs against the constitution to train a preference model, replacing human evaluators. CAI reduces the need for human feedback while maintaining strong safety properties and is used in training Claude. It enables more scalable and consistent alignment than pure RLHF.