Reinforcement learning is the branch of machine learning concerned with how agents should take actions in an environment to maximize cumulative reward. Unlike supervised learning, where models learn from labeled examples, RL agents learn from experience: they take actions, observe outcomes, and gradually discover strategies that lead to better results. This trial-and-error approach has produced some of AI's most dramatic achievements, from defeating world champions at Go to controlling nuclear fusion reactors.

The RL Framework

Every reinforcement learning problem can be described using the same fundamental framework. An agent interacts with an environment through a cycle of observation, action, and reward. At each timestep, the agent observes the current state of the environment, selects an action based on its policy, receives a reward signal, and transitions to a new state.

Key Concepts

  • State (s): A representation of the current situation. In a chess game, the state is the board position. In robotics, the state includes joint angles, velocities, and sensor readings
  • Action (a): A choice the agent can make. Actions can be discrete (move left, right, or jump) or continuous (apply 3.7 Nm of torque to the motor)
  • Reward (r): A scalar signal indicating how good or bad the outcome of an action was. Designing the reward function is often the most critical and challenging part of an RL problem
  • Policy (pi): The agent's strategy, mapping states to actions. The goal of RL is to find the optimal policy that maximizes expected cumulative reward
  • Value Function V(s): The expected cumulative reward from a given state, following the current policy. It tells the agent how good a state is in the long run
  • Q-Function Q(s,a): The expected cumulative reward from taking a specific action in a specific state. It tells the agent how good a particular action is

Markov Decision Processes

The mathematical foundation of RL is the Markov Decision Process (MDP). An MDP assumes that the future depends only on the current state and action, not on the history of states that preceded it. This Markov property simplifies the problem enormously: the agent does not need to remember its entire history, only its current state.

An MDP is defined by the tuple (S, A, P, R, gamma), where S is the set of states, A is the set of actions, P is the transition probability function (the probability of reaching a new state given the current state and action), R is the reward function, and gamma is the discount factor that determines how much the agent values immediate versus future rewards.

"Reinforcement learning is the science of decision making. It is the formal study of how to act in the world when actions have consequences, and when the goal is not just to do well now but to do well over time."

Exploration vs. Exploitation

The exploration-exploitation tradeoff is the central tension in reinforcement learning. Should the agent exploit what it already knows, taking the best action according to its current knowledge? Or should it explore unfamiliar actions that might lead to better rewards? Too much exploitation leads to suboptimal policies that miss better strategies. Too much exploration wastes time on actions the agent already knows are poor.

Common Strategies

  • Epsilon-Greedy: With probability epsilon, take a random action; otherwise, take the best-known action. Epsilon typically decreases over time as the agent learns
  • Upper Confidence Bound (UCB): Choose actions that have high estimated value or high uncertainty, naturally balancing exploration and exploitation
  • Thompson Sampling: Maintain a probability distribution over action values and sample from it, exploring actions proportionally to their probability of being optimal
  • Intrinsic Motivation: Reward the agent for visiting novel states or reducing its own uncertainty, encouraging systematic exploration

Key Takeaway

The exploration-exploitation tradeoff is fundamental to RL. An effective agent must balance trying new things (exploration) with leveraging what it already knows (exploitation). The balance shifts over time as the agent becomes more knowledgeable about its environment.

Value-Based Methods

Q-Learning

Q-learning is the foundational value-based RL algorithm. It maintains a table of Q-values, Q(s,a), representing the expected return for each state-action pair. The agent updates these values based on experience using the Bellman equation: the Q-value of a state-action pair equals the immediate reward plus the discounted maximum Q-value of the next state. Q-learning is off-policy, meaning it can learn from experience generated by any behavior policy, making it sample-efficient.

SARSA

SARSA (State-Action-Reward-State-Action) is similar to Q-learning but on-policy: it updates Q-values based on the action actually taken rather than the optimal action. This makes SARSA more conservative, as it accounts for the exploration behavior of the policy, which can be advantageous in environments where exploration is risky.

Policy-Based Methods

Value-based methods work well for discrete action spaces but struggle with continuous actions. Policy gradient methods directly parameterize the policy and optimize it using gradient ascent on expected reward. Instead of estimating how good actions are and choosing the best, policy gradients directly adjust the probabilities of taking different actions to increase expected reward.

REINFORCE

The simplest policy gradient algorithm, REINFORCE, collects full episodes, calculates returns, and updates the policy to increase the probability of actions that led to high returns. Its simplicity comes at the cost of high variance in gradient estimates, which makes training unstable.

Actor-Critic Methods

Actor-Critic methods combine the strengths of value-based and policy-based approaches. The actor is a policy network that selects actions. The critic is a value network that evaluates how good those actions are. The critic reduces variance in gradient estimates while the actor enables handling continuous action spaces. This combination is the foundation of most modern RL algorithms.

Modern RL Algorithms

PPO (Proximal Policy Optimization)

PPO is currently the most widely used RL algorithm. It constrains policy updates to prevent large, destabilizing changes, using a clipped objective function that limits how much the policy can change in a single update. PPO is simple to implement, works well across diverse environments, and requires minimal hyperparameter tuning compared to alternatives.

SAC (Soft Actor-Critic)

SAC adds an entropy term to the reward, encouraging the agent to maintain exploration throughout training. This maximum entropy framework leads to more robust policies that generalize better to new situations. SAC is particularly effective for continuous control tasks in robotics and simulation.

Reward Design

The reward function is how you communicate your goals to the RL agent. Poorly designed rewards lead to reward hacking, where the agent finds unintended ways to maximize reward without actually achieving the desired behavior. A classic example: a cleaning robot rewarded for not seeing dirt might learn to close its eyes rather than clean.

Effective reward design requires thinking carefully about what behaviors you want to encourage and what shortcuts the agent might find. Reward shaping provides intermediate rewards that guide the agent toward the final goal. Inverse reinforcement learning infers the reward function from expert demonstrations. RLHF (Reinforcement Learning from Human Feedback) uses human preferences to define rewards, which has become the dominant approach for training large language models.

Challenges and Limitations

  • Sample Efficiency: RL agents typically require millions of interactions to learn, making them impractical for real-world environments where each interaction is expensive or slow
  • Reward Specification: Translating complex human goals into scalar reward signals is difficult and error-prone
  • Sim-to-Real Transfer: Policies learned in simulation often fail in the real world due to differences between the simulated and real environments
  • Credit Assignment: In long-horizon tasks, determining which past actions were responsible for current rewards is challenging
  • Safety: RL agents explore by trial and error, which can be dangerous in safety-critical applications

Despite these challenges, reinforcement learning remains one of the most exciting and rapidly advancing areas of AI. Its ability to discover novel strategies through interaction, rather than imitation, gives it a unique potential to solve problems that other approaches cannot.

Key Takeaway

Reinforcement learning enables agents to learn optimal behavior through interaction with an environment. Understanding the core concepts of states, actions, rewards, and policies, along with the exploration-exploitation tradeoff, provides the foundation for applying RL to real-world problems.