Classical reinforcement learning algorithms like Q-learning work by maintaining tables that map every state-action pair to a value. But what happens when the state space is too large for a table, say, the pixels on a screen or the positions of every joint in a robot? This is where deep reinforcement learning enters: using neural networks to approximate the value functions or policies that would be impossible to store explicitly. The marriage of deep learning's representational power with RL's decision-making framework has produced agents capable of superhuman performance in games, robotics, and beyond.

DQN: The Deep Q-Network Revolution

In 2013, DeepMind introduced the Deep Q-Network (DQN), a landmark algorithm that learned to play Atari games from raw pixel inputs at a superhuman level. DQN replaces the Q-table with a neural network that takes the state (game screen pixels) as input and outputs Q-values for each possible action.

Key Innovations

  • Experience Replay: Instead of learning from sequential experiences (which are correlated), DQN stores transitions in a replay buffer and samples random mini-batches for training. This breaks temporal correlations and improves stability
  • Target Network: A separate copy of the Q-network, updated periodically rather than at every step, provides stable targets for learning. Without this, the moving target problem causes training to diverge
  • Frame Stacking: Multiple consecutive frames are stacked as input, giving the network information about motion and velocity that a single frame cannot provide

"DQN demonstrated that a single algorithm and architecture could learn from raw pixels to achieve superhuman performance across dozens of different Atari games. It was the moment deep reinforcement learning became real."

DQN Variants

Subsequent research improved DQN substantially. Double DQN addresses overestimation of Q-values by using one network to select actions and another to evaluate them. Dueling DQN separates the network into state-value and action-advantage streams. Prioritized Experience Replay samples more important transitions more frequently. Rainbow DQN combines all these improvements into a single agent.

Key Takeaway

DQN showed that neural networks could approximate Q-functions for complex, high-dimensional environments. Its key innovations, experience replay and target networks, remain essential building blocks in modern deep RL algorithms.

Policy Gradient Methods

DQN and its variants work for discrete action spaces, but many real-world problems involve continuous actions: how much torque to apply, what angle to steer, how fast to accelerate. Policy gradient methods handle continuous actions naturally by parameterizing the policy directly as a neural network that outputs action distributions.

A3C: Asynchronous Advantage Actor-Critic

A3C, introduced by DeepMind in 2016, was a breakthrough in both performance and training efficiency. Instead of using a replay buffer, A3C runs multiple agents in parallel, each interacting with its own copy of the environment. These agents asynchronously update a shared model, providing diverse experience without the memory cost of replay buffers.

The "advantage" in A3C refers to using the advantage function A(s,a) = Q(s,a) - V(s), which measures how much better an action is compared to the average. This reduces variance in gradient estimates compared to raw returns. A3C demonstrated strong performance on both Atari games and continuous control tasks.

A2C: Synchronous Version

A2C is the synchronous variant of A3C. Instead of each worker updating the model independently, A2C collects experiences from all workers, computes a single update, and distributes it. This is simpler to implement, often runs faster on modern GPU hardware, and produces equivalent or better results.

PPO: The Workhorse of Deep RL

Proximal Policy Optimization (PPO), introduced by OpenAI in 2017, has become the most widely used deep RL algorithm. Its popularity stems from its simplicity, generality, and robust performance across a wide range of tasks.

How PPO Works

PPO belongs to the family of trust region methods that prevent the policy from changing too dramatically in a single update. Large policy changes can be catastrophic: a policy that worked well can suddenly become terrible, and there is no way back because the new experience is generated by the broken policy.

PPO uses a clipped surrogate objective that limits the ratio between the new and old policy probabilities. If the new policy assigns much higher or lower probability to an action than the old policy, the objective is clipped, preventing the optimization from pushing the policy too far. This simple mechanism provides most of the benefits of more complex trust region methods like TRPO at a fraction of the implementation complexity.

Why PPO Dominates

  • Simplicity: PPO can be implemented in a few hundred lines of code
  • Generality: Works for discrete actions, continuous actions, and mixed action spaces
  • Stability: The clipping mechanism prevents catastrophic policy updates
  • Scalability: Parallelizes well across multiple environments and GPUs
  • RLHF: PPO is the standard algorithm used for fine-tuning language models with human feedback

SAC: Soft Actor-Critic

Soft Actor-Critic (SAC) introduces the concept of maximum entropy RL, where the agent maximizes both expected reward and policy entropy. The entropy bonus encourages exploration and leads to more robust policies that do not collapse to a single deterministic behavior. SAC is off-policy (can learn from past experience), sample-efficient, and particularly effective for continuous control tasks.

SAC uses twin Q-networks (taking the minimum to prevent overestimation), an automatically tuned temperature parameter for the entropy bonus, and a squashed Gaussian policy that naturally handles bounded continuous actions. For robotics and simulation-based control, SAC often outperforms PPO in terms of sample efficiency.

Model-Based Deep RL

All the algorithms discussed so far are model-free: they learn directly from experience without building an explicit model of the environment. Model-based deep RL learns a dynamics model that predicts how the environment transitions between states, then uses this model to plan or generate imaginary experience for training.

Notable model-based approaches include Dreamer (which learns a world model in a latent space and trains a policy entirely within the learned model), MuZero (which combines learned dynamics with Monte Carlo tree search), and TD-MPC (which uses model predictive control with learned dynamics). Model-based methods can be dramatically more sample-efficient but require accurate world models, which are hard to learn in complex environments.

Choosing the Right Algorithm

  • Discrete actions, moderate state spaces: DQN variants remain strong choices
  • Continuous actions, lots of environment interaction: PPO is the default starting point
  • Continuous actions, limited interaction budget: SAC offers better sample efficiency
  • Very limited interaction budget: Model-based methods like Dreamer can learn from fewer interactions
  • RLHF for language models: PPO is the standard, with DPO emerging as a simpler alternative

Deep RL remains an active and rapidly evolving research area. New algorithms, architectures, and training techniques continue to push the boundaries of what RL agents can achieve, from manipulating objects with robotic hands to discovering new mathematical theorems.

Key Takeaway

Deep RL combines neural networks with RL to solve complex, high-dimensional problems. DQN handles discrete actions with experience replay and target networks. PPO is the go-to algorithm for most applications. SAC excels at continuous control. Choose based on your action space, sample budget, and application requirements.