Language, music, stock prices, and weather patterns all share a fundamental property: order matters. The word "not" changes the meaning of an entire sentence. A note depends on the notes that came before it. Standard feedforward networks treat each input independently, ignoring this temporal structure. Recurrent Neural Networks (RNNs) were designed to solve exactly this problem.

The Core Idea: Memory Through Recurrence

An RNN processes a sequence one element at a time, maintaining a hidden state that acts as memory. At each time step, the network takes the current input and the previous hidden state, and produces a new hidden state and (optionally) an output:

h_t = f(W_hh * h_{t-1} + W_xh * x_t + b)

The key insight is that h_t depends on h_{t-1}, which depends on h_{t-2}, and so on. This recurrence allows information from earlier in the sequence to influence processing of later elements.

"An RNN is a neural network with a loop. It reads a sequence one step at a time, carrying forward a summary of everything it has seen so far in its hidden state."

The Vanishing Gradient Problem in RNNs

In theory, RNNs can capture arbitrarily long-range dependencies. In practice, training them with backpropagation through time (BPTT) exposes a critical flaw: the vanishing gradient problem. As gradients are propagated back through many time steps, they shrink exponentially, making it impossible for the network to learn dependencies spanning more than about 10-20 steps.

LSTM: Long Short-Term Memory

Introduced by Hochreiter and Schmidhuber in 1997, LSTM networks solve the vanishing gradient problem with a clever gating mechanism. Instead of a single hidden state, LSTMs maintain a cell state that acts as an information highway, allowing gradients to flow unchanged across many time steps.

The Three Gates

  • Forget gate: Decides what information to discard from the cell state. It reads the previous hidden state and current input and outputs a value between 0 (completely forget) and 1 (completely keep) for each element of the cell state.
  • Input gate: Decides what new information to store in the cell state. It has two parts: a sigmoid layer that decides which values to update, and a tanh layer that creates candidate values to add.
  • Output gate: Decides what to output from the cell state. The cell state is passed through tanh (to normalize values) and multiplied by the output gate's sigmoid to select which parts become the hidden state output.

Key Takeaway

The cell state in an LSTM is like a conveyor belt running through time. Information can be added or removed through gates, but the default path is to keep information flowing unchanged. This is why LSTMs can capture dependencies spanning hundreds of time steps.

GRU: Gated Recurrent Unit

The GRU, introduced by Cho et al. in 2014, simplifies the LSTM by merging the cell state and hidden state and using only two gates instead of three:

  • Reset gate: Controls how much of the previous hidden state to forget when computing the candidate hidden state.
  • Update gate: Controls how much of the candidate hidden state to use versus keeping the old hidden state. This combines the forget and input gates of the LSTM into one.

GRUs have fewer parameters than LSTMs and are often faster to train. In practice, GRUs and LSTMs perform comparably on most tasks, with neither consistently outperforming the other.

Bidirectional RNNs

A standard RNN processes sequences from left to right. But for many tasks, future context is also important. A bidirectional RNN runs two separate RNNs: one processing the sequence forward and one backward. Their outputs are concatenated, giving each time step access to both past and future context. This is especially useful for tasks like named entity recognition and machine translation.

Applications

  • Language modeling: Predicting the next word in a sentence. RNNs powered early language models before Transformers took over.
  • Machine translation: The encoder-decoder architecture uses one RNN to encode the source sentence and another to decode the target sentence.
  • Speech recognition: Processing audio spectrograms as sequences to produce text transcriptions.
  • Time series forecasting: Predicting future values of stocks, weather, energy consumption, and other temporal data.
  • Music generation: Learning patterns in musical sequences and generating new compositions.
  • Text generation: Producing coherent text character by character or word by word.

RNNs vs. Transformers

Transformers have largely replaced RNNs for natural language processing due to their parallelizable attention mechanism and ability to capture long-range dependencies without recurrence. However, RNNs still have advantages in specific scenarios:

  • Low memory for long sequences: RNNs process one step at a time, using constant memory regardless of sequence length. Transformers require memory quadratic in sequence length.
  • Streaming data: RNNs naturally handle data arriving one element at a time.
  • Smaller datasets: RNNs can work well with less data than Transformers typically need.
  • Edge deployment: Simpler RNN architectures may be more suitable for resource-constrained devices.

Key Takeaway

LSTMs and GRUs solved the vanishing gradient problem that made vanilla RNNs impractical for long sequences. While Transformers now dominate NLP, understanding RNNs remains important because they are still used in many production systems and provide foundational concepts for sequential modeling.

Practical Tips

  1. Use gradient clipping to prevent exploding gradients, especially with vanilla RNNs.
  2. Apply dropout between layers (not within recurrent connections) to regularize.
  3. Start with GRUs if you are unsure. They are simpler and often match LSTM performance.
  4. Consider bidirectional architectures when future context is available at prediction time.
  5. Stack multiple layers for more capacity, but monitor for overfitting.
  6. Try Transformers first for NLP tasks, falling back to LSTMs only if data or compute constraints favor them.

RNNs, LSTMs, and GRUs represent a fundamental paradigm in deep learning: the idea that neural networks can maintain and update memory over time. This concept continues to influence modern architecture design, from state-space models to memory-augmented Transformers.