Encoder-Decoder Architecture: From Seq2Seq to Transformers

The encoder-decoder architecture is one of the most important design patterns in deep learning. Originally conceived for machine translation, it provides a general framework for any task that transforms one sequence into another. From the earliest RNN-based seq2seq models to the transformer and its descendants like T5 and BART, the encoder-decoder pattern has evolved dramatically while retaining its core principle: encode the input into a rich representation, then decode that representation into the desired output.

The Original Seq2Seq Framework

The sequence-to-sequence (seq2seq) model was introduced independently by two research groups in 2014: Sutskever, Vinyals, and Le at Google, and Cho et al. at the University of Montreal. The architecture consists of two recurrent neural networks:

Encoder RNN: Processes the input sequence token by token, updating its hidden state at each step. The final hidden state serves as a compressed representation of the entire input.
Decoder RNN: Takes the encoder's final hidden state as its initial state and generates the output sequence one token at a time, using each generated token as input for the next step.

This was groundbreaking for machine translation. Previous statistical methods required extensive feature engineering and separate components for language modeling, translation modeling, and reordering. Seq2seq replaced all of this with a single end-to-end neural network.

Seq2seq showed that a neural network could learn to translate between languages without being explicitly told anything about grammar, vocabulary mappings, or linguistic rules.

The Bottleneck Problem

The original seq2seq architecture had a critical flaw: the entire input sequence had to be compressed into a single fixed-size vector. For short sentences, this worked reasonably well. But as sentences grew beyond 20-30 words, translation quality degraded sharply. The fixed-size vector simply could not capture all the nuances of longer inputs.

Key Takeaway

The original seq2seq model compressed the entire input into a single vector, creating an information bottleneck that limited its ability to handle long sequences effectively.

Adding Attention to Seq2Seq

The attention mechanism, introduced by Bahdanau et al. in 2014, solved the bottleneck problem by allowing the decoder to access all encoder hidden states at every generation step. Instead of relying on a single compressed vector, the decoder computes attention weights over the encoder outputs and creates a custom context vector for each output token.

This attention-augmented encoder-decoder became the dominant architecture for machine translation from 2015 to 2017. Google Translate switched from its statistical system to an attention-based seq2seq model in 2016, producing dramatic improvements in translation quality.

The attention mechanism also provided interpretability. By visualizing the attention weights, researchers could see which input words the model focused on when generating each output word, revealing that the model learned meaningful alignment patterns between source and target languages.

The Transformer Encoder-Decoder

The 2017 paper "Attention Is All You Need" replaced the recurrent components entirely with self-attention, creating the transformer architecture. The encoder-decoder structure remained, but both components were rebuilt using attention layers:

Encoder-Decoder Transformer Architecture

Figure: The full Transformer with encoder processing input and decoder generating output with cross-attention

Transformer Encoder

The encoder consists of a stack of identical layers (6 in the original paper). Each layer has two sub-layers:

Multi-head self-attention: Every input position attends to every other input position, building contextualized representations
Position-wise feed-forward network: A two-layer MLP applied independently to each position, adding nonlinear transformation capacity

Each sub-layer uses residual connections and layer normalization. The encoder processes all input positions in parallel, making it dramatically faster than recurrent encoders.

Transformer Decoder

The decoder also consists of a stack of identical layers, but each layer has three sub-layers:

Masked multi-head self-attention: Each output position attends to all previous output positions but not future ones, enforcing autoregressive generation
Multi-head cross-attention: The decoder attends to the encoder's output, retrieving relevant information from the input
Position-wise feed-forward network: Same as in the encoder

The masking in the decoder's self-attention is crucial. During training, the entire target sequence is fed in at once for efficiency, but the mask prevents each position from attending to future positions, simulating the autoregressive generation that occurs at inference time.

Encoder-Only and Decoder-Only Variants

The transformer architecture quickly spawned specialized variants that use only one half of the encoder-decoder pair:

Encoder-Only Models (BERT Family)

BERT and its variants use only the transformer encoder with bidirectional self-attention. Every token can attend to every other token, including those that come after it. This makes encoder-only models excellent for understanding tasks like classification, named entity recognition, and question answering, but they cannot generate text autoregressively.

Decoder-Only Models (GPT Family)

GPT and its successors use only the transformer decoder with causal (masked) self-attention. Each token can attend only to previous tokens, enabling autoregressive text generation. Decoder-only models have become dominant for general-purpose language modeling and form the basis of ChatGPT, Claude, and most modern LLMs.

The transformer's encoder-decoder architecture proved so versatile that its two halves each spawned entire families of models: encoder-only models for understanding and decoder-only models for generation.

Modern Encoder-Decoder Models

Despite the popularity of encoder-only and decoder-only approaches, the full encoder-decoder architecture remains important and has produced several influential models:

T5 (Text-to-Text Transfer Transformer)

Google's T5 frames every NLP task as a text-to-text problem. Classification, translation, summarization, and question answering all use the same encoder-decoder format. The input is prefixed with a task description (e.g., "translate English to German:"), and the model generates the output as text. This unified framework simplified multi-task training and demonstrated the generality of the encoder-decoder approach.

BART (Bidirectional and Auto-Regressive Transformers)

Facebook's BART combines a bidirectional encoder (like BERT) with an autoregressive decoder (like GPT). It is pre-trained by corrupting text with various noise functions (token masking, deletion, permutation) and training the model to reconstruct the original. BART excels at generation tasks like summarization and has been particularly successful in abstractive summarization benchmarks.

mBART and mT5

Multilingual variants extend the encoder-decoder framework to handle many languages simultaneously. mBART is pre-trained on monolingual data in 25 languages, while mT5 covers 101 languages. These models enable cross-lingual transfer, where training on high-resource languages improves performance on low-resource languages.

Key Takeaway

Full encoder-decoder models like T5 and BART remain powerful choices for tasks that naturally involve transforming one sequence into another, such as translation, summarization, and question answering.

When to Use Each Architecture

Choosing between encoder-only, decoder-only, and encoder-decoder architectures depends on the task:

Encoder-only: Best for classification, token labeling, and tasks requiring bidirectional understanding. Use when the output is a label, score, or tagged version of the input.
Decoder-only: Best for open-ended generation, creative writing, and conversational AI. Use when the task is primarily about generating new text. Currently the dominant paradigm for general-purpose LLMs.
Encoder-decoder: Best for structured transformations where input and output are clearly distinct sequences, such as translation, summarization, and code generation from natural language specifications.

The trend in the field has shifted heavily toward decoder-only models for most applications, partly because they can be scaled more simply and partly because instruction tuning and in-context learning have made them surprisingly effective at tasks traditionally associated with encoder-decoder models. However, for production systems requiring maximum quality on specific transformation tasks, encoder-decoder architectures often still provide an edge.

From the original seq2seq model to modern transformers, the encoder-decoder pattern has proven to be one of the most enduring and adaptable architectural ideas in deep learning. Understanding it is essential for understanding the entire landscape of modern AI.

Encoder-Decoder Architecture: From Seq2Seq to Transformers

The Original Seq2Seq Framework

The Bottleneck Problem

Key Takeaway

Adding Attention to Seq2Seq

The Transformer Encoder-Decoder

Encoder-Decoder Transformer Architecture

Transformer Encoder

Transformer Decoder

Encoder-Only and Decoder-Only Variants

Encoder-Only Models (BERT Family)

Decoder-Only Models (GPT Family)

Modern Encoder-Decoder Models

T5 (Text-to-Text Transfer Transformer)

BART (Bidirectional and Auto-Regressive Transformers)

mBART and mT5

Key Takeaway

When to Use Each Architecture

References & Sources

Related Glossary Terms

Encoder-Decoder Architecture: From Seq2Seq to Transformers

The Original Seq2Seq Framework

The Bottleneck Problem

Key Takeaway

Adding Attention to Seq2Seq

The Transformer Encoder-Decoder

Encoder-Decoder Transformer Architecture

Transformer Encoder

Transformer Decoder

Encoder-Only and Decoder-Only Variants

Encoder-Only Models (BERT Family)

Decoder-Only Models (GPT Family)

Modern Encoder-Decoder Models

T5 (Text-to-Text Transfer Transformer)

BART (Bidirectional and Auto-Regressive Transformers)

mBART and mT5

Key Takeaway

When to Use Each Architecture

References & Sources

Related Glossary Terms

Related Posts

The Attention Mechanism: How AI Learned to Focus

Self-Attention vs Cross-Attention: A Visual Guide

GPT Architecture Explained: From GPT-1 to GPT-4