While encoder-only models (BERT) excel at understanding and decoder-only models (GPT) dominate generation, encoder-decoder models combine both to handle tasks where you need to process an input and produce a structurally different output. Machine translation, summarization, question answering, and text reformulation all fit naturally into this architecture. T5, BART, and their successors represent the most powerful expression of this approach.

How Encoder-Decoder Models Work

An encoder-decoder Transformer has two distinct stacks. The encoder processes the full input with bidirectional attention, creating rich contextual representations. The decoder generates output tokens autoregressively, using causal attention for its own outputs and cross-attention to attend to the encoder's representations.

Cross-attention is the critical bridge between the two components. In each decoder layer, after the self-attention operation, a cross-attention layer allows the decoder to attend to all encoder positions. The queries come from the decoder's current representation, while the keys and values come from the encoder's output. This mechanism lets the decoder dynamically focus on relevant parts of the input when generating each output token.

"The encoder-decoder architecture is the natural choice when your task involves transforming an input into a structurally different output -- the encoder understands, the decoder generates, and cross-attention connects them."

T5: Text-to-Text Transfer Transformer

Google's T5 introduced an elegantly simple idea: frame every NLP task as a text-to-text problem. Translation? Input: "translate English to French: Hello" Output: "Bonjour". Summarization? Input: "summarize: [long text]" Output: "[summary]". Classification? Input: "classify: I love this movie" Output: "positive".

This unified framing allowed T5 to be trained on a diverse mixture of tasks with a single model architecture and a single training objective: generate the correct text output for each input. The model learned to distinguish between tasks through the text prefix, making it remarkably flexible.

T5 came with an extensive ablation study examining architectural choices, pre-training objectives, training data, and scaling. This study, published in the "Exploring the Limits of Transfer Learning" paper, became one of the most cited references for Transformer design decisions. Key findings included:

  • Encoder-decoder architectures outperform decoder-only models of the same size on many tasks.
  • The "span corruption" pre-training objective (masking and predicting contiguous spans of text) outperforms individual token masking.
  • Pre-training data quality and diversity significantly impact downstream performance.

Key Takeaway

T5's text-to-text framework demonstrated that a single model architecture, trained with a unified objective, can handle virtually any NLP task. This simplification was both elegant and practical.

BART: Denoising Sequence-to-Sequence Pre-training

Facebook's BART (Bidirectional and Auto-Regressive Transformers) took a different approach to pre-training. It corrupted input text with various noise functions -- token masking, deletion, permutation, and span replacement -- and trained the model to reconstruct the original text.

This denoising objective gave BART particularly strong performance on generation tasks, especially summarization. By learning to reconstruct clean text from corrupted inputs, BART developed a deep understanding of text structure and coherence that translated directly to the ability to generate fluent, well-organized summaries.

mBART: Multilingual BART

mBART extended BART's denoising approach to 25 languages, using a single model with a shared vocabulary. This multilingual pre-training proved remarkably effective for machine translation, particularly for low-resource language pairs where training data is scarce. mBART demonstrated that cross-lingual transfer learning in encoder-decoder models could significantly improve translation quality.

Encoder-Decoder vs Decoder-Only: When to Choose What

The rise of large decoder-only models has raised a natural question: when should you still use an encoder-decoder architecture? The answer depends on your task and constraints.

Choose Encoder-Decoder When:

  • The input and output are structurally different: Translation, summarization, and reformulation benefit from the encoder's ability to build a complete representation before the decoder starts generating.
  • You need efficiency: For a given parameter budget, encoder-decoder models often outperform decoder-only models on specific tasks. The encoder only processes the input once, while a decoder-only model must attend to the full context at every generation step.
  • Your task is well-defined: If you know the input-output mapping and can fine-tune on task-specific data, encoder-decoder models provide excellent performance in a compact form.

Choose Decoder-Only When:

  • You need generality: Decoder-only models handle a wider range of tasks without task-specific fine-tuning.
  • In-context learning is important: Few-shot prompting works more naturally with decoder-only models.
  • You want to scale: The largest and most capable models are all decoder-only, and the scaling laws for decoders are better understood.

Modern Applications and Legacy

Encoder-decoder models continue to power important applications. Google Translate uses encoder-decoder architectures. Many production summarization systems use BART or Pegasus. Whisper, OpenAI's speech recognition model, uses an encoder-decoder design where the encoder processes audio and the decoder generates transcription text.

While decoder-only models dominate the headlines, the encoder-decoder architecture remains a vital part of the Transformer ecosystem. For tasks that require transforming structured input into structured output, they offer the most natural and often the most efficient solution.

Key Takeaway

Encoder-decoder models remain the best choice for structured transformation tasks like translation and summarization. While decoder-only models are more general, encoder-decoder architectures offer superior efficiency and performance when the task fits their strengths.