Seq2Seq and Encoder-Decoder Models
2025-06-24
Back in December, Sequence to Sequence Learning with Neural Networks, by Ilya Sutskever, Oriol Vinyals, and Quoc Le, was awarded one of the 2024 NeurIPS Test of Time awards. For anyone who isn’t steeped in academic research, Test of Time awards are typically given out to papers that have had a significant and lasting impact on their field (usually awarded at least 10 years after publication), and this one certainly deserves that recognition.
Ilya gave a retrospective talk at NeurIPS reflecting on the 10 years since the paper came out: what they got right, what they got wrong, and how it has played a role in this crazy decade of progress in NLP and AI. It’s probably not a stretch to say that many of the capabilities of modern AI can trace its roots back to key contributions made in this work.
Ironically, I hadn’t actually read the original paper until I watched a recording of that talk. Despite my ignorance, I feel that this paper serves as a useful starting point for understanding many of the ideas behind modern AI systems.
In this post, I’ll unpack the core concepts behind sequence-to-sequence models and the encoder-decoder architecture popularized1 by the paper. This design has led to major advancements in machine translation, text generation, and LLMs, and I think it’s super helpful to revisit and understand how it works.
Sequence to Sequence Models
Sequence to sequence (seq2seq) models are designed to convert an input sequence into an output sequence. This is an extremely useful ability, since many real-world problems/tasks involve sequences of variable and/or unknown length. Classic examples include machine translation (input a sequence of words in one language and output a sequence of words in another) and question answering (input a question and output the answer in natural language).
By 2014, Deep Neural Networks (DNNs) were already quite good at problems like speech recognition and image classification, which notably do not typically involve sequences unless you frame the problem in a non-standard way. Meanwhile, Recurrent Neural Networks (RNNs), especially advanced variants like LSTMs, could handle sequence to sequence generation where the inputs and outputs have the same length. But these models fell short when input and output lengths could differ, leading to an architecture now known as the Encoder-Decoder model.
The Encoder-Decoder Architecture
At a high level, the Encoder-Decoder model (as described in the paper, though the idea has generalized far beyond it by now) consists of two RNNs: an encoder RNN and a decoder RNN. The original paper actually used multilayered LSTMs, but I’ll refer to them as RNNs for simplicity and without loss of generality.
The encoder takes the embedded input sequence and processes it one token at a time. After consuming the entire input, it outputs a single, fixed-size “context vector” that represents a compressed version of the entire input. In the case of machine translation, this context vector would ideally capture all of the semantic meaning of the source sentence that you would want represented in its translation.
The decoder then takes the context vector from the encoder and uses it as a starting “seed” to generate the output sequence. At each time step, the decoder generates one token, using the previously generated token as input. Crucially, the decoder relies only on information from the context vector. It doesn’t have direct access to the original input sequence, and has no concept of its length or structure. This is part of what allows it to handle input and output sequences of varying lengths, and the magic happens with how well that context vector captures the input.
Each decoder hidden state is passed through a linear layer that maps it to a vector of logits, which are then softmaxxed into a probability distribution over the output vocabulary. The token with the highest probability (or sampled from this distribution) is chosen as the output token.
SOS and EOS
You might notice the SOS and EOS tokens in Figure 1. These are special tokens that represent “start of sentence” and “end of sentence”, respectively. During decoding, the model begins generating the output sequence when it sees SOS, and stops when it produces an EOS. It learns this behavior during training, but at inference time, it is explicitly fed SOS as the first decoder input and hard-coded to stop once EOS is generated. The original paper used EOS for both purposes, but having distinct SOS and EOS tokens is a bit more modern practice.
A neat trick: Reversing the input sequence
The paper also introduces a clever trick where you actually reverse the input sequence, so that the beginning of the source sentence appears closer to the beginning of the translated sentence when the decoder starts generating tokens. The idea is that if you keep the original order, the early parts of the source sentence might be too far from the decoder and “forgotten” by the time the encoder outputs the context vector. The authors found that this led to empirically better results in their translation benchmarks, but this trick isn’t used as much these days since long-range dependencies are better handled through attention mechanisms.
Training and Inference
During training, the encoder processes example input sequences and the decoder learns to generate correct output sequences (e.g., correctly translated sentences). Importantly, during training, the decoder does not actually use its predicted previous token as input to the next time step (contrary to what I drew in Figure 1). It actually uses the ground truth previous token, which helps with training stability. Today, this is known as “teacher forcing”. The decoder’s predicted output at each time step is compared to the ground truth token using cross-entropy loss, and all the weights of the end-to-end model are updated with backpropagation through time in standard RNN fashion.
During inference, the model of course does not know the ground truth output. So the decoder uses its predicted output in the previous time step as input for the next, even if it might be incorrect.
And that’s pretty much it for the basics of seq2seq models and the encoder-decoder architecture! While the results of this paper were very impressive at the time, you can imagine that a single fixed-size context vector to represent an entire input sequence might not be enough, especially for very long or complex sentences. One solution to this is a (now revolutionary) mechanism called attention, which I’ll discuss in a future post.
While this paper really helped popularize the encoder-decoder model thanks to its strong results and use of LSTMs, it never actually refers to it by that name. In fact, it was introduced earlier by Cho et al. in Learning phrase representations using RNN encoder-decoder for statistical machine translation and also Kalchbrenner and Blunsom in Recurrent continuous translation models, which the authors cite extensively.↩︎