Recurrent Neural Networks (RNNs) and LSTMs, Explained

A regular neural network has the memory of a goldfish. Show it an image of a cat and it’ll confidently say “cat” — but it has no idea what you showed it a second ago, and frankly it doesn’t care. That’s fine for photos. It’s a disaster for anything where order matters: a sentence, a stock chart, a melody, your heart rate over an afternoon. For those, the question isn’t just “what is this input?” but “what is this input, given everything that came before it?”

That “given everything before” is the whole reason recurrent neural networks exist.

Sequence data and the hidden state

A recurrent neural network (RNN) processes a sequence one step at a time while carrying a little summary of the past along with it. That summary is the hidden state — a vector that gets updated at every step and fed back into the network for the next one. Think of it as the network muttering notes to itself as it reads: “okay, subject was plural, verb should agree, we’re mid-clause…”

Conceptually, at each timestep t:

h_t = tanh(W_h @ h_prev + W_x @ x_t + b)   # new hidden state
y_t = W_y @ h_t + b_y                        # output for this step

The same weights (W_h, W_x) are reused at every step — that’s the “recurrent” part. This weight-sharing is what lets an RNN handle sequences of any length with a fixed, modest number of parameters. Feed it words and it can predict the next one. Feed it sensor readings and it can forecast tomorrow’s.

The vanishing-gradient problem

Beautiful in theory. Then you try to teach it.

RNNs learn via backpropagation through time: you unroll the loop across all timesteps and push the error gradient backwards through every one of them. And here’s the catch — that gradient gets multiplied by the same weight matrix again and again as it travels back. Multiply a number smaller than one by itself fifty times and it shrinks to a rounding error. Multiply a number larger than one and it explodes.

The shrinking case is the famous vanishing-gradient problem, identified back when Hochreiter and Schmidhuber were sizing up LSTMs in the 1990s. The practical consequence: by the time the gradient crawls back to early timesteps, it’s so faint that the network simply can’t learn long-range dependencies. Ask a vanilla RNN to connect “The cat, which had wandered in from the garden hours earlier, was…” and it has usually forgotten the cat by the time it reaches the verb. Short memory, all over again.

LSTMs and GRUs: gates to the rescue

The fix was to stop forcing information through a tanh blender at every step and instead give the network a cell state — a kind of conveyor belt that carries information across many timesteps with only light, deliberate edits.

The Long Short-Term Memory (LSTM) network, introduced by Hochreiter and Schmidhuber in 1997, manages that conveyor belt with three learnable gates:

a forget gate that decides what to drop from the cell state,
an input gate that decides what new information to write, and
an output gate that decides what to expose as the hidden state.

Because the cell state is mostly added to rather than repeatedly multiplied, gradients can flow back across long spans without vanishing. That’s the entire trick — and it’s why LSTMs dominated sequence modelling for the better part of two decades.

The Gated Recurrent Unit (GRU), proposed by Cho et al. in 2014, is the LSTM’s leaner cousin: it merges the gates down to two (reset and update) and ditches the separate cell state. Fewer parameters, faster training, and often indistinguishable accuracy. Rule of thumb: try a GRU first; reach for an LSTM if you genuinely need the extra capacity.

In practice you rarely wire the gates by hand. Keras makes it almost insultingly easy:

from tensorflow.keras import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding

model = Sequential([
    Embedding(input_dim=10000, output_dim=128),  # token IDs -> vectors
    LSTM(64, return_sequences=False),            # the gated magic
    Dense(1, activation="sigmoid"),              # e.g. sentiment score
])
model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

Where they shone — and why transformers took over

For years this family ran the show: machine translation, speech recognition, autocomplete, handwriting generation, and time-series forecasting for everything from electricity demand to ECG signals. If your data had an order, an LSTM was the safe bet.

Then 2017 happened. Vaswani et al.’s “Attention Is All You Need” introduced the transformer, which threw out recurrence entirely in favour of self-attention — letting the model look at every position in a sequence at once instead of shuffling along one step at a time. Two things followed: training parallelised beautifully across modern GPUs (no more waiting for step 99 before starting step 100), and long-range dependencies became a direct lookup rather than a fragile relay race. That combination is what made today’s large language models possible, and it’s why transformers have largely superseded RNNs for serious NLP.

RNNs haven’t entirely left the building, mind you — they’re still lean and capable on small devices and short time-series problems, and researchers keep poking at efficient revivals like xLSTM and minimal GRUs. But for anything ambitious and text-shaped, attention won.

The takeaway

When you’ve got sequence data, ask one question first: do distant parts of the sequence need to talk to each other? If they do and the sequence is long, default to a transformer. If your sequences are short, your data is modest, or you’re squeezing onto constrained hardware, a GRU (or an LSTM) is still a perfectly sharp, far cheaper tool. RNNs taught machines to remember; transformers taught them to pay attention. Knowing which problem you actually have is what saves you a week of training the wrong model.

Recurrent Neural Networks (RNNs) and LSTMs, Explained

Sequence data and the hidden state

The vanishing-gradient problem

LSTMs and GRUs: gates to the rescue

Where they shone — and why transformers took over

The takeaway

More posts

Reading Learning Curves: Diagnosing Model Training

Model Explainability: Making Sense of SHAP and LIME

Diffusion Models: How AI Generates Images