A recurrent neural network, particularly an LSTM, doesn’t learn temporal dependencies by looking at the past; it learns by remembering the past.
Let’s see this in action. Imagine we have a simple sine wave as our time series data. We’ll generate 1000 points, split them into training and testing sets, and then feed them into an LSTM.
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# Generate synthetic time series data (sine wave)
time = np.arange(0, 1000, 0.1)
data = np.sin(time) + np.random.normal(scale=0.1, size=len(time))
# Scale data
scaler = MinMaxScaler(feature_range=(-1, 1))
data_scaled = scaler.fit_transform(data.reshape(-1, 1))
# Create sequences
def create_sequences(data, seq_length):
xs, ys = [], []
for i in range(len(data) - seq_length):
x = data[i:(i + seq_length), 0]
y = data[i + seq_length, 0]
xs.append(x)
ys.append(y)
return np.array(xs), np.array(ys)
SEQ_LENGTH = 50
X, y = create_sequences(data_scaled, SEQ_LENGTH)
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Reshape for LSTM [samples, timesteps, features]
X_train = X_train.reshape((X_train.shape[0], X_train.shape[1], 1))
X_test = X_test.reshape((X_test.shape[0], X_test.shape[1], 1))
# Build LSTM model
model = Sequential([
LSTM(units=50, return_sequences=True, input_shape=(SEQ_LENGTH, 1)),
Dropout(0.2),
LSTM(units=50, return_sequences=False),
Dropout(0.2),
Dense(units=25),
Dense(units=1)
])
model.compile(optimizer='adam', loss='mean_squared_error')
# Train model
history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.1, verbose=1)
# Plot training history
plt.figure(figsize=(12, 6))
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss During Training')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.show()
# Make predictions
y_pred_scaled = model.predict(X_test)
# Inverse transform predictions
y_pred = scaler.inverse_transform(y_pred_scaled)
y_test_original = scaler.inverse_transform(y_test.reshape(-1, 1))
# Plot predictions vs actual
plt.figure(figsize=(14, 7))
plt.plot(y_test_original, label='Actual Values')
plt.plot(y_pred, label='Predicted Values')
plt.title('LSTM Prediction vs Actual Values')
plt.xlabel('Time Steps (Test Set)')
plt.ylabel('Value')
plt.legend()
plt.show()
This code trains an LSTM to predict the next value in a time series based on a fixed window of past values (our SEQ_LENGTH). The LSTM layer itself is the core component. It has internal states (cell state and hidden state) that are updated at each time step. These states act as the "memory" that allows the network to capture long-term dependencies. The return_sequences=True argument in the first LSTM layer is crucial when stacking LSTMs, as it ensures that the output at each time step is passed to the next layer, not just the final output. The Dropout layers help prevent overfitting by randomly setting a fraction of neuron outputs to zero during training.
The problem LSTMs solve is the vanishing gradient problem that plagues simpler recurrent neural networks when dealing with long sequences. Traditional RNNs struggle to propagate gradient information over many time steps, effectively "forgetting" early inputs. LSTMs, with their gating mechanisms (input, forget, and output gates) and a cell state, are designed to selectively remember or forget information, allowing them to learn dependencies that span much longer periods. The cell state acts as a conveyor belt, carrying relevant information across time steps with minimal degradation.
The key levers you control are SEQ_LENGTH (how much past data the model sees at once), the number of LSTM units (the complexity and capacity of the memory), the number of LSTM layers (for deeper temporal feature extraction), Dropout rates (for regularization), the optimizer (how weights are updated), and the epochs (how long training runs).
A common misconception is that LSTMs explicitly store past values. They don’t. Instead, they learn to adjust their internal state (the cell state and hidden state) in such a way that, when combined with the current input, they produce an output that reflects the relevant historical patterns. The "memory" is a distributed representation within the network’s weights and states, not a direct lookup table of past observations.
The next concept to explore is how to handle multivariate time series with LSTMs, where you have multiple related sequences influencing each other.