TensorFlow’s MirroredStrategy is actually a more efficient way to distribute training across multiple GPUs than the older ParameterServerStrategy, despite the latter’s name suggesting more direct control.
Let’s see MirroredStrategy in action. Imagine you have a simple Keras model and want to train it on two GPUs.
import tensorflow as tf
# Define a simple Keras model
def build_model():
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
tf.keras.layers.Dense(10, activation='softmax')
])
return model
# Create a MirroredStrategy
strategy = tf.distribute.MirroredStrategy()
print(f"Number of devices: {strategy.num_replicas_in_sync}")
# Build the model within the strategy's scope
with strategy.scope():
model = build_model()
optimizer = tf.keras.optimizers.Adam()
model.compile(optimizer=optimizer,
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# Generate some dummy data
(x_train, y_train), _ = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(-1, 784).astype('float32') / 255.0
y_train = y_train.astype('int32')
# Train the model
model.fit(x_train, y_train, epochs=5, batch_size=32)
When you run this, TensorFlow automatically divides the data and the model’s variables across the available GPUs. MirroredStrategy creates a replica of your model on each GPU. During the forward pass, each replica processes a slice of the input batch. Gradients are then computed independently on each replica. The crucial part is the synchronization: before the optimizer updates the weights, the gradients from all replicas are averaged. This ensures that all model replicas stay in sync. The print(f"Number of devices: {strategy.num_replicas_in_sync}") will output the number of GPUs TensorFlow detected and is using for distribution.
The core problem MirroredStrategy solves is efficiently scaling training by leveraging multiple processing units without the overhead of explicit communication for every single operation. Unlike older methods where a central parameter server might become a bottleneck, MirroredStrategy performs gradient aggregation locally on each node (or GPU in this case) before a final reduction. This minimizes network latency and maximizes GPU utilization. The strategy.scope() context manager is key; any Keras objects (like the model, optimizer, and variables) created within this scope are automatically distributed according to the strategy.
The ParameterServerStrategy (and its asynchronous variant) was designed for distributed training across multiple machines, often with a dedicated set of "parameter servers" that hold the model’s weights. Workers would fetch weights from the servers, compute gradients, and push those gradients back. This is powerful for massive models and clusters but introduces significant communication overhead. MirroredStrategy is optimized for a single machine with multiple GPUs, where communication between GPUs is much faster. It achieves data parallelism by replicating the entire model on each device and synchronously averaging gradients.
The surprising efficiency of MirroredStrategy comes from its implicit collective operations. When gradients are calculated on each GPU, TensorFlow uses highly optimized communication primitives (like tf.distribute.Strategy.reduce with SUM_OVER_REACH) to average them. This is often implemented using NCCL (NVIDIA Collective Communications Library) on NVIDIA hardware, which is extremely fast for intra-node GPU communication. The optimizer then applies the averaged gradient to the local copy of the weights on each GPU, ensuring consistency.
A common pitfall is not placing model and optimizer creation within the strategy.scope(). If you create your model or optimizer outside this context, they won’t be aware of the distribution strategy, and training will likely occur on only one device (the CPU or the first GPU), negating the benefits of multi-GPU setup. The optimizer itself is also replicated, and its apply_gradients method is orchestrated by the strategy to update all replicas correctly.
If you intend to distribute training across multiple machines, you would typically use MultiWorkerMirroredStrategy or ParameterServerStrategy (though MultiWorkerMirroredStrategy is generally preferred for its synchronous nature and ease of use).