TensorFlow’s GPU memory growth isn’t about TensorFlow requesting more memory as it needs it; it’s about TensorFlow reserving a chunk of GPU memory upfront and then allowing it to grow within that pre-allocated reservation.

Here’s the core problem: TensorFlow, by default, tries to allocate all available GPU memory on startup. This is usually a bad idea because other processes (like your display driver, or even other TensorFlow jobs) might need that memory. When TensorFlow grabs it all, or when it tries to grab more and there’s no contiguous block available, you get an "Out Of Memory" (OOM) error, even if the GPU total usage isn’t 100%.

Let’s fix that.

Common Causes and Fixes for TensorFlow GPU OOM Errors

  1. Default Behavior: Allocating All Memory

    • Diagnosis: You see ResourceExhaustedError: OOM when allocating tensor with shape... right at the start of your script, or after a few iterations.
    • Cause: TensorFlow’s default allow_growth setting is False. It tries to reserve all memory.
    • Fix: Configure TensorFlow to allow memory growth.
      import tensorflow as tf
      
      gpus = tf.config.experimental.list_physical_devices('GPU')
      if gpus:
          try:
              # Currently, memory growth needs to be the same across GPUs
              for gpu in gpus:
                  tf.config.experimental.set_memory_growth(gpu, True)
              logical_gpus = tf.config.experimental.list_logical_devices('GPU')
              print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
          except RuntimeError as e:
              # Memory growth must be set before GPUs have been initialized
              print(e)
      
    • Why it works: This tells TensorFlow to only allocate memory as needed, rather than reserving it all upfront. It will still grow the allocation, but it won’t claim everything at once, leaving room for other processes or future TensorFlow needs within the session.
  2. Tiny Batches, Massive Models

    • Diagnosis: OOM errors occur even with allow_growth=True, but your GPU utilization (e.g., nvidia-smi) shows only 20-30% memory usage.
    • Cause: Each individual sample or small batch, when processed by a large model, can still require a significant amount of intermediate memory for activations during the forward pass. A small batch size might not be enough to keep the GPU busy, but each sample’s processing is memory-intensive.
    • Fix: Increase the batch size.
      # Example: instead of batch_size=16, try batch_size=64 or 128
      # Adjust model training loop:
      # train_dataset = train_dataset.batch(64)
      
    • Why it works: A larger batch size allows TensorFlow to amortize the memory cost of activations over more samples. While a single sample might need X memory for activations, processing N samples simultaneously might only require X + Y (where Y is overhead), rather than N*X. This increases GPU utilization and can resolve OOMs by using available memory more efficiently.
  3. Memory Fragmentation

    • Diagnosis: You get OOM errors after your script has been running for a while, or after multiple training steps, even with allow_growth=True and a reasonable batch size. nvidia-smi shows significant free memory, but TensorFlow still fails.
    • Cause: TensorFlow’s memory allocator, while growing, can leave small, non-contiguous "holes" of free memory. When a large tensor needs to be allocated, and no single contiguous block is large enough, an OOM occurs.
    • Fix: Restart the Python kernel or the entire script. Alternatively, for more advanced control, set a per-process memory limit.
      import tensorflow as tf
      
      gpus = tf.config.experimental.list_physical_devices('GPU')
      if gpus:
          try:
              # Set a memory limit, e.g., 80% of total GPU memory
              # Let's say your GPU has 16GB (16384 MB)
              tf.config.experimental.set_virtual_device_configuration(
                  gpus[0],
                  [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=12288)]) # 12GB
              tf.config.experimental.set_memory_growth(gpus[0], True) # Still good to have growth
          except RuntimeError as e:
              print(e)
      
    • Why it works: Restarting clears TensorFlow’s internal memory state, effectively defragmenting it. Setting a memory_limit (in MB) provides a fixed cap, preventing TensorFlow from aggressively trying to grab more memory than is available or stable, which can sometimes help the allocator manage fragmentation better within that limit.
  4. Large Input Data or Intermediate Tensors

    • Diagnosis: OOM occurs when loading a large dataset, or when a specific layer in your model (e.g., a large embedding layer, or a convolution with many filters) is initialized.
    • Cause: The model itself, or the data pipeline, is trying to create tensors that are simply too large for the available GPU memory, even with growth enabled.
    • Fix:
      • Data: Use data generators that yield batches, or use tf.data.Dataset with prefetch and cache carefully to avoid loading everything into memory at once. Downsample or preprocess your data if possible.
      • Model:
        • Reduce the size of intermediate layers (e.g., fewer filters in CNNs, smaller embedding dimensions).
        • Use mixed-precision training (FP16) to halve the memory footprint of most tensors.
        # For mixed precision:
        from tensorflow.keras import mixed_precision
        mixed_precision.set_global_policy('mixed_float16')
        
    • Why it works: Data generators stream data, preventing the entire dataset from occupying GPU memory. tf.data optimizations manage memory efficiently. Reducing model tensor sizes directly lowers the peak memory demand. Mixed precision uses 16-bit floating-point numbers, which consume half the memory of 32-bit floats for weights, activations, and gradients.
  5. Multiple TensorFlow Processes/GPUs

    • Diagnosis: OOM errors occur when you run multiple training jobs simultaneously, or when using tf.distribute.Strategy across multiple GPUs.
    • Cause: Each process or each GPU (if not managed by a distribution strategy) tries to allocate its own memory, potentially exceeding the total available physical GPU memory.
    • Fix:
      • Multiple Jobs: Manually set CUDA_VISIBLE_DEVICES environment variable for each job to restrict it to specific GPUs.
      • Distribution Strategy: Ensure your strategy is correctly configured and that the memory_limit is set appropriately if you’re using it in conjunction with strategies (though set_memory_growth is usually preferred).
      # For job 1 on GPU 0
      CUDA_VISIBLE_DEVICES=0 python your_script_1.py
      
      # For job 2 on GPU 1
      CUDA_VISIBLE_DEVICES=1 python your_script_2.py
      
    • Why it works: CUDA_VISIBLE_DEVICES makes TensorFlow (and CUDA) only see and use the GPUs you specify, preventing contention for the same physical hardware. A correctly implemented distribution strategy manages memory allocation across the assigned GPUs.
  6. Outdated Drivers or TensorFlow Version

    • Diagnosis: You’ve tried all the above, and still face persistent OOM errors, or erratic memory behavior.
    • Cause: Bugs in older NVIDIA drivers or TensorFlow versions could lead to inefficient memory management or incorrect reporting of available memory.
    • Fix: Update your NVIDIA drivers and TensorFlow/CUDA toolkit to the latest stable versions compatible with each other. Check TensorFlow’s installation guide for compatibility matrices.
    • Why it works: Newer versions often contain performance improvements and bug fixes related to GPU memory management and CUDA interactions.

After fixing these, the next error you might encounter is a CUDA error: unspecified launch failure if your model is still too large for the GPU’s VRAM, even with memory growth enabled.

Want structured learning?

Take the full Tensorflow course →