LogoLogo
ProductResourcesGitHubStart free
  • Documentation
  • Learn
  • ZenML Pro
  • Stacks
  • API Reference
  • SDK Reference
  • Overview
  • Starter guide
    • Create an ML pipeline
    • Cache previous executions
    • Manage artifacts
    • Track ML models
    • A starter project
  • Production guide
    • Deploying ZenML
    • Understanding stacks
    • Connecting remote storage
    • Orchestrate on the cloud
    • Configure your pipeline to add compute
    • Configure a code repository
    • Set up CI/CD
    • An end-to-end project
  • LLMOps guide
    • RAG with ZenML
      • RAG in 85 lines of code
      • Understanding Retrieval-Augmented Generation (RAG)
      • Data ingestion and preprocessing
      • Embeddings generation
      • Storing embeddings in a vector database
      • Basic RAG inference pipeline
    • Evaluation and metrics
      • Evaluation in 65 lines of code
      • Retrieval evaluation
      • Generation evaluation
      • Evaluation in practice
    • Reranking for better retrieval
      • Understanding reranking
      • Implementing reranking in ZenML
      • Evaluating reranking performance
    • Improve retrieval by finetuning embeddings
      • Synthetic data generation
      • Finetuning embeddings with Sentence Transformers
      • Evaluating finetuned embeddings
    • Finetuning LLMs with ZenML
      • Finetuning in 100 lines of code
      • Why and when to finetune LLMs
      • Starter choices with finetuning
      • Finetuning with 🤗 Accelerate
      • Evaluation for finetuning
      • Deploying finetuned models
      • Next steps
  • Tutorials
    • Managing scheduled pipelines
    • Trigger pipelines from external systems
    • Hyper-parameter tuning
    • Inspecting past pipeline runs
    • Train with GPUs
    • Running notebooks remotely
    • Managing machine learning datasets
    • Handling big data
  • Best practices
    • 5-minute Quick Wins
    • Keep Your Dashboard Clean
    • Configure Python environments
    • Shared Components for Teams
    • Organizing Stacks Pipelines Models
    • Access Management
    • Setting up a Project Repository
    • Infrastructure as Code with Terraform
    • Creating Templates for ML Platform
    • Using VS Code extension
    • Leveraging MCP
    • Debugging and Solving Issues
    • Choosing an Orchestrator
  • Examples
    • Quickstart
    • End-to-End Batch Inference
    • Basic NLP with BERT
    • Computer Vision with YoloV8
    • LLM Finetuning
    • More Projects...
Powered by GitBook
On this page
  • 1 Request extra resources for a step
  • 2 Build a CUDA‑enabled container image
  • Optional – clear the CUDA cache
  • 3 Multi‑GPU / multi‑node training with 🤗 Accelerate
  • Prepare the container
  • 4 Troubleshooting & Tips

Was this helpful?

Edit on GitHub
  1. Tutorials

Train with GPUs

Train ZenML pipelines on GPUs and scale out with 🤗 Accelerate.

PreviousInspecting past pipeline runsNextRunning notebooks remotely

Last updated 23 days ago

Was this helpful?

Need more compute than your laptop can offer? This tutorial shows how to:

  1. Request GPU resources for individual steps.

  2. Build a CUDA‑enabled container image so the GPU is actually visible.

  3. Reset the CUDA cache between steps (optional but handy for memory‑heavy jobs).

  4. Scale to multiple GPUs or nodes with the integration.


1 Request extra resources for a step

If your orchestrator supports it you can reserve CPU, GPU and RAM directly on a ZenML @step:

from zenml import step
from zenml.config import ResourceSettings

@step(settings={
    "resources": ResourceSettings(cpu_count=8, gpu_count=2, memory="16GB")
})
def training_step(...):
    ...  # heavy training logic

👉 Check your orchestrator's docs; some (e.g. SkyPilot) expose dedicated settings instead of ResourceSettings.

If your orchestrator can't satisfy these requirements, consider off‑loading the step to a dedicated .


2 Build a CUDA‑enabled container image

Requesting a GPU is not enough—your Docker image needs the CUDA runtime, too.

from zenml import pipeline
from zenml.config import DockerSettings

docker = DockerSettings(
    parent_image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    requirements=["zenml", "torchvision"]
)

@pipeline(settings={"docker": docker})
def my_gpu_pipeline(...):
    ...

Use the official CUDA images for TensorFlow/PyTorch or the pre‑built ones offered by AWS, GCP or Azure.


Optional – clear the CUDA cache

If you squeeze every last MB out of the GPU consider clearing the cache at the beginning of each step:

import gc, torch

def cleanup_memory():
    while gc.collect():
        torch.cuda.empty_cache()

Call cleanup_memory() at the start of your GPU steps.


3 Multi‑GPU / multi‑node training with 🤗 Accelerate

ZenML integrates with the Hugging Face Accelerate launcher. Wrap your training step with run_with_accelerate to fan it out over multiple GPUs or machines:

from zenml import step, pipeline
from zenml.integrations.huggingface.steps import run_with_accelerate

@run_with_accelerate(num_processes=4, multi_gpu=True)
@step
def training_step(...):
    ...  # your distributed training code

@pipeline
def dist_pipeline(...):
    training_step(...)

Common arguments:

  • num_processes: total processes to launch (one per GPU)

  • multi_gpu=True: enable multi‑GPU mode

  • cpu=True: force CPU training

  • mixed_precision : "fp16" / "bf16" / "no"

Accelerate‑decorated steps must be called with keyword arguments and cannot be wrapped a second time inside the pipeline definition.

Prepare the container

Use the same CUDA image as above plus add Accelerate to the requirements:

DockerSettings(
    parent_image="pytorch/pytorch:2.1.0-cuda12.1-cudnn8-runtime",
    requirements=["zenml", "accelerate", "torchvision"]
)

4 Troubleshooting & Tips

Problem
Quick fix

GPU is unused

Verify CUDA toolkit inside container (nvcc --version), check driver compatibility

OOM even after cache reset

Reduce batch size, use gradient accumulation, or request more GPU memory

Accelerate hangs

Make sure ports are open between nodes; pass main_process_port explicitly

Need help? Join us on .

🤗 Accelerate
step operator
Slack