CUDA warning: driver shutting down - Dataloader GPU memory data to trainer transfer #6636

matt3o · 2023-06-21T13:02:47Z

Describe the bug
I just tried to get some sample code for #6626 but ran into a warning I have seen many times before. The problem appears when the transform pushed code the GPU and the data is then handed over from the Dataloader Thread to the main Thread.
This is no hard bug but it is very annoying since it gets spammed a lot.
Temporary workaround which I found is to add "persistent_workers=True," to the DataLoader, then the warning gets only shown at the end of the program, sometimes never.

Warning message:

[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)

To Reproduce
Run this code, minimal sample:

import torch
from torch import optim, nn
from monai.engines import SupervisedTrainer
from monai.data import DataLoader, ArrayDataset
import gc
from monai.networks.nets import UNet
from monai.inferers import SimpleInferer, SlidingWindowInferer
from monai.networks.nets.dynunet import DynUNet

from monai.engines import SupervisedEvaluator, SupervisedTrainer
import monai.transforms as mt


NETWORK_INPUT_SHAPE = (1, 128, 128, 256)
NUM_IMAGES = 50

def get_xy():
    xs = [256 * torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    ys = [torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    return xs, ys

transform = mt.Compose([
    mt.ToDevice(device="cuda")
])

def get_data_loader():
    x, y = get_xy()
    dataset = ArrayDataset(x, seg=y, img_transform=transform, seg_transform=transform)

    loader = DataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn')
    return loader


def get_model():
    return DynUNet(
            spatial_dims=3,
            in_channels=1,
            out_channels=1,
            kernel_size=[3, 3, 3, 3, 3 ,3],
            strides=[1, 2, 2, 2, 2, [2, 2, 1]],
            upsample_kernel_size=[2, 2, 2, 2, [2, 2, 1]],
            norm_name="instance",
            deep_supervision=False,
            res_block=True,
).to(device=device)

if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    train_loader = get_data_loader()
    model = get_model()
    MAX_EPOCHS = 2

    optimizer = optim.Adam(model.parameters())
    inferer = SlidingWindowInferer(roi_size=(64, 64, 64), sw_batch_size=10, mode="gaussian")

    trainer = SupervisedTrainer(
        device=device,
        max_epochs=MAX_EPOCHS,
        amp=True,
        train_data_loader=train_loader,
        network=model,
        optimizer=optimizer,
        inferer=inferer,
        loss_function=nn.CrossEntropyLoss(),
        prepare_batch=lambda batchdata, device, non_blocking: (
            batchdata[0].to(device),
            batchdata[1].squeeze(1).to(device, dtype=torch.long),
        ),
    )

    trainer.run()

Expected behavior
No Cuda Warnings

Environment

Verified on different environments.

================================
Printing MONAI config...
================================
MONAI version: 1.1.0
Numpy version: 1.23.5
Pytorch version: 1.13.1+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: a2ec3752f54bfc3b40e7952234fbeb5452ed63e3
MONAI __file__: /home/matteo/anaconda3/envs/monai/lib/python3.9/site-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: 0.4.10
Nibabel version: 5.0.1
scikit-image version: 0.20.0
Pillow version: 9.5.0
Tensorboard version: 2.12.1
gdown version: 4.7.1
TorchVision version: 0.14.0+cu117
tqdm version: 4.64.1
lmdb version: 1.4.0
psutil version: 5.9.4
pandas version: 1.5.3
einops version: 0.6.0
transformers version: 4.21.3
mlflow version: 2.2.2
pynrrd version: 1.0.0

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies


================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 22.04.2 LTS
Platform: Linux-5.19.0-45-generic-x86_64-with-glibc2.35
Processor: x86_64
Machine: x86_64
Python version: 3.9.16
Process name: python
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: []
Num physical CPUs: 12
Num logical CPUs: 24
Num usable CPUs: 24
CPU usage (%): [4.1, 3.6, 4.2, 3.6, 3.6, 3.7, 3.6, 3.1, 3.6, 4.1, 5.2, 99.5, 4.1, 3.6, 4.6, 3.6, 3.6, 3.6, 3.6, 3.6, 3.6, 5.2, 3.6, 4.2]
CPU freq. (MHz): 3687
Load avg. in last 1, 5, 15 mins (%): [6.3, 7.6, 7.5]
Disk usage (%): 24.8
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 31.2
Available memory (GB): 26.9
Used memory (GB): 3.9

================================
Printing GPU config...
================================
Num GPUs: 1
Has CUDA: True
CUDA version: 11.7
cuDNN enabled: True
cuDNN version: 8500
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
GPU 0 Name: NVIDIA GeForce RTX 3090 Ti
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 84
GPU 0 Total memory (GB): 22.2
GPU 0 CUDA capability (maj.min): 8.6

Additional context
Adding an evaluator further complicates the warnings and a new warning is now shown:

[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

The code for that:

import torch
from torch import optim, nn
from monai.engines import SupervisedTrainer
from monai.data import DataLoader, ArrayDataset
import gc
from monai.networks.nets import UNet
from monai.inferers import SimpleInferer, SlidingWindowInferer
from monai.networks.nets.dynunet import DynUNet

from monai.handlers import (
    CheckpointSaver,
    LrScheduleHandler,
    MeanDice,
    StatsHandler,
    TensorBoardStatsHandler,
    ValidationHandler,
    from_engine,
    GarbageCollector,
)
from monai.engines import SupervisedEvaluator, SupervisedTrainer
import monai.transforms as mt


NETWORK_INPUT_SHAPE = (1, 128, 128, 256)
NUM_IMAGES = 50

def get_xy():
    xs = [256 * torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    ys = [torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    return xs, ys

transform = mt.Compose([
    mt.ToDevice(device="cuda")
])

def get_data_loader():
    x, y = get_xy()
    dataset = ArrayDataset(x, seg=y, img_transform=transform, seg_transform=transform)

    loader = DataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn')
    return loader


def get_model():
    return DynUNet(
            spatial_dims=3,
            # 1 dim for the image, the other ones for the signal per label with is the size of image
            in_channels=1,
            out_channels=1,
            kernel_size=[3, 3, 3, 3, 3 ,3],
            strides=[1, 2, 2, 2, 2, [2, 2, 1]],
            upsample_kernel_size=[2, 2, 2, 2, [2, 2, 1]],
            norm_name="instance",
            deep_supervision=False,
            res_block=True,
).to(device=device)

if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    train_loader = get_data_loader()
    model = get_model()
    MAX_EPOCHS = 2

    optimizer = optim.Adam(model.parameters())
    inferer = SlidingWindowInferer(roi_size=(64, 64, 64), sw_batch_size=10, mode="gaussian")
    val_inferer = SlidingWindowInferer(roi_size=(64, 64, 64), sw_batch_size=10, mode="gaussian")

    val_handlers = [
        StatsHandler(output_transform=lambda x: None),
    ]

    evaluator = SupervisedEvaluator(
        device=device,
        amp=True,
        val_data_loader=train_loader,
        network=model,
        inferer=val_inferer,
        prepare_batch=lambda batchdata, device, non_blocking: (
            batchdata[0].to(device),
            batchdata[1].squeeze(1).to(device, dtype=torch.long),
        ),
        val_handlers = val_handlers,
    )
    lr_scheduler =  torch.optim.lr_scheduler.PolynomialLR(optimizer, total_iters=MAX_EPOCHS, power = 2)
    train_handlers = [
        ValidationHandler(
            validator=evaluator, interval=1, epoch_level=True,
        ),
        LrScheduleHandler(lr_scheduler=lr_scheduler, 
                print_lr=True,
        ),

    ]

    trainer = SupervisedTrainer(
        device=device,
        max_epochs=MAX_EPOCHS,
        amp=True,
        train_data_loader=train_loader,
        network=model,
        optimizer=optimizer,
        inferer=inferer,
        loss_function=nn.CrossEntropyLoss(),
        prepare_batch=lambda batchdata, device, non_blocking: (
            batchdata[0].to(device),
            batchdata[1].squeeze(1).to(device, dtype=torch.long),
        ),
        train_handlers=train_handlers,
    )
    trainer.run()

The text was updated successfully, but these errors were encountered:

wyli · 2023-06-22T10:15:03Z

Thanks for reporting this, I think the multiprocessing_context='spawn' creates and removes new processes during training, this actually introduces overheads.

Have you tried monai.data.ThreadLoader, this is often better than the multiprocessing loader when handling transforms on GPU:

loader = monai.data.ThreadDataLoader(dataset, num_workers=0, batch_size=1)

or

loader = ThreadDataLoader(dataset, num_workers=1, batch_size=1, use_thread_workers=True)

matt3o · 2023-06-22T12:12:47Z

@wyli Thanks for the quick reply!
Yes, ThreadDataloader without spawn (loader = ThreadDataLoader(dataset, num_workers=1, batch_size=1)) I get this error:
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

With spawn (loader = ThreadDataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn')) I get exactly the same results:

2023-06-22 13:54:36,946 - Engine run resuming from iteration 0, epoch 0 until 2 epochs
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
....
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

With loader = ThreadDataLoader(dataset, num_workers=0, batch_size=1, multiprocessing_context='spawn') I get
ValueError: multiprocessing_context can only be used with multi-process loading (num_workers > 0), but got num_workers=0

Working results with loader = ThreadDataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn', use_thread_workers=True)
Kind of confusing. I was playing around a bit with the different Dataloaders and options but could not find a working solution. So thanks for your help, this is a working solution!

matt3o · 2023-06-22T12:54:54Z

@wyli This topic is sadly not resolved however. Using ThreadDataloader modifies the logic of my real code, I believe the affine matrix for the spatial transforms is getting lost in the process.

The transforms, similar to DeepEdit, look like this:

            # Initial transforms on the CPU which does not hurt since they are executed asynchronously and only once
            InitLoggerd(args), # necessary if the dataloader runs in an extra thread / process
            LoadImaged(keys=("image", "label"), reader="ITKReader"),
            EnsureChannelFirstd(keys=("image", "label")),
            NormalizeLabelsInDatasetd(keys="label", label_names=labels, device=device),
            Orientationd(keys=["image", "label"], axcodes="RAS"),
            Spacingd(keys=["image", "label"], pixdim=spacing),
            CropForegroundd(keys=("image", "label"), source_key="image", select_fn=threshold_foreground),
            ScaleIntensityRanged(keys="image", a_min=0, a_max=43, b_min=0.0, b_max=1.0, clip=True), # 0.05 and 99.95 percentiles of the spleen HUs

            ### Random Transforms ###
            RandCropByPosNegLabeld(keys=("image", "label"), label_key="label", spatial_size=args.train_crop_size, pos=0.6, neg=0.4) if args.train_crop_size is not None else NoOpd(),
            DivisiblePadd(keys=["image", "label"], k=64, value=0) if args.inferer == "SimpleInferer" else NoOpd(), # UNet needs this
            RandFlipd(keys=("image", "label"), spatial_axis=[0], prob=0.10),
            RandFlipd(keys=("image", "label"), spatial_axis=[1], prob=0.10),
            RandFlipd(keys=("image", "label"), spatial_axis=[2], prob=0.10),
            RandRotate90d(keys=("image", "label"), prob=0.10, max_k=3),
            
            # Move to GPU
            ToTensord(keys=("image", "label"), device=device, track_meta=False),

With the Dataloader the resulting shape is torch.Size([1, 3, 224, 224, 320])
With ThreadDataloader the result is torch.Size([1, 3, 169, 169, 109]).
The following message is printed when using ThreadDataloader:

`data_array` is not of type `MetaTensor, assuming affine to be identity.
`data_array` is not of type MetaTensor, assuming affine to be identity.

I have run into this error before and I believe this is due to Spacingd(keys=["image", "label"], pixdim=spacing), not finding the affine matrix, but I have no idea why this happens with ThreadDataloader but not with the normal Dataloader. Any insight here would be cool since this makes the ThreadDataloader solution unusable for me.

wyli · 2023-06-22T12:56:10Z

yes, please use track_meta=True in the ToTensord

matt3o · 2023-06-22T13:01:00Z

Does not change anything. Do I have to call that sooner? The ToTensord call is after all the relevant transforms imo.

matt3o · 2023-06-22T13:03:37Z

Same problem, even with ToTensord right at the start:

            InitLoggerd(args), # necessary if the dataloader runs in an extra thread / process
            LoadImaged(keys=("image", "label"), reader="ITKReader"),
            ToTensord(keys=("image", "label"), device=device, track_meta=True),
            EnsureChannelFirstd(keys=("image", "label")),
            NormalizeLabelsInDatasetd(keys="label", label_names=labels, device=device),
            Orientationd(keys=["image", "label"], axcodes="RAS"),
            Spacingd(keys=["image", "label"], pixdim=spacing),
            CropForegroundd(keys=("image", "label"), source_key="image", select_fn=threshold_foreground),
            ScaleIntensityRanged(keys="image", a_min=0, a_max=43, b_min=0.0, b_max=1.0, clip=True), # 0.05 and 99.95 percentiles of the spleen HUs

matt3o · 2023-06-22T13:07:17Z

I just changed my code to Dataset instead of PersistentDataset, just to be sure this is no caching effect. Same results on Dataset

wyli · 2023-06-22T13:11:51Z

ok... perhaps this NormalizeLabelsInDatasetd implicitly converts a metatensor into a torch tensor and removes the metadata. could you please try dropping that transform to confirm? I can also look into this soon.

matt3o · 2023-06-22T13:15:34Z

Same result, so the transforms still don't work as expected, plus the code then crashed later of course since information is missing. The code is extremely similar to this one https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/deepedit/transforms.py#L86 , so it should check if the tensor is normal or a MetaTensor

matt3o · 2023-06-22T13:31:02Z

@wyli Would be great if you find the time to check that out 😊 As I said, with Dataloader it works, with ThreadDataloader it doesn't. Just for completeness I'll paste the calling code for both below.

-    train_loader = DataLoader(
-        train_ds, shuffle=True, num_workers=args.num_workers, batch_size=1, multiprocessing_context='spawn', persistent_workers=True,
+    train_loader = ThreadDataLoader(
+        train_ds, shuffle=True, num_workers=args.num_workers, batch_size=1, multiprocessing_context='spawn', use_thread_workers=True#, persistent_workers=True,

wyli · 2023-06-22T13:37:27Z

sure, I forgot to mention that ThreadDataLoader with num_workers greater than 1 may have some problems because some transforms are not thread-safe. DataLoader on the other hand can work with more than 1 processes.

the issue of data_array is not of type MetaTensor, assuming affine to be identity. looks like a separate issue, I'll have a look..

matt3o · 2023-06-22T13:44:22Z

Ah no worries there, I am using args.num_workers==1 per default. Good to know anyways, then I won't increase it. I found old code mentioning setting num_workers to 0 but that no longer works.

wyli · 2023-06-26T09:05:15Z

@wyli This topic is sadly not resolved however. Using ThreadDataloader modifies the logic of my real code, I believe the affine matrix for the spatial transforms is getting lost in the process.

The transforms, similar to DeepEdit, look like this:

            # Initial transforms on the CPU which does not hurt since they are executed asynchronously and only once
            InitLoggerd(args), # necessary if the dataloader runs in an extra thread / process
            LoadImaged(keys=("image", "label"), reader="ITKReader"),
            EnsureChannelFirstd(keys=("image", "label")),
            NormalizeLabelsInDatasetd(keys="label", label_names=labels, device=device),
            Orientationd(keys=["image", "label"], axcodes="RAS"),
            Spacingd(keys=["image", "label"], pixdim=spacing),
            CropForegroundd(keys=("image", "label"), source_key="image", select_fn=threshold_foreground),
            ScaleIntensityRanged(keys="image", a_min=0, a_max=43, b_min=0.0, b_max=1.0, clip=True), # 0.05 and 99.95 percentiles of the spleen HUs

            ### Random Transforms ###
            RandCropByPosNegLabeld(keys=("image", "label"), label_key="label", spatial_size=args.train_crop_size, pos=0.6, neg=0.4) if args.train_crop_size is not None else NoOpd(),
            DivisiblePadd(keys=["image", "label"], k=64, value=0) if args.inferer == "SimpleInferer" else NoOpd(), # UNet needs this
            RandFlipd(keys=("image", "label"), spatial_axis=[0], prob=0.10),
            RandFlipd(keys=("image", "label"), spatial_axis=[1], prob=0.10),
            RandFlipd(keys=("image", "label"), spatial_axis=[2], prob=0.10),
            RandRotate90d(keys=("image", "label"), prob=0.10, max_k=3),
            
            # Move to GPU
            ToTensord(keys=("image", "label"), device=device, track_meta=False),

With the Dataloader the resulting shape is torch.Size([1, 3, 224, 224, 320]) With ThreadDataloader the result is torch.Size([1, 3, 169, 169, 109]). The following message is printed when using ThreadDataloader:

`data_array` is not of type `MetaTensor, assuming affine to be identity.
`data_array` is not of type MetaTensor, assuming affine to be identity.

I have run into this error before and I believe this is due to Spacingd(keys=["image", "label"], pixdim=spacing), not finding the affine matrix, but I have no idea why this happens with ThreadDataloader but not with the normal Dataloader. Any insight here would be cool since this makes the ThreadDataloader solution unusable for me.

Hi @diazandr3s I'm not sure about the root cause of this deepedit transform + ThreadDataloader issue, please have a look if you have time.. thanks! (converting this to a discussion for now, please feel free to create a bug report if it's not a usage question)

Project-MONAI locked and limited conversation to collaborators Jun 26, 2023

wyli converted this issue into discussion #6657 Jun 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

This issue was moved to a discussion.

CUDA warning: driver shutting down - Dataloader GPU memory data to trainer transfer #6636

CUDA warning: driver shutting down - Dataloader GPU memory data to trainer transfer #6636

matt3o commented Jun 21, 2023

wyli commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023 •

edited

Loading

Uh oh!

wyli commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

wyli commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

wyli commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

wyli commented Jun 26, 2023

Uh oh!

This issue was moved to a discussion.

This issue was moved to a discussion.

CUDA warning: driver shutting down - Dataloader GPU memory data to trainer transfer #6636

CUDA warning: driver shutting down - Dataloader GPU memory data to trainer transfer #6636

Comments

matt3o commented Jun 21, 2023

wyli commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wyli commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

wyli commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

wyli commented Jun 22, 2023

Uh oh!

matt3o commented Jun 22, 2023

Uh oh!

wyli commented Jun 26, 2023

Uh oh!

This issue was moved to a discussion.

matt3o commented Jun 22, 2023 •

edited

Loading