Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

CUDA warning: driver shutting down - Dataloader GPU memory data to trainer transfer #6636

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
matt3o opened this issue Jun 21, 2023 · 13 comments

Comments

@matt3o
Copy link
Contributor

matt3o commented Jun 21, 2023

Describe the bug
I just tried to get some sample code for #6626 but ran into a warning I have seen many times before. The problem appears when the transform pushed code the GPU and the data is then handed over from the Dataloader Thread to the main Thread.
This is no hard bug but it is very annoying since it gets spammed a lot.
Temporary workaround which I found is to add "persistent_workers=True," to the DataLoader, then the warning gets only shown at the end of the program, sometimes never.

Warning message:

[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)

To Reproduce
Run this code, minimal sample:

import torch
from torch import optim, nn
from monai.engines import SupervisedTrainer
from monai.data import DataLoader, ArrayDataset
import gc
from monai.networks.nets import UNet
from monai.inferers import SimpleInferer, SlidingWindowInferer
from monai.networks.nets.dynunet import DynUNet

from monai.engines import SupervisedEvaluator, SupervisedTrainer
import monai.transforms as mt


NETWORK_INPUT_SHAPE = (1, 128, 128, 256)
NUM_IMAGES = 50

def get_xy():
    xs = [256 * torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    ys = [torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    return xs, ys

transform = mt.Compose([
    mt.ToDevice(device="cuda")
])

def get_data_loader():
    x, y = get_xy()
    dataset = ArrayDataset(x, seg=y, img_transform=transform, seg_transform=transform)

    loader = DataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn')
    return loader


def get_model():
    return DynUNet(
            spatial_dims=3,
            in_channels=1,
            out_channels=1,
            kernel_size=[3, 3, 3, 3, 3 ,3],
            strides=[1, 2, 2, 2, 2, [2, 2, 1]],
            upsample_kernel_size=[2, 2, 2, 2, [2, 2, 1]],
            norm_name="instance",
            deep_supervision=False,
            res_block=True,
).to(device=device)

if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    train_loader = get_data_loader()
    model = get_model()
    MAX_EPOCHS = 2

    optimizer = optim.Adam(model.parameters())
    inferer = SlidingWindowInferer(roi_size=(64, 64, 64), sw_batch_size=10, mode="gaussian")

    trainer = SupervisedTrainer(
        device=device,
        max_epochs=MAX_EPOCHS,
        amp=True,
        train_data_loader=train_loader,
        network=model,
        optimizer=optimizer,
        inferer=inferer,
        loss_function=nn.CrossEntropyLoss(),
        prepare_batch=lambda batchdata, device, non_blocking: (
            batchdata[0].to(device),
            batchdata[1].squeeze(1).to(device, dtype=torch.long),
        ),
    )

    trainer.run()

Expected behavior
No Cuda Warnings

Environment

Verified on different environments.

================================
Printing MONAI config...
================================
MONAI version: 1.1.0
Numpy version: 1.23.5
Pytorch version: 1.13.1+cu117
MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False
MONAI rev id: a2ec3752f54bfc3b40e7952234fbeb5452ed63e3
MONAI __file__: /home/matteo/anaconda3/envs/monai/lib/python3.9/site-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: 0.4.10
Nibabel version: 5.0.1
scikit-image version: 0.20.0
Pillow version: 9.5.0
Tensorboard version: 2.12.1
gdown version: 4.7.1
TorchVision version: 0.14.0+cu117
tqdm version: 4.64.1
lmdb version: 1.4.0
psutil version: 5.9.4
pandas version: 1.5.3
einops version: 0.6.0
transformers version: 4.21.3
mlflow version: 2.2.2
pynrrd version: 1.0.0

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies


================================
Printing system config...
================================
System: Linux
Linux version: Ubuntu 22.04.2 LTS
Platform: Linux-5.19.0-45-generic-x86_64-with-glibc2.35
Processor: x86_64
Machine: x86_64
Python version: 3.9.16
Process name: python
Command: ['python', '-c', 'import monai; monai.config.print_debug_info()']
Open files: []
Num physical CPUs: 12
Num logical CPUs: 24
Num usable CPUs: 24
CPU usage (%): [4.1, 3.6, 4.2, 3.6, 3.6, 3.7, 3.6, 3.1, 3.6, 4.1, 5.2, 99.5, 4.1, 3.6, 4.6, 3.6, 3.6, 3.6, 3.6, 3.6, 3.6, 5.2, 3.6, 4.2]
CPU freq. (MHz): 3687
Load avg. in last 1, 5, 15 mins (%): [6.3, 7.6, 7.5]
Disk usage (%): 24.8
Avg. sensor temp. (Celsius): UNKNOWN for given OS
Total physical memory (GB): 31.2
Available memory (GB): 26.9
Used memory (GB): 3.9

================================
Printing GPU config...
================================
Num GPUs: 1
Has CUDA: True
CUDA version: 11.7
cuDNN enabled: True
cuDNN version: 8500
Current device: 0
Library compiled for CUDA architectures: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86']
GPU 0 Name: NVIDIA GeForce RTX 3090 Ti
GPU 0 Is integrated: False
GPU 0 Is multi GPU board: False
GPU 0 Multi processor count: 84
GPU 0 Total memory (GB): 22.2
GPU 0 CUDA capability (maj.min): 8.6

Additional context
Adding an evaluator further complicates the warnings and a new warning is now shown:

[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: driver shutting down (function uncheckedSetDevice)
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

The code for that:

import torch
from torch import optim, nn
from monai.engines import SupervisedTrainer
from monai.data import DataLoader, ArrayDataset
import gc
from monai.networks.nets import UNet
from monai.inferers import SimpleInferer, SlidingWindowInferer
from monai.networks.nets.dynunet import DynUNet

from monai.handlers import (
    CheckpointSaver,
    LrScheduleHandler,
    MeanDice,
    StatsHandler,
    TensorBoardStatsHandler,
    ValidationHandler,
    from_engine,
    GarbageCollector,
)
from monai.engines import SupervisedEvaluator, SupervisedTrainer
import monai.transforms as mt


NETWORK_INPUT_SHAPE = (1, 128, 128, 256)
NUM_IMAGES = 50

def get_xy():
    xs = [256 * torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    ys = [torch.rand(NETWORK_INPUT_SHAPE) for _ in range(NUM_IMAGES)]
    return xs, ys

transform = mt.Compose([
    mt.ToDevice(device="cuda")
])

def get_data_loader():
    x, y = get_xy()
    dataset = ArrayDataset(x, seg=y, img_transform=transform, seg_transform=transform)

    loader = DataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn')
    return loader


def get_model():
    return DynUNet(
            spatial_dims=3,
            # 1 dim for the image, the other ones for the signal per label with is the size of image
            in_channels=1,
            out_channels=1,
            kernel_size=[3, 3, 3, 3, 3 ,3],
            strides=[1, 2, 2, 2, 2, [2, 2, 1]],
            upsample_kernel_size=[2, 2, 2, 2, [2, 2, 1]],
            norm_name="instance",
            deep_supervision=False,
            res_block=True,
).to(device=device)

if __name__ == "__main__":
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    train_loader = get_data_loader()
    model = get_model()
    MAX_EPOCHS = 2

    optimizer = optim.Adam(model.parameters())
    inferer = SlidingWindowInferer(roi_size=(64, 64, 64), sw_batch_size=10, mode="gaussian")
    val_inferer = SlidingWindowInferer(roi_size=(64, 64, 64), sw_batch_size=10, mode="gaussian")

    val_handlers = [
        StatsHandler(output_transform=lambda x: None),
    ]

    evaluator = SupervisedEvaluator(
        device=device,
        amp=True,
        val_data_loader=train_loader,
        network=model,
        inferer=val_inferer,
        prepare_batch=lambda batchdata, device, non_blocking: (
            batchdata[0].to(device),
            batchdata[1].squeeze(1).to(device, dtype=torch.long),
        ),
        val_handlers = val_handlers,
    )
    lr_scheduler =  torch.optim.lr_scheduler.PolynomialLR(optimizer, total_iters=MAX_EPOCHS, power = 2)
    train_handlers = [
        ValidationHandler(
            validator=evaluator, interval=1, epoch_level=True,
        ),
        LrScheduleHandler(lr_scheduler=lr_scheduler, 
                print_lr=True,
        ),

    ]

    trainer = SupervisedTrainer(
        device=device,
        max_epochs=MAX_EPOCHS,
        amp=True,
        train_data_loader=train_loader,
        network=model,
        optimizer=optimizer,
        inferer=inferer,
        loss_function=nn.CrossEntropyLoss(),
        prepare_batch=lambda batchdata, device, non_blocking: (
            batchdata[0].to(device),
            batchdata[1].squeeze(1).to(device, dtype=torch.long),
        ),
        train_handlers=train_handlers,
    )
    trainer.run()
@wyli
Copy link
Contributor

wyli commented Jun 22, 2023

Thanks for reporting this, I think the multiprocessing_context='spawn' creates and removes new processes during training, this actually introduces overheads.

Have you tried monai.data.ThreadLoader, this is often better than the multiprocessing loader when handling transforms on GPU:

loader = monai.data.ThreadDataLoader(dataset, num_workers=0, batch_size=1)

or

loader = ThreadDataLoader(dataset, num_workers=1, batch_size=1, use_thread_workers=True)

@matt3o
Copy link
Contributor Author

matt3o commented Jun 22, 2023

@wyli Thanks for the quick reply!
Yes, ThreadDataloader without spawn (loader = ThreadDataLoader(dataset, num_workers=1, batch_size=1)) I get this error:
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

With spawn (loader = ThreadDataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn')) I get exactly the same results:

2023-06-22 13:54:36,946 - Engine run resuming from iteration 0, epoch 0 until 2 epochs
[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
....
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

With loader = ThreadDataLoader(dataset, num_workers=0, batch_size=1, multiprocessing_context='spawn') I get
ValueError: multiprocessing_context can only be used with multi-process loading (num_workers > 0), but got num_workers=0

Working results with loader = ThreadDataLoader(dataset, num_workers=1, batch_size=1, multiprocessing_context='spawn', use_thread_workers=True)
Kind of confusing. I was playing around a bit with the different Dataloaders and options but could not find a working solution. So thanks for your help, this is a working solution!

@matt3o
Copy link
Contributor Author

matt3o commented Jun 22, 2023

@wyli This topic is sadly not resolved however. Using ThreadDataloader modifies the logic of my real code, I believe the affine matrix for the spatial transforms is getting lost in the process.

The transforms, similar to DeepEdit, look like this:

            # Initial transforms on the CPU which does not hurt since they are executed asynchronously and only once
            InitLoggerd(args), # necessary if the dataloader runs in an extra thread / process
            LoadImaged(keys=("image", "label"), reader="ITKReader"),
            EnsureChannelFirstd(keys=("image", "label")),
            NormalizeLabelsInDatasetd(keys="label", label_names=labels, device=device),
            Orientationd(keys=["image", "label"], axcodes="RAS"),
            Spacingd(keys=["image", "label"], pixdim=spacing),
            CropForegroundd(keys=("image", "label"), source_key="image", select_fn=threshold_foreground),
            ScaleIntensityRanged(keys="image", a_min=0, a_max=43, b_min=0.0, b_max=1.0, clip=True), # 0.05 and 99.95 percentiles of the spleen HUs

            ### Random Transforms ###
            RandCropByPosNegLabeld(keys=("image", "label"), label_key="label", spatial_size=args.train_crop_size, pos=0.6, neg=0.4) if args.train_crop_size is not None else NoOpd(),
            DivisiblePadd(keys=["image", "label"], k=64, value=0) if args.inferer == "SimpleInferer" else NoOpd(), # UNet needs this
            RandFlipd(keys=("image", "label"), spatial_axis=[0], prob=0.10),
            RandFlipd(keys=("image", "label"), spatial_axis=[1], prob=0.10),
            RandFlipd(keys=("image", "label"), spatial_axis=[2], prob=0.10),
            RandRotate90d(keys=("image", "label"), prob=0.10, max_k=3),
            
            # Move to GPU
            ToTensord(keys=("image", "label"), device=device, track_meta=False),

With the Dataloader the resulting shape is torch.Size([1, 3, 224, 224, 320])
With ThreadDataloader the result is torch.Size([1, 3, 169, 169, 109]).
The following message is printed when using ThreadDataloader:

`data_array` is not of type `MetaTensor, assuming affine to be identity.
`data_array` is not of type MetaTensor, assuming affine to be identity.

I have run into this error before and I believe this is due to Spacingd(keys=["image", "label"], pixdim=spacing), not finding the affine matrix, but I have no idea why this happens with ThreadDataloader but not with the normal Dataloader. Any insight here would be cool since this makes the ThreadDataloader solution unusable for me.

@wyli
Copy link
Contributor

wyli commented Jun 22, 2023

yes, please use track_meta=True in the ToTensord

@matt3o
Copy link
Contributor Author

matt3o commented Jun 22, 2023

Does not change anything. Do I have to call that sooner? The ToTensord call is after all the relevant transforms imo.

@matt3o
Copy link
Contributor Author

matt3o commented Jun 22, 2023

Same problem, even with ToTensord right at the start:

            InitLoggerd(args), # necessary if the dataloader runs in an extra thread / process
            LoadImaged(keys=("image", "label"), reader="ITKReader"),
            ToTensord(keys=("image", "label"), device=device, track_meta=True),
            EnsureChannelFirstd(keys=("image", "label")),
            NormalizeLabelsInDatasetd(keys="label", label_names=labels, device=device),
            Orientationd(keys=["image", "label"], axcodes="RAS"),
            Spacingd(keys=["image", "label"], pixdim=spacing),
            CropForegroundd(keys=("image", "label"), source_key="image", select_fn=threshold_foreground),
            ScaleIntensityRanged(keys="image", a_min=0, a_max=43, b_min=0.0, b_max=1.0, clip=True), # 0.05 and 99.95 percentiles of the spleen HUs

@matt3o
Copy link
Contributor Author

matt3o commented Jun 22, 2023

I just changed my code to Dataset instead of PersistentDataset, just to be sure this is no caching effect. Same results on Dataset

@wyli
Copy link
Contributor

wyli commented Jun 22, 2023

ok... perhaps this NormalizeLabelsInDatasetd implicitly converts a metatensor into a torch tensor and removes the metadata. could you please try dropping that transform to confirm? I can also look into this soon.

@matt3o
Copy link
Contributor Author

matt3o commented Jun 22, 2023

Same result, so the transforms still don't work as expected, plus the code then crashed later of course since information is missing. The code is extremely similar to this one https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/deepedit/transforms.py#L86 , so it should check if the tensor is normal or a MetaTensor

@matt3o
Copy link
Contributor Author

matt3o commented Jun 22, 2023

@wyli Would be great if you find the time to check that out 😊 As I said, with Dataloader it works, with ThreadDataloader it doesn't. Just for completeness I'll paste the calling code for both below.

-    train_loader = DataLoader(
-        train_ds, shuffle=True, num_workers=args.num_workers, batch_size=1, multiprocessing_context='spawn', persistent_workers=True,
+    train_loader = ThreadDataLoader(
+        train_ds, shuffle=True, num_workers=args.num_workers, batch_size=1, multiprocessing_context='spawn', use_thread_workers=True#, persistent_workers=True,

@wyli
Copy link
Contributor

wyli commented Jun 22, 2023

sure, I forgot to mention that ThreadDataLoader with num_workers greater than 1 may have some problems because some transforms are not thread-safe. DataLoader on the other hand can work with more than 1 processes.

the issue of data_array is not of type MetaTensor, assuming affine to be identity. looks like a separate issue, I'll have a look..

@matt3o
Copy link
Contributor Author

matt3o commented Jun 22, 2023

Ah no worries there, I am using args.num_workers==1 per default. Good to know anyways, then I won't increase it. I found old code mentioning setting num_workers to 0 but that no longer works.

@wyli
Copy link
Contributor

wyli commented Jun 26, 2023

@wyli This topic is sadly not resolved however. Using ThreadDataloader modifies the logic of my real code, I believe the affine matrix for the spatial transforms is getting lost in the process.

The transforms, similar to DeepEdit, look like this:

            # Initial transforms on the CPU which does not hurt since they are executed asynchronously and only once
            InitLoggerd(args), # necessary if the dataloader runs in an extra thread / process
            LoadImaged(keys=("image", "label"), reader="ITKReader"),
            EnsureChannelFirstd(keys=("image", "label")),
            NormalizeLabelsInDatasetd(keys="label", label_names=labels, device=device),
            Orientationd(keys=["image", "label"], axcodes="RAS"),
            Spacingd(keys=["image", "label"], pixdim=spacing),
            CropForegroundd(keys=("image", "label"), source_key="image", select_fn=threshold_foreground),
            ScaleIntensityRanged(keys="image", a_min=0, a_max=43, b_min=0.0, b_max=1.0, clip=True), # 0.05 and 99.95 percentiles of the spleen HUs

            ### Random Transforms ###
            RandCropByPosNegLabeld(keys=("image", "label"), label_key="label", spatial_size=args.train_crop_size, pos=0.6, neg=0.4) if args.train_crop_size is not None else NoOpd(),
            DivisiblePadd(keys=["image", "label"], k=64, value=0) if args.inferer == "SimpleInferer" else NoOpd(), # UNet needs this
            RandFlipd(keys=("image", "label"), spatial_axis=[0], prob=0.10),
            RandFlipd(keys=("image", "label"), spatial_axis=[1], prob=0.10),
            RandFlipd(keys=("image", "label"), spatial_axis=[2], prob=0.10),
            RandRotate90d(keys=("image", "label"), prob=0.10, max_k=3),
            
            # Move to GPU
            ToTensord(keys=("image", "label"), device=device, track_meta=False),

With the Dataloader the resulting shape is torch.Size([1, 3, 224, 224, 320]) With ThreadDataloader the result is torch.Size([1, 3, 169, 169, 109]). The following message is printed when using ThreadDataloader:

`data_array` is not of type `MetaTensor, assuming affine to be identity.
`data_array` is not of type MetaTensor, assuming affine to be identity.

I have run into this error before and I believe this is due to Spacingd(keys=["image", "label"], pixdim=spacing), not finding the affine matrix, but I have no idea why this happens with ThreadDataloader but not with the normal Dataloader. Any insight here would be cool since this makes the ThreadDataloader solution unusable for me.

Hi @diazandr3s I'm not sure about the root cause of this deepedit transform + ThreadDataloader issue, please have a look if you have time.. thanks! (converting this to a discussion for now, please feel free to create a bug report if it's not a usage question)

@Project-MONAI Project-MONAI locked and limited conversation to collaborators Jun 26, 2023
@wyli wyli converted this issue into discussion #6657 Jun 26, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants