-
Notifications
You must be signed in to change notification settings - Fork 1.2k
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
CUDA warning: driver shutting down - Dataloader GPU memory data to trainer transfer #6636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for reporting this, I think the Have you tried
or
|
@wyli Thanks for the quick reply! With spawn (
With Working results with |
@wyli This topic is sadly not resolved however. Using ThreadDataloader modifies the logic of my real code, I believe the affine matrix for the spatial transforms is getting lost in the process. The transforms, similar to DeepEdit, look like this: # Initial transforms on the CPU which does not hurt since they are executed asynchronously and only once
InitLoggerd(args), # necessary if the dataloader runs in an extra thread / process
LoadImaged(keys=("image", "label"), reader="ITKReader"),
EnsureChannelFirstd(keys=("image", "label")),
NormalizeLabelsInDatasetd(keys="label", label_names=labels, device=device),
Orientationd(keys=["image", "label"], axcodes="RAS"),
Spacingd(keys=["image", "label"], pixdim=spacing),
CropForegroundd(keys=("image", "label"), source_key="image", select_fn=threshold_foreground),
ScaleIntensityRanged(keys="image", a_min=0, a_max=43, b_min=0.0, b_max=1.0, clip=True), # 0.05 and 99.95 percentiles of the spleen HUs
### Random Transforms ###
RandCropByPosNegLabeld(keys=("image", "label"), label_key="label", spatial_size=args.train_crop_size, pos=0.6, neg=0.4) if args.train_crop_size is not None else NoOpd(),
DivisiblePadd(keys=["image", "label"], k=64, value=0) if args.inferer == "SimpleInferer" else NoOpd(), # UNet needs this
RandFlipd(keys=("image", "label"), spatial_axis=[0], prob=0.10),
RandFlipd(keys=("image", "label"), spatial_axis=[1], prob=0.10),
RandFlipd(keys=("image", "label"), spatial_axis=[2], prob=0.10),
RandRotate90d(keys=("image", "label"), prob=0.10, max_k=3),
# Move to GPU
ToTensord(keys=("image", "label"), device=device, track_meta=False), With the Dataloader the resulting shape is torch.Size([1, 3, 224, 224, 320])
I have run into this error before and I believe this is due to |
yes, please use |
Does not change anything. Do I have to call that sooner? The ToTensord call is after all the relevant transforms imo. |
Same problem, even with ToTensord right at the start: InitLoggerd(args), # necessary if the dataloader runs in an extra thread / process
LoadImaged(keys=("image", "label"), reader="ITKReader"),
ToTensord(keys=("image", "label"), device=device, track_meta=True),
EnsureChannelFirstd(keys=("image", "label")),
NormalizeLabelsInDatasetd(keys="label", label_names=labels, device=device),
Orientationd(keys=["image", "label"], axcodes="RAS"),
Spacingd(keys=["image", "label"], pixdim=spacing),
CropForegroundd(keys=("image", "label"), source_key="image", select_fn=threshold_foreground),
ScaleIntensityRanged(keys="image", a_min=0, a_max=43, b_min=0.0, b_max=1.0, clip=True), # 0.05 and 99.95 percentiles of the spleen HUs |
I just changed my code to Dataset instead of PersistentDataset, just to be sure this is no caching effect. Same results on Dataset |
ok... perhaps this |
Same result, so the transforms still don't work as expected, plus the code then crashed later of course since information is missing. The code is extremely similar to this one https://github.com/Project-MONAI/MONAI/blob/dev/monai/apps/deepedit/transforms.py#L86 , so it should check if the tensor is normal or a MetaTensor |
@wyli Would be great if you find the time to check that out 😊 As I said, with Dataloader it works, with ThreadDataloader it doesn't. Just for completeness I'll paste the calling code for both below. - train_loader = DataLoader(
- train_ds, shuffle=True, num_workers=args.num_workers, batch_size=1, multiprocessing_context='spawn', persistent_workers=True,
+ train_loader = ThreadDataLoader(
+ train_ds, shuffle=True, num_workers=args.num_workers, batch_size=1, multiprocessing_context='spawn', use_thread_workers=True#, persistent_workers=True, |
sure, I forgot to mention that the issue of |
Ah no worries there, I am using args.num_workers==1 per default. Good to know anyways, then I won't increase it. I found old code mentioning setting num_workers to 0 but that no longer works. |
Hi @diazandr3s I'm not sure about the root cause of this deepedit transform + ThreadDataloader issue, please have a look if you have time.. thanks! (converting this to a discussion for now, please feel free to create a bug report if it's not a usage question) |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
Describe the bug
I just tried to get some sample code for #6626 but ran into a warning I have seen many times before. The problem appears when the transform pushed code the GPU and the data is then handed over from the Dataloader Thread to the main Thread.
This is no hard bug but it is very annoying since it gets spammed a lot.
Temporary workaround which I found is to add "persistent_workers=True," to the DataLoader, then the warning gets only shown at the end of the program, sometimes never.
Warning message:
To Reproduce
Run this code, minimal sample:
Expected behavior
No Cuda Warnings
Environment
Verified on different environments.
Additional context
Adding an evaluator further complicates the warnings and a new warning is now shown:
The code for that:
The text was updated successfully, but these errors were encountered: