Skip to content

cpu_offload vRAM memory consumption large than 4GB #1934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Sanster opened this issue Jan 6, 2023 · 2 comments
Closed

cpu_offload vRAM memory consumption large than 4GB #1934

Sanster opened this issue Jan 6, 2023 · 2 comments
Labels
bug Something isn't working stale Issues that haven't received updates

Comments

@Sanster
Copy link

Sanster commented Jan 6, 2023

Describe the bug

I am using the code from https://huggingface.co/docs/diffusers/optimization/fp16#offloading-to-cpu-with-accelerate-for-memory-savings to test cpu_offload, but the vRAM memory consumption is large than 4GB

GPU cpu_offload enabled vRAM cost
1080 Yes 4539MB
1080 No 5101MB
TITAN RTX Yes 5134MB
TITAN RTX No 5668MB

Reproduction

I am using the code from https://huggingface.co/docs/diffusers/optimization/fp16#offloading-to-cpu-with-accelerate-for-memory-savings

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    
    torch_dtype=torch.float16,
)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
image = pipe(prompt).images[0]

Logs

No response

System Info

test on 1080/TITAN RTX

  • diffusers version: 0.11.1
  • accelerate version: 0.15.0
  • Platform: Linux-4.15.0-142-generic-x86_64-with-glibc2.29
  • Python version: 3.8.10
  • PyTorch version (GPU?): 1.10.1+cu111 (True)
  • Huggingface_hub version: 0.11.1
  • Transformers version: 4.25.1
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: No
@Sanster Sanster added the bug Something isn't working label Jan 6, 2023
@patrickvonplaten
Copy link
Contributor

Hey @Sanster,

Thanks a lot for the super clean bug report. When running your code-snippet in combination with:

nvidia-smi

I can observe the same numbers as in your table.

I think the problem is that we move all modules to GPU in the very beginning. We shouldn't do this. When setting:

pipe.enable_sequential_cpu_offload()

It's important to previously not run .to("cuda"). E.g. the following should give much better memory numbers:

import torch
from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16,
    safety_checker=None,
)

prompt = "a photo of an astronaut riding a horse on mars"
pipe.enable_sequential_cpu_offload()
image = pipe(prompt, num_inference_steps=4).images[0]

When running the above I'm getting <3GB memory usage.

In this example here, I'm not using the safety_checker since there is a bug with cpu_offload + safety checker at the moment that should however be corrected in this PR: #1968 along-side the incorrect documentation as spotted by you. Thanks a lot ❤️

It's an interesting use case here since by it might not be super intuitive to have to remove .to("cuda") when using enable_sequential_cpu_offload(...).

cc @pcuenca @patil-suraj @anton-l and maybe also @sgugger @muellerzr just FYI since it might be a common problem people run into.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 5, 2023

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issues that haven't received updates
Projects
None yet
Development

No branches or pull requests

2 participants