Skip to content

[Pipeline] Add new pipeline for ParaDiGMS -- parallel sampling of diffusion models #3716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 25 commits into from
Jun 20, 2023

Conversation

AndyShih12
Copy link
Contributor

@AndyShih12 AndyShih12 commented Jun 8, 2023

This pull request implements the paper Parallel Sampling of Diffusion Models: https://arxiv.org/abs/2305.16317
Based on the repository: https://github.com/AndyShih12/paradigms

Example of use:

import torch
from diffusers import DDPMParallelScheduler
from diffusers import StableDiffusionParadigmsPipeline

scheduler = DDPMParallelScheduler.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="scheduler")

pipe = StableDiffusionParadigmsPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

ngpu, batch_per_device = torch.cuda.device_count(), 5
pipe.wrapped_unet = torch.nn.DataParallel(pipe.unet, device_ids=[d for d in range(ngpu)])

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt, parallel=ngpu * batch_per_device, num_inference_steps=1000).images[0]

@alexblattner
Copy link

@AndyShih12 can this be used with other pipelines too to make them concurrent?

@AndyShih12
Copy link
Contributor Author

Yes paradigms should be compatible with most pipelines, just by changing the denoising loop to the parallel denoising loop. But each pipeline is in a different file so we would have to make a parallel version of each pipeline separately.

@AndyShih12
Copy link
Contributor Author

requesting feedback @patrickvonplaten @sayakpaul

@alexblattner
Copy link

@patrickvonplaten @sayakpaul I think that there's an issue with the way pipelines are made. This pipeline will only work for txt2img only despite needing a specific part modified only. I think there should be a better way to mix multiple pipeline than making more and more pipeline. I am unsure what the best approach would be, but I think this would save a lot of work in the long run.

@sayakpaul
Copy link
Member

@patrickvonplaten @sayakpaul I think that there's an issue with the way pipelines are made. This pipeline will only work for txt2img only despite needing a specific part modified only. I think there should be a better way to mix multiple pipeline than making more and more pipeline. I am unsure what the best approach would be, but I think this would save a lot of work in the long run.

I understand your point of view. But as stated in our documentation, we want to be flexible and readable with our codebase. The way we achieve that is by exposing the pipelines in a way that is as self-contained as possible, taking inspiration from transformers. Ccing @patrickvonplaten to share more on this if needed.

@sayakpaul
Copy link
Member

sayakpaul commented Jun 12, 2023

@AndyShih12 thanks so much for your PR.

Could you also give us some comparisons on how parallel sampling improves the efficiency of a standard say, StableDiffusionPipeline?

I will get to reviewing the PR soon.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jun 12, 2023

The documentation is not available anymore as the PR was closed or merged.

@AndyShih12
Copy link
Contributor Author

AndyShih12 commented Jun 12, 2023

@sayakpaul thank you!

Sure, parallel sampling on 8 GPUs can give a 3.1x speedup over StableDiffusionPipeline for 1000-step DDPM, and a 1.8x speedup for 200-step DDIM. There are some more details and comparisons in the paper.

results_table

Here is what we can expect the speedup to be when using fewer GPUs, on 1000-step DDPM.

paraddpm

Here is a script to compare running time. As we can see above, it's important run with multiple GPUs.

import torch
from diffusers import DDPMParallelScheduler, DDIMParallelScheduler
from diffusers import StableDiffusionParadigmsPipeline, StableDiffusionPipeline

scheduler = DDPMParallelScheduler.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="scheduler")
prompt = "a photo of an astronaut riding a horse on mars"

sequential_pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", scheduler=scheduler, torch_dtype=torch.float16)
sequential_pipe = sequential_pipe.to("cuda")

ngpu = torch.cuda.device_count()
parallel_pipe = StableDiffusionParadigmsPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", scheduler=scheduler, torch_dtype=torch.float16)
parallel_pipe = parallel_pipe.to("cuda")
parallel_pipe.wrapped_unet = torch.nn.DataParallel(parallel_pipe.unet, device_ids=[d for d in range(ngpu)])

num_inference_steps, batch_per_device = 1000, 5
# warmup
_ = sequential_pipe(prompt, num_inference_steps=10).images[0]
_ = parallel_pipe(prompt, parallel=ngpu * batch_per_device, num_inference_steps=10).images[0]
# run
image_sequential = sequential_pipe(prompt, num_inference_steps=num_inference_steps).images[0]
image_parallel = parallel_pipe(prompt, parallel=ngpu * batch_per_device, num_inference_steps=num_inference_steps).images[0]

@sayakpaul
Copy link
Member

Thanks for providing the info!

Sure, parallel sampling on 8 GPUs can give a 3.1x speedup over StableDiffusionPipeline for 1000-step DDPM, and a 1.8x speedup for 200-step DDIM.

Do you mean that if we use the pipeline without parallel sampling on StableDiffusionPipeline and run inference with 8 GPUs for 1000-step DDPM, it's 3.1x slower? That's quite some speedup!

Additionally, I wanted to make you aware of our distributed inference support in case you aren't: https://huggingface.co/docs/diffusers/main/en/training/distributed_inference. It would be good to have a direct comparison of the timings using the setup used in our docs as well.

Nonetheless, I am reviewing your PR now.

@AndyShih12
Copy link
Contributor Author

Ok, I've refactored the schedulers so that DDPMParallelScheduler and DDIMParallelScheduler are now separate classes! @patrickvonplaten

Comment on lines 241 to 246
# if t is a tensor, match the number of dimensions of sample
if isinstance(t, torch.Tensor):
num_dims = len(sample.shape)
# pad t with 1s to match num_dims
t = t.reshape(-1, *(1,) * (num_dims - 1))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I get a brief explanation on why was this needed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also curious here

Copy link
Contributor Author

@AndyShih12 AndyShih12 Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically timestep t is a scalar, and sample has shape e.g. (batch, 3, 8, 8).
For parallel sampling, t has shape (batch,) and sample has shape (batch, 3, 8, 8).

When running net.forward(sample, t, ...) with the typical unet, both versions run fine.
But the way this test model is implemented it has error broadcasting the dimensions for parallel sampling, so I'm broadcasting manually.

Copy link
Contributor

@patrickvonplaten patrickvonplaten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, think we're close to getting this one in

@AndyShih12
Copy link
Contributor Author

Awesome, I made the new changes so that scheduling_ddim.py is now untouched! Let me know if there is anything else!

@patrickvonplaten
Copy link
Contributor

It looks like a couple of scheduler tests are now failing:

FAILED tests/schedulers/test_scheduler_euler_ancestral.py::EulerAncestralDiscreteSchedulerTest::test_full_loop_device - assert 2.4515058968472374 < 0.01
 +  where 2.4515058968472374 = abs((154.77070589684723 - 152.3192))
 +    where 154.77070589684723 = <built-in method item of Tensor object at 0x7f19eaf3e9a0>()
 +      where <built-in method item of Tensor object at 0x7f19eaf3e9a0> = tensor(154.7707, dtype=torch.float64).item
FAILED tests/schedulers/test_scheduler_euler_ancestral.py::EulerAncestralDiscreteSchedulerTest::test_full_loop_no_noise - assert 2.4515058968472374 < 0.01
 +  where 2.4515058968472374 = abs((154.77070589684723 - 152.3192))
 +    where 154.77070589684723 = <built-in method item of Tensor object at 0x7f19eb100b80>()
 +      where <built-in method item of Tensor object at 0x7f19eb100b80> = tensor(154.7707, dtype=torch.float64).item
FAILED tests/schedulers/test_scheduler_euler_ancestral.py::EulerAncestralDiscreteSchedulerTest::test_full_loop_with_v_prediction - assert 4.660559621037578 < 0.01
 +  where 4.660559621037578 = abs((103.78334037896242 - 108.4439))
 +    where 103.78334037896242 = <built-in method item of Tensor object at 0x7f19dfd93bd0>()
 +      where <built-in method item of Tensor object at 0x7f19dfd93bd0> = tensor(103.7833, dtype=torch.float64).item
FAILED tests/schedulers/test_scheduler_kdpm2_ancestral.py::KDPM2AncestralDiscreteSchedulerTest::test_full_loop_device - assert 165.99396548951336 < 0.1
 +  where 165.99396548951336 = abs((14015.375765489513 - 13849.3818))
 +    where 14015.375765489513 = <built-in method item of Tensor object at 0x7f19ead8f360>()
 +      where <built-in method item of Tensor object at 0x7f19ead8f360> = tensor(14015.3758, dtype=torch.float64).item
FAILED tests/schedulers/test_scheduler_kdpm2_ancestral.py::KDPM2AncestralDiscreteSchedulerTest::test_full_loop_no_noise - assert 165.98806548951325 < 0.01
 +  where 165.98806548951325 = abs((14015.375765489513 - 13849.3877))
 +    where 14015.375765489513 = <built-in method item of Tensor object at 0x7f19eb0ac310>()
 +      where <built-in method item of Tensor object at 0x7f19eb0ac310> = tensor(14015.3758, dtype=torch.float64).item
FAILED tests/schedulers/test_scheduler_kdpm2_ancestral.py::KDPM2AncestralDiscreteSchedulerTest::test_full_loop_with_v_prediction - assert 4.016957222634858 < 0.01
 +  where 4.016957222634858 = abs((333.0139572226349 - 328.997))
 +    where 333.0139572226349 = <built-in method item of Tensor object at 0x7f19dfa89e50>()
 +      where <built-in method item of Tensor object at 0x7f19dfa89e50> = tensor(333.0140, dtype=torch.float64).item

Could we try to fix those before merging? :-)

@sayakpaul
Copy link
Member

I ran the failing test cases (test_scheduler_kdpm2_ancestral.py and test_scheduler_euler_ancestral.py) from the main branch of diffusers. They passed successfully whereas when I ran them from @AndyShih12's branch they didn't.

@AndyShih12, let's maybe try to fix the failing tests before merging :)

@AndyShih12
Copy link
Contributor Author

Oh thank you for the catch, I had assumed the test issue was due to the main branch. Sorry about not testing it thoroughly.

Indeed I fixed it by properly casting the type/device of the new tensor in the dummy model.

@sayakpaul
Copy link
Member

The failing test seems to be unrelated. Let me update the branch once and rerun the workflow.

@sayakpaul
Copy link
Member

Thanks for your great contribution!

@sayakpaul sayakpaul merged commit 73b125d into huggingface:main Jun 20, 2023
@AndyShih12
Copy link
Contributor Author

thank you both for all the help!

yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
…fusion models (huggingface#3716)

* add paradigms parallel sampling pipeline

* linting

* ran make fix-copies

* add paradigms parallel sampling pipeline

* linting

* ran make fix-copies

* Apply suggestions from code review

Co-authored-by: Sayak Paul <[email protected]>

* changes based on review

* add docs for paradigms

* update docs with paradigms abstract

* improve documentation, and add tests for ddim/ddpm batch_step_no_noise

* fix docs and run make fix-copies

* minor changes to docs.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <[email protected]>

* move parallel scheduler to new classes for DDPMParallelScheduler and DDIMParallelScheduler

* remove changes for scheduling_ddim, adjust licenses, credits, and commented code

* fix tensor type that is breaking tests

---------

Co-authored-by: Sayak Paul <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request Apr 26, 2024
…fusion models (huggingface#3716)

* add paradigms parallel sampling pipeline

* linting

* ran make fix-copies

* add paradigms parallel sampling pipeline

* linting

* ran make fix-copies

* Apply suggestions from code review

Co-authored-by: Sayak Paul <[email protected]>

* changes based on review

* add docs for paradigms

* update docs with paradigms abstract

* improve documentation, and add tests for ddim/ddpm batch_step_no_noise

* fix docs and run make fix-copies

* minor changes to docs.

* Apply suggestions from code review

Co-authored-by: Patrick von Platen <[email protected]>

* move parallel scheduler to new classes for DDPMParallelScheduler and DDIMParallelScheduler

* remove changes for scheduling_ddim, adjust licenses, credits, and commented code

* fix tensor type that is breaking tests

---------

Co-authored-by: Sayak Paul <[email protected]>
Co-authored-by: Patrick von Platen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants