[Pipeline] Add new pipeline for ParaDiGMS -- parallel sampling of diffusion models #3716

AndyShih12 · 2023-06-08T05:52:55Z

This pull request implements the paper Parallel Sampling of Diffusion Models: https://arxiv.org/abs/2305.16317
Based on the repository: https://github.com/AndyShih12/paradigms

Example of use:

import torch
from diffusers import DDPMParallelScheduler
from diffusers import StableDiffusionParadigmsPipeline

scheduler = DDPMParallelScheduler.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="scheduler")

pipe = StableDiffusionParadigmsPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

ngpu, batch_per_device = torch.cuda.device_count(), 5
pipe.wrapped_unet = torch.nn.DataParallel(pipe.unet, device_ids=[d for d in range(ngpu)])

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt, parallel=ngpu * batch_per_device, num_inference_steps=1000).images[0]

…paradigms

alexblattner · 2023-06-11T08:04:34Z

@AndyShih12 can this be used with other pipelines too to make them concurrent?

AndyShih12 · 2023-06-11T16:54:36Z

Yes paradigms should be compatible with most pipelines, just by changing the denoising loop to the parallel denoising loop. But each pipeline is in a different file so we would have to make a parallel version of each pipeline separately.

AndyShih12 · 2023-06-11T17:03:29Z

requesting feedback @patrickvonplaten @sayakpaul

alexblattner · 2023-06-12T06:49:13Z

@patrickvonplaten @sayakpaul I think that there's an issue with the way pipelines are made. This pipeline will only work for txt2img only despite needing a specific part modified only. I think there should be a better way to mix multiple pipeline than making more and more pipeline. I am unsure what the best approach would be, but I think this would save a lot of work in the long run.

sayakpaul · 2023-06-12T13:37:58Z

@patrickvonplaten @sayakpaul I think that there's an issue with the way pipelines are made. This pipeline will only work for txt2img only despite needing a specific part modified only. I think there should be a better way to mix multiple pipeline than making more and more pipeline. I am unsure what the best approach would be, but I think this would save a lot of work in the long run.

I understand your point of view. But as stated in our documentation, we want to be flexible and readable with our codebase. The way we achieve that is by exposing the pipelines in a way that is as self-contained as possible, taking inspiration from transformers. Ccing @patrickvonplaten to share more on this if needed.

sayakpaul · 2023-06-12T13:39:15Z

@AndyShih12 thanks so much for your PR.

Could you also give us some comparisons on how parallel sampling improves the efficiency of a standard say, StableDiffusionPipeline?

I will get to reviewing the PR soon.

HuggingFaceDocBuilderDev · 2023-06-12T13:42:19Z

The documentation is not available anymore as the PR was closed or merged.

…main

AndyShih12 · 2023-06-12T17:52:46Z

@sayakpaul thank you!

Sure, parallel sampling on 8 GPUs can give a 3.1x speedup over StableDiffusionPipeline for 1000-step DDPM, and a 1.8x speedup for 200-step DDIM. There are some more details and comparisons in the paper.

Here is what we can expect the speedup to be when using fewer GPUs, on 1000-step DDPM.

Here is a script to compare running time. As we can see above, it's important run with multiple GPUs.

import torch
from diffusers import DDPMParallelScheduler, DDIMParallelScheduler
from diffusers import StableDiffusionParadigmsPipeline, StableDiffusionPipeline

scheduler = DDPMParallelScheduler.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="scheduler")
prompt = "a photo of an astronaut riding a horse on mars"

sequential_pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", scheduler=scheduler, torch_dtype=torch.float16)
sequential_pipe = sequential_pipe.to("cuda")

ngpu = torch.cuda.device_count()
parallel_pipe = StableDiffusionParadigmsPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", scheduler=scheduler, torch_dtype=torch.float16)
parallel_pipe = parallel_pipe.to("cuda")
parallel_pipe.wrapped_unet = torch.nn.DataParallel(parallel_pipe.unet, device_ids=[d for d in range(ngpu)])

num_inference_steps, batch_per_device = 1000, 5
# warmup
_ = sequential_pipe(prompt, num_inference_steps=10).images[0]
_ = parallel_pipe(prompt, parallel=ngpu * batch_per_device, num_inference_steps=10).images[0]
# run
image_sequential = sequential_pipe(prompt, num_inference_steps=num_inference_steps).images[0]
image_parallel = parallel_pipe(prompt, parallel=ngpu * batch_per_device, num_inference_steps=num_inference_steps).images[0]

sayakpaul · 2023-06-13T02:04:05Z

Thanks for providing the info!

Sure, parallel sampling on 8 GPUs can give a 3.1x speedup over StableDiffusionPipeline for 1000-step DDPM, and a 1.8x speedup for 200-step DDIM.

Do you mean that if we use the pipeline without parallel sampling on StableDiffusionPipeline and run inference with 8 GPUs for 1000-step DDPM, it's 3.1x slower? That's quite some speedup!

Additionally, I wanted to make you aware of our distributed inference support in case you aren't: https://huggingface.co/docs/diffusers/main/en/training/distributed_inference. It would be good to have a direct comparison of the timings using the setup used in our docs as well.

Nonetheless, I am reviewing your PR now.

src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_paradigms.py

src/diffusers/schedulers/scheduling_ddpm.py

Co-authored-by: Patrick von Platen <[email protected]>

…DDIMParallelScheduler

AndyShih12 · 2023-06-15T16:49:42Z

Ok, I've refactored the schedulers so that DDPMParallelScheduler and DDIMParallelScheduler are now separate classes! @patrickvonplaten

src/diffusers/schedulers/scheduling_ddim.py

sayakpaul · 2023-06-16T06:35:13Z

tests/schedulers/test_schedulers.py

+            # if t is a tensor, match the number of dimensions of sample
+            if isinstance(t, torch.Tensor):
+                num_dims = len(sample.shape)
+                # pad t with 1s to match num_dims
+                t = t.reshape(-1, *(1,) * (num_dims - 1))
+


Could I get a brief explanation on why was this needed?

Also curious here

Typically timestep t is a scalar, and sample has shape e.g. (batch, 3, 8, 8).
For parallel sampling, t has shape (batch,) and sample has shape (batch, 3, 8, 8).

When running net.forward(sample, t, ...) with the typical unet, both versions run fine.
But the way this test model is implemented it has error broadcasting the dimensions for parallel sampling, so I'm broadcasting manually.

tests/schedulers/test_scheduler_ddpm_parallel.py

src/diffusers/schedulers/scheduling_ddim_parallel.py

tests/schedulers/test_scheduler_ddpm_parallel.py

tests/schedulers/test_scheduler_ddim_parallel.py

src/diffusers/schedulers/scheduling_ddpm_parallel.py

src/diffusers/schedulers/scheduling_ddim_parallel.py

src/diffusers/schedulers/scheduling_ddim.py

patrickvonplaten

Nice, think we're close to getting this one in

…mented code

AndyShih12 · 2023-06-16T23:41:00Z

Awesome, I made the new changes so that scheduling_ddim.py is now untouched! Let me know if there is anything else!

patrickvonplaten · 2023-06-19T15:43:57Z

It looks like a couple of scheduler tests are now failing:

FAILED tests/schedulers/test_scheduler_euler_ancestral.py::EulerAncestralDiscreteSchedulerTest::test_full_loop_device - assert 2.4515058968472374 < 0.01
 +  where 2.4515058968472374 = abs((154.77070589684723 - 152.3192))
 +    where 154.77070589684723 = <built-in method item of Tensor object at 0x7f19eaf3e9a0>()
 +      where <built-in method item of Tensor object at 0x7f19eaf3e9a0> = tensor(154.7707, dtype=torch.float64).item
FAILED tests/schedulers/test_scheduler_euler_ancestral.py::EulerAncestralDiscreteSchedulerTest::test_full_loop_no_noise - assert 2.4515058968472374 < 0.01
 +  where 2.4515058968472374 = abs((154.77070589684723 - 152.3192))
 +    where 154.77070589684723 = <built-in method item of Tensor object at 0x7f19eb100b80>()
 +      where <built-in method item of Tensor object at 0x7f19eb100b80> = tensor(154.7707, dtype=torch.float64).item
FAILED tests/schedulers/test_scheduler_euler_ancestral.py::EulerAncestralDiscreteSchedulerTest::test_full_loop_with_v_prediction - assert 4.660559621037578 < 0.01
 +  where 4.660559621037578 = abs((103.78334037896242 - 108.4439))
 +    where 103.78334037896242 = <built-in method item of Tensor object at 0x7f19dfd93bd0>()
 +      where <built-in method item of Tensor object at 0x7f19dfd93bd0> = tensor(103.7833, dtype=torch.float64).item
FAILED tests/schedulers/test_scheduler_kdpm2_ancestral.py::KDPM2AncestralDiscreteSchedulerTest::test_full_loop_device - assert 165.99396548951336 < 0.1
 +  where 165.99396548951336 = abs((14015.375765489513 - 13849.3818))
 +    where 14015.375765489513 = <built-in method item of Tensor object at 0x7f19ead8f360>()
 +      where <built-in method item of Tensor object at 0x7f19ead8f360> = tensor(14015.3758, dtype=torch.float64).item
FAILED tests/schedulers/test_scheduler_kdpm2_ancestral.py::KDPM2AncestralDiscreteSchedulerTest::test_full_loop_no_noise - assert 165.98806548951325 < 0.01
 +  where 165.98806548951325 = abs((14015.375765489513 - 13849.3877))
 +    where 14015.375765489513 = <built-in method item of Tensor object at 0x7f19eb0ac310>()
 +      where <built-in method item of Tensor object at 0x7f19eb0ac310> = tensor(14015.3758, dtype=torch.float64).item
FAILED tests/schedulers/test_scheduler_kdpm2_ancestral.py::KDPM2AncestralDiscreteSchedulerTest::test_full_loop_with_v_prediction - assert 4.016957222634858 < 0.01
 +  where 4.016957222634858 = abs((333.0139572226349 - 328.997))
 +    where 333.0139572226349 = <built-in method item of Tensor object at 0x7f19dfa89e50>()
 +      where <built-in method item of Tensor object at 0x7f19dfa89e50> = tensor(333.0140, dtype=torch.float64).item

Could we try to fix those before merging? :-)

sayakpaul · 2023-06-20T05:27:44Z

I ran the failing test cases (test_scheduler_kdpm2_ancestral.py and test_scheduler_euler_ancestral.py) from the main branch of diffusers. They passed successfully whereas when I ran them from @AndyShih12's branch they didn't.

@AndyShih12, let's maybe try to fix the failing tests before merging :)

AndyShih12 · 2023-06-20T07:54:04Z

Oh thank you for the catch, I had assumed the test issue was due to the main branch. Sorry about not testing it thoroughly.

Indeed I fixed it by properly casting the type/device of the new tensor in the dummy model.

sayakpaul · 2023-06-20T09:22:12Z

The failing test seems to be unrelated. Let me update the branch once and rerun the workflow.

sayakpaul · 2023-06-20T09:34:26Z

Thanks for your great contribution!

AndyShih12 · 2023-06-20T16:04:01Z

thank you both for all the help!

…fusion models (huggingface#3716) * add paradigms parallel sampling pipeline * linting * ran make fix-copies * add paradigms parallel sampling pipeline * linting * ran make fix-copies * Apply suggestions from code review Co-authored-by: Sayak Paul <[email protected]> * changes based on review * add docs for paradigms * update docs with paradigms abstract * improve documentation, and add tests for ddim/ddpm batch_step_no_noise * fix docs and run make fix-copies * minor changes to docs. * Apply suggestions from code review Co-authored-by: Patrick von Platen <[email protected]> * move parallel scheduler to new classes for DDPMParallelScheduler and DDIMParallelScheduler * remove changes for scheduling_ddim, adjust licenses, credits, and commented code * fix tensor type that is breaking tests --------- Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Patrick von Platen <[email protected]>

AndyShih12 added 4 commits June 7, 2023 22:37

add paradigms parallel sampling pipeline

11b6861

linting

1d2e277

Merge branch 'main' of https://github.com/huggingface/diffusers into …

6339a1f

…paradigms

ran make fix-copies

767102e

AndyShih12 added 3 commits June 11, 2023 09:41

add paradigms parallel sampling pipeline

83c9d2b

linting

1c299b8

ran make fix-copies

209185d

AndyShih12 added 2 commits June 12, 2023 09:53

Merge branch 'main' of https://github.com/AndyShih12/diffusers into main

01df188

Merge branch 'main' of https://github.com/huggingface/diffusers into …

515cfd3

…main