Proposal to address precision issues in CI #4775

DN6 · 2023-08-25T07:05:20Z

What does this PR do?

Looking through the precision related tests Push failures, I noticed that most of the 85 failures are coming from tests defined in the PipelineMixin class. With 30 failures coming from test_float16_inference The reason for this is that all Pipeline tests inherit from this Mixin class, and run their tests with the default precision tolerance for the test.

Given that most pipelines on Slow Tests are using different checkpoints, they tend to produce very different levels of precision error.
e.g. (notice that Panorama, IF, and Kandinsky have different scales of error)

FAILED tests/pipelines/stable_diffusion/test_stable_diffusion_panorama.py::StableDiffusionPanoramaPipelineFastTests::test_float16_inference - AssertionError: 0.9291724 not less than 0.01 : The outputs of the fp16 and fp32 pipelines are too different.
FAILED tests/pipelines/deepfloyd_if/test_if_img2img_superresolution.py::IFImg2ImgSuperResolutionPipelineFastTests::test_float16_inference - AssertionError: 0.010922372 not less than 0.01 : The outputs of the fp16 and fp32 pipelines are too different.
FAILED tests/pipelines/kandinsky_v22/test_kandinsky.py::KandinskyV22PipelineFastTests::test_float16_inference - AssertionError: 0.05452037 not less than 0.01 : The outputs of the fp16 and fp32 pipelines are too different.
========================================================================== 30 failed, 63 warnings in 30.36s ==========================================================================

Rather than hardcoding these precision values into the individual tests, I am proposing that we use a measurement metric that is slightly more robust to the small numerical fluctuations caused by GPU Non-Determinism. Using cosine similarity distance between the generated outputs of the pipelines and our expected output to assess similarity.

def numpy_cosine_similarity_distance(generated_output, expected_output):
    similarity = np.dot(generated_output, expected_output) / (norm(generated_output) * norm(expected_output))
    distance = 1.0 - similarity.mean()

    return distance

Results from running the same tests with Cosine Similarity Distance and the default precision tolerance

FAILED tests/pipelines/stable_diffusion/test_stable_diffusion_pix2pix_zero.py::StableDiffusionPix2PixZeroPipelineFastTests::test_float16_inference - AssertionError: 0.08929622173309326 not less than 0.01 : The outputs of the fp16 and fp32 pipelines are too different.
FAILED tests/pipelines/stable_diffusion/test_stable_diffusion_model_editing.py::StableDiffusionModelEditingPipelineFastTests::test_float16_inference - AssertionError: 0.08928412199020386 not less than 0.01 : The outputs of the fp16 and fp32 pipelines are too different.
FAILED tests/pipelines/shap_e/test_shap_e_img2img.py::ShapEImg2ImgPipelineFastTests::test_float16_inference - RuntimeError: cumsum_cuda_kernel does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this ...
FAILED tests/pipelines/stable_diffusion/test_stable_diffusion_panorama.py::StableDiffusionPanoramaPipelineFastTests::test_float16_inference - AssertionError: 0.09012287855148315 not less than 0.01 : The outputs of the fp16 and fp32 pipelines are too different.
===================================================================== 4 failed, 26 passed, 63 warnings in 29.90s =====================================================================

To test the sensitivity of Cosine Similarity I ran a quick test, by adding gradually increasing levels of noise to a flattened image numpy array and measured the Max Abs Distance (what we use now) and the Cosine Similarity

Notice that at low noise levels (caused by GPU non-determinism) Cosine Similarity is relatively flat.

At higher noise levels Cosine Similarity does start to change more significantly.

Colab Notebook with Full Analysis:
https://colab.research.google.com/gist/DN6/fcbc10649977fbecd604e5feef1f5b05/cosine-similarity-tests.ipynb

Moving other precision related tests to use Cosine Distance will likely lead to a big reduction in flaky test behaviour due to GPU no determinism. Opening this PR for thoughts and discussion.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2023-08-25T07:13:21Z

The documentation is not available anymore as the PR was closed or merged.

patrickvonplaten · 2023-08-25T09:08:29Z

I like the idea! Let's go for it

patrickvonplaten · 2023-08-25T09:08:46Z

Could you run make style real quick? :-)

DN6 · 2023-08-25T12:21:14Z

@patrickvonplaten Merging just the changes that will affect the test_float16_inference tests. It should reduce approximately 30 failing slow tests. I'll expand to other tests in a follow up PR.

* proposal for flaky tests * clean up

DN6 added 2 commits August 25, 2023 02:38

proposal for flaky tests

cf19966

Merge branch 'main' into test-cleanup-fp16-fix

27ad77d

DN6 requested a review from patrickvonplaten August 25, 2023 07:05

patrickvonplaten approved these changes Aug 25, 2023

View reviewed changes

clean up

2b5e630

DN6 merged commit 3bba44d into main Aug 25, 2023

DN6 changed the title ~~[WIP ] Proposal to address precision issues in CI~~ Proposal to address precision issues in CI Aug 25, 2023

kashif deleted the test-cleanup-fp16-fix branch September 11, 2023 19:06

yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023

[WIP ] Proposal to address precision issues in CI (huggingface#4775)

b71f848

* proposal for flaky tests * clean up

AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request Apr 26, 2024

[WIP ] Proposal to address precision issues in CI (huggingface#4775)

4d4ae9e

* proposal for flaky tests * clean up

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal to address precision issues in CI #4775

Proposal to address precision issues in CI #4775

Uh oh!

DN6 commented Aug 25, 2023

Uh oh!

HuggingFaceDocBuilderDev commented Aug 25, 2023 •

edited

Loading

Uh oh!

patrickvonplaten commented Aug 25, 2023

Uh oh!

patrickvonplaten commented Aug 25, 2023

Uh oh!

DN6 commented Aug 25, 2023

Uh oh!

Uh oh!

Proposal to address precision issues in CI #4775

Proposal to address precision issues in CI #4775

Uh oh!

Conversation

DN6 commented Aug 25, 2023

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Aug 25, 2023

Uh oh!

patrickvonplaten commented Aug 25, 2023

Uh oh!

DN6 commented Aug 25, 2023

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Aug 25, 2023 •

edited

Loading