Skip to content

Proposal to address precision issues in CI #4775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Aug 25, 2023
Merged

Proposal to address precision issues in CI #4775

merged 3 commits into from
Aug 25, 2023

Conversation

DN6
Copy link
Collaborator

@DN6 DN6 commented Aug 25, 2023

What does this PR do?

Looking through the precision related tests Push failures, I noticed that most of the 85 failures are coming from tests defined in the PipelineMixin class. With 30 failures coming from test_float16_inference The reason for this is that all Pipeline tests inherit from this Mixin class, and run their tests with the default precision tolerance for the test.

Given that most pipelines on Slow Tests are using different checkpoints, they tend to produce very different levels of precision error.
e.g. (notice that Panorama, IF, and Kandinsky have different scales of error)

FAILED tests/pipelines/stable_diffusion/test_stable_diffusion_panorama.py::StableDiffusionPanoramaPipelineFastTests::test_float16_inference - AssertionError: 0.9291724 not less than 0.01 : The outputs of the fp16 and fp32 pipelines are too different.
FAILED tests/pipelines/deepfloyd_if/test_if_img2img_superresolution.py::IFImg2ImgSuperResolutionPipelineFastTests::test_float16_inference - AssertionError: 0.010922372 not less than 0.01 : The outputs of the fp16 and fp32 pipelines are too different.
FAILED tests/pipelines/kandinsky_v22/test_kandinsky.py::KandinskyV22PipelineFastTests::test_float16_inference - AssertionError: 0.05452037 not less than 0.01 : The outputs of the fp16 and fp32 pipelines are too different.
========================================================================== 30 failed, 63 warnings in 30.36s ==========================================================================

Rather than hardcoding these precision values into the individual tests, I am proposing that we use a measurement metric that is slightly more robust to the small numerical fluctuations caused by GPU Non-Determinism. Using cosine similarity distance between the generated outputs of the pipelines and our expected output to assess similarity.

def numpy_cosine_similarity_distance(generated_output, expected_output):
    similarity = np.dot(generated_output, expected_output) / (norm(generated_output) * norm(expected_output))
    distance = 1.0 - similarity.mean()

    return distance    

Results from running the same tests with Cosine Similarity Distance and the default precision tolerance

FAILED tests/pipelines/stable_diffusion/test_stable_diffusion_pix2pix_zero.py::StableDiffusionPix2PixZeroPipelineFastTests::test_float16_inference - AssertionError: 0.08929622173309326 not less than 0.01 : The outputs of the fp16 and fp32 pipelines are too different.
FAILED tests/pipelines/stable_diffusion/test_stable_diffusion_model_editing.py::StableDiffusionModelEditingPipelineFastTests::test_float16_inference - AssertionError: 0.08928412199020386 not less than 0.01 : The outputs of the fp16 and fp32 pipelines are too different.
FAILED tests/pipelines/shap_e/test_shap_e_img2img.py::ShapEImg2ImgPipelineFastTests::test_float16_inference - RuntimeError: cumsum_cuda_kernel does not have a deterministic implementation, but you set 'torch.use_deterministic_algorithms(True)'. You can turn off determinism just for this ...
FAILED tests/pipelines/stable_diffusion/test_stable_diffusion_panorama.py::StableDiffusionPanoramaPipelineFastTests::test_float16_inference - AssertionError: 0.09012287855148315 not less than 0.01 : The outputs of the fp16 and fp32 pipelines are too different.
===================================================================== 4 failed, 26 passed, 63 warnings in 29.90s ===================================================================== 

To test the sensitivity of Cosine Similarity I ran a quick test, by adding gradually increasing levels of noise to a flattened image numpy array and measured the Max Abs Distance (what we use now) and the Cosine Similarity

Notice that at low noise levels (caused by GPU non-determinism) Cosine Similarity is relatively flat.

csd-low-noise

At higher noise levels Cosine Similarity does start to change more significantly.

csd-high-noise

Colab Notebook with Full Analysis:
https://colab.research.google.com/gist/DN6/fcbc10649977fbecd604e5feef1f5b05/cosine-similarity-tests.ipynb

Moving other precision related tests to use Cosine Distance will likely lead to a big reduction in flaky test behaviour due to GPU no determinism. Opening this PR for thoughts and discussion.

Fixes # (issue)

Before submitting

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@DN6 DN6 requested a review from patrickvonplaten August 25, 2023 07:05
@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Aug 25, 2023

The documentation is not available anymore as the PR was closed or merged.

@patrickvonplaten
Copy link
Contributor

I like the idea! Let's go for it

@patrickvonplaten
Copy link
Contributor

Could you run make style real quick? :-)

@DN6
Copy link
Collaborator Author

DN6 commented Aug 25, 2023

@patrickvonplaten Merging just the changes that will affect the test_float16_inference tests. It should reduce approximately 30 failing slow tests. I'll expand to other tests in a follow up PR.

@DN6 DN6 merged commit 3bba44d into main Aug 25, 2023
@DN6 DN6 changed the title [WIP ] Proposal to address precision issues in CI Proposal to address precision issues in CI Aug 25, 2023
@kashif kashif deleted the test-cleanup-fp16-fix branch September 11, 2023 19:06
yoonseokjin pushed a commit to yoonseokjin/diffusers that referenced this pull request Dec 25, 2023
AmericanPresidentJimmyCarter pushed a commit to AmericanPresidentJimmyCarter/diffusers that referenced this pull request Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants