Skip to content

Fix Flaky Euler-Maruyama Tests #6287

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Nov 19, 2022

Conversation

williambdean
Copy link
Contributor

@williambdean williambdean commented Nov 11, 2022

What is this PR about?
This PR addresses the flaky tests with the EulerMaruyama class which were brought up in here and here.

This PR includes seeding the current tests and ensuring that the V4 port didn't change the functionality.

Checklist

Major / Breaking Changes

  • ...

Bugfixes / New features

  • ...

Docs / Maintenance

  • Seeding the Euler-Maruyama linear_model test
  • Adding a space : in the Parameters for betting render

@williambdean
Copy link
Contributor Author

williambdean commented Nov 11, 2022

So I took at look at the differences @ricardoV94...
The linear_model test seems to be a direct port from the last tagged version of V3.
Old test -> Latest V4 test

Also, the implementation of the logp seems to be the same besides the addition of the init_dist's logp. I view this as a bug fix but I haven't dug up that discussion. Not sure if you were involved there. EDIT: Looking at it again, I guess there is just a bit different formulation change there to include the init-dist. Thought that was just omitted in logp
Old logp -> Latest V4 logp

Other than that, it seems the port was as intended. I few the random seeding as blocking against the small chance the parameter used to generate is outside of the posterior credible interval

@codecov
Copy link

codecov bot commented Nov 11, 2022

Codecov Report

Merging #6287 (ca7b5bd) into main (4acd98e) will decrease coverage by 3.77%.
The diff coverage is 0.00%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #6287      +/-   ##
==========================================
- Coverage   94.17%   90.39%   -3.78%     
==========================================
  Files         111      111              
  Lines       23908    23869      -39     
==========================================
- Hits        22515    21577     -938     
- Misses       1393     2292     +899     
Impacted Files Coverage Δ
pymc/distributions/timeseries.py 28.46% <ø> (-66.00%) ⬇️
pymc/tests/distributions/test_timeseries.py 0.00% <0.00%> (-99.51%) ⬇️
pymc/tests/step_methods/hmc/test_quadpotential.py 0.00% <0.00%> (-95.82%) ⬇️
pymc/model_graph.py 66.47% <0.00%> (-12.36%) ⬇️
pymc/step_methods/hmc/quadpotential.py 73.84% <0.00%> (-6.78%) ⬇️
pymc/distributions/logprob.py 61.72% <0.00%> (-5.56%) ⬇️
pymc/distributions/shape_utils.py 97.44% <0.00%> (-0.43%) ⬇️
pymc/step_methods/hmc/hmc.py 92.85% <0.00%> (-0.13%) ⬇️
... and 11 more

@williambdean
Copy link
Contributor Author

williambdean commented Nov 11, 2022

I am also looking into the failure rate as mentioned in the issue. WIP

I modified the linear_model test to store the two booleans values. i.e.

{
    "lamh: (lo < lam) and (lam < hi), 
    "zh": ((lo < z) * (z < hi)).mean() > 0.95, 
}

and ran the test 50 times and stored each experiment as row in DataFrame.
For the main branch,

>>> df.mean()
lamh    0.88
zh     1.00
dtype: float64

And with tag 3.11.5

df.mean()
lamh    0.96
zh     1.00
dtype: float64

So seeing that lamh is culprit for the assertion errors. However, there is indeed a higher failure rate in the main branch in two samples (and potentially under assumption of 0.95 success rate. Takes a bit of time to run these)

@williambdean
Copy link
Contributor Author

williambdean commented Nov 11, 2022

Also seeing the previous tests uses advi+adapt_diag instead of NUTS but I'm not sure atm how that might impact the results.

Differences so far:

  1. ADVI+adapt_diag vs NUTS
  2. Accounting for first value based on init_dist in the logp

@williambdean
Copy link
Contributor Author

williambdean commented Nov 11, 2022

And now another test is failing: pymc/tests/distributions/test_timeseries.py::TestEulerMaruyama::test_batched_size[False-2] lol
There is no comparable test in the previous version.

Which is also could be fixed with seeding... However, the specific test is

    y_eval = draw(y, draws=2)
    assert y_eval[0].shape == (batch_size, steps)
>   assert not np.any(np.isclose(y_eval[0], y_eval[1]))

...which I think is just trying to get at the samples are distinct. In that case, I think np.any(~np.isclose(y_eval[0], y_eval[1])) might be better in the case that the two ts are close enough at a point but are still different.

Even though two draws are unlikely to be the same across all the steps, I think a seed is still a wise choice here too.

assert y_eval[0].shape == (batch_size, steps)
assert not np.any(np.isclose(y_eval[0], y_eval[1]))
assert np.any(~np.isclose(y_eval[0], y_eval[1]))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

np.any(~np.isclose([1, 2, 3, 4, 5], [1, 2, 3, 4, 5]))   # -> False since all the same
np.any(~np.isclose([5, 4, 3, 2, 1], [1, 2, 3, 4, 5]))   # -> True since at least one different 
np.any(~np.isclose([1, 2, 3, 4, 5], [6, 7, 8, 9, 10]))  # -> True since all different 

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you reckon this check is better, than checking all are different?

Copy link
Contributor Author

@williambdean williambdean Nov 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surprisingly, I had it fail on my end where checking that all are different (previous test). Not sure the chances of that happening, but it did. That is, the two arrays were the same at at least 1 spot. In the case of an intersection between the two ys, I saw this better capturing that they are different arrays

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should now be deterministic no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. There are two changes:

  1. Deterministic because of the random seed
  2. Change in the assert which, I thought, better checks that the two draws are different based on the logic described. If you disagree, I will change back

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still passes under the seed and the previous assert though but so can any test catered to do so

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong preference, but all values differently is certainly stronger evidence that it's being random :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair. 😃

@williambdean williambdean marked this pull request as ready for review November 11, 2022 16:32
Copy link
Member

@michaelosthege michaelosthege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

One less flaky boi :)

@williambdean
Copy link
Contributor Author

I have to quickly rebase with new main since the seed args in test were just changed to random_seed (by me)

@williambdean
Copy link
Contributor Author

Failed test on test_gp.py

@michaelosthege
Copy link
Member

Failed test on test_gp.py

That's a known flaky test #6299.

we can just override

@michaelosthege michaelosthege merged commit cb37afa into pymc-devs:main Nov 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants