Skip to content

[core][stab/01] script to find and list tests by flakiness level #52955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

can-anyscale
Copy link
Collaborator

@can-anyscale can-anyscale commented May 13, 2025

Add a script to find and list tests by flakiness level. This forms the basic to find and sort by other metrics such as test duration.

Test:

> bazel run //ci/ray_ci/automation:find_tests -- core

Tests sorted by flaky_percentage:
         - Test: darwin://python/ray/tests:test_advanced_5, Flaky Percentage: 6.666666666666667
         - Test: darwin://python/ray/tests:test_threaded_actor, Flaky Percentage: 6.666666666666667
         - Test: darwin://python/ray/tests:test_tqdm, Flaky Percentage: 6.666666666666667
         - Test: darwin://python/ray/dashboard:modules/job/tests/test_job_agent, Flaky Percentage: 3.3333333333333335
         - Test: darwin://python/ray/tests:test_failure, Flaky Percentage: 3.3333333333333335
         - Test: darwin://python/ray/tests:test_gcs_fault_tolerance, Flaky Percentage: 3.3333333333333335
         - Test: darwin://python/ray/tests:test_placement_group_3, Flaky Percentage: 3.3333333333333335
         - Test: darwin://python/ray/tests:test_placement_group_5, Flaky Percentage: 3.3333333333333335
         - Test: darwin://python/ray/tests:test_reconstruction_2, Flaky Percentage: 3.3333333333333335
         - Test: darwin://python/ray/tests:test_streaming_generator_4, Flaky Percentage: 3.3333333333333335
         - Test: darwin://:accessor_test, Flaky Percentage: 0.0
         - Test: darwin://:actor_creator_test, Flaky Percentage: 0.0
         ...

@can-anyscale can-anyscale marked this pull request as ready for review May 13, 2025 13:59
@can-anyscale can-anyscale requested a review from a team as a code owner May 13, 2025 13:59
@can-anyscale can-anyscale requested a review from a team May 13, 2025 14:00
@jjyao jjyao added the go add ONLY when ready to merge, run all tests label May 13, 2025
Copy link
Contributor

@israbbani israbbani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for doing this! I left a few comments.

Comment on lines +74 to +93
def main(
team: str,
test_history_length: int,
test_prefix: str,
order_by: str,
debug: bool,
) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a help option that explains how flakiness is calculated will be helpful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

totally

>= FLAKY_PERCENTAGE_THRESHOLD
)

def get_flaky_percentage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this is written, I don't think we can ever see a flaky percentage of 100%.

For example, with a test history of:

Pass -> Fail -> Pass -> Fail -> Pass -> Fail
Transitions=2/6 = 33%

It also means that this history is more flaky than the previous:

Fail -> Pass -> Fail -> Pass -> Fail -> Pass
Transitions=3/6 = 50%

Is there a reason why we don't count Pass -> Fail as flaky transitions?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you're correct, it's just one way of computing a score that reflect the flip-floping behavior, and can be used as a number for comparision, no special reason why we don't count Pass -> Fail, if we do we just need to double the FLAKY_PERCENTAGE_THRESHOLD to make it not too punishing

results: List[TestResult],
) -> float:
"""
Get the percentage of flaky tests in the test history
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function does the actual flaky test calculation. Can you add the method for calculation and why it's the method we choose to the docstring?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

test_stats.sort(key=lambda x: x.get_flaky_percentage(), reverse=True)
print(f"Tests sorted by {order_by}:")
for test_stat in test_stats:
print(f"\t - {test_stat}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be more readable to truncate the decimals to some fixed number.

tests = [
test for test in Test.gen_from_s3(test_prefix) if test.get_oncall() == team
]
print(f"Analyzing {len(tests)} tests for team {team}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain, but if these don't include flaky tests that have been disabled, I would add a log line for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also include tests that are disabled by the test automation bot (the database however captures tests based on their run activities on CIs so they don't capture tests that have never been run on CI, or have not run past the data retention point, 30 days I think).

Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just fix the numbers on go/flaky ?

Copy link
Collaborator

@aslonnie aslonnie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or if we prefer a rewrite, maybe add another new, static, core team specific page in https://github.com/ray-project/travis-tracker-v2 ?

also, reef team is planning to revamp/refactor the flaky test dashboard and the test db.. all these code is likely going to be rather short-living..

@can-anyscale
Copy link
Collaborator Author

@aslonnie i talked briefly to the team and suggested them to not investing too much on go/flaky at this point; it diverged very far from what matters to the release

@can-anyscale
Copy link
Collaborator Author

@edoakes , @jjyao - how much do you want this in the short term?
@aslonnie - what is the timeline for the revamping?

@israbbani
Copy link
Contributor

If we can get a compact representation in the CLI tool, I would prefer it to the website (especially since we can look at the code, script against, and make modifications).

The information I would be interested in is:

  • Which tests should I fix?

    • List ordered by flakiness (but currently enabled)
    • List of tests that were disabled because they were too flaky
  • Information to debug and fix each test:

    • Does it run on premerge or postmerge?
    • Is it flaky on only one platform?
    • Link to the last failed run
    • Link to the last successful run
    • List of last X runs (successful and failing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants