[core][stab/01] script to find and list tests by flakiness level #52955

can-anyscale · 2025-05-13T04:26:15Z

Add a script to find and list tests by flakiness level. This forms the basic to find and sort by other metrics such as test duration.

Test:

> bazel run //ci/ray_ci/automation:find_tests -- core

Tests sorted by flaky_percentage:
         - Test: darwin://python/ray/tests:test_advanced_5, Flaky Percentage: 6.666666666666667
         - Test: darwin://python/ray/tests:test_threaded_actor, Flaky Percentage: 6.666666666666667
         - Test: darwin://python/ray/tests:test_tqdm, Flaky Percentage: 6.666666666666667
         - Test: darwin://python/ray/dashboard:modules/job/tests/test_job_agent, Flaky Percentage: 3.3333333333333335
         - Test: darwin://python/ray/tests:test_failure, Flaky Percentage: 3.3333333333333335
         - Test: darwin://python/ray/tests:test_gcs_fault_tolerance, Flaky Percentage: 3.3333333333333335
         - Test: darwin://python/ray/tests:test_placement_group_3, Flaky Percentage: 3.3333333333333335
         - Test: darwin://python/ray/tests:test_placement_group_5, Flaky Percentage: 3.3333333333333335
         - Test: darwin://python/ray/tests:test_reconstruction_2, Flaky Percentage: 3.3333333333333335
         - Test: darwin://python/ray/tests:test_streaming_generator_4, Flaky Percentage: 3.3333333333333335
         - Test: darwin://:accessor_test, Flaky Percentage: 0.0
         - Test: darwin://:actor_creator_test, Flaky Percentage: 0.0
         ...

Signed-off-by: can <[email protected]>

israbbani

Thank you for doing this! I left a few comments.

israbbani · 2025-05-13T17:09:31Z

ci/ray_ci/automation/find_tests.py

+def main(
+    team: str,
+    test_history_length: int,
+    test_prefix: str,
+    order_by: str,
+    debug: bool,
+) -> None:


I think a help option that explains how flakiness is calculated will be helpful.

israbbani · 2025-05-13T17:13:44Z

release/ray_release/test_automation/ci_state_machine.py

+            >= FLAKY_PERCENTAGE_THRESHOLD
+        )
+
+    def get_flaky_percentage(


The way this is written, I don't think we can ever see a flaky percentage of 100%.

For example, with a test history of:

Pass -> Fail -> Pass -> Fail -> Pass -> Fail
Transitions=2/6 = 33%

It also means that this history is more flaky than the previous:

Fail -> Pass -> Fail -> Pass -> Fail -> Pass
Transitions=3/6 = 50%

Is there a reason why we don't count Pass -> Fail as flaky transitions?

yes, you're correct, it's just one way of computing a score that reflect the flip-floping behavior, and can be used as a number for comparision, no special reason why we don't count Pass -> Fail, if we do we just need to double the FLAKY_PERCENTAGE_THRESHOLD to make it not too punishing

israbbani · 2025-05-13T17:15:18Z

release/ray_release/test_automation/ci_state_machine.py

+        results: List[TestResult],
+    ) -> float:
+        """
+        Get the percentage of flaky tests in the test history


This function does the actual flaky test calculation. Can you add the method for calculation and why it's the method we choose to the docstring?

israbbani · 2025-05-13T17:18:38Z

ci/ray_ci/automation/find_tests.py

+    test_stats.sort(key=lambda x: x.get_flaky_percentage(), reverse=True)
+    print(f"Tests sorted by {order_by}:")
+    for test_stat in test_stats:
+        print(f"\t - {test_stat}")


It might be more readable to truncate the decimals to some fixed number.

israbbani · 2025-05-13T17:19:31Z

ci/ray_ci/automation/find_tests.py

+    tests = [
+        test for test in Test.gen_from_s3(test_prefix) if test.get_oncall() == team
+    ]
+    print(f"Analyzing {len(tests)} tests for team {team}")


I'm not certain, but if these don't include flaky tests that have been disabled, I would add a log line for that.

I think this also include tests that are disabled by the test automation bot (the database however captures tests based on their run activities on CIs so they don't capture tests that have never been run on CI, or have not run past the data retention point, 30 days I think).

aslonnie

should we just fix the numbers on go/flaky ?

aslonnie

or if we prefer a rewrite, maybe add another new, static, core team specific page in https://github.com/ray-project/travis-tracker-v2 ?

also, reef team is planning to revamp/refactor the flaky test dashboard and the test db.. all these code is likely going to be rather short-living..

can-anyscale · 2025-05-13T17:39:52Z

@aslonnie i talked briefly to the team and suggested them to not investing too much on go/flaky at this point; it diverged very far from what matters to the release

can-anyscale · 2025-05-13T17:43:30Z

@edoakes , @jjyao - how much do you want this in the short term?
@aslonnie - what is the timeline for the revamping?

Signed-off-by: can <[email protected]>

israbbani · 2025-05-13T18:18:00Z

If we can get a compact representation in the CLI tool, I would prefer it to the website (especially since we can look at the code, script against, and make modifications).

The information I would be interested in is:

Which tests should I fix?
- List ordered by flakiness (but currently enabled)
- List of tests that were disabled because they were too flaky
Information to debug and fix each test:
- Does it run on premerge or postmerge?
- Is it flaky on only one platform?
- Link to the last failed run
- Link to the last successful run
- List of last X runs (successful and failing)

can-anyscale force-pushed the can-coretest01 branch from 4394c19 to 5853354 Compare May 13, 2025 05:19

can-anyscale marked this pull request as ready for review May 13, 2025 13:59

can-anyscale requested a review from a team as a code owner May 13, 2025 13:59

can-anyscale requested a review from a team May 13, 2025 14:00

jjyao assigned aslonnie May 13, 2025

jjyao added the go add ONLY when ready to merge, run all tests label May 13, 2025

[core][tests/1] cli for finding flaky tests

9636c60

Signed-off-by: can <[email protected]>

can-anyscale force-pushed the can-coretest01 branch from 5853354 to f9a907e Compare May 13, 2025 17:16

israbbani reviewed May 13, 2025

View reviewed changes

aslonnie reviewed May 13, 2025

View reviewed changes

[core][stab/01] add scripts to list tests by flakiness

fd37db2

Signed-off-by: can <[email protected]>

can-anyscale force-pushed the can-coretest01 branch from f9a907e to fd37db2 Compare May 13, 2025 18:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core][stab/01] script to find and list tests by flakiness level #52955

[core][stab/01] script to find and list tests by flakiness level #52955

can-anyscale commented May 13, 2025 •

edited

Loading

israbbani left a comment

israbbani May 13, 2025

can-anyscale May 13, 2025

israbbani May 13, 2025

can-anyscale May 13, 2025

israbbani May 13, 2025

can-anyscale May 13, 2025

israbbani May 13, 2025

israbbani May 13, 2025

can-anyscale May 13, 2025

aslonnie left a comment

aslonnie left a comment

can-anyscale commented May 13, 2025

can-anyscale commented May 13, 2025

israbbani commented May 13, 2025

[core][stab/01] script to find and list tests by flakiness level #52955

Are you sure you want to change the base?

[core][stab/01] script to find and list tests by flakiness level #52955

Conversation

can-anyscale commented May 13, 2025 • edited Loading

israbbani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aslonnie left a comment

Choose a reason for hiding this comment

aslonnie left a comment

Choose a reason for hiding this comment

can-anyscale commented May 13, 2025

can-anyscale commented May 13, 2025

israbbani commented May 13, 2025

can-anyscale commented May 13, 2025 •

edited

Loading