Skip to content

Conversation

iamjustinhsu
Copy link
Contributor

@iamjustinhsu iamjustinhsu commented Oct 6, 2025

Why are these changes needed?

On executor shutdown, the metrics persist even after execution. The plan is to reset on streaming_executor.shutdown. This PR also includes 2 potential drive-by fixes for metric calculation

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run pre-commit jobs to lint the changes in this PR. (pre-commit setup)
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

targets=[
Target(
expr='sum(ray_data_block_generation_time{{{global_filters}, operator=~"$Operator"}}) by (dataset, operator)',
expr='increase(ray_data_block_generation_time{{{global_filters}, operator=~"$Operator"}}[5m]) / increase(ray_data_num_task_outputs_generated{{{global_filters}, operator=~"$Operator"}}[5m])',
Copy link
Contributor Author

@iamjustinhsu iamjustinhsu Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W/O PR: shows total sum of block generation time (meaningless)
W/ PR: shows average block generation time over 5min period

targets=[
Target(
expr='sum(ray_data_task_submission_backpressure_time{{{global_filters}, operator=~"$Operator"}}) by (dataset, operator)',
expr='increase(ray_data_task_submission_backpressure_time{{{global_filters}, operator=~"$Operator"}}[5m]) / increase(ray_data_num_tasks_submitted{{{global_filters}, operator=~"$Operator"}}[5m])',
Copy link
Contributor Author

@iamjustinhsu iamjustinhsu Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

W/O PR: shows total sum of submitted tasks (could be meaningful)
W/ PR: shows current # of submitted tasks (I find this more meaningful)

@iamjustinhsu iamjustinhsu changed the title [data] reset metrics on executor shutdown [data] reset cpu + gpu metrics on executor shutdown Oct 9, 2025
include_parent=False
)
# Reset the scheduling loop duration gauge.
self._sched_loop_duration_s.set(0, tags={"dataset": self._dataset_id})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this meant to be nuked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the update_metrics calls it

@iamjustinhsu iamjustinhsu marked this pull request as ready for review October 9, 2025 22:48
@iamjustinhsu iamjustinhsu requested a review from a team as a code owner October 9, 2025 22:48
for op in self._op_usages:
self._op_usages[op] = ExecutionResources.zero()
self.op_resource_allocator._op_budgets[op] = ExecutionResources.zero()
self.op_resource_allocator._output_budgets[op] = 0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Null Resource Allocator Causes Shutdown Failures

The clear_usages_and_budget method assumes op_resource_allocator always exists and is a ReservationOpResourceAllocator. This causes an AssertionError when _op_resource_allocator is None (e.g., when resource allocation is disabled), leading to failures during executor shutdown.

Fix in Cursor Fix in Web

@ray-gardener ray-gardener bot added data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling labels Oct 10, 2025
@iamjustinhsu iamjustinhsu changed the title [data] reset cpu + gpu metrics on executor shutdown [data] reset cpu + gpu metrics on executor shutdown and updating task submission/block generation metrics Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants