-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[data] reset cpu + gpu metrics on executor shutdown and updating task submission/block generation metrics #57246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: iamjustinhsu <[email protected]>
targets=[ | ||
Target( | ||
expr='sum(ray_data_block_generation_time{{{global_filters}, operator=~"$Operator"}}) by (dataset, operator)', | ||
expr='increase(ray_data_block_generation_time{{{global_filters}, operator=~"$Operator"}}[5m]) / increase(ray_data_num_task_outputs_generated{{{global_filters}, operator=~"$Operator"}}[5m])', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W/O PR: shows total sum of block generation time (meaningless)
W/ PR: shows average block generation time over 5min period
targets=[ | ||
Target( | ||
expr='sum(ray_data_task_submission_backpressure_time{{{global_filters}, operator=~"$Operator"}}) by (dataset, operator)', | ||
expr='increase(ray_data_task_submission_backpressure_time{{{global_filters}, operator=~"$Operator"}}[5m]) / increase(ray_data_num_tasks_submitted{{{global_filters}, operator=~"$Operator"}}[5m])', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
W/O PR: shows total sum of submitted tasks (could be meaningful)
W/ PR: shows current # of submitted tasks (I find this more meaningful)
include_parent=False | ||
) | ||
# Reset the scheduling loop duration gauge. | ||
self._sched_loop_duration_s.set(0, tags={"dataset": self._dataset_id}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this meant to be nuked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the update_metrics calls it
for op in self._op_usages: | ||
self._op_usages[op] = ExecutionResources.zero() | ||
self.op_resource_allocator._op_budgets[op] = ExecutionResources.zero() | ||
self.op_resource_allocator._output_budgets[op] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Null Resource Allocator Causes Shutdown Failures
The clear_usages_and_budget
method assumes op_resource_allocator
always exists and is a ReservationOpResourceAllocator
. This causes an AssertionError
when _op_resource_allocator
is None
(e.g., when resource allocation is disabled), leading to failures during executor shutdown.
Why are these changes needed?
On executor shutdown, the metrics persist even after execution. The plan is to reset on streaming_executor.shutdown. This PR also includes 2 potential drive-by fixes for metric calculation
Related issue number
Checks
git commit -s
) in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.