[core] allow reporter agent to get pid via rpc to raylet #57004

tianyi-ge · 2025-09-29T15:33:33Z

Why are these changes needed?

currently, reporter agent is spawned by raylet process. It's assumed that all core workers are direct children of raylet, but it's not the case with new features (uv, image_url). reporter agent need another way to find all core workers.

ray/python/ray/dashboard/modules/reporter/reporter_agent.py

Line 911 in 10eacfd

for proc in raylet_proc.children()
driver is not spawned by raylet, thus is never monitored

implementation:

add an grpc endpoint in raylet process (node manager), and allow reporter agent to connect
reporter agent fetches worker lists via grpc reply, including driver. it creates a raylet client with a dedicated thread

Related issue number

Closes #56739

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Note

Reporter agent now fetches worker/driver PIDs via a new Raylet GetWorkerPIDs RPC using a new RayletClient binding, replacing psutil child-process scanning.

Backend (Raylet RPC):
- Add GetWorkerPIDs RPC in node_manager.proto and wire it into NodeManagerService.
- Implement NodeManager::HandleGetWorkerPIDs to return PIDs of all alive workers and drivers.
- Extend RayletClient (C++) with GetWorkerPIDs(timeout_ms) and an alternate ctor (ip, port); expose to Python via Cython (includes/raylet_client.pxi, includes/common.pxd).
Python/Cython plumbing:
- Include includes/raylet_client.pxi in _raylet.pyx to expose RayletClient to Python.
Dashboard Reporter:
- Update reporter_agent.py to use RayletClient(ip, node_manager_port).get_worker_pids(timeout) to discover workers; build psutil.Process objects from returned PIDs.
- Add RAYLET_RPC_TIMEOUT_SECONDS = 1 in dashboard/consts.py and use it for RPC timeout.
Server registration:
- Register new handler in node_manager_server.h macro list.

^{Written by Cursor Bugbot for commit f76f633. This will update automatically on new commits. Configure here.}

Signed-off-by: tianyi-ge <[email protected]>

gemini-code-assist

Code Review

This pull request adds a new gRPC endpoint to the node manager for fetching worker and driver PIDs, which is a solid approach for discovering all worker processes. The changes to the protobuf definition and the C++ implementation are mostly correct. However, I've found a critical issue in the Python client code due to a typo that would cause the RPC call to fail. I've also included a few suggestions for improving error handling and code efficiency.

python/ray/dashboard/modules/reporter/reporter_agent.py

src/ray/raylet/node_manager.cc

Signed-off-by: tianyi-ge <[email protected]>

src/ray/protobuf/node_manager.proto

edoakes · 2025-09-29T21:21:52Z

src/ray/protobuf/node_manager.proto

+  // Get the worker managed by local raylet.
+  // Failure: Sends to local raylet, so should never fail.


we should still add error handling & retries just in case (there could be a logical bug in the raylet)

src/ray/protobuf/node_manager.proto

src/ray/raylet/node_manager.cc

tianyi-ge · 2025-09-30T01:51:38Z

@edoakes thanks you for the prompt comments; I'll fix it soon. Also, after discussing with @can-anyscale , I'll replace python grpcio lib with a cython wrapper "RayletClient"

Signed-off-by: tianyi-ge <[email protected]>

can-anyscale

Let's figure out a way to test that the solution work

python/ray/includes/raylet_client.pxi

can-anyscale · 2025-10-01T18:45:00Z

python/ray/dashboard/modules/reporter/reporter_agent.py

+        )
+        try:
+            return raylet_client.get_worker_pids(timeout=timeout)
+        except Exception as e:


let's not exception catch all; be explicit of what are the acceptable exceptions can be thrown from get_worker_pids and what exceptions ray should just fail out loud

yes this should be rpc exceptions or something; try not to exception catch all if possible

python/ray/dashboard/modules/reporter/reporter_agent.py

can-anyscale · 2025-10-01T19:06:39Z

src/ray/protobuf/node_manager.proto

  rpc IsLocalWorkerDead(IsLocalWorkerDeadRequest) returns (IsLocalWorkerDeadReply);
+  // Get the PIDs of all workers currently alive that are managed by the local Raylet.
+  // This includes connected driver processes.
+  // Failure: Will retry on failure with logging


nit: what "with logging" means? more useful information would be to retry how many time; what will the reply look like on failures (partial results, empty) etc.

src/ray/raylet/node_manager.cc

src/ray/raylet_rpc_client/raylet_client.cc

src/ray/raylet_rpc_client/raylet_client.h

python/ray/dashboard/modules/reporter/reporter_agent.py

Signed-off-by: tianyi-ge <[email protected]>

can-anyscale · 2025-10-06T19:13:11Z

src/ray/raylet_rpc_client/threaded_raylet_client.cc

+
+ThreadedRayletClient::ThreadedRayletClient(const std::string &ip_address, int port)
+    : RayletClient() {
+  io_service_ = std::make_unique<instrumented_io_context>();


maybe can just use this https://github.com/ray-project/ray/blob/master/src/ray/common/asio/asio_util.h#L53 and don't need to maintain the thread yourself

there are also patterns here to make sure the io_context is reused across raylet client within one process

thanks for your suggestions

I guess in the future, if raylet client is used at multiple places, reusing io_context is important.
But to use IOContextProvider, I have to create a "default io context" anyway. It's also manually maintained, right?

oh dang sorry forgot to include the link to the pattern; you can create a static InstrumentedIOContextWithThread and reuse it across the constructor of ThreadedRayletClient https://github.com/ray-project/ray/blob/master/src/ray/gcs_rpc_client/gcs_client.cc#L219

I run a test of creating 5 actors. The rpc reply has 12 processes, including 5 actors, 5 idle workers, driver (python in the following screenshot) and a dashboard server head, which aligns with the dashboard.

2025-10-08 11:29:33,355 INFO reporter_agent.py:913 -- Worker PIDs from raylet: [41692, 41694, 41685, 41689, 41693, 41688, 41690, 41691, 41687, 41686, 41676, 41618]

should dashboard server head be here?

ah -- I don't think the dashboard server head should be there... the reason it's showing up is because it is connecting to ray with ray.init. We'll need some way of filtering it. I believe it is started in a namespace prefixed with _ray_internal. We do other such filtering here:

ray/python/ray/dashboard/modules/job/job_head.py

Line 687 in a214565

# This includes the _ray_internal_dashboard job that gets automatically

If the namespace is available in the raylet, we can add the filtering there and exclude any workers that are associated with a _ray_internal* namespace

Agree, for system driver processes, we should hide them.

thanks @edoakes . I added a new option filter_system_drivers. It finds the corresponding namesapce and check its prefix. now dashboard server head is gone

python/ray/dashboard/modules/reporter/reporter_agent.py

src/ray/raylet_rpc_client/raylet_client.h

…ylet client Signed-off-by: tianyi-ge <[email protected]>

Signed-off-by: tianyi-ge <[email protected]>

src/ray/raylet/worker_pool.h

can-anyscale · 2025-10-09T13:43:09Z

python/ray/dashboard/modules/reporter/reporter_agent.py

+        )
+        try:
+            return raylet_client.get_worker_pids(timeout=timeout)
+        except Exception as e:


yes this should be rpc exceptions or something; try not to exception catch all if possible

src/ray/raylet_rpc_client/raylet_client.cc

python/ray/dashboard/modules/reporter/tests/test_reporter.py

python/ray/dashboard/modules/reporter/reporter_agent.py

src/ray/raylet/node_manager.cc

can-anyscale · 2025-10-13T18:01:47Z

src/ray/raylet_rpc_client/raylet_client.h

                        rpc::ClientCallManager &client_call_manager,
                        std::function<void()> raylet_unavailable_timeout_callback);

+  RayletClient() = default;


do we need this; ideally we shouldn't change raylet interface; it was designed so that it is always reuse the thread from its caller (so that ray logic won't compete with ray application logic)

because I need to construct threaded raylet client.

The current RayletClient constructor inits grpc_client_ and retryable_grpc_client_ right away, but ThreadedRayletClient need to init them after getting io_service and client_call_manager. so I need an default RayletClient constructor for ThreadedRayletClient constructor

kk perhaps move this constructor into protected

We won't have this problem if we do wrapper instead of inheritance.

src/ray/raylet_rpc_client/threaded_raylet_client.cc

can-anyscale · 2025-10-13T18:37:56Z

src/ray/raylet_rpc_client/threaded_raylet_client.cc

+    std::shared_ptr<std::vector<int32_t>> worker_pids, int64_t timeout_ms) {
+  rpc::GetWorkerPIDsRequest request;
+  auto promise = std::make_shared<std::promise<Status>>();
+  std::weak_ptr<std::promise<Status>> weak_promise = promise;


why do you need to use weak_ptr?

because promise is captured in callback lambda. when GetWorkerPIDs ends, promise is destructed, but the callback will potentially be called after that. weak_ptr is to avoid use-after-free

got it makes sense

Signed-off-by: tianyi-ge <[email protected]>

edoakes · 2025-10-14T13:12:54Z

Kicked off full CI tests: https://buildkite.com/ray-project/premerge/builds/51531

can-anyscale · 2025-10-14T16:48:33Z

LGTM, pending for test results, thanks

jjyao · 2025-10-14T18:23:02Z

src/ray/raylet_rpc_client/threaded_raylet_client.h

+
+/// Threaded raylet client is provided for python (e.g. ReporterAgent) to communicate with
+/// raylet. It creates and manages a separate thread to run the grpc event loop
+class ThreadedRayletClient : public RayletClient {


Do we really need a ThreadedRayletClient? We don't have ThreadedGcsClient.

We can have something similar to ConnectOnSingletonIoContext

CreateRayletClientOnSingletonIoContext(ip, port) { static InstrumentedIOContextWithThread io_context("raylet_client_io_context"); static ClientCallManager client_call_manager(); return RayletClient(ip, port, client_call_manager); }

or we can create a wrapper of the existing raylet client instead of inheritance:

class RayletClientWithIoContext { raylet_client_; io_context_; client_call_manager_; }

I think wrapper is probably cleaner.

yeah I've considered wrapper before. as you mentioned, RayletClient then need to add GetWorkerPIDs and a new constructor Raylet(ip, port, client_call_manager). Both methods look good to me as they won't affect the cython usage, but wrapper seems a more decoupled way

jjyao · 2025-10-14T18:44:08Z

src/ray/raylet_rpc_client/raylet_client_with_io_context.h

+  Status GetWorkerPIDs(std::shared_ptr<std::vector<int32_t>> worker_pids,
+                       int64_t timeout_ms);


This should go into the existing RayletClient.

jjyao · 2025-10-14T18:44:42Z

src/ray/raylet_rpc_client/raylet_client.h

                        rpc::ClientCallManager &client_call_manager,
                        std::function<void()> raylet_unavailable_timeout_callback);

+  RayletClient() = default;


We won't have this problem if we do wrapper instead of inheritance.

…ientWithIoContext Signed-off-by: tianyi-ge <[email protected]>

Signed-off-by: tianyi-ge <[email protected]>

cursor · 2025-10-15T02:35:32Z

src/ray/raylet_rpc_client/raylet_client.cc

+    return Status::TimedOut("Timed out getting worker PIDs from raylet");
+  }
+  return future.get();
+}


Bug: Conflicting Timeouts in GetWorkerPIDs Method

The GetWorkerPIDs method uses the same timeout_ms for both the RPC call and the future.wait_for. This creates competing timeouts, which can cause the method to return TimedOut prematurely, even if the RPC call is still active or would eventually succeed.

can-anyscale

There are several nits, but overall LGTM, thank you.

I'll leave some times for @jjyao @edoakes to take a look too before merging.

can-anyscale · 2025-10-15T18:14:14Z

python/ray/dashboard/modules/reporter/reporter_agent.py

+            # Get worker pids from raylet via gRPC.
+            return self._raylet_client.get_worker_pids()
+        except TimeoutError as e:
+            logger.debug(f"Failed to get worker pids from raylet: {e}")


logger.error

should use logger.exception here, not logger.error

logger.exception formats the stack trace automatically without need to include the exception repr

can-anyscale · 2025-10-15T18:14:55Z

python/ray/dashboard/modules/reporter/reporter_agent.py

+            try:
+                proc = psutil.Process(pid)
+                workers[self._generate_worker_key(proc)] = proc
+            except (psutil.NoSuchProcess, psutil.AccessDenied):


logger.error("...")

can-anyscale · 2025-10-15T18:16:09Z

python/ray/includes/raylet_client.pxi

+        if status.IsTimedOut():
+            raise TimeoutError(status.message())
+        elif not status.ok():
+            raise RuntimeError(


this also raises RuntimeError, need to catch this exception upstream

can-anyscale · 2025-10-15T18:16:28Z

python/ray/dashboard/modules/reporter/reporter_agent.py

+        try:
+            # Get worker pids from raylet via gRPC.
+            return self._raylet_client.get_worker_pids()
+        except TimeoutError as e:


catch RuntimeError as well

can-anyscale · 2025-10-15T18:18:44Z

src/ray/raylet/worker_pool.h

  ///
  /// \param filter_dead_drivers whether or not if this method will filter dead drivers
  /// that are still registered.
+  /// \param filter_system_drivers whether or not if this method will filter system


should this be called filter_ray_internal_processes @jjyao , @edoakes , I don't have context if we have system driver as a concept, or this PR is introducing a new concept

filter_system_drivers looks fine to me since it's mirroring filter_dead_drivers

can-anyscale · 2025-10-15T19:15:32Z

Also kicking off a release test to check if the processes that the metrics reported do not regress. Here is the list of process get reported on master:

jjyao · 2025-10-15T21:38:59Z

src/ray/rpc/node_manager/node_manager_server.h

                                       PushMutableObjectReply *reply,
                                       SendReplyCallback send_reply_callback) = 0;
+
+  virtual void HandleGetWorkerPIDs(GetWorkerPIDsRequest request,


lets make sure runtime env agent metrics are still reported

can-anyscale · 2025-10-15T22:13:02Z

processes reported by this PR

can-anyscale · 2025-10-15T22:13:48Z

@jjyao: all processes are reported as expected with this PR

Signed-off-by: tianyi-ge <[email protected]>

jjyao · 2025-10-16T04:32:58Z

@jjyao: all processes are reported as expected with this PR

@can-anyscale I didn't see runtime env agent metrics from your screenshot.

Signed-off-by: tianyi-ge <[email protected]>

tianyi-ge · 2025-10-16T13:56:17Z

@jjyao Is runtime env agent a special driver or core worker? is it possible to assert it in my unittest?

cursor · 2025-10-16T14:00:54Z

src/ray/raylet_rpc_client/raylet_client.cc

+    return Status::TimedOut("Timed out getting worker PIDs from raylet");
+  }
+  return future.get();
+}


Bug: Race Condition in Dual Timeout Handling

The GetWorkerPIDs method has a race condition due to dual timeout handling. Both the RPC call and future.wait_for use the same timeout_ms, which can cause future.wait_for to incorrectly report a timeout even if the RPC successfully completed.

jjyao · 2025-10-16T18:11:50Z

python/ray/dashboard/modules/reporter/tests/test_reporter.py


-def test_report_stats():
+@patch("ray.dashboard.modules.reporter.reporter_agent.RayletClient")
+def test_report_stats(mock_raylet_client):


the mock_raylet_client is not used?

it will be used in ReporterAgent constructor to avoid creating a real grpc client

jjyao · 2025-10-16T18:18:28Z

python/ray/dashboard/modules/reporter/tests/test_reporter.py

    assert resp_data["rayInitCluster"] == meta["ray_init_cluster"]


+def test_reporter_raylet_agent(ray_start_with_dashboard):


I think this test depends on the fact the total cpu resource of the node is 1 so we don't create extra idle nodes. Could you make it explicit by doing

@pytest.mark.parametrize( "ray_start_with_dashboard", [ { "num_cpus": 1, } ], indirect=True, )

jjyao · 2025-10-16T18:27:53Z

src/ray/common/constants.h

 /// PID of GCS process to record metrics.
 constexpr char kGcsPidKey[] = "gcs_pid";
+
+// Please keep this in sync with the definition in ray_constants.py.


We can enforce the sync by exposing the c++ constant to python via cython. We have examples in common.pxi and common.pxd:

ray/python/ray/includes/common.pxi

Line 161 in 9a434c7

RAY_NODE_TPU_POD_TYPE_KEY = kLabelKeyTpuPodType.decode()

jjyao · 2025-10-16T18:34:28Z

src/ray/protobuf/node_manager.proto

  // worker clients. The unavailable callback will eventually be retried so if this fails.
  rpc IsLocalWorkerDead(IsLocalWorkerDeadRequest) returns (IsLocalWorkerDeadReply);
+  // Get the PIDs of all workers currently alive that are managed by the local Raylet.
+  // This includes connected driver processes.


We should mention system drivers are excluded

jjyao · 2025-10-16T18:43:29Z

src/ray/raylet_rpc_client/raylet_client.cc

+  std::weak_ptr<std::promise<Status>> weak_promise = promise;
+  std::weak_ptr<std::vector<int32_t>> weak_worker_pids = worker_pids;


why do we need weak_ptr and promise here?

jjyao · 2025-10-16T19:57:13Z

python/ray/dashboard/modules/reporter/reporter_agent.py

+    def _get_worker_pids_from_raylet(self) -> List[int]:
+        try:
+            # Get worker pids from raylet via gRPC.
+            return self._raylet_client.get_worker_pids()


this is PRC so we should make it async and change get_worker_pids_from_raylet to async.

ah yes, @tianyi-ge, there is a pattern to turn this async grpc call into a await/future method in python, example here https://github.com/ray-project/ray/blob/master/python/ray/includes/gcs_client.pxi#L177-L191

Signed-off-by: tianyi-ge <[email protected]>

cursor · 2025-10-17T16:28:54Z

python/ray/dashboard/modules/reporter/reporter_agent.py

-        raylet_proc = self._get_raylet_proc()
-        if raylet_proc is None:
+        pids = asyncio.run(self._get_worker_pids_from_raylet())
+        logger.debug(f"Worker PIDs from raylet: {pids}")


Bug: Asyncio Loop Conflict in Worker Process Retrieval

The _get_worker_processes method uses asyncio.run() to execute _get_worker_pids_from_raylet(). Since the ReporterAgent runs within an existing asyncio event loop, calling asyncio.run() from it raises a RuntimeError and crashes the application.

Signed-off-by: tianyi-ge <[email protected]>

[core] add get pid rpc to node manager

eb8924e

Signed-off-by: tianyi-ge <[email protected]>

tianyi-ge requested a review from a team as a code owner September 29, 2025 15:33

[core] add get pid rpc to node manager

d28c905

Signed-off-by: tianyi-ge <[email protected]>

gemini-code-assist bot reviewed Sep 29, 2025

View reviewed changes

tianyi-ge changed the title ~~[core] add get pid rpc to node manager~~ [core] allow reporter agent to get pid via rpc to raylet Sep 29, 2025

This comment was marked as outdated.

Sign in to view

[core] add get pid rpc to node manager

5927027

Signed-off-by: tianyi-ge <[email protected]>

This comment was marked as outdated.

Sign in to view

ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Sep 29, 2025

edoakes reviewed Sep 29, 2025

View reviewed changes

[core] add cython wrapper for raylet client

f76f633

Signed-off-by: tianyi-ge <[email protected]>

This comment was marked as outdated.

Sign in to view

can-anyscale self-assigned this Oct 1, 2025

can-anyscale reviewed Oct 1, 2025

View reviewed changes

[core] add cython wrapper for raylet client

c3dca66

Signed-off-by: tianyi-ge <[email protected]>

This comment was marked as outdated.

Sign in to view

tianyi-ge added 2 commits October 5, 2025 23:08

Merge branch 'master' of github.com:ray-project/ray into raylet-grpc

1ea7081

[core] fix cursor comments

af61506

Signed-off-by: tianyi-ge <[email protected]>

can-anyscale reviewed Oct 6, 2025

View reviewed changes

[core] reuse singleton io context helper class to start a threaded ra…

204caef

…ylet client Signed-off-by: tianyi-ge <[email protected]>

This comment was marked as outdated.

Sign in to view

[core] fix reporter agent unittest

2b32615

Signed-off-by: tianyi-ge <[email protected]>

This comment was marked as outdated.

Sign in to view

[core] add reporter agent unittest

c8f0fa4

Signed-off-by: tianyi-ge <[email protected]>

This comment was marked as outdated.

Sign in to view

tianyi-ge added 2 commits October 10, 2025 09:58

[core] connect to raylet inside client constructor

ac20283

Signed-off-by: tianyi-ge <[email protected]>

[core] filter out system drivers for reporter

b4e9a1d

Signed-off-by: tianyi-ge <[email protected]>

tianyi-ge added 2 commits October 11, 2025 23:03

[core] fix mock WorkerPool compilation

53e693c

Signed-off-by: tianyi-ge <[email protected]>

[core] fix mock WorkerPool compilation

62b48cf

Signed-off-by: tianyi-ge <[email protected]>

This comment was marked as outdated.

Sign in to view

edoakes reviewed Oct 13, 2025

View reviewed changes

src/ray/raylet/worker_pool.h Show resolved Hide resolved

can-anyscale reviewed Oct 13, 2025

View reviewed changes

[core] nit updates

be70b12

Signed-off-by: tianyi-ge <[email protected]>

edoakes added the go add ONLY when ready to merge, run all tests label Oct 14, 2025

jjyao reviewed Oct 14, 2025

View reviewed changes

tianyi-ge added 2 commits October 15, 2025 10:28

[core] replace inheritance ThreadedRayletClient with wrapper RayletCl…

f09822a

…ientWithIoContext Signed-off-by: tianyi-ge <[email protected]>

[core] remove exception-catch-all

6a6b742

Signed-off-by: tianyi-ge <[email protected]>

cursor bot reviewed Oct 15, 2025

View reviewed changes

can-anyscale approved these changes Oct 15, 2025

View reviewed changes

jjyao reviewed Oct 15, 2025

View reviewed changes

tianyi-ge added 2 commits October 16, 2025 10:30

[core] catch RuntimeError and log with exception

02656c0

Signed-off-by: tianyi-ge <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into raylet-grpc

3ed4295

[core] update config raylet_rpc_server_reconnect_timeout_max_s

3521a91

Signed-off-by: tianyi-ge <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into raylet-grpc

a93cb28

cursor bot reviewed Oct 16, 2025

View reviewed changes

kenmcheng mentioned this pull request Oct 16, 2025

Filter out ANSI escape codes from logs when retrieving logs from the dashboard #53370

Merged

8 tasks

jjyao reviewed Oct 16, 2025

View reviewed changes

[core] change GetWorkerPIDs to async

9940d37

Signed-off-by: tianyi-ge <[email protected]>

cursor bot reviewed Oct 17, 2025

View reviewed changes

[core] update async logic

e3c79a2

Signed-off-by: tianyi-ge <[email protected]>

		// Get the worker managed by local raylet.
		// Failure: Sends to local raylet, so should never fail.

		Status GetWorkerPIDs(std::shared_ptr<std::vector<int32_t>> worker_pids,
		int64_t timeout_ms);

		assert resp_data["rayInitCluster"] == meta["ray_init_cluster"]


		def test_reporter_raylet_agent(ray_start_with_dashboard):

		std::weak_ptr<std::promise<Status>> weak_promise = promise;
		std::weak_ptr<std::vector<int32_t>> weak_worker_pids = worker_pids;

[core] allow reporter agent to get pid via rpc to raylet #57004

Are you sure you want to change the base?

[core] allow reporter agent to get pid via rpc to raylet #57004

Conversation

tianyi-ge commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tianyi-ge commented Sep 30, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

can-anyscale left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

tianyi-ge commented Sep 29, 2025 •

edited

Loading

edoakes commented Oct 14, 2025 •

edited

Loading

jjyao Oct 14, 2025 •

edited

Loading

Bug: Conflicting Timeouts in `GetWorkerPIDs` Method