Skip to content

Conversation

abrarsheikh
Copy link
Contributor

@abrarsheikh abrarsheikh commented Sep 23, 2025

currently the RequestRouterConfig is initialized from the controller. When RequestRouterConfig is initialized, we try to import the Class and serialize it. This is unsafe because the controller does not have the application runtime-env. See the following stack trace for evidence of how RequestRouterConfig is initialized in controller

	Unexpected error occurred while applying config for application 'app1': 
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/serve/_private/application_state.py", line 691, in _reconcile_build_app_task
    overrided_infos = override_deployment_info(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/serve/_private/application_state.py", line 1333, in override_deployment_info
    override_options["deployment_config"] = DeploymentConfig(**original_options)
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/main.py", line 339, in __init__
    values, fields_set, validation_error = validate_model(__pydantic_self__.__class__, data)
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/main.py", line 1074, in validate_model
    v_, errors_ = field.validate(value, values, loc=field.alias, cls=cls_)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/fields.py", line 881, in validate
    v, errors = self._validate_singleton(v, values, loc, cls)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/fields.py", line 1098, in _validate_singleton
    return self._apply_validators(v, values, loc, cls, self.validators)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/fields.py", line 1154, in _apply_validators
    v = validator(cls, v, values, self, self.model_config)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/class_validators.py", line 337, in <lambda>
    return lambda cls, v, values, field, config: validator(v)
                                                 ^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/pydantic/v1/main.py", line 711, in validate
    return cls(**value)
           ^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/serve/config.py", line 135, in __init__
    self._serialize_request_router_cls()
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/serve/config.py", line 151, in _serialize_request_router_cls
    request_router_class = import_attr(request_router_path)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/ray/_common/utils.py", line 44, in import_attr
    module = importlib.import_module(module_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/importlib/__init__.py", line 90, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen importlib._bootstrap>", line 1387, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1360, in _find_and_load
  File "<frozen importlib._bootstrap>", line 1324, in _find_and_load_unlocked
ModuleNotFoundError: No module named 'custom_request_router'

This PR import the custom class from build_serve_application so that the operation is performed in context of runtime env

The same idea applies to the deployment level autoscaling function and application-level autoscaling function.

Another issue happens because cloudpickle sees the UniformRequestRouter class as a top-level symbol in an importable module (custom_request_router).

When cloudpickle.dumps() runs, it recognizes that the object can be re-imported via
import custom_request_router; custom_request_router.UniformRequestRouter,
so instead of embedding the actual code, it just stores that import reference.

That’s why the pickle bytes only contain
b'...custom_request_router...UniformRequestRouter...' — no source or bytecode.

but we want cloudpickle to embed the class definition instead of just referencing it, call:

import importlib, cloudpickle
mod = importlib.import_module("custom_request_router")
cloudpickle.register_pickle_by_value(mod)
payload = cloudpickle.dumps(mod.UniformRequestRouter)

This tells cloudpickle to serialize everything from that module by value (with code included) rather than by reference.

We cannot rely on the reference because custom request router class can be used in serve proxy running outside of runtime env, hence having the entire bytecode make cloudpickle.loads possible

@abrarsheikh abrarsheikh requested a review from a team as a code owner September 23, 2025 22:09
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Sep 23, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly defers the loading of the custom request router to the replica, which resolves the issue of ModuleNotFoundError in the controller. The change from serializing/deserializing the class to normalizing it to a path and importing it on demand is a good approach. I've added a couple of comments to update the docstrings which are now outdated due to these changes. Also, it seems the _serialized_request_router_cls attribute in RequestRouterConfig is now obsolete and could be removed in a follow-up to improve code clarity.

@abrarsheikh
Copy link
Contributor Author

I think this still wont work given runtime-env for proxy is not from application

@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Sep 24, 2025
Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the intention to explicitly only support this for non-ingress deployments then? (unless they make it importable from the image)

@abrarsheikh
Copy link
Contributor Author

Is the intention to explicitly only support this for non-ingress deployments then? (unless they make it importable from the image)

We would encourage users to use

  1. Custom routers shipped with serve, there wont be that many of these to be honest.
  2. If they build their own, then we make these nuances clear to them in the documentation.

cursor[bot]

This comment was marked as outdated.

@abrarsheikh abrarsheikh force-pushed the SERVE-1134-abrar-request_router branch from 773063f to e97f7ba Compare October 3, 2025 17:03
@abrarsheikh abrarsheikh changed the title load custom request router lazily import custom class safely in controller by using runtime env Oct 3, 2025
Signed-off-by: abrar <[email protected]>
cursor[bot]

This comment was marked as outdated.

Signed-off-by: abrar <[email protected]>
Signed-off-by: abrar <[email protected]>
Copy link
Contributor

@eicherseiji eicherseiji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @abrarsheikh!

Is the intention to explicitly only support this for non-ingress deployments then? (unless they make it importable from the image)
We would encourage users to use

  1. Custom routers shipped with serve, there wont be that many of these to be honest.
  2. If they build their own, then we make these nuances clear to them in the documentation.

@abrarsheikh would it be possible to add the docs update to this PR as well?

@abrarsheikh
Copy link
Contributor Author

@abrarsheikh would it be possible to add the docs update to this PR as well?

I was planning to do that in a follow up PR, this once already quite long

cursor[bot]

This comment was marked as outdated.

Copy link
Contributor

@eicherseiji eicherseiji left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @abrarsheikh!

cursor[bot]

This comment was marked as outdated.

Signed-off-by: abrar <[email protected]>
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

Signed-off-by: abrar <[email protected]>
cursor[bot]

This comment was marked as outdated.

cursor[bot]

This comment was marked as outdated.

@edoakes
Copy link
Collaborator

edoakes commented Oct 17, 2025

@abrarsheikh I'm not very familiar with cloudpickle.register_pickle_by_value(mod), so I'm not sure what edge cases it does and doesn't handle. Do you have a clear sense of the guarantees here? Is there any chance of an un-importable class depending on the specifics of the environments?

@abrarsheikh
Copy link
Contributor Author

@abrarsheikh I'm not very familiar with cloudpickle.register_pickle_by_value(mod), so I'm not sure what edge cases it does and doesn't handle. Do you have a clear sense of the guarantees here? Is there any chance of an un-importable class depending on the specifics of the environments?

Gotchas and limitations -- based on my limited understanding of cloudpickle and limited by serve specific requirements :D , I am sure there are things that could go wrong.

  1. Transitive dependencies aren’t vendored. For example, if the class inherits from a superclass defined in another module, that dependency must still exist in the target environment.
  2. Globals and constants in external modules aren’t serialized by value — they’re still referenced by import.
  3. Environment parity matters. Differences in Python version, cloudpickle version, or library versions can affect deserialization.

When users provide a policy, we can fully support it as long as it’s simple, self-contained Python code relying only on the standard library. Once the policy becomes more complex (e.g., depends on other custom modules or packages), they’ll need to ensure those modules are bundled into the Docker image or environment.

This PR improves the out-of-the-box experience for most users and newcomers by lowering friction, but it doesn’t completely eliminate environment-dependency issues for complex setups.

Signed-off-by: abrar <[email protected]>
Signed-off-by: abrar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants