Skip to content

Allow timeout during trained model download process #129003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dan-rubinstein
Copy link
Member

@dan-rubinstein dan-rubinstein commented Jun 5, 2025

Description

We currently allow users to provide a timeout during inference endpoint creation and when performing an inference request. When creating an endpoint requiring a trained model deployment to be started or performing an inference request to a default endpoint that does not have a trained model deployment started we will download the model before starting a deployment if it has not been previously downloaded. During this download process, we do not currently timeout if the user's requested timeout is exceeded and instead download the model fully and then timeout during the model deployment starting process. This change fixes this poor experience and allows the system to timeout during the model download. If this timeout occurs, we should still retain the experience that the model will be downloaded and a trained model deployment will be started in the background so the user does not have to take any further action for the process to complete.

Testing

  • Tested that locally creating an ElasticsearchInternalService endpoint with a small timeout (1 second) will throw the ModelDeploymentTimeoutException and will complete the download/deployment start asynchronously.
  • Tested that calling inference on a default endpoint with no model downloaded/no trained model deployment started has the same experience as the test above.
  • Should we have some QA tests or IT tests for this?

TODO: Test what happens when an inference endpoint is created with a short timeout. It still downloads the model, creates the endpoint, and starts the deployment deployment but the error message is confusing as it tells the user to try again.

@dan-rubinstein dan-rubinstein added >bug :ml Machine learning Team:ML Meta label for the ML team v8.19.0 v9.1.0 labels Jun 5, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @dan-rubinstein, I've created a changelog YAML for you.

@dan-rubinstein
Copy link
Member Author

@elasticmachine merge upstream

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants