Skip to content

Allow timeout during trained model download process #129003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

dan-rubinstein
Copy link
Member

@dan-rubinstein dan-rubinstein commented Jun 5, 2025

Description

We currently allow users to provide a timeout during inference endpoint creation and when performing an inference request. When creating an endpoint requiring a trained model deployment to be started or performing an inference request to a default endpoint that does not have a trained model deployment started we will download the model before starting a deployment if it has not been previously downloaded. During this download process, we do not currently timeout if the user's requested timeout is exceeded and instead download the model fully and then timeout during the model deployment starting process. This change fixes this poor experience and allows the system to timeout during the model download. If this timeout occurs, we should still retain the experience that the model will be downloaded and a trained model deployment will be started in the background so the user does not have to take any further action for the process to complete.

Testing

  • Tested that locally creating an ElasticsearchInternalService endpoint with a small timeout (1 second) will throw the ModelDeploymentTimeoutException and will complete the download/deployment start asynchronously.
  • Tested that calling inference on a default endpoint with no model downloaded/no trained model deployment started has the same experience as the test above.
  • Should we have some QA tests or IT tests for this?
    • Discussed with Wei and he will be working on QA tests for this as part of this issue

@dan-rubinstein dan-rubinstein added >bug :ml Machine learning Team:ML Meta label for the ML team v8.19.0 v9.1.0 labels Jun 5, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @dan-rubinstein, I've created a changelog YAML for you.

@dan-rubinstein
Copy link
Member Author

@elasticmachine merge upstream

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@dan-rubinstein dan-rubinstein marked this pull request as ready for review June 6, 2025 17:26
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@dan-rubinstein
Copy link
Member Author

@elasticmachine merge upstream

elasticmachine and others added 2 commits July 2, 2025 15:35

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@dan-rubinstein dan-rubinstein merged commit 136442d into elastic:main Jul 2, 2025
32 checks passed
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jul 3, 2025
* Allow timeout during trained model download process

* Update docs/changelog/129003.yaml

* Update timeout message

---------

Co-authored-by: Elastic Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team v9.2.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants