Skip to content

Issue-105420: Fix bug causing incorrect error on force deleting already deleted model #107188

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

batcity
Copy link

@batcity batcity commented Apr 8, 2024

What I did:

Previously, attempting to force delete a referenced model in Elasticsearch that had already been deleted would result in an error indicating that the model is still being referenced by ingest processors. This behavior was misleading since the model no longer exists in the system.

Changes made:

  • Modified the deleteModel method to to properly check for the existence of the model before proceeding with the deletion process.

With this fix, attempting to force delete an already deleted model will now correctly indicate that the model isn't found, rather than misleadingly reporting it as still being referenced.

Related issue:

Fixes #105420

How did I test my fix:

I built and started up elasticsearch locally, then I downloaded a new model using the following command:

curl --insecure -u "elastic:<password>" -X PUT "https://localhost:9200/_ml/trained_models/.elser_model_2?pretty" -H 'Content-Type: application/json' -d' { "input": { "field_names": ["text_field"] } } '

Then I attempted to delete it and it said that it's being referenced as expected:

curl --insecure -u "elastic:<password>" -X DELETE "https://localhost:9200/_ml/trained_models/.elser_model_2?pretty" { "error" : { "root_cause" : [ { "type" : "status_exception", "reason" : "Cannot delete model [.elser_model_2] as it is still referenced by ingest processors; use force to delete the model" } ], "type" : "status_exception", "reason" : "Cannot delete model [.elser_model_2] as it is still referenced by ingest processors; use force to delete the model" }, "status" : 409 }

I force deleted it the first time:

curl --insecure -u "elastic:<password>" -X DELETE "https://localhost:9200/_ml/trained_models/.elser_model_2?force=true" {"acknowledged":true}%

It now returns the right error response when I attempt to delete the same model again:

curl --insecure -u "elastic:<password>" -X DELETE "https://localhost:9200/_ml/trained_models/.elser_model_2" {"error":{"root_cause":[{"type":"resource_not_found_exception","reason":"Could not find trained model [.elser_model_2]"}],"type":"resource_not_found_exception","reason":"Could not find trained model [.elser_model_2]"},"status":404}%

PR checklist:

  • Have you signed the contributor license agreement? ✔️Yes

  • Have you followed the contributor guidelines? ✔️Yes

  • If submitting code, have you built your formula locally prior to submission with gradle check?

    I ran tests in the package to make sure I didn't break anything, here's the command I ran:

    ./gradlew :x-pack:plugin:ml:test

    Here's a screenshot showing that the tests succeeded:

Screenshot 2024-04-10 at 4 44 21 AM
  • If submitting code, is your pull request against main? Unless there is a good reason otherwise, we prefer pull requests against main and will backport as needed. ✔️Yes
  • If submitting code, have you checked that your submission is for an OS and architecture that we support? ✔️Yes
  • If you are submitting this code for a class then read our policy for that. N/A

@elasticsearchmachine elasticsearchmachine added external-contributor Pull request authored by a developer outside the Elasticsearch team v8.14.0 labels Apr 8, 2024
@batcity batcity changed the title draft: Issue-105420: Fix bug causing incorrect error on force deleting already deleted model Issue-105420: Fix bug causing incorrect error on force deleting already deleted model Apr 9, 2024
@batcity batcity marked this pull request as ready for review April 9, 2024 23:45
@elasticsearchmachine elasticsearchmachine added the needs:triage Requires assignment of a team area label label Apr 9, 2024
@davidkyle davidkyle added :ml Machine learning and removed needs:triage Requires assignment of a team area label labels Apr 10, 2024
@elasticsearchmachine elasticsearchmachine added the Team:ML Meta label for the ML team label Apr 10, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@davidkyle
Copy link
Member

@elasticmachine test this please

@davidkyle davidkyle added the >bug label Apr 10, 2024
@maxhniebergall maxhniebergall self-requested a review April 10, 2024 18:27
Copy link
Contributor

@maxhniebergall maxhniebergall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this change! I have requested a few changes, but I am open to doing it a different way. Let me know what you think!

edit: please run this command once your next commit is ready ./gradlew :x-pack:plugin:core:spotlessApply :x-pack:plugin:core:precommit :x-pack:plugin:ml:spotlessApply :x-pack:plugin:ml:precommit :x-pack:plugin:inference:spotlessApply :x-pack:plugin:inference:precommit :x-pack:plugin:ml:qa:native-multi-node-tests:precommit

String id = request.getId();
IngestMetadata currentIngestMetadata = state.metadata().custom(IngestMetadata.TYPE);
Set<String> referencedModels = getReferencedModelKeys(currentIngestMetadata, ingestService);

if (modelExists(request.getId()) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! this looks very clean and readable.

trainedModelProvider.getTrainedModel(modelId, GetTrainedModelsAction.Includes.empty(), null, trainedModelListener);

try {
boolean latchReached = latch.await(5, TimeUnit.SECONDS);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want a 5 second timeout here which will throw an exception. If this timeout occurs, I don't think it will be any clearer what happened for the end user.

}
};

trainedModelProvider.getTrainedModel(modelId, GetTrainedModelsAction.Includes.empty(), null, trainedModelListener);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't pass a parent task to this request, it wont be cancelable. I think it would be better to pass in the task from the masterOperation here.

Comment on lines +247 to +250
protected boolean modelExists(String modelId) {
CountDownLatch latch = new CountDownLatch(1);
final AtomicBoolean modelExists = new AtomicBoolean(false);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid using a latch and requiring a timeout, I think we could replace this function with an actionListener. What do you think?


@Override
public void onFailure(Exception e) {
logger.error("Failed to retrieve model {}: {}", modelId, e.getMessage(), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.error("Failed to retrieve model {}: {}", modelId, e.getMessage(), e);
logger.error("Failed to retrieve model [" + modelId + "]: [" + e.getMessage() + "]", e);

@@ -228,6 +244,39 @@ private void deleteModel(DeleteTrainedModelAction.Request request, ClusterState
}
}

protected boolean modelExists(String modelId) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can avoid the countdown latch and hence blocking the calling thread by using a listener.

You don't have to timeout the call to trainedModelProvider.getTrainedModel() if it does timeout simply out let the error propagate from the call.

    private void modelExists(String modelId, ActionListener<Boolean> listener) {
        trainedModelProvider.getTrainedModel(modelId, GetTrainedModelsAction.Includes.empty(), null,
            ActionListener.wrap(
                model -> listener.onResponse(Boolean.TRUE),
                exception -> {
                    if (ExceptionsHelper.unwrapCause(exception) instanceof ResourceNotFoundException) {
                        listener.onResponse(Boolean.FALSE);
                    } else {
                        listener.onFailure(exception);
                    }
                }
            )
        );
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After you make the modelExists() function async with a listener you will need to chain the various processing steps together. The best way to do this is use a SubscribableListener

Here's an example if it being used:
https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetTrainedModelsStatsAction.java#L128


@Override
public void onFailure(Exception e) {
logger.error("Failed to retrieve model {}: {}", modelId, e.getMessage(), e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't an error as we are checking the model's existence here. If the model doesn't exist then that is fine and it should be reported back to the caller rather than logged.

throw new ElasticsearchException("Timeout while waiting for trained model to be retrieved");
}
} catch (InterruptedException e) {
throw new ElasticsearchException("Unexpected exception", e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is not necessary if you take my suggestion but in Java it's best practice in to reset the interrupt flag with Thread.currentThread().interrupt();

@maxhniebergall maxhniebergall removed their assignment May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug external-contributor Pull request authored by a developer outside the Elasticsearch team :ml Machine learning Team:ML Meta label for the ML team v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wrong error message returned on delete trained model when model already deleted
5 participants