-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Issue-105420: Fix bug causing incorrect error on force deleting already deleted model #107188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Pinging @elastic/ml-core (Team:ML) |
@elasticmachine test this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for submitting this change! I have requested a few changes, but I am open to doing it a different way. Let me know what you think!
edit: please run this command once your next commit is ready ./gradlew :x-pack:plugin:core:spotlessApply :x-pack:plugin:core:precommit :x-pack:plugin:ml:spotlessApply :x-pack:plugin:ml:precommit :x-pack:plugin:inference:spotlessApply :x-pack:plugin:inference:precommit :x-pack:plugin:ml:qa:native-multi-node-tests:precommit
String id = request.getId(); | ||
IngestMetadata currentIngestMetadata = state.metadata().custom(IngestMetadata.TYPE); | ||
Set<String> referencedModels = getReferencedModelKeys(currentIngestMetadata, ingestService); | ||
|
||
if (modelExists(request.getId()) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice! this looks very clean and readable.
trainedModelProvider.getTrainedModel(modelId, GetTrainedModelsAction.Includes.empty(), null, trainedModelListener); | ||
|
||
try { | ||
boolean latchReached = latch.await(5, TimeUnit.SECONDS); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want a 5 second timeout here which will throw an exception. If this timeout occurs, I don't think it will be any clearer what happened for the end user.
} | ||
}; | ||
|
||
trainedModelProvider.getTrainedModel(modelId, GetTrainedModelsAction.Includes.empty(), null, trainedModelListener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't pass a parent task to this request, it wont be cancelable. I think it would be better to pass in the task from the masterOperation here.
protected boolean modelExists(String modelId) { | ||
CountDownLatch latch = new CountDownLatch(1); | ||
final AtomicBoolean modelExists = new AtomicBoolean(false); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid using a latch and requiring a timeout, I think we could replace this function with an actionListener. What do you think?
|
||
@Override | ||
public void onFailure(Exception e) { | ||
logger.error("Failed to retrieve model {}: {}", modelId, e.getMessage(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.error("Failed to retrieve model {}: {}", modelId, e.getMessage(), e); | |
logger.error("Failed to retrieve model [" + modelId + "]: [" + e.getMessage() + "]", e); |
@@ -228,6 +244,39 @@ private void deleteModel(DeleteTrainedModelAction.Request request, ClusterState | |||
} | |||
} | |||
|
|||
protected boolean modelExists(String modelId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can avoid the countdown latch and hence blocking the calling thread by using a listener.
You don't have to timeout the call to trainedModelProvider.getTrainedModel()
if it does timeout simply out let the error propagate from the call.
private void modelExists(String modelId, ActionListener<Boolean> listener) {
trainedModelProvider.getTrainedModel(modelId, GetTrainedModelsAction.Includes.empty(), null,
ActionListener.wrap(
model -> listener.onResponse(Boolean.TRUE),
exception -> {
if (ExceptionsHelper.unwrapCause(exception) instanceof ResourceNotFoundException) {
listener.onResponse(Boolean.FALSE);
} else {
listener.onFailure(exception);
}
}
)
);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After you make the modelExists()
function async with a listener you will need to chain the various processing steps together. The best way to do this is use a SubscribableListener
Here's an example if it being used:
https://github.com/elastic/elasticsearch/blob/main/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetTrainedModelsStatsAction.java#L128
|
||
@Override | ||
public void onFailure(Exception e) { | ||
logger.error("Failed to retrieve model {}: {}", modelId, e.getMessage(), e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't an error as we are checking the model's existence here. If the model doesn't exist then that is fine and it should be reported back to the caller rather than logged.
throw new ElasticsearchException("Timeout while waiting for trained model to be retrieved"); | ||
} | ||
} catch (InterruptedException e) { | ||
throw new ElasticsearchException("Unexpected exception", e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is not necessary if you take my suggestion but in Java it's best practice in to reset the interrupt flag with Thread.currentThread().interrupt();
What I did:
Previously, attempting to force delete a referenced model in Elasticsearch that had already been deleted would result in an error indicating that the model is still being referenced by ingest processors. This behavior was misleading since the model no longer exists in the system.
Changes made:
deleteModel
method to to properly check for the existence of the model before proceeding with the deletion process.With this fix, attempting to force delete an already deleted model will now correctly indicate that the model isn't found, rather than misleadingly reporting it as still being referenced.
Related issue:
Fixes #105420
How did I test my fix:
I built and started up elasticsearch locally, then I downloaded a new model using the following command:
curl --insecure -u "elastic:<password>" -X PUT "https://localhost:9200/_ml/trained_models/.elser_model_2?pretty" -H 'Content-Type: application/json' -d' { "input": { "field_names": ["text_field"] } } '
Then I attempted to delete it and it said that it's being referenced as expected:
curl --insecure -u "elastic:<password>" -X DELETE "https://localhost:9200/_ml/trained_models/.elser_model_2?pretty" { "error" : { "root_cause" : [ { "type" : "status_exception", "reason" : "Cannot delete model [.elser_model_2] as it is still referenced by ingest processors; use force to delete the model" } ], "type" : "status_exception", "reason" : "Cannot delete model [.elser_model_2] as it is still referenced by ingest processors; use force to delete the model" }, "status" : 409 }
I force deleted it the first time:
curl --insecure -u "elastic:<password>" -X DELETE "https://localhost:9200/_ml/trained_models/.elser_model_2?force=true" {"acknowledged":true}%
It now returns the right error response when I attempt to delete the same model again:
curl --insecure -u "elastic:<password>" -X DELETE "https://localhost:9200/_ml/trained_models/.elser_model_2" {"error":{"root_cause":[{"type":"resource_not_found_exception","reason":"Could not find trained model [.elser_model_2]"}],"type":"resource_not_found_exception","reason":"Could not find trained model [.elser_model_2]"},"status":404}%
PR checklist:
Have you signed the contributor license agreement? ✔️Yes
Have you followed the contributor guidelines? ✔️Yes
If submitting code, have you built your formula locally prior to submission with
gradle check
?I ran tests in the package to make sure I didn't break anything, here's the command I ran:
./gradlew :x-pack:plugin:ml:test
Here's a screenshot showing that the tests succeeded: