Recovery stage of a searchable snapshot shard stuck at FINALIZE #83878

dnhatn · 2022-02-13T03:20:19Z

Elasticsearch Version: 7.17

If there's a failure in prewarmCache, then the recovery stage of a searchable snapshot shard will be stuck at FINALIZE although that its recovery is completed properly.

This is the failure during prewarmCache.

[
  "[instance-0000000058] [2022.02.04-000017][28] prewarming failed for file [_4db.fdt] 
 org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution 
 at org.elasticsearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:80) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:72) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:112) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:100) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:149) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:147) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.StepListener.innerOnFailure(StepListener.java:57) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:36) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.support.GroupedActionListener.onFailure(GroupedActionListener.java:72) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.ActionListener$RunAfterActionListener.onFailure(ActionListener.java:350) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:77) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:765) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-7.17.0.jar:7.17.0] 
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] 
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] 
 at java.lang.Thread.run(Thread.java:833) [?:?] 
 Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Failed to prefetch file part in cache 
 at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:231) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:53) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:65) ~[elasticsearch-7.17.0.jar:7.17.0] 
 ... 16 more 
 Caused by: java.io.IOException: Failed to prefetch file part in cache 
 at org.elasticsearch.xpack.searchablesnapshots.store.input.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:216) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.store.SearchableSnapshotDirectory.lambda$prewarmCache$10(SearchableSnapshotDirectory.java:547) ~[?:?] 
 at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:34) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.0.jar:7.17.0] 
 ... 3 more 
 Caused by: java.util.concurrent.ExecutionException: org.apache.lucene.store.AlreadyClosedException: Cache file is evicted 
 at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:244) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:75) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.xpack.searchablesnapshots.store.input.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:206) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.store.SearchableSnapshotDirectory.lambda$prewarmCache$10(SearchableSnapshotDirectory.java:547) ~[?:?] 
 at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:34) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.0.jar:7.17.0] 
 ... 3 more 
 Caused by: org.apache.lucene.store.AlreadyClosedException: Cache file is evicted 
 at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.throwAlreadyEvicted(CacheFile.java:325) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.ensureOpen(CacheFile.java:320) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.acquireFileChannelReference(CacheFile.java:454) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.populateAndRead(CacheFile.java:356) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.store.input.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:194) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.store.SearchableSnapshotDirectory.lambda$prewarmCache$10(SearchableSnapshotDirectory.java:547) ~[?:?] 
 at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:34) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.0.jar:7.17.0] 
 ... 3 more"
]

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-02-13T03:20:22Z

Pinging @elastic/es-distributed (Team:Distributed)

tlrx · 2022-03-29T14:04:45Z

We noticed that the current behaviour of fully mounted indices shards being stuck at FINALIZE stage because of prewarming failures is confusing for users.

@henningandersen and I discussed this and we agreed on implementing some retrying logic (maybe at the directory level) for cache file prewarming. In addition to this retry logic we could add a FINALIZE_RETRY stage that would make more explicit that we encountered some errors during prewarming and that we are now retrying. We should also make sure that we are not polluting the logs with prewarming errors.

rpasche · 2022-07-14T07:37:20Z

It looks like we are truly hit by this problem in our productive clusters now.

rpasche · 2022-07-14T09:00:56Z

Question: Could such a cache eviction be caused by the _cache/clear API call (while the restore is running)? https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clearcache.html

tlrx · 2022-07-26T12:45:48Z

Question: Could such a cache eviction be caused by the _cache/clear API call (while the restore is running)? https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clearcache.html

No. This API clears the index request, query and field data caches but not the caches used by searchable snapshot.

Usually cache file evictions occurs when a shard is relocated, removed or closed during it's prewarming.

gbanasiak · 2023-04-14T11:24:31Z

It's worth no note there might be different scenarios leading to this problem. Example with network socket read timeout from version 7.17.7:

2023-04-12T03:14:07.236Z WARN [0] prewarming failed for file [_86.fdt]
[..]
        "stacktrace": [
          "org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution",
[..]
          "Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Failed to prefetch file part in cache",
[..]
         "Caused by: java.io.IOException: Failed to prefetch file part in cache",
[..]
          "Caused by: java.net.SocketTimeoutException: Read timed out",                 <----- HERE
          "at sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:273) ~[?:?]",
          "at sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:299) ~[?:?]",
          "at sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:340) ~[?:?]",
          "at sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:789) ~[?:?]",
          "at java.net.Socket$SocketInputStream.read(Socket.java:1025) ~[?:?]",
          "at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:477) ~[?:?]",
          "at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:471) ~[?:?]",
          "at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70) ~[?:?]",
          "at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1465) ~[?:?]",
          "at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1069) ~[?:?]",
          "at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[?:?]",
          "at org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[?:?]",
          "at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[?:?]",
          "at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:107) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.services.s3.internal.S3AbortableInputStream.read(S3AbortableInputStream.java:125) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at org.elasticsearch.repositories.s3.S3RetryingInputStream.read(S3RetryingInputStream.java:141) ~[?:?]",
          "at java.io.FilterInputStream.read(FilterInputStream.java:119) ~[?:?]",
          "at org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:62) ~[elasticsearch-7.17.7.jar:7.17.7]",
          "at java.io.FilterInputStream.read(FilterInputStream.java:119) ~[?:?]",
[..]

nerophon · 2023-08-04T15:54:36Z

This is still happening. Caused repeated plan change failure of a routine operation on a large cluster. Required 4 hours manual labour to identify and workaround. This process was to identify the stuck shards and run:

{
  "commands": [
    {
      "cancel": {
        "index": "<index_name>", 
        "shard": <shard_number>,
        "node": "<node_name>",
        "allow_primary": "<true/false>"
      }
    }
]
}

Please note that this happened on an 8.9 ESS cluster. Therefore I believe this bug is either not limited to 7.17 or we have found a new bug with identical symptoms but different root cause.

joeafari · 2025-03-07T23:32:27Z

Still occurring on an 8.14 cluster

renshuki · 2025-04-25T02:50:07Z

Still happening on 8.17

tlrx · 2025-04-28T07:12:48Z

@joeafari @renshuki The issue is not fixed so it may happen on any version, including 9.0.

dnhatn added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. needs:triage Requires assignment of a team area label labels Feb 13, 2022

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Feb 13, 2022

pquentin removed the needs:triage Requires assignment of a team area label label Feb 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recovery stage of a searchable snapshot shard stuck at FINALIZE #83878

Recovery stage of a searchable snapshot shard stuck at FINALIZE #83878

dnhatn commented Feb 13, 2022 •

edited

Loading

elasticmachine commented Feb 13, 2022

tlrx commented Mar 29, 2022

rpasche commented Jul 14, 2022

rpasche commented Jul 14, 2022

tlrx commented Jul 26, 2022

gbanasiak commented Apr 14, 2023

nerophon commented Aug 4, 2023 •

edited

Loading

joeafari commented Mar 7, 2025

renshuki commented Apr 25, 2025

tlrx commented Apr 28, 2025

Recovery stage of a searchable snapshot shard stuck at FINALIZE #83878

Recovery stage of a searchable snapshot shard stuck at FINALIZE #83878

Comments

dnhatn commented Feb 13, 2022 • edited Loading

elasticmachine commented Feb 13, 2022

tlrx commented Mar 29, 2022

rpasche commented Jul 14, 2022

rpasche commented Jul 14, 2022

tlrx commented Jul 26, 2022

gbanasiak commented Apr 14, 2023

nerophon commented Aug 4, 2023 • edited Loading

joeafari commented Mar 7, 2025

renshuki commented Apr 25, 2025

tlrx commented Apr 28, 2025

dnhatn commented Feb 13, 2022 •

edited

Loading

nerophon commented Aug 4, 2023 •

edited

Loading