Skip to content

Recovery stage of a searchable snapshot shard stuck at FINALIZE #83878

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dnhatn opened this issue Feb 13, 2022 · 10 comments
Open

Recovery stage of a searchable snapshot shard stuck at FINALIZE #83878

dnhatn opened this issue Feb 13, 2022 · 10 comments
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@dnhatn
Copy link
Member

dnhatn commented Feb 13, 2022

Elasticsearch Version: 7.17

If there's a failure in prewarmCache, then the recovery stage of a searchable snapshot shard will be stuck at FINALIZE although that its recovery is completed properly.

This is the failure during prewarmCache.

[
  "[instance-0000000058] [2022.02.04-000017][28] prewarming failed for file [_4db.fdt] 
 org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution 
 at org.elasticsearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:80) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:72) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListenerDirectly(ListenableFuture.java:112) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:100) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:149) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:147) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.StepListener.innerOnFailure(StepListener.java:57) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:36) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.support.GroupedActionListener.onFailure(GroupedActionListener.java:72) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.ActionListener$Delegating.onFailure(ActionListener.java:66) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.ActionListener$RunAfterActionListener.onFailure(ActionListener.java:350) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:77) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:765) [elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-7.17.0.jar:7.17.0] 
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?] 
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?] 
 at java.lang.Thread.run(Thread.java:833) [?:?] 
 Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Failed to prefetch file part in cache 
 at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:231) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:53) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:65) ~[elasticsearch-7.17.0.jar:7.17.0] 
 ... 16 more 
 Caused by: java.io.IOException: Failed to prefetch file part in cache 
 at org.elasticsearch.xpack.searchablesnapshots.store.input.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:216) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.store.SearchableSnapshotDirectory.lambda$prewarmCache$10(SearchableSnapshotDirectory.java:547) ~[?:?] 
 at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:34) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.0.jar:7.17.0] 
 ... 3 more 
 Caused by: java.util.concurrent.ExecutionException: org.apache.lucene.store.AlreadyClosedException: Cache file is evicted 
 at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:257) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:244) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:75) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.xpack.searchablesnapshots.store.input.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:206) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.store.SearchableSnapshotDirectory.lambda$prewarmCache$10(SearchableSnapshotDirectory.java:547) ~[?:?] 
 at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:34) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.0.jar:7.17.0] 
 ... 3 more 
 Caused by: org.apache.lucene.store.AlreadyClosedException: Cache file is evicted 
 at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.throwAlreadyEvicted(CacheFile.java:325) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.ensureOpen(CacheFile.java:320) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.acquireFileChannelReference(CacheFile.java:454) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.cache.common.CacheFile.populateAndRead(CacheFile.java:356) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.store.input.CachedBlobContainerIndexInput.prefetchPart(CachedBlobContainerIndexInput.java:194) ~[?:?] 
 at org.elasticsearch.xpack.searchablesnapshots.store.SearchableSnapshotDirectory.lambda$prewarmCache$10(SearchableSnapshotDirectory.java:547) ~[?:?] 
 at org.elasticsearch.action.ActionRunnable$1.doRun(ActionRunnable.java:34) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.0.jar:7.17.0] 
 at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.0.jar:7.17.0] 
 ... 3 more"
]
@dnhatn dnhatn added >bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. needs:triage Requires assignment of a team area label labels Feb 13, 2022
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Feb 13, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@pquentin pquentin removed the needs:triage Requires assignment of a team area label label Feb 14, 2022
@tlrx
Copy link
Member

tlrx commented Mar 29, 2022

We noticed that the current behaviour of fully mounted indices shards being stuck at FINALIZE stage because of prewarming failures is confusing for users.

@henningandersen and I discussed this and we agreed on implementing some retrying logic (maybe at the directory level) for cache file prewarming. In addition to this retry logic we could add a FINALIZE_RETRY stage that would make more explicit that we encountered some errors during prewarming and that we are now retrying. We should also make sure that we are not polluting the logs with prewarming errors.

@rpasche
Copy link

rpasche commented Jul 14, 2022

It looks like we are truly hit by this problem in our productive clusters now.

@rpasche
Copy link

rpasche commented Jul 14, 2022

Question: Could such a cache eviction be caused by the _cache/clear API call (while the restore is running)? https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clearcache.html

@tlrx
Copy link
Member

tlrx commented Jul 26, 2022

Question: Could such a cache eviction be caused by the _cache/clear API call (while the restore is running)? https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-clearcache.html

No. This API clears the index request, query and field data caches but not the caches used by searchable snapshot.

Usually cache file evictions occurs when a shard is relocated, removed or closed during it's prewarming.

@gbanasiak
Copy link
Contributor

It's worth no note there might be different scenarios leading to this problem. Example with network socket read timeout from version 7.17.7:

2023-04-12T03:14:07.236Z WARN [0] prewarming failed for file [_86.fdt]
[..]
        "stacktrace": [
          "org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution",
[..]
          "Caused by: java.util.concurrent.ExecutionException: java.io.IOException: Failed to prefetch file part in cache",
[..]
         "Caused by: java.io.IOException: Failed to prefetch file part in cache",
[..]
          "Caused by: java.net.SocketTimeoutException: Read timed out",                 <----- HERE
          "at sun.nio.ch.NioSocketImpl.timedRead(NioSocketImpl.java:273) ~[?:?]",
          "at sun.nio.ch.NioSocketImpl.implRead(NioSocketImpl.java:299) ~[?:?]",
          "at sun.nio.ch.NioSocketImpl.read(NioSocketImpl.java:340) ~[?:?]",
          "at sun.nio.ch.NioSocketImpl$1.read(NioSocketImpl.java:789) ~[?:?]",
          "at java.net.Socket$SocketInputStream.read(Socket.java:1025) ~[?:?]",
          "at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:477) ~[?:?]",
          "at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:471) ~[?:?]",
          "at sun.security.ssl.SSLSocketInputRecord.bytesInCompletePacket(SSLSocketInputRecord.java:70) ~[?:?]",
          "at sun.security.ssl.SSLSocketImpl.readApplicationRecord(SSLSocketImpl.java:1465) ~[?:?]",
          "at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:1069) ~[?:?]",
          "at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[?:?]",
          "at org.apache.http.impl.io.SessionInputBufferImpl.read(SessionInputBufferImpl.java:197) ~[?:?]",
          "at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:176) ~[?:?]",
          "at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:135) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:180) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:107) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at com.amazonaws.services.s3.internal.S3AbortableInputStream.read(S3AbortableInputStream.java:125) ~[?:?]",
          "at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:90) ~[?:?]",
          "at org.elasticsearch.repositories.s3.S3RetryingInputStream.read(S3RetryingInputStream.java:141) ~[?:?]",
          "at java.io.FilterInputStream.read(FilterInputStream.java:119) ~[?:?]",
          "at org.elasticsearch.index.snapshots.blobstore.RateLimitingInputStream.read(RateLimitingInputStream.java:62) ~[elasticsearch-7.17.7.jar:7.17.7]",
          "at java.io.FilterInputStream.read(FilterInputStream.java:119) ~[?:?]",
[..]

@nerophon
Copy link

nerophon commented Aug 4, 2023

This is still happening. Caused repeated plan change failure of a routine operation on a large cluster. Required 4 hours manual labour to identify and workaround. This process was to identify the stuck shards and run:

{
  "commands": [
    {
      "cancel": {
        "index": "<index_name>", 
        "shard": <shard_number>,
        "node": "<node_name>",
        "allow_primary": "<true/false>"
      }
    }
]
}

Please note that this happened on an 8.9 ESS cluster. Therefore I believe this bug is either not limited to 7.17 or we have found a new bug with identical symptoms but different root cause.

@joeafari
Copy link
Contributor

joeafari commented Mar 7, 2025

Still occurring on an 8.14 cluster

@renshuki
Copy link
Contributor

Still happening on 8.17

@tlrx
Copy link
Member

tlrx commented Apr 28, 2025

@joeafari @renshuki The issue is not fixed so it may happen on any version, including 9.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

9 participants