Skip to content

Taking a snapshot of an index closed prior to 7.2.0 fails with a NPE in 7.11.2 #70676

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
DaveCTurner opened this issue Mar 22, 2021 · 23 comments
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs good first issue low hanging fruit help wanted adoptme Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@DaveCTurner
Copy link
Contributor

Elasticsearch version (bin/elasticsearch --version): master, various 7.x versions

Plugins installed: N/A

JVM version (java -version): N/A

OS version (uname -a if on a Unix-like system): N/A

Description of the problem including expected versus actual behavior:

In SnapshotsService#shards we assume that if an index has metadata in the cluster state then it has a routing table entry, but this isn't true if the index was closed prior to 7.2. As a workaround, you can open and close any such indices.

Steps to reproduce:

Introduce a closed index in, say, 7.1.1 and then try and take a snapshot in master. For instance, apply these changes to b65992e ...

$ git diff
diff --git a/qa/full-cluster-restart/src/test/java/org/elasticsearch/upgrades/FullClusterRestartIT.java b/qa/full-cluster-restart/src/test/java/org/elasticsearch/upgrades/FullClusterRestartIT.java
index 865b076d4e1..bd20357bae9 100644
--- a/qa/full-cluster-restart/src/test/java/org/elasticsearch/upgrades/FullClusterRestartIT.java
+++ b/qa/full-cluster-restart/src/test/java/org/elasticsearch/upgrades/FullClusterRestartIT.java
@@ -14,6 +14,7 @@ import org.elasticsearch.client.Request;
 import org.elasticsearch.client.Response;
 import org.elasticsearch.client.ResponseException;
 import org.elasticsearch.client.RestClient;
+import org.elasticsearch.cluster.block.ClusterBlockException;
 import org.elasticsearch.cluster.metadata.IndexMetadata;
 import org.elasticsearch.cluster.metadata.MetadataIndexStateService;
 import org.elasticsearch.common.Booleans;
@@ -769,6 +770,9 @@ public class FullClusterRestartIT extends AbstractFullClusterRestartTestCase {
      * and some routing configuration.
      */
     public void testSnapshotRestore() throws IOException {
+        // Close an index in the old version to verify that we handle snapshotting it correctly in the newer version
+        final String closedIndexName = index + "_closed";
+
         int count;
         if (isRunningAgainstOldCluster()) {
             // Create the index
@@ -779,6 +783,9 @@ public class FullClusterRestartIT extends AbstractFullClusterRestartTestCase {
             }
             createIndex(index, settings.build());
             indexRandomDocuments(count, true, true, i -> jsonBuilder().startObject().field("field", "value").endObject());
+
+            createIndex(closedIndexName, settings.build());
+            closeIndex(closedIndexName);
         } else {
             count = countOfIndexedRandomDocuments();
         }
@@ -859,6 +866,14 @@ public class FullClusterRestartIT extends AbstractFullClusterRestartTestCase {
         if (false == isRunningAgainstOldCluster()) {
             checkSnapshot("new_snap", count, Version.CURRENT);
         }
+
+        final String closedSnapshotName = isRunningAgainstOldCluster() ? "old_snap_closed" : "new_snap_closed";
+        Request createSnapshotOfClosedIndex = new Request("PUT", "/_snapshot/repo/" + closedSnapshotName);
+        createSnapshotOfClosedIndex.addParameter("wait_for_completion", "true");
+        createSnapshotOfClosedIndex.setJsonEntity("{\"indices\": \"" + closedIndexName + "\"}");
+        final ResponseException responseException
+                = expectThrows(ResponseException.class, () -> client().performRequest(createSnapshotOfClosedIndex));
+        assertThat(responseException.getMessage(), containsString("index closed"));
     }

     public void testHistoryUUIDIsAdded() throws Exception {

... and then run ./gradlew :qa:full-cluster-restart:v7.1.1#bwcTest -Dtests.class=org.elasticsearch.upgrades.FullClusterRestartIT -Dtests.method=testSnapshotRestore.

Provide logs (if relevant):

In tests one of the nodes fails with an AssertionError:

[2021-03-22T17:25:08,048][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [v7.1.1-1] fatal error in thread [elasticsearch[v7.1.1-1][masterService#updateTask][T#1]], exiting
java.lang.AssertionError: null
        at org.elasticsearch.snapshots.SnapshotsService.shards(SnapshotsService.java:2285) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.snapshots.SnapshotsService$1.execute(SnapshotsService.java:330) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.repositories.blobstore.BlobStoreRepository$1.execute(BlobStoreRepository.java:391) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:48) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:686) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:308) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:203) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:140) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]

In production it's a NPE instead:

[REDACTED] [REDACTED][REDACTED] failed to create snapshot
java.lang.NullPointerException: Cannot invoke "org.elasticsearch.cluster.routing.IndexRoutingTable.shard(int)" because "indexRoutingTable" is null
	at org.elasticsearch.snapshots.SnapshotsService.shards(SnapshotsService.java:2717) ~[elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.snapshots.SnapshotsService.access$1000(SnapshotsService.java:115) ~[elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.snapshots.SnapshotsService$2.execute(SnapshotsService.java:399) ~[elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$1.execute(BlobStoreRepository.java:380) ~[elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:48) ~[elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:691) ~[elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:313) ~[elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:208) [elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:62) [elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:140) [elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139) [elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177) [elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) [elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241) [elasticsearch-7.11.1.jar:7.11.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204) [elasticsearch-7.11.1.jar:7.11.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]
@DaveCTurner DaveCTurner added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Mar 22, 2021
@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Mar 22, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor Author

We agreed to fix this by marking such shards as MISSING in future, effectively reinstating the branch that was removed from server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java in #39644.

The test I suggested above only works for BWC versions <7.2.0 because of that change, because more recent versions can successfully snapshot a closed index.

@DaveCTurner DaveCTurner added good first issue low hanging fruit help wanted adoptme labels Mar 23, 2021
@SivaTharun
Copy link

@DaveCTurner i am newbie to open source contributions, would like to pick this issue.

@DaveCTurner
Copy link
Contributor Author

Sure, go ahead @SivaTharun

@jnayan1
Copy link

jnayan1 commented Apr 11, 2021

First timer here, can I get started with this issue?

@DaveCTurner
Copy link
Contributor Author

@SivaTharun are you still working on this?

@DaveCTurner
Copy link
Contributor Author

No reply, so I think this issue is up for grabs again.

@SivaTharun
Copy link

@DaveCTurner sorry for late reply, missed the conversation in my inbox, can i start to work on this issue

@DaveCTurner
Copy link
Contributor Author

Yes, sure, I thought you'd already started :)

@rushikesh-rd
Copy link

@DaveCTurner can i work on this? I'm new to contributing.

@arteam arteam self-assigned this Sep 1, 2021
@arteam arteam removed their assignment Nov 9, 2021
@sarahlee429
Copy link

@DaveCTurner - hi Dave. Not rly sure what the status of this is. I'm interested in learning more of Elasticsearch and would like to start helping out here. Can I work on this issue?

@DaveCTurner
Copy link
Contributor Author

Sure, it doesn't look like anyone is working on this.

@KhushV25
Copy link

Hey I am newbie and would like to work on it. How can I get started.

@sarahlee429
Copy link

@KhushV25 - I started looking at this. Haven't made too much progress yet. I'm still in the process of trying to set up my dev environment to test against. Did you want to try to work on it together?

@KhushV25
Copy link

@sarahlee429 - Hey I would love to work with you but I am totally a beginner and this would be my first ever project so will it be okay for you to work with me?

@sarahlee429
Copy link

@KhushV25 sure thing. I'm pretty new to ES also so we can learn together. So in the README there's quite a bit of info about getting started if you haven't gone through those yet.

@KhushV25
Copy link

@sarahlee429 Hey can it discuss it on discord for better communication. Here is my user id " khushhere. " .

@baseely
Copy link

baseely commented Jan 8, 2022

@KhushV25 @sarahlee429 Just checking if any of you is already working on the issue ? Could you please share a status of your work so far ? We can discuss here more before digging and doing further implementations and share ideas.

@zembrzuski
Copy link
Contributor

I am also a totally beginner and I would love to start contributing to ElasticSearch.

@KhushV25 @sarahlee429 - Have you guys made some progress on this issue? If you guys have made some progress, please let me know, so we maybe can be working together in order to learn ant contribute to the community!

@baseely - I've created a PR - I know that it is a rough work, but I would appreciate if you could do a quick review and guide me to get on track to make a great quality code to fix this issue.

Thanks,
Rodrigo

@zembrzuski
Copy link
Contributor

Hi @DaveCTurner

I know you might be very busy working on more important issues, but could you please give a quick look in the PR that I raised?

I raised the PR a few days ago, but I haven't got any feedback yet. This is my first contribution to Elasticsearch (and I know this PR is still a work-in-progress), but I would appreciate if you could spend some minutes to give me some feedback, so I can know if I am going on the right path or not.

Link to the PR

Thank you so much!

@chen-ni
Copy link
Contributor

chen-ni commented Jul 13, 2022

Hi @zembrzuski, are you still working on this? If not then I'd like to pick up your work :)

@jonny5203
Copy link

Hi I am new here, and I noticed that this issue have been here since 2021, if no one is currently working on it, I will love to take over

@DaveCTurner
Copy link
Contributor Author

Go for it @jonny5203, tho please note that any fix will need to include a test case that reproduces the problem, and that will require you to write a test that starts with a cluster running v7.1.1 or earlier and then upgrades it all the way to the target branch version (i.e. main, which today targets v9.1.0). I believe this will be possible, but I doubt it will be easy for a new contributor. All power to your elbow ofc but you might want to pick easier ones for your first contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs good first issue low hanging fruit help wanted adoptme Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests