Deadlock: Restarting Ignite node on segmentation gets stuck indefinitely

I have a setup with two Ignite(2.17.0) server nodes using ZookeeperDiscoverySpi for discovery. When a network partition occurs between the Ignite nodes, both nodes are still able to communicate with Zookeeper. As expected, ZookeeperDiscoverySpi triggers a NODE_SEGMENTED event.
According to my configured segmentation policy (RESTART_JVM), the affected Ignite node stops and restarts.

However, during restart the network partition still exists, so the restarting node is again unable to join the cluster. This causes another NODE_SEGMENTED event, and the restart procedure is triggered again.

At this point, the restart thread becomes blocked because it cannot acquire the instance lock on IgnitionEx (inside the synchronized stop0() method). The lock is already held by the main thread, which is stuck inside IgnitionEx’s synchronized start0() method.

Inside start0(), the main thread hangs indefinitely in
GridCachePartitionExchangeManager#onKernalStart() while waiting for an exchange future. It repeatedly times out trying to get the exchange future because it cannot communicate with the peer Ignite node. The timeout exception is caught and retried indefinitely, causing the main thread to loop forever and never release the instance lock.

This results in:

Restart thread blocked on instance lock

Main thread stuck inside an infinite retry loop

Node never recovers after segmentation during network partitioning, Its like deadlock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Deadlock: Restarting Ignite node on segmentation gets stuck indefinitely #12480

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Deadlock: Restarting Ignite node on segmentation gets stuck indefinitely #12480

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions