-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
I have a setup with two Ignite(2.17.0) server nodes using ZookeeperDiscoverySpi for discovery. When a network partition occurs between the Ignite nodes, both nodes are still able to communicate with Zookeeper. As expected, ZookeeperDiscoverySpi triggers a NODE_SEGMENTED event.
According to my configured segmentation policy (RESTART_JVM), the affected Ignite node stops and restarts.
However, during restart the network partition still exists, so the restarting node is again unable to join the cluster. This causes another NODE_SEGMENTED event, and the restart procedure is triggered again.
At this point, the restart thread becomes blocked because it cannot acquire the instance lock on IgnitionEx (inside the synchronized stop0() method). The lock is already held by the main thread, which is stuck inside IgnitionEx’s synchronized start0() method.
Inside start0(), the main thread hangs indefinitely in
GridCachePartitionExchangeManager#onKernalStart() while waiting for an exchange future. It repeatedly times out trying to get the exchange future because it cannot communicate with the peer Ignite node. The timeout exception is caught and retried indefinitely, causing the main thread to loop forever and never release the instance lock.
This results in:
Restart thread blocked on instance lock
Main thread stuck inside an infinite retry loop
Node never recovers after segmentation during network partitioning, Its like deadlock