Skip to content

Deadlock: Restarting Ignite node on segmentation gets stuck indefinitely #12480

@OmShinde1513

Description

@OmShinde1513

I have a setup with two Ignite(2.17.0) server nodes using ZookeeperDiscoverySpi for discovery. When a network partition occurs between the Ignite nodes, both nodes are still able to communicate with Zookeeper. As expected, ZookeeperDiscoverySpi triggers a NODE_SEGMENTED event.
According to my configured segmentation policy (RESTART_JVM), the affected Ignite node stops and restarts.

However, during restart the network partition still exists, so the restarting node is again unable to join the cluster. This causes another NODE_SEGMENTED event, and the restart procedure is triggered again.

At this point, the restart thread becomes blocked because it cannot acquire the instance lock on IgnitionEx (inside the synchronized stop0() method). The lock is already held by the main thread, which is stuck inside IgnitionEx’s synchronized start0() method.

Inside start0(), the main thread hangs indefinitely in
GridCachePartitionExchangeManager#onKernalStart() while waiting for an exchange future. It repeatedly times out trying to get the exchange future because it cannot communicate with the peer Ignite node. The timeout exception is caught and retried indefinitely, causing the main thread to loop forever and never release the instance lock.

This results in:

Restart thread blocked on instance lock

Main thread stuck inside an infinite retry loop

Node never recovers after segmentation during network partitioning, Its like deadlock

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions