Fault Tolerance

Automatically fail-over jobs to other nodes in case of a crash.

❗️

This is a legacy Apache Ignite documentation

The new documentation is hosted here: https://ignite.apache.org/docs/latest/

Overview

Ignite supports automatic job failover. In case of a node crash, jobs are automatically transferred to other available nodes for re-execution. However, in Ignite you can also treat any job result as a failure as well. The worker node can still be alive, but it may be running low on CPU, I/O, disk space, etc. There are many conditions that may result in a failure within your application and you can trigger a failover based on them. Moreover, you have the ability to choose the node a job should be failed over to, as it could be different for different applications or different computations within the same application.

The FailoverSpi is responsible for handling the selection of a new node for the execution of a failed job. FailoverSpi inspects the failed job and the list of all available grid nodes on which the job execution can be retried. It ensures that the job is not re-mapped to the same node it had failed on. Failover is triggered when the method ComputeTask.result(...) returns the ComputeJobResultPolicy.FAILOVER policy. Ignite comes with a number of built-in customizable Failover SPI implementations.

At Least Once Guarantee

As long as there is at least one node standing, no job will ever be lost.

By default, Ignite will failover all jobs from stopped or crashed nodes automatically. For custom failover behavior, you should implement the ComputeTask.result() method. The example below triggers a failover whenever a job throws any IgniteException (or its subclasses):

public class MyComputeTask extends ComputeTaskSplitAdapter<String, String> {
    ...
      
    @Override 
    public ComputeJobResultPolicy result(ComputeJobResult res, List<ComputeJobResult> rcvd) {
        IgniteException err = res.getException();
     
        if (err != null)
            return ComputeJobResultPolicy.FAILOVER;
    
        // If there is no exception, wait for all job results.
        return ComputeJobResultPolicy.WAIT;
    }
  
    ...
}

Closure Failover

Closure failover is by default governed by ComputeTaskAdapter, which is triggered if a remote node either crashes or rejects closure execution. This default behavior may be overridden by using the IgniteCompute.withNoFailover() method, which creates an instance of IgniteCompute with a no-failover flag set on it. Here is an example:

IgniteCompute compute = ignite.compute().withNoFailover();

compute.apply(() -> {
    // Do something
    ...
}, "Some argument");

AlwaysFailOverSpi

Ignite splits a task into jobs and assigns them to multiple nodes for faster processing. In case of a node failure, AlwaysFailoverSpi always reroutes a failed job to another node. First, an attempt will be made to reroute the failed job to a node that has not executed any other job from the same task. If no such node is available, then an attempt will be made to reroute the failed job to one of the nodes that may be running other jobs from the same task. If none of the above attempts succeeds, then the job will not be failed over and null will be returned.

The following configuration parameters can be used to configure AlwaysFailoverSpi.

Setter Method

Description

Default

setMaximumFailoverAttempts(int)

Sets the maximum number of attempts to fail-over a failed job to other nodes.

5

<bean id="grid.custom.cfg" class="org.apache.ignite.IgniteConfiguration" singleton="true">
  ...
  <property name="failoverSpi">
  	<bean class="org.apache.ignite.spi.failover.always.AlwaysFailoverSpi">
    	<property name="maximumFailoverAttempts" value="5"/>
  	</bean>
  </property>
  ...
</bean>
AlwaysFailoverSpi failSpi = new AlwaysFailoverSpi();
 
IgniteConfiguration cfg = new IgniteConfiguration();
 
// Override maximum failover attempts.
failSpi.setMaximumFailoverAttempts(5);
 
// Override the default failover SPI.
cfg.setFailoverSpi(failSpi);
 
// Start Ignite node.
Ignition.start(cfg);