IGFS as Hadoop FileSystem

❗️
IGFS and Ignite Hadoop Accelerator were discontinued. Instead, use an architecture described on the page below to achieve Hadoop deployments acceleration with Ignite in-memory clusters:https://ignite.apache.org/use-cases/hadoop-acceleration.html

Ignite Hadoop Accelerator ships with Hadoop-compliant IGFS File System implementation called IgniteHadoopFileSystem. Hadoop can run over this file system in plug-n-play fashion and significantly reduce I/O and improve both, latency and throughput.

Configure Ignite

Apache Ignite Hadoop Accelerator performs file system operations within Ignite cluster. Several prerequisites must be satisfied.

IGNITE_HOME environment variable must be set and point to the root of Ignite installation directory.
Each cluster node must have Hadoop jars in CLASSPATH.

See respective Ignite installation guide for your Hadoop distribution for details.

IGFS must be configured on the cluster node. See https://apacheignite-fs.readme.io/docs/igfs for details on how to do that.
To let IGFS accept requests from Hadoop, an endpoint should be configured (default configuration file is ${IGNITE_HOME}/config/default-config.xml).
Ignite offers two endpoint types:

shmem - working over shared memory (not available on Windows);
tcp - working over standard socket API.

Shared memory endpoint is the recommended approach if the code executing file system operations is on the same machine as Ignite node. Note that port parameter is also used in case of shared memory communication mode to perform initial client-server handshake:

<bean class="org.apache.ignite.configuration.FileSystemConfiguration">
  <property name="name" value="myIgfs"/>
  ...
  <property name="ipcEndpointConfiguration">
    <bean class="org.apache.ignite.igfs.IgfsIpcEndpointConfiguration">
      <property name="type" value="SHMEM"/>
      <property name="port" value="12345"/>
    </bean>
  </property>
  ....
</bean>

FileSystemConfiguration fileSystemCfg = new FileSystemConfiguration();
...
IgfsIpcEndpointConfiguration endpointCfg = new IgfsIpcEndpointConfiguration();
endpointCfg.setType(IgfsEndpointType.SHMEM);
...
fileSystemCfg.setIpcEndpointConfiguration(endpointCfg);

TCP endpoint should be used when Ignite node is either located on another machine, or shared memory is not available.

<bean class="org.apache.ignite.configuration.FileSystemConfiguration">
  <property name="name" value="myIgfs"/>
  ...
  <property name="ipcEndpointConfiguration">
    <bean class="org.apache.ignite.igfs.IgfsIpcEndpointConfiguration">
      <property name="type" value="TCP"/>
      <property name="host" value="myHost"/>
      <property name="port" value="12345"/>
    </bean>
  </property>
  ....
</bean>

FileSystemConfiguration fileSystemCfg = new FileSystemConfiguration();
...
IgfsIpcEndpointConfiguration endpointCfg = new IgfsIpcEndpointConfiguration();
endpointCfg.setType(IgfsEndpointType.TCP);
endpointCfg.setHost("myHost");
...
fileSystemCfg.setIpcEndpointConfiguration(endpointCfg);

If host is not set, it defaults to 127.0.0.1.
If port is not set, it defaults to 10500.

If ipcEndpointConfiguration is not set, then shared memory endpoint with default port will be used for Linux systems, and TCP endpoint with default port will be used for Windows.

Run Ignite

When Ignite node is configured, start it using the following command:

$ bin/ignite.sh

Configure Hadoop

To run Hadoop job using Ignite job tracker the following prerequisites must be satisfied:

IGNITE_HOME environment variable must be set and point to the root of Ignite installation directory.
Hadoop must have Ignite libraries ${IGNITE_HOME}/libs/ignite-core-[version].jar, ${IGNITE_HOME}/libs/ignite-hadoop/ignite-hadoop-[version].jar, and ${IGNITE_HOME}/libs/ignite-shmem-1.0.0.jar in CLASSPATH.

This can be achieved in several ways.

Add these JARs to HADOOP_CLASSPATH environment variable.
Copy or symlink these JARs to the folder where your Hadoop installation stores shared libraries.

See respective Ignite installation guide for your Hadoop distribution for details.

Ignite Hadoop Accelerator file system must be configured for the action your are going to perform.
At the very least you must provide fully qualified file system class name:

<configuration>
  ...
  <property>
    <name>fs.igfs.impl</name>
    <value>org.apache.ignite.hadoop.fs.v1.IgniteHadoopFileSystem</value>
  </property>
  ...
</configuration>

If you want to set Ignite File System as a default file system for your environment, then add the following property:

<configuration>
  ...
  <property>
    <name>fs.default.name</name>
    <value>igfs://myIgfs@/</value>
  </property>
  ...
</configuration>

Here value is an URL of endpoint of the Ignite node with IGFS. Rules how to this URL must look like provided at the end of this article.

There are several ways to pass these configurations on to your Hadoop jobs.
First, you may create separate core-site.xml file with these configuration properties and use it for job runs:
Second, you may set these properties for a particular job programmatically:

Configuration conf = new Configuration();
...
conf.set("fs.igfs.impl", "org.apache.ignite.hadoop.fs.v1.IgniteHadoopFileSystem");
conf.set("fs.default.name", "igfs://myIgfs@/");
...
Job job = new Job(conf, "word count");

Third, you can change fs.igfs.impl property in default core-site.xml of your Hadoop deployment.

🚧
Caution
Do not change fs.default.name property in default core-site.xml file because HDFS will not be able to start. You sholud access IGFS using fully-qualified paths in this case. E.g. igfs://myIgfs@/path/to/file instead of /path/to/file.

Run Hadoop

How you run a job depends on how you have configured your Hadoop.

If you created a separate core-site.xml:

hadoop --config [path_to_config] [arguments]

Note that a separate Hadoop config directory and Ignite libraries may be easily referenced using custom Hadoop launcher script, e.g. hadoop-ignited:

export IGNITE_HOME=[path_to_ignite]

# Add necessary Ignite libraries to the Hadoop client classpath:
VERSION=[ignite_version_identifier]
export HADOOP_CLASSPATH=${IGNITE_HOME}/libs/ignite-core-${VERSION}.jar:${IGNITE_HOME}/libs/ignite-hadoop/ignite-hadoop-${VERSION}.jar:${IGNITE_HOME}/libs/ignite-shmem-1.0.0.jar

hadoop --config [path_to_config_directory] "${@}"

This way to run a Hadoop job using IGFS, you may run command hadoop-ignited ..., while to run the same job on default Hadoop you can run just hadoop ... with the same arguments.

If you start the job programmatically, then submit it:

...
Job job = new Job(conf, "word count");
...
job.submit();

File system URI

URI to access IGFS has the following structure: igfs://[name]@[host]:[port]/, where:

name - name of IGFS to connect to (as specified in FileSystemConfiguration.setName(...)).
host - optional IGFS endpoint host (IgfsIpcEndpointConfiguration.host). Defaults to 127.0.0.1.
port - optional IGFS endpoint port (IgfsIpcEndpointConfiguration.port). Defaults to 10500.

Sample URIs:

igfs://myIgfs@/ - connect to IGFS named myIgfs running on localhost and default port;
igfs://myIgfs@myHost/ - connect to IGFS named myIgfs running on specific host and default port;
igfs://myIgfs@myHost:12345/ - connect to IGFS named myIgfs running on specific host and port;
igfs://myIgfs@:12345/ - connect to IGFS named myIgfs running on localhost and specific port.

High Availability IGFS Client

High Availability (HA) IGFS client is based on the Ignite client node. An Ignite client node is started on the host where a hadoop job is run. The regular TCP client connection to an Ignite cluster that is described above doesn't support availability of opened IGFS I/O streams after reconnect to cluster. Ignite client node allows to support this case.

To configure connection to cluster with Ignite client node add hadoop Job's property fs.igfs.<igfs_authority>.config_path with path to Ignite node configuration:

Configuration conf = new Configuration();
...
conf.set("fs.default.name", "igfs://myIgfs@/");
conf.set("fs.igfs.myIgfs.config_path", "ignite/config/client.xml");
...
Job job = new Job(conf, "word count");

The server node configuration XML can be used here. The client mode is forced for node that will be started for IGFS connection.

❗️

Configure Ignite

Run Ignite

Configure Hadoop

🚧Caution

Run Hadoop

File system URI

High Availability IGFS Client

🚧
Caution