IGFS as Hadoop FileSystem
In-Memory Distributed Hadoop-Compliant File System.
IGFS and Ignite Hadoop Accelerator were discontinued. Instead, use an architecture described on the page below to achieve Hadoop deployments acceleration with Ignite in-memory clusters:https://ignite.apache.org/use-cases/hadoop-acceleration.html
Ignite Hadoop Accelerator ships with Hadoop-compliant IGFS File System implementation called IgniteHadoopFileSystem. Hadoop can run over this file system in plug-n-play fashion and significantly reduce I/O and improve both, latency and throughput.
Configure Ignite
Apache Ignite Hadoop Accelerator performs file system operations within Ignite cluster. Several prerequisites must be satisfied.
-
IGNITE_HOMEenvironment variable must be set and point to the root of Ignite installation directory. -
Each cluster node must have Hadoop jars in CLASSPATH.
See respective Ignite installation guide for your Hadoop distribution for details.
-
IGFSmust be configured on the cluster node. See https://apacheignite-fs.readme.io/docs/igfs for details on how to do that. -
To let
IGFSaccept requests from Hadoop, an endpoint should be configured (default configuration file is${IGNITE_HOME}/config/default-config.xml).
Ignite offers two endpoint types:
shmem- working over shared memory (not available on Windows);tcp- working over standard socket API.
Shared memory endpoint is the recommended approach if the code executing file system operations is on the same machine as Ignite node. Note that port parameter is also used in case of shared memory communication mode to perform initial client-server handshake:
<bean class="org.apache.ignite.configuration.FileSystemConfiguration">
<property name="name" value="myIgfs"/>
...
<property name="ipcEndpointConfiguration">
<bean class="org.apache.ignite.igfs.IgfsIpcEndpointConfiguration">
<property name="type" value="SHMEM"/>
<property name="port" value="12345"/>
</bean>
</property>
....
</bean>
FileSystemConfiguration fileSystemCfg = new FileSystemConfiguration();
...
IgfsIpcEndpointConfiguration endpointCfg = new IgfsIpcEndpointConfiguration();
endpointCfg.setType(IgfsEndpointType.SHMEM);
...
fileSystemCfg.setIpcEndpointConfiguration(endpointCfg);
TCP endpoint should be used when Ignite node is either located on another machine, or shared memory is not available.
<bean class="org.apache.ignite.configuration.FileSystemConfiguration">
<property name="name" value="myIgfs"/>
...
<property name="ipcEndpointConfiguration">
<bean class="org.apache.ignite.igfs.IgfsIpcEndpointConfiguration">
<property name="type" value="TCP"/>
<property name="host" value="myHost"/>
<property name="port" value="12345"/>
</bean>
</property>
....
</bean>
FileSystemConfiguration fileSystemCfg = new FileSystemConfiguration();
...
IgfsIpcEndpointConfiguration endpointCfg = new IgfsIpcEndpointConfiguration();
endpointCfg.setType(IgfsEndpointType.TCP);
endpointCfg.setHost("myHost");
...
fileSystemCfg.setIpcEndpointConfiguration(endpointCfg);
If host is not set, it defaults to 127.0.0.1.
If port is not set, it defaults to 10500.
If ipcEndpointConfiguration is not set, then shared memory endpoint with default port will be used for Linux systems, and TCP endpoint with default port will be used for Windows.
Run Ignite
When Ignite node is configured, start it using the following command:
$ bin/ignite.sh
Configure Hadoop
To run Hadoop job using Ignite job tracker the following prerequisites must be satisfied:
-
IGNITE_HOMEenvironment variable must be set and point to the root of Ignite installation directory. -
Hadoop must have Ignite libraries
${IGNITE_HOME}/libs/ignite-core-[version].jar,${IGNITE_HOME}/libs/ignite-hadoop/ignite-hadoop-[version].jar, and${IGNITE_HOME}/libs/ignite-shmem-1.0.0.jarin CLASSPATH.
This can be achieved in several ways.
- Add these JARs to
HADOOP_CLASSPATHenvironment variable. - Copy or symlink these JARs to the folder where your Hadoop installation stores shared libraries.
See respective Ignite installation guide for your Hadoop distribution for details.
- Ignite Hadoop Accelerator file system must be configured for the action your are going to perform.
At the very least you must provide fully qualified file system class name:
<configuration>
...
<property>
<name>fs.igfs.impl</name>
<value>org.apache.ignite.hadoop.fs.v1.IgniteHadoopFileSystem</value>
</property>
...
</configuration>
If you want to set Ignite File System as a default file system for your environment, then add the following property:
<configuration>
...
<property>
<name>fs.default.name</name>
<value>igfs://myIgfs@/</value>
</property>
...
</configuration>
Here value is an URL of endpoint of the Ignite node with IGFS. Rules how to this URL must look like provided at the end of this article.
There are several ways to pass these configurations on to your Hadoop jobs.
First, you may create separate core-site.xml file with these configuration properties and use it for job runs:
Second, you may set these properties for a particular job programmatically:
Configuration conf = new Configuration();
...
conf.set("fs.igfs.impl", "org.apache.ignite.hadoop.fs.v1.IgniteHadoopFileSystem");
conf.set("fs.default.name", "igfs://myIgfs@/");
...
Job job = new Job(conf, "word count");
Third, you can change fs.igfs.impl property in default core-site.xml of your Hadoop deployment.
Caution
Do not change
fs.default.nameproperty in defaultcore-site.xmlfile because HDFS will not be able to start. You sholud access IGFS using fully-qualified paths in this case. E.g.igfs://myIgfs@/path/to/fileinstead of/path/to/file.
Run Hadoop
How you run a job depends on how you have configured your Hadoop.
If you created a separate core-site.xml:
hadoop --config [path_to_config] [arguments]
Note that a separate Hadoop config directory and Ignite libraries may be easily referenced using custom Hadoop launcher script, e.g. hadoop-ignited:
export IGNITE_HOME=[path_to_ignite]
# Add necessary Ignite libraries to the Hadoop client classpath:
VERSION=[ignite_version_identifier]
export HADOOP_CLASSPATH=${IGNITE_HOME}/libs/ignite-core-${VERSION}.jar:${IGNITE_HOME}/libs/ignite-hadoop/ignite-hadoop-${VERSION}.jar:${IGNITE_HOME}/libs/ignite-shmem-1.0.0.jar
hadoop --config [path_to_config_directory] "${@}"
This way to run a Hadoop job using IGFS, you may run command hadoop-ignited ..., while to run the same job on default Hadoop you can run just hadoop ... with the same arguments.
If you start the job programmatically, then submit it:
...
Job job = new Job(conf, "word count");
...
job.submit();
File system URI
URI to access IGFS has the following structure: igfs://[name]@[host]:[port]/, where:
name- name of IGFS to connect to (as specified in FileSystemConfiguration.setName(...)).host- optional IGFS endpoint host (IgfsIpcEndpointConfiguration.host). Defaults to127.0.0.1.port- optional IGFS endpoint port (IgfsIpcEndpointConfiguration.port). Defaults to10500.
Sample URIs:
igfs://myIgfs@/- connect to IGFS namedmyIgfsrunning on localhost and default port;igfs://myIgfs@myHost/- connect to IGFS namedmyIgfsrunning on specific host and default port;igfs://myIgfs@myHost:12345/- connect to IGFS namedmyIgfsrunning on specific host and port;igfs://myIgfs@:12345/- connect to IGFS namedmyIgfsrunning on localhost and specific port.
High Availability IGFS Client
High Availability (HA) IGFS client is based on the Ignite client node. An Ignite client node is started on the host where a hadoop job is run. The regular TCP client connection to an Ignite cluster that is described above doesn't support availability of opened IGFS I/O streams after reconnect to cluster. Ignite client node allows to support this case.
To configure connection to cluster with Ignite client node add hadoop Job's property fs.igfs.<igfs_authority>.config_path with path to Ignite node configuration:
Configuration conf = new Configuration();
...
conf.set("fs.default.name", "igfs://myIgfs@/");
conf.set("fs.igfs.myIgfs.config_path", "ignite/config/client.xml");
...
Job job = new Job(conf, "word count");
The server node configuration XML can be used here. The client mode is forced for node that will be started for IGFS connection.
Updated over 5 years ago
