MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via jeagles)

Jonathan Turner Eagles · Jonathan Turner Eagles · commit 300fb37fcde9 · 2014-04-29T21:23:50.000Z
git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@1591107 13f79535-47bb-0310-9956-ffa450edef68
diff --git a/hadoop-mapreduce-project/CHANGES.txt b/hadoop-mapreduce-project/CHANGES.txt
@@ -175,6 +175,9 @@ Release 2.5.0 - UNRELEASED
     MAPREDUCE-5812. Make job context available to
     OutputCommitter.isRecoverySupported() (Mohammad Kamrul Islam via jlowe)
 
+    MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via
+    jeagles)
+
   OPTIMIZATIONS
 
   BUG FIXES 
diff --git a/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm b/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/HadoopArchives.md.vm
@@ -0,0 +1,138 @@
+<!---
+  Licensed under the Apache License, Version 2.0 (the "License");
+  you may not use this file except in compliance with the License.
+  You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License. See accompanying LICENSE file.
+-->
+
+#set ( $H3 = '###' )
+
+Hadoop Archives Guide
+=====================
+
+ - [Overview](#Overview)
+ - [How to Create an Archive](#How_to_Create_an_Archive)
+ - [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives)
+ - [Archives Examples](#Archives_Examples)
+     - [Creating an Archive](#Creating_an_Archive)
+     - [Looking Up Files](#Looking_Up_Files)
+ - [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce)
+
+Overview
+--------
+
+  Hadoop archives are special format archives. A Hadoop archive maps to a file
+  system directory. A Hadoop archive always has a \*.har extension. A Hadoop
+  archive directory contains metadata (in the form of _index and _masterindex)
+  and data (part-\*) files. The _index file contains the name of the files that
+  are part of the archive and the location within the part files.
+
+How to Create an Archive
+------------------------
+
+  `Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>`
+
+  -archiveName is the name of the archive you would like to create. An example
+  would be foo.har. The name should have a \*.har extension. The parent argument
+  is to specify the relative path to which the files should be archived to.
+  Example would be :
+
+  `-p /foo/bar a/b/c e/f/g`
+
+  Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
+  parent. Note that this is a Map/Reduce job that creates the archives. You
+  would need a map reduce cluster to run this. For a detailed example the later
+  sections.
+
+  If you just want to archive a single directory /foo/bar then you can just use
+
+  `hadoop archive -archiveName zoo.har -p /foo/bar /outputdir`
+
+How to Look Up Files in Archives
+--------------------------------
+
+  The archive exposes itself as a file system layer. So all the fs shell
+  commands in the archives work but with a different URI. Also, note that
+  archives are immutable. So, rename's, deletes and creates return an error.
+  URI for Hadoop Archives is
+
+  `har://scheme-hostname:port/archivepath/fileinarchive`
+
+  If no scheme is provided it assumes the underlying filesystem. In that case
+  the URI would look like
+
+  `har:///archivepath/fileinarchive`
+
+Archives Examples
+-----------------
+
+$H3 Creating an Archive
+
+  `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
+
+  The above example is creating an archive using /user/hadoop as the relative
+  archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2
+  will be archived in the following file system directory -- /user/zoo/foo.har.
+  Archiving does not delete the input files. If you want to delete the input
+  files after creating the archives (to reduce namespace), you will have to do
+  it on your own. 
+
+$H3 Looking Up Files
+
+  Looking up files in hadoop archives is as easy as doing an ls on the
+  filesystem. After you have archived the directories /user/hadoop/dir1 and
+  /user/hadoop/dir2 as in the example above, to see all the files in the
+  archives you can just run:
+
+  `hdfs dfs -ls -R har:///user/zoo/foo.har/`
+
+  To understand the significance of the -p argument, lets go through the above
+  example again. If you just do an ls (not lsr) on the hadoop archive using
+
+  `hdfs dfs -ls har:///user/zoo/foo.har`
+
+  The output should be:
+
+```
+har:///user/zoo/foo.har/dir1
+har:///user/zoo/foo.har/dir2
+```
+
+  As you can recall the archives were created with the following command
+
+  `hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
+
+  If we were to change the command to:
+
+  `hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo`
+
+  then a ls on the hadoop archive using
+
+  `hdfs dfs -ls har:///user/zoo/foo.har`
+
+  would give you
+
+```
+har:///user/zoo/foo.har/hadoop/dir1
+har:///user/zoo/foo.har/hadoop/dir2
+```
+
+  Notice that the archived files have been archived relative to /user/ rather
+  than /user/hadoop.
+
+Hadoop Archives and MapReduce
+-----------------------------
+
+  Using Hadoop Archives in MapReduce is as easy as specifying a different input
+  filesystem than the default file system. If you have a hadoop archive stored
+  in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input,
+  all you need to specify the input directory as har:///user/zoo/foo.har. Since
+  Hadoop Archives is exposed as a file system MapReduce will be able to use all
+  the logical input files in Hadoop Archives as input.
diff --git a/hadoop-project/src/site/site.xml b/hadoop-project/src/site/site.xml
@@ -92,6 +92,7 @@
       <item name="Encrypted Shuffle" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html"/>
       <item name="Pluggable Shuffle/Sort" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html"/>
       <item name="Distributed Cache Deploy" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html"/>
+      <item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
       <item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
     </menu>