Skip to content

Commit 300fb37

Browse files
author
Jonathan Turner Eagles
committed
MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via jeagles)
git-svn-id: https://svn.apache.org/repos/asf/hadoop/common/trunk@1591107 13f79535-47bb-0310-9956-ffa450edef68
1 parent d9f7fa5 commit 300fb37

File tree

3 files changed

+142
-0
lines changed

3 files changed

+142
-0
lines changed

hadoop-mapreduce-project/CHANGES.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,9 @@ Release 2.5.0 - UNRELEASED
175175
MAPREDUCE-5812. Make job context available to
176176
OutputCommitter.isRecoverySupported() (Mohammad Kamrul Islam via jlowe)
177177

178+
MAPREDUCE-5638. Port Hadoop Archives document to trunk (Akira AJISAKA via
179+
jeagles)
180+
178181
OPTIMIZATIONS
179182

180183
BUG FIXES
Lines changed: 138 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,138 @@
1+
<!---
2+
Licensed under the Apache License, Version 2.0 (the "License");
3+
you may not use this file except in compliance with the License.
4+
You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software
9+
distributed under the License is distributed on an "AS IS" BASIS,
10+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
11+
See the License for the specific language governing permissions and
12+
limitations under the License. See accompanying LICENSE file.
13+
-->
14+
15+
#set ( $H3 = '###' )
16+
17+
Hadoop Archives Guide
18+
=====================
19+
20+
- [Overview](#Overview)
21+
- [How to Create an Archive](#How_to_Create_an_Archive)
22+
- [How to Look Up Files in Archives](#How_to_Look_Up_Files_in_Archives)
23+
- [Archives Examples](#Archives_Examples)
24+
- [Creating an Archive](#Creating_an_Archive)
25+
- [Looking Up Files](#Looking_Up_Files)
26+
- [Hadoop Archives and MapReduce](#Hadoop_Archives_and_MapReduce)
27+
28+
Overview
29+
--------
30+
31+
Hadoop archives are special format archives. A Hadoop archive maps to a file
32+
system directory. A Hadoop archive always has a \*.har extension. A Hadoop
33+
archive directory contains metadata (in the form of _index and _masterindex)
34+
and data (part-\*) files. The _index file contains the name of the files that
35+
are part of the archive and the location within the part files.
36+
37+
How to Create an Archive
38+
------------------------
39+
40+
`Usage: hadoop archive -archiveName name -p <parent> <src>* <dest>`
41+
42+
-archiveName is the name of the archive you would like to create. An example
43+
would be foo.har. The name should have a \*.har extension. The parent argument
44+
is to specify the relative path to which the files should be archived to.
45+
Example would be :
46+
47+
`-p /foo/bar a/b/c e/f/g`
48+
49+
Here /foo/bar is the parent path and a/b/c, e/f/g are relative paths to
50+
parent. Note that this is a Map/Reduce job that creates the archives. You
51+
would need a map reduce cluster to run this. For a detailed example the later
52+
sections.
53+
54+
If you just want to archive a single directory /foo/bar then you can just use
55+
56+
`hadoop archive -archiveName zoo.har -p /foo/bar /outputdir`
57+
58+
How to Look Up Files in Archives
59+
--------------------------------
60+
61+
The archive exposes itself as a file system layer. So all the fs shell
62+
commands in the archives work but with a different URI. Also, note that
63+
archives are immutable. So, rename's, deletes and creates return an error.
64+
URI for Hadoop Archives is
65+
66+
`har://scheme-hostname:port/archivepath/fileinarchive`
67+
68+
If no scheme is provided it assumes the underlying filesystem. In that case
69+
the URI would look like
70+
71+
`har:///archivepath/fileinarchive`
72+
73+
Archives Examples
74+
-----------------
75+
76+
$H3 Creating an Archive
77+
78+
`hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
79+
80+
The above example is creating an archive using /user/hadoop as the relative
81+
archive directory. The directories /user/hadoop/dir1 and /user/hadoop/dir2
82+
will be archived in the following file system directory -- /user/zoo/foo.har.
83+
Archiving does not delete the input files. If you want to delete the input
84+
files after creating the archives (to reduce namespace), you will have to do
85+
it on your own.
86+
87+
$H3 Looking Up Files
88+
89+
Looking up files in hadoop archives is as easy as doing an ls on the
90+
filesystem. After you have archived the directories /user/hadoop/dir1 and
91+
/user/hadoop/dir2 as in the example above, to see all the files in the
92+
archives you can just run:
93+
94+
`hdfs dfs -ls -R har:///user/zoo/foo.har/`
95+
96+
To understand the significance of the -p argument, lets go through the above
97+
example again. If you just do an ls (not lsr) on the hadoop archive using
98+
99+
`hdfs dfs -ls har:///user/zoo/foo.har`
100+
101+
The output should be:
102+
103+
```
104+
har:///user/zoo/foo.har/dir1
105+
har:///user/zoo/foo.har/dir2
106+
```
107+
108+
As you can recall the archives were created with the following command
109+
110+
`hadoop archive -archiveName foo.har -p /user/hadoop dir1 dir2 /user/zoo`
111+
112+
If we were to change the command to:
113+
114+
`hadoop archive -archiveName foo.har -p /user/ hadoop/dir1 hadoop/dir2 /user/zoo`
115+
116+
then a ls on the hadoop archive using
117+
118+
`hdfs dfs -ls har:///user/zoo/foo.har`
119+
120+
would give you
121+
122+
```
123+
har:///user/zoo/foo.har/hadoop/dir1
124+
har:///user/zoo/foo.har/hadoop/dir2
125+
```
126+
127+
Notice that the archived files have been archived relative to /user/ rather
128+
than /user/hadoop.
129+
130+
Hadoop Archives and MapReduce
131+
-----------------------------
132+
133+
Using Hadoop Archives in MapReduce is as easy as specifying a different input
134+
filesystem than the default file system. If you have a hadoop archive stored
135+
in HDFS in /user/zoo/foo.har then for using this archive for MapReduce input,
136+
all you need to specify the input directory as har:///user/zoo/foo.har. Since
137+
Hadoop Archives is exposed as a file system MapReduce will be able to use all
138+
the logical input files in Hadoop Archives as input.

hadoop-project/src/site/site.xml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -92,6 +92,7 @@
9292
<item name="Encrypted Shuffle" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/EncryptedShuffle.html"/>
9393
<item name="Pluggable Shuffle/Sort" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/PluggableShuffleAndPluggableSort.html"/>
9494
<item name="Distributed Cache Deploy" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistributedCacheDeploy.html"/>
95+
<item name="Hadoop Archives" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/HadoopArchives.html"/>
9596
<item name="DistCp" href="hadoop-mapreduce-client/hadoop-mapreduce-client-core/DistCp.html"/>
9697
</menu>
9798

0 commit comments

Comments
 (0)