Vcdiff Algorithm
Vcdiff Algorithm
Korn
Request for Comments: 3284 AT&T Labs
Category: Standards Track J. MacDonald
UC Berkeley
J. Mogul
Hewlett-Packard Company
K. Vo
AT&T Labs
June 2002
Copyright Notice
Abstract
Table of Contents
1. Executive Summary
The encoding format Vcdiff proposed here addresses the above issues.
Vcdiff achieves the characteristics below:
Output compactness:
The basic encoding format compactly represents compressed or
delta files. Applications can further extend the basic
encoding format with "secondary encoders" to achieve more
compression.
Data portability:
The basic encoding format is free from machine byte order and
word size issues. This allows data to be encoded on one
machine and decoded on a different machine with different
architecture.
Algorithm genericity:
The decoding algorithm is independent from string matching and
windowing algorithms. This allows competition among
implementations of the encoder while keeping the same decoder.
Decoding efficiency:
Except for secondary encoder issues, the decoding algorithm
runs in time proportionate to the size of the target file and
uses space proportionate to the maximal window size. Vcdiff
differs from more conventional compressors in that it uses only
byte-aligned data, thus avoiding bit-level operations, which
improves decoding speed at the slight cost of compression
efficiency.
2. Conventions
The basic data unit is a byte. For portability, Vcdiff shall limit a
byte to its lower eight bits even on machines with larger bytes. The
bits in a byte are ordered from right to left so that the least
significant bit (LSB) has value 1, and the most significant bit
(MSB), has value 128.
+-------------------------------------------+
| 10111010 | 11101111 | 10011010 | 00010101 |
+-------------------------------------------+
MSB+58 MSB+111 MSB+26 0+21
Henceforth, the terms "byte" and "integer" will refer to a byte and
an unsigned integer as described.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in BCP 14, RFC 2119 [12].
3. Delta Instructions
Assume that S[j] represents the jth byte in S, and T[k] represents
the kth byte in T. Then, for the delta instructions, we treat the
data windows S and T as substrings of a superstring U, formed by
concatenating them like this:
S[0]S[1]...S[s-1]T[0]T[1]...T[t-1]
Below are example source and target windows and the delta
instructions that encode the target window in terms of the source
window.
a b c d e f g h i j k l m n o p
a b c d w x y z e f g h e f g h e f g h e f g h z z z z
COPY 4, 0
ADD 4, w x y z
COPY 4, 4
COPY 12, 24
RUN 4, z
Header
Header1 - byte
Header2 - byte
Header3 - byte
Header4 - byte
Hdr_Indicator - byte
[Secondary compressor ID] - byte
[Length of code table data] - integer
[Code table data]
Size of near cache - byte
Size of same cache - byte
Compressed code table data
Window1
Win_Indicator - byte
[Source segment size] - integer
[Source segment position] - integer
The delta encoding of the target window
Length of the delta encoding - integer
The delta encoding
Size of the target window - integer
Delta_Indicator - byte
Length of data for ADDs and RUNs - integer
Length of instructions and sizes - integer
Length of addresses for COPYs - integer
Data section for ADDs and RUNs - array of bytes
Instructions and sizes section - array of bytes
Addresses section for COPYs - array of bytes
Window2
...
The first three Header bytes are the ASCII characters ’V’, ’C’ and
’D’ with their most significant bits turned on (in hexadecimal, the
values are 0xD6, 0xC3, and 0xC4). The fourth Header byte is
currently set to zero. In the future, it might be used to indicate
the version of Vcdiff.
7 6 5 4 3 2 1 0
+-+-+-+-+-+-+-+-+
| | | | | | | | |
+-+-+-+-+-+-+-+-+
^ ^
| |
| +-- VCD_DECOMPRESS
+---- VCD_CODETABLE
If both bits are set, then the compressor ID byte is included before
the code table data length and the code table data.
Win_Indicator - byte
[Source segment length] - integer
[Source segment position] - integer
The delta encoding of the target window
Win_Indicator:
This byte is a set of bits, as shown:
7 6 5 4 3 2 1 0
+-+-+-+-+-+-+-+-+
| | | | | | | | |
+-+-+-+-+-+-+-+-+
^ ^
| |
| +-- VCD_SOURCE
+---- VCD_TARGET
The Win_Indicator byte MUST NOT have more than one of the bits
set (non-zero). It MAY have none of these bits set.
Delta_Indicator:
This byte is a set of bits, as shown:
7 6 5 4 3 2 1 0
+-+-+-+-+-+-+-+-+
| | | | | | | | |
+-+-+-+-+-+-+-+-+
^ ^ ^
| | |
| | +-- VCD_DATACOMP
| +---- VCD_INSTCOMP
+------ VCD_ADDRCOMP
To make clear the above description, below are examples of cache data
structures and algorithms to initialize and update them:
cache_init(Cache_t* ka)
{
int i;
ka->next_slot = 0;
for(i = 0; i < ka->s_near; ++i)
ka->near[i] = 0;
if(ka->s_same > 0)
ka->same[addr % (ka->s_same*256)] = addr;
}
VCD_HERE: This mode has value 1. The address was encoded as the
integer value "here - addr".
Near modes: The "near modes" are in the range [2,s_near+1]. Let m
be the mode of the address encoding. The address was encoded
as the integer value "addr - near[m-2]".
5.4 Example code for encoding and decoding of COPY instruction addresses
Note that the address caches are updated immediately after an address
is encoded or decoded. In this way, the decoder is always
synchronized with the encoder.
cache_update(ka,addr);
Note that the addr_encode() algorithm chooses the best address mode
using a local optimization, but that may not lead to the best
encoding efficiency because different modes lead to different
instruction encodings, as described below.
if(mode == VCD_SELF)
addr = addrint();
else if(mode == VCD_HERE)
addr = here - addrint();
else if((m = mode - 2) >= 0 && m < ka->s_near) /* near cache */
addr = ka->near[m] + addrint();
else /* same cache */
{ m = mode - (2 + ka->s_near);
addr = ka->same[m*256 + addrbyte()];
}
cache_update(ka, addr);
return addr;
}
The Vcdiff data format is designed so that a decoder does not need to
be aware of the choices made in encoding algorithms. This is
achieved with the notion of an "instruction code table", containing
256 entries. Each entry defines, either a single delta instruction
or a pair of instructions that have been combined. Note that the
code table itself only exists in main memory, not in the delta file
(unless using an application-defined code table, described in Section
7). The encoded data simply includes the index of each instruction
and, since there are only 256 indices, each index can be represented
as a single byte.
+-----------------------------------------------+
| inst1 | size1 | mode1 | inst2 | size2 | mode2 |
+-----------------------------------------------+
inst: An "inst" field can have one of the four values: NOOP (0),
ADD (1), RUN (2) or COPY (3) to indicate the instruction
types. NOOP means that no instruction is specified. In
this case, both the corresponding size and mode fields will
be zero.
If a line in the depiction includes more than one entry using the
[i,j] notation, implying a "nested loop" to convert the line to a
range of table entries, the first such [i,j] range specifies the
outer loop, and the second specifies the inner loop.
Line 1 shows the single RUN instruction with index 0. As the size
field is 0, this RUN instruction always has its actual size encoded
separately.
The last line, line 21, shows the eight instruction pairs, where the
first instruction is a COPY and the second is an ADD. In this case,
all COPY instructions have size 4 with mode ranging from 0 to 8 and
all the ADD instructions have size 1. Thus, the entry with the
largest index 255 combines a COPY instruction of size 4 and mode 8
with an ADD instruction of size 1.
The choice of the minimum size 4 for COPY instructions in the default
code table was made from experiments that showed that excluding small
matches (less then 4 bytes long) improved the compression rates.
Section 4.3 discusses that the delta instructions and associated data
are encoded in three arrays of bytes:
Further, these data sections may have been further compressed by some
secondary compressor. Assuming that any such compressed data has
been decompressed so that we now have three arrays:
For example, if during the processing of the target window, the next
unconsumed tuple in the inst array has an index value of 19, then the
first instruction is a COPY, whose size is found as the immediately
following integer in the inst array. Since the mode of this COPY
instruction is VCD_SELF, the corresponding address is found by
consuming the next integer in the addr array. The data array is left
intact. As the second instruction for code index 19 is a NOOP, this
tuple is finished.
Although the default code table used in Vcdiff is good for general
purpose encoders, there are times when other code tables may perform
better. For example, to code a file with many identical segments of
data, it may be advantageous to have a COPY instruction with the
specific size of these data segments, so that the instruction can be
encoded in a single byte. Such a special code table MUST then be
encoded in the delta file so that the decoder can reconstruct it
before decoding the data.
The "compressed code table data" encodes the delta between the
default code table (source) and the new code table (target) in the
same manner as described in Section 4.3 for encoding a target window
in terms of a source window. This delta is computed using the
following steps:
i. Add in order the 256 bytes representing the types of the first
instructions in the instruction pairs.
ii. Add in order the 256 bytes representing the types of the
second instructions in the instruction pairs.
iii. Add in order the 256 bytes representing the sizes of the first
instructions in the instruction pairs.
iv. Add in order the 256 bytes representing the sizes of the
second instructions in the instruction pairs.
v. Add in order the 256 bytes representing the modes of the first
instructions in the instruction pairs.
vi. Add in order the 256 bytes representing the modes of the
second instructions in the instruction pairs.
The decoder can then reverse the above steps to decode the compressed
table data using the method of Section 6, employing the default code
table, to generate the new code table. Note that the decoder does
not need to know about the details of the encoding algorithm used in
step (c). It is able to decode the new code table because the Vcdiff
format is independent from the choice of encoding algorithm, and
because the encoder in step (c) uses the known, default code table.
8. Performance
The encoding format is compact. For compression only, using the LZ-
77 string parsing strategy and without any secondary compressors, the
typical compression rate is better than Unix compress and close to
gzip. For differencing, the data format is better than all known
methods in terms of its stated goal, which is primarily decoding
speed and encoding efficiency.
The above table shows the raw sizes of the tar files and the sizes of
the compressed results. The differencing results in the gcc-2.95.2
column were obtained by compressing gcc-2.95.2, given gcc-2.95.1.
The same results for the column gcc-2.95.3 were obtained by
compressing gcc-2.95.3, given gcc-2.95.2.
Rows 2, 3 and 4 show that, for compression only, the compression rate
from Vcdiff is worse than gzip and better than compress.
The last three rows in the column gcc-2.95.2 show that when two file
versions are very similar, differencing can give dramatically good
compression rates. Vcdiff-d and Vcdiff-dc use the same simple window
selection method of aligning by file offsets, but Vcdiff-dc also does
compression so its output is slightly smaller. Vcdiff-dcw uses a
content-based algorithm to search for source data that likely will
match a given target window. Although it does a good job, the
algorithm does not always find the best matches, which in this case,
are given by the simple algorithm of Vcdiff-d. As a result, the
output size for Vcdiff-dcw is slightly larger.
9. Further Issues
Secondary compressors:
As discussed in Section 4.3, certain sections in the delta
encoding of a window may be further compressed by a secondary
compressor. In our experience, the basic Vcdiff format is
adequate for most purposes so that secondary compressors are
seldom needed. In particular, for normal use of data
differencing, where the files to be compared have long stretches
of matches, much of the gain in compression rate is already
achieved by normal string matching. Thus, the use of secondary
compressors is seldom needed in this case. However, for
applications beyond differencing of such nearly identical files,
secondary compressors may be needed to achieve maximal compressed
results.
10. Summary
11. Acknowledgements
http://www.research.att.com/sw/tools
This document does not define any values in this number space.
16. References
[1] D.G. Korn and K.P. Vo, Vdelta: Differencing and Compression,
Practical Reusable Unix Software, Editor B. Krishnamurthy, John
Wiley & Sons, Inc., 1995.
[7] D.G. Korn, K.P. Vo, Sfio: A buffered I/O Library, Proc. of the
Summer ’91 Usenix Conference, 1991.
[11] Mogul, J., Krishnamurthy, B., Douglis, F., Feldmann, A., Goland,
Y. and A. Van Hoff, "Delta Encoding in HTTP", RFC 3229, January
2002.
[12] Bradner, S., "Key words for use in RFCs to Indicate Requirement
Levels", BCP 14, RFC 2119, March 1997.
David G. Korn
AT&T Labs, Room D237
180 Park Avenue
Florham Park, NJ 07932
Jeffrey C. Mogul
Western Research Laboratory
Hewlett-Packard Company
1501 Page Mill Road, MS 1251
Palo Alto, California, 94304, U.S.A.
Joshua P. MacDonald
Computer Science Division
University of California, Berkeley
345 Soda Hall
Berkeley, CA 94720
EMail: [email protected]
The limited permissions granted above are perpetual and will not be
revoked by the Internet Society or its successors or assigns.
Acknowledgement