You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: request-body-canonicalization/latest/index.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
3
3
## Abstract
4
4
5
-
Originally CDX files were only used to index web archives containing GET requests. As browser-based capture methods can record non-GET requests such as those generated by JavaScript, a way for CDX/CDXJ index records to differentiate based on request method and request body is needed. This document describes the mechanism used for encoding the request method and body in the CDX/CDXJ key by appending additional query parameters, as originally implemented by pywb.
5
+
Originally, CDX files were only used to index web archives containing GET requests. As browser-based capture methods can record non-GET requests such as those generated by JavaScript, a way for CDX/CDXJ index records to differentiate based on request method and request body is needed. This document describes the mechanism used for encoding the request method and body in the CDX/CDXJ key by appending additional query parameters, as originally implemented by pywb.
6
6
7
7
## Conformance
8
8
@@ -23,18 +23,18 @@ The key words MAY and MUST in this document are to be interpreted as described i
23
23
24
24
Web archiving data is often stored in specialized formats, which include a full record of the HTTP network traffic as well as additional metadata. The archived data is often accessed via random-access, loading the appropriate chunks of data based on URLs requested by end users.
25
25
26
-
This specification is designed to describe how to store two key file formats used for web archives:
26
+
Web archiving data is often stored in two key file formats:
27
27
28
28
1. WARC — A widely accepted [ISO standard][3] used by many institutions around the world for storing web archive data.
29
29
2. WACZ — A new format [developed by Webrecorder][4] for packaging WARCs with other web archive data which supports random-access reads.
30
30
31
31
Both formats are 'composite' formats, containing smaller amounts of data interspersed with metadata. In the case of WARC, the format consists of concatenated records which are appended one after the other, eg. `cat A.warc B.warc > C.warc`. The WARCs may or may not be gzipped, in which case the result is a multi-member gzip.
32
32
33
-
WACZ files use the ZIP format which contains a specialized file and directory layout. ZIP is also a composite format, containing the raw (sometimes compressed) data as well as header data which contains the location files and directories within the ZIP file.
33
+
WACZ files use the ZIP format, which contains a specialized file and directory layout. ZIP is also a composite format, containing the raw (sometimes compressed) data as well as header data which contains the location files and directories within the ZIP file.
34
34
35
35
## Web Archive Index Formats (CDX and CDXJ)
36
36
37
-
Web archive search and retrieval is frequently intermediated by index files of WARC data, in the CDX or CDXJ formats. WACZ files contain CDXJ indices, which may or may not be gzipped, within the ZIP file that comprises the WACZ.
37
+
Web archive search and retrieval is frequently intermediated by index files of WARC data in the CDX or CDXJ formats. WACZ files contain CDXJ indices, which may or may not be gzipped, within the ZIP file that comprises the WACZ.
38
38
39
39
### CDX
40
40
@@ -70,7 +70,7 @@ The JSON Block contains a serialized [JSON][7] object with newlines escaped so t
70
70
71
71
### Motivation
72
72
73
-
POST-canonicalization provides a standardized way of representing a non-GET HTTP request as a GET request for indexing and playback in web archives. The original HTTP request type as well as the encoded request body are appended to the original URL and included in CDX/CDXJ indices as the Searchable URL. This allows web archive playback engines to then reconstruct the original non-GET requests for use in playback with their original HTTP method and request body.
73
+
Request body canonicalization provides a standardized way of representing a non-GET HTTP request as a GET request for indexing and playback in web archives. The original HTTP request type as well as the encoded request body are appended to the original URL and included in CDX/CDXJ indices as the Searchable URL. This allows web archive playback engines to then reconstruct the original non-GET requests for use in playback with their original HTTP method and request body.
0 commit comments