3 releases
| 0.1.2 | Nov 26, 2025 |
|---|---|
| 0.1.1 | Nov 26, 2025 |
| 0.1.0 | Nov 26, 2025 |
#220 in Compression
33KB
638 lines
Stardex
Streaming Tar Index is a zero-trust, streaming tar parser and per-file hasher designed for backup pipelines.
It reads tar streams from stdin, emits per-file metadata and hashes to stdout as JSONL or other formats, and never modifies the stream. It is designed to be used with tee in a pipeline.
Features
- Streaming-native: Handles large streams without seeking; 2 MiB buffer (configurable) reused across entries.
- Safe & strict: Validates tar header checksums and stops on malformed archives; never emits tar bytes.
- Deterministic: JSONL/CSV/SQL outputs include explicit
hash_algo+hashwhen hashing is performed; preserves non-UTF-8 paths viapath_raw_b64. - Fast: Uses BLAKE3 by default for high-performance hashing; other algorithms available on demand.
- Flexible output: JSONL (default), CSV, or SQL INSERT statements (all with matching fields).
- PAX-aware: PAX headers are size-limited (256 MiB by default), length fields are validated, and overrides (path/size/mtime/mode) are applied to the top-level fields.
Installation
From crates.io (Recommended)
Once published:
cargo install stardex
From Source
git clone https://github.com/tpet93/stardex.git
cd stardex
cargo install --path .
Usage
Basic Usage
tar -cf - my_directory | stardex > index.jsonl
In a Pipeline
Calculate hashes while compressing and writing to a file (or tape):
tar -cf - /data \
| tee >(stardex --algo blake3 > index.jsonl) \
| zstd -T0 > backup.tar.zst
Advanced Pipeline (Tape Backup)
Calculate per-file hashes, a global tar hash, and a compressed archive hash in one pass:
tar -cf - directory \
| tee >(stardex --algo sha256 --global-hash sha256 --summary-out summary.json > index.jsonl) \
| zstd -T0 \
| tee >(sha256sum > archive.tar.zst.sha256) \
> archive.tar.zst
This produces:
index.jsonl: Per-file metadata and SHA256 hashes.summary.json: Total tar size and SHA256 hash of the uncompressed tar stream.archive.tar.zst: The compressed archive.archive.tar.zst.sha256: SHA256 hash of the compressed archive.
Speed Test
run the benchmark script to see how fast stardex can go on your system:
./tests/benchmark.sh
Options
--algo <ALGO>: Hashing algorithm to use. Options:blake3(default),sha256,md5,sha1,xxh64,xxh3,xxh128,none.--format <FORMAT>: Output format. Options:jsonl(default),csv,sql.--buffer-size <SIZE>: Set read buffer size (default: 2M). Supports human-readable units (e.g.,64K,1M,10M).--no-fail: Drain stdin on error instead of exiting (prevents broken pipes).--init-sql: When using--format sql, emit the schema and wrap inserts inBEGIN; ... COMMIT;so you can pipe directly intosqlite3 file.sqlite.
TODO
- Re-enable
stardex manand shell completion generation once the publishing bugs around those assets are resolved. Contributions welcome!
Behavior & Limits
- Hashing is applied only to data-bearing entries (
Regular,GNUSparse,Continuous). Metadata-only entries are still validated and emitted without hashes.--algo nonedisables hashing entirely but leaves all metadata intact. - PAX headers are parsed using their declared length and capped at 256 MiB by default (
STARDEX_PAX_MAX_SIZEenv var can override). Malformed length fields or oversized headers fail fast. PAX overrides forpath,size,mtime, andmodeare reflected in the top-level fields. --no-faildrains stdin to EOF after an error to avoid breaking downstream pipes, and then exits with status 0 (so downstream tools stay running).
Output Format (JSONL)
{
"path": "my_directory/file.txt",
"path_is_utf8": true,
"path_raw_b64": null,
"file_type": "Regular",
"size": 1234,
"mode": 420,
"mtime": 1700000000,
"hash_algo": "blake3",
"hash": "...",
"pax": {
"path": "...",
"mtime": "..."
},
"offset": 0
}
path_raw_b64 is emitted when the tar entry name is not valid UTF-8, allowing lossless reconstruction without emitting tar bytes. CSV and SQL formats contain the same fields (SQL output is emitted as INSERT statements with proper escaping). offset is the byte offset of the entry header within the tar stream.
SQL column order: path, path_is_utf8, path_raw_b64, file_type, size, mode, mtime, hash_algo, hash, pax (JSON), offset.
SQLite one-liner
tar -cf - /path/to/dir \
| stardex --format sql --init-sql \
| sqlite3 archive.sqlite
License
MIT
Dependencies
~5–19MB
~228K SLoC