#backup #checksum #tar #streaming

bin+lib stardex

A zero-trust, streaming tar parser + per-file hasher for backup pipelines

3 releases

0.1.2 Nov 26, 2025
0.1.1 Nov 26, 2025
0.1.0 Nov 26, 2025

#220 in Compression

MIT license

33KB
638 lines

Stardex

Crates.io License Documentation

Streaming Tar Index is a zero-trust, streaming tar parser and per-file hasher designed for backup pipelines.

It reads tar streams from stdin, emits per-file metadata and hashes to stdout as JSONL or other formats, and never modifies the stream. It is designed to be used with tee in a pipeline.

Features

  • Streaming-native: Handles large streams without seeking; 2 MiB buffer (configurable) reused across entries.
  • Safe & strict: Validates tar header checksums and stops on malformed archives; never emits tar bytes.
  • Deterministic: JSONL/CSV/SQL outputs include explicit hash_algo + hash when hashing is performed; preserves non-UTF-8 paths via path_raw_b64.
  • Fast: Uses BLAKE3 by default for high-performance hashing; other algorithms available on demand.
  • Flexible output: JSONL (default), CSV, or SQL INSERT statements (all with matching fields).
  • PAX-aware: PAX headers are size-limited (256 MiB by default), length fields are validated, and overrides (path/size/mtime/mode) are applied to the top-level fields.

Installation

Once published:

cargo install stardex

From Source

git clone https://github.com/tpet93/stardex.git
cd stardex
cargo install --path .

Usage

Basic Usage

tar -cf - my_directory | stardex > index.jsonl

In a Pipeline

Calculate hashes while compressing and writing to a file (or tape):

tar -cf - /data \
  | tee >(stardex --algo blake3 > index.jsonl) \
  | zstd -T0 > backup.tar.zst

Advanced Pipeline (Tape Backup)

Calculate per-file hashes, a global tar hash, and a compressed archive hash in one pass:

tar -cf - directory \
  | tee >(stardex --algo sha256 --global-hash sha256 --summary-out summary.json > index.jsonl) \
  | zstd -T0 \
  | tee >(sha256sum > archive.tar.zst.sha256) \
  > archive.tar.zst

This produces:

  • index.jsonl: Per-file metadata and SHA256 hashes.
  • summary.json: Total tar size and SHA256 hash of the uncompressed tar stream.
  • archive.tar.zst: The compressed archive.
  • archive.tar.zst.sha256: SHA256 hash of the compressed archive.

Speed Test

run the benchmark script to see how fast stardex can go on your system:

./tests/benchmark.sh

Options

  • --algo <ALGO>: Hashing algorithm to use. Options: blake3 (default), sha256, md5, sha1, xxh64, xxh3, xxh128, none.
  • --format <FORMAT>: Output format. Options: jsonl (default), csv, sql.
  • --buffer-size <SIZE>: Set read buffer size (default: 2M). Supports human-readable units (e.g., 64K, 1M, 10M).
  • --no-fail: Drain stdin on error instead of exiting (prevents broken pipes).
  • --init-sql: When using --format sql, emit the schema and wrap inserts in BEGIN; ... COMMIT; so you can pipe directly into sqlite3 file.sqlite.

TODO

  • Re-enable stardex man and shell completion generation once the publishing bugs around those assets are resolved. Contributions welcome!

Behavior & Limits

  • Hashing is applied only to data-bearing entries (Regular, GNUSparse, Continuous). Metadata-only entries are still validated and emitted without hashes. --algo none disables hashing entirely but leaves all metadata intact.
  • PAX headers are parsed using their declared length and capped at 256 MiB by default (STARDEX_PAX_MAX_SIZE env var can override). Malformed length fields or oversized headers fail fast. PAX overrides for path, size, mtime, and mode are reflected in the top-level fields.
  • --no-fail drains stdin to EOF after an error to avoid breaking downstream pipes, and then exits with status 0 (so downstream tools stay running).

Output Format (JSONL)

{
  "path": "my_directory/file.txt",
  "path_is_utf8": true,
  "path_raw_b64": null,
  "file_type": "Regular",
  "size": 1234,
  "mode": 420,
  "mtime": 1700000000,
  "hash_algo": "blake3",
  "hash": "...",
  "pax": {
    "path": "...",
    "mtime": "..."
  },
  "offset": 0
}

path_raw_b64 is emitted when the tar entry name is not valid UTF-8, allowing lossless reconstruction without emitting tar bytes. CSV and SQL formats contain the same fields (SQL output is emitted as INSERT statements with proper escaping). offset is the byte offset of the entry header within the tar stream.

SQL column order: path, path_is_utf8, path_raw_b64, file_type, size, mode, mtime, hash_algo, hash, pax (JSON), offset.

SQLite one-liner

tar -cf - /path/to/dir \
  | stardex --format sql --init-sql \
  | sqlite3 archive.sqlite

License

MIT

Dependencies

~5–19MB
~228K SLoC