Skip to content

Python bindings for lz4_flex, the fastest pure-Rust implementation of the LZ4 compression algorithm.

License

Notifications You must be signed in to change notification settings

LVivona/safelz4

Repository files navigation

safelz4

GitHub PyPI Python Version PyPI Downloads

Python bindings for lz4_flex, the fastest pure-Rust implementation of the LZ4 compression algorithm.

Installation

Pip

You can install safelz4 via the pip manager:

pip install safelz4

From source

For the sources, you need Rust

# Install Rust
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Make sure it's up to date and using stable channel
rustup update
git clone https://github.com/LVivona/safelz4.git
cd safelz4
pip install setuptools_rust
pip install maturin
# install
pip install -e .

Getting Started

Block Format

safelz4 block

The block format is suitable only for smaller chunks of data, as each block must be fully compressed or decompressed in memory. For larger data sequences, the frame format should be used instead, as it supports streaming and includes metadata for better handling of large-byte sequences. specs

import os
import sys
from typing import Union, Generator
from safelz4.block import compress_prepend_size, decompress_size_prepended

def chunk_blocks(filename : Union[os.PathLike, str], chunk_size : int = 1048576) -> Generator[bytes, None, None]:
    """compress read bytes into chunks blocks"""
    with open(filename, "rb") as f:
        while content := f.read(chunk_size):
            buffer = compress_prepend_size(content)
            yield buffer

# 1 Mb chunck
blocks = chunk_blocks("dickens.txt")

for block in blocks:
    output = decompress_size_prepended(buffer)
    sys.stdout.write(output.decode("utf-8"))

Frame Format

safelz4 frame

Frames are containers that encapsulate a set of compressed blocks. Information about the blocks is stored both in the frame header and within the blocks themselves. Read more within the specs

import safelz4

buffer = None
with open("dickens.txt", "rb") as file:
    buffer = file.read(-1)
    safelz4.compress_into_file("dickens.lz4", buffer)


with safelz4.open("dickens.lz4", "rb") as f:
   while content := f.read(100):
      print(content.decode("utf-8"))

Bechmarks

Benchmark results are available in the benches folder. We evaluated performance in two key scenarios:

Full byte availability, where the entire buffer is accessible during compression and decompression.

Streamed access, using reader and writer interfaces with chunked input.

Summary

In full buffer scenarios, lz4 generally performs well and occasionally outpaces safelz4, especially on larger files. However, safelz4 still remained competitive, with close times.

In reader/writer scenarios (chunked input, 1024 bytes), safelz4 significantly outperforms lz4, consistently achieving more than 2x speed improvement in both compression and decompression.

Streamed access (chunk 1024 bytes)

open Write Benchmark lz4 safelz4
ctx_compression_writer_compression_1k.txt 22.5 us 8.84 us: 2.54x faster
ctx_compression_writer_compression_34k.txt 22.6 us 9.07 us: 2.49x faster
ctx_compression_writer_compression_65k.txt 23.0 us 9.18 us: 2.50x faster
ctx_compression_writer_compression_66k_JSON.txt 23.1 us 9.18 us: 2.51x faster
ctx_compression_writer_dickens.txt 23.9 us 9.16 us: 2.61x faster
ctx_compression_writer_hdfs.json 22.9 us 9.21 us: 2.49x faster
ctx_compression_writer_reymont.pdf 22.9 us 9.26 us: 2.48x faster
ctx_compression_writer_xml_collection.xml 23.1 us 9.27 us: 2.49x faster
Geometric mean (ref) 2.51x faster
open Read Benchmark lz4 safelz4
ctx_decompression_writer_compression_1k.txt 17.6 us 11.0 us: 1.59x faster
ctx_decompression_writer_compression_34k.txt 46.2 us 23.8 us: 1.94x faster
ctx_decompression_writer_compression_65k.txt 68.6 us 34.6 us: 1.98x faster
ctx_decompression_writer_compression_66k_JSON.txt 61.9 us 27.1 us: 2.28x faster
ctx_decompression_writer_dickens.txt 8.67 ms 4.11 ms: 2.11x faster
ctx_decompression_writer_hdfs.json 4.39 ms 1.77 ms: 2.48x faster
ctx_decompression_writer_reymont.pdf 5.74 ms 2.92 ms: 1.97x faster
ctx_decompression_writer_xml_collection.xml 3.97 ms 1.99 ms: 2.00x faster
Geometric mean (ref) 2.03x faster

Full byte availability Run(s)

frame.compress Benchmark lz4 safelz4
compression_compression_1k.txt 839 ns 829 ns: 1.01x faster
compression_compression_34k.txt 32.5 us 26.3 us: 1.23x faster
compression_compression_65k.txt 60.1 us 49.9 us: 1.20x faster
compression_compression_66k_JSON.txt 24.7 us 26.5 us: 1.07x slower
compression_dickens.txt 15.9 ms 17.0 ms: 1.07x slower
compression_hdfs.json 2.63 ms 3.16 ms: 1.20x slower
compression_reymont.pdf 11.4 ms 12.3 ms: 1.08x slower
compression_xml_collection.xml 4.12 ms 4.58 ms: 1.11x slower
Geometric mean (ref) 1.01x slower
frame.decompress Benchmark lz4 safelz4
decompress_compression_1k.txt 416 ns 612 ns: 1.47x slower
decompress_compression_34k.txt 10.0 us 8.96 us: 1.12x faster
decompress_compression_65k.txt 17.1 us 15.4 us: 1.11x faster
decompress_compression_66k_JSON.txt 8.04 us 9.45 us: 1.18x slower
decompress_dickens.txt 2.13 ms 4.00 ms: 1.88x slower
decompress_hdfs.json 1.03 ms 1.50 ms: 1.45x slower
decompress_reymont.pdf 1.99 ms 2.42 ms: 1.21x slower
decompress_xml_collection.xml 1.19 ms 1.68 ms: 1.41x slower
Geometric mean (ref) 1.26x slower

NOTE: All benchmarks were performed using python package pypref, on a system equipped with an Apple M4 Max processor and 36GB of unified memory.

Acknowledgement

This project acknowledges the outstanding work of Yann Collet.

Special thanks also to the maintainers of the lz4_flex Rust crate for providing a safe, pure-Rust implementation of LZ4 compression and decompression.

Other Implementation

LZ4 implementations, including:

Python Library Build Status Version Licence
python-lz4 Build Status PyPI - License

Licence

MIT License

About

Python bindings for lz4_flex, the fastest pure-Rust implementation of the LZ4 compression algorithm.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

No packages published