#zero-knowledge #cryptography #binary-fields

no-std bin+lib ligerito

Ligerito polynomial commitment scheme over binary extension fields

13 releases

new 0.3.2 Dec 4, 2025
0.3.0 Nov 27, 2025
0.2.5 Nov 27, 2025
0.1.5 Nov 19, 2025

#490 in Magic Beans


Used in 3 crates

MIT license

2.5MB
10K SLoC

Ligerito

polynomial commitment scheme over binary extension fields.

⚠️ IMPORTANT: For optimal performance (3x speedup), install with native CPU optimizations:

RUSTFLAGS="-C target-cpu=native" cargo install ligerito

Without this flag, the prover will be significantly slower. See installation for details.

what it's good for

  • committing to large polynomials with small proofs (~150 KB for 2^20 polynomial)
  • fast proving on modern cpus with simd (50ms for 1M elements on AVX-512, SMT off)
  • tiered simd: AVX-512 → AVX2 → SSE → scalar fallback
  • verifier-only builds for constrained environments (polkavm, wasm, embedded)
  • transparent setup (no trusted setup required)
  • enabling verifiable light client p2p networks

what it's not good for

  • general-purpose zkp (no arbitrary circuits, only polynomial commitments)
  • proving without simd (slow without hardware acceleration)
  • tiny polynomials (proof overhead significant below 2^12)
  • scenarios requiring smallest possible proofs (starkware/plonky2 may be smaller)

library usage

add to Cargo.toml:

[dependencies]
ligerito = "0.3.1"

⚠️ for development: clone the workspace to get automatic native optimizations:

git clone https://github.com/rotkonetworks/zeratul
cd zeratul
cargo build --release -p ligerito

example:

use ligerito::{prove, verify, hardcoded_config_20, hardcoded_config_20_verifier};
use ligerito_binary_fields::{BinaryElem32, BinaryElem128};
use std::marker::PhantomData;

// create prover config
let config = hardcoded_config_20(
    PhantomData::<BinaryElem32>,
    PhantomData::<BinaryElem128>,
);

// polynomial to commit (2^20 elements)
let poly: Vec<BinaryElem32> = vec![BinaryElem32::from(42); 1 << 20];

// generate proof
let proof = prove(&config, &poly).unwrap();

// verify proof
let verifier_config = hardcoded_config_20_verifier();
let valid = verify(&verifier_config, &proof).unwrap();
assert!(valid);

transcript backends

// sha256 (default, no extra deps, works in no_std)
use ligerito::{prove_sha256, verify_sha256};
let proof = prove_sha256(&config, &poly).unwrap();
let valid = verify_sha256(&verifier_config, &proof).unwrap();

// merlin (requires std + transcript-merlin feature)
use ligerito::{prove, verify};
let proof = prove(&config, &poly).unwrap();
let valid = verify(&verifier_config, &proof).unwrap();

supported sizes

configs available: hardcoded_config_{12,16,20,24,28,30} for prover and hardcoded_config_{12,16,20,24,28,30}_verifier for verifier.

build configurations

cargo build --release

includes: prover, verifier, parallelism, simd

verifier-only

cargo build --release --no-default-features --features="std,verifier-only"

~50% smaller binary, perfect for polkavm/on-chain verification

no_std verifier

cargo build --release --no-default-features --features="verifier-only"

minimal build for wasm/embedded (requires alloc)

cli usage

installation

recommended (optimized for your cpu):

# from crates.io
RUSTFLAGS="-C target-cpu=native" cargo install ligerito

# or from source
git clone https://github.com/rotkonetworks/zeratul
cd zeratul
cargo install --path crates/ligerito

the workspace config automatically enables native cpu optimizations.

performance by simd tier (8 cores, SMT off):

AVX-512 + VPCLMULQDQ:  ~50ms for 2^20 prove (fastest)
AVX2 + VPCLMULQDQ:     ~65ms for 2^20 prove
SSE + PCLMULQDQ:       ~95ms for 2^20 prove
No SIMD (scalar):      ~220ms for 2^20 prove

important: disable SMT (hyperthreading) for accurate benchmarks - it causes cache contention.

without optimizations (not recommended):

cargo install ligerito  # will show build warning about missing SIMD

check your build:

ligerito --version  # should show v0.2.4 or later
ligerito bench --size 20  # quick performance test (no I/O overhead)
# look for [release SIMD] in prove output for optimal performance

prove and verify

# generate random test data (2^20 = 1M elements)
ligerito generate --size 20 --pattern random > poly.bin

# generate proof from polynomial data
cat poly.bin | ligerito prove --size 20 > proof.bin

# verify proof
cat proof.bin | ligerito verify --size 20
# output: "VALID" with exit code 0

# roundtrip test
cat poly.bin | ligerito prove --size 20 | ligerito verify --size 20

transcript backends

prover and verifier must use the same transcript backend:

# sha256 (default)
ligerito prove --size 20 --transcript sha256 < poly.bin > proof.bin
ligerito verify --size 20 --transcript sha256 < proof.bin

# merlin (requires transcript-merlin feature)
ligerito prove --size 20 --transcript merlin < poly.bin > proof.bin
ligerito verify --size 20 --transcript merlin < proof.bin

generate test data

# random data (default)
ligerito generate --size 20 --pattern random > poly.bin

# all zeros
ligerito generate --size 20 --pattern zeros > poly.bin

# all ones
ligerito generate --size 20 --pattern ones > poly.bin

# sequential (0, 1, 2, ...)
ligerito generate --size 20 --pattern sequential > poly.bin

# save to file
ligerito generate --size 20 --pattern random --output test.bin

benchmark (no I/O overhead)

# prove only
ligerito bench --size 24

# prove + verify
ligerito bench --size 24 --verify

# multiple iterations
ligerito bench --size 24 --verify --iterations 10

# with specific thread count
RAYON_NUM_THREADS=16 ligerito bench --size 24

show configuration

ligerito config --size 20

data format

polynomials are binary data: size * 4 bytes (4 bytes per BinaryElem32 element).

example for 2^12:

# 2^12 elements = 4096 elements * 4 bytes = 16384 bytes
dd if=/dev/urandom of=test.bin bs=16384 count=1
cat test.bin | ligerito prove --size 12 | ligerito verify --size 12

features

  • std (default): standard library support
  • prover (default): include proving functionality
  • verifier-only: minimal verifier build
  • parallel (default): multi-threaded with rayon
  • hardware-accel (default): simd acceleration
  • transcript-sha256: sha256 transcript (always available)
  • transcript-merlin: merlin transcript (requires std)
  • cli: command-line binary
  • wasm: browser support with wasm-bindgen
  • wasm-parallel: multi-threaded WASM via Web Workers (requires SharedArrayBuffer)

wasm usage

# build WASM with SIMD128 (recommended)
RUSTFLAGS='-C target-feature=+simd128' \
  cargo build --target wasm32-unknown-unknown --features wasm --no-default-features

# with parallel support (requires SharedArrayBuffer + CORS headers)
RUSTFLAGS='-C target-feature=+simd128,+atomics,+bulk-memory,+mutable-globals' \
  cargo build --target wasm32-unknown-unknown --features wasm-parallel --no-default-features

wasm simd128 optimizations:

  • i8x16_swizzle for parallel 16-way table lookups
  • v128_xor for SIMD GF(2) additions
  • 4-bit lookup table with Karatsuba decomposition

supported sizes

  • 2^12 (4,096 elements, 16 KB)
  • 2^16 (65,536 elements, 256 KB)
  • 2^20 (1,048,576 elements, 4 MB)
  • 2^24 (16,777,216 elements, 64 MB)
  • 2^28 (268,435,456 elements, 1 GB)
  • 2^30 (1,073,741,824 elements, 4 GB)

performance

benchmark results

benchmarked on amd ryzen 9 7945hx (16 cores / 32 threads, AVX-512 + VPCLMULQDQ):

size elements prove time verify time proof size throughput
2^20 1.05m 41ms 5ms 149 KB 25.6m elem/s
2^24 16.8m 570ms 29ms 245 KB 29.4m elem/s
2^28 268.4m ~9s ~400ms ~2.4 MB 29.8m elem/s

use the built-in benchmark command (no I/O overhead):

ligerito bench --size 24 --verify --iterations 5

parallel scaling analysis

the prover is memory-bandwidth bound, not compute-bound. scaling efficiency drops significantly beyond 8 cores:

threads 2^24 prove speedup efficiency
1 4339ms 1.0x 100%
2 2301ms 1.9x 94%
4 1251ms 3.5x 87%
8 780ms 5.6x 70%
16 617ms 7.0x 44%
32 570ms 7.6x 24%

why scaling stops at ~8 cores:

  1. memory bandwidth saturation: 2^24 elements = 64MB polynomial. the FFT reads/writes this data multiple times per level. DDR5 bandwidth (~100GB/s) becomes the bottleneck, not CPU cycles.

  2. L3 cache pressure: ryzen 7945hx has 64MB L3 (32MB per CCD). the polynomial barely fits in cache. with >8 cores, cross-CCD traffic over infinity fabric adds latency.

  3. amdahl's law: FFT has log2(n) = 24 levels. only some levels run in parallel. merkle tree construction and transcript hashing have limited parallelism.

  4. diminishing returns: 7.6x speedup on 32 threads (vs theoretical 32x) is actually good for memory-bound workloads. typical memory-bound algorithms achieve 4-8x on many-core systems.

simd tier comparison

fft butterfly performance (2^24 elements, single iteration):

tier instruction elements/op time vs SSE
AVX-512 vpclmulqdq zmm 8 17ms 2.6x
AVX2 vpclmulqdq ymm 4 25ms 1.8x
SSE pclmulqdq xmm 2 45ms baseline

runtime detection automatically selects the best available tier.

cli vs library performance

the CLI has ~2x overhead when using pipes due to I/O:

method 2^24 prove
ligerito bench --size 24 570ms
generate | prove | verify ~1200ms

the overhead comes from:

  • piping 64MB through stdin/stdout
  • random number generation in generate
  • proof serialization/deserialization

for benchmarking, always use ligerito bench which measures pure algorithm performance.

optimal configuration

# best performance: all threads, SMT on
RAYON_NUM_THREADS=32 ligerito bench --size 24

# consistent benchmarks: physical cores only, SMT off
echo "off" | sudo tee /sys/devices/system/cpu/smt/control
RAYON_NUM_THREADS=16 taskset -c 0-15 ligerito bench --size 24
echo "on" | sudo tee /sys/devices/system/cpu/smt/control

# single CCD (lower latency, less throughput)
RAYON_NUM_THREADS=8 taskset -c 0-7 ligerito bench --size 24

improving performance further

current bottlenecks and potential solutions:

bottleneck solution expected gain
memory bandwidth GPU acceleration (HBM) 5-10x
FFT memory access cache-blocked FFT 1.2-1.5x
field element size GF(2^16) instead of GF(2^32) 1.5-2x
cross-CCD latency NUMA-aware allocation 1.1-1.2x

reference

ligerito paper by andrija novakovic and guillermo angeris

license

mit / apache-2.0

Dependencies

~4–42MB
~628K SLoC