🚦 Zero-false-positive drift detection for analytics pipelines.
Fast: Optimized with Cython - handles 10,000+ unique values in under 2 seconds.
Flexible: Works with both CSV and Parquet files.
⚠️ Early Stage Project: This project is in active development with frequent breaking changes. The API and data formats may change between versions. Please pin your dependencies if using in production.
pip install driftfx
import pandas as pd, driftfx as dr
dr.snapshot(df_baseline, "baseline", cols=["name"])
result = dr.check(df_new, "baseline", cols=["name"])
if not result.is_clean():
if result.renames:
print("Renames / typos:", result.renames[:5]) # first 5 examples
if result.brand_new:
print("Brand-new names:", result.brand_new[:5]) # first 5 examples
# snapshot baseline (CSV or Parquet)
$ driftfx snapshot --input data.csv --cols name --baseline baseline/
$ Snapshot complete ✓
# check new batch
$ driftfx check --input new.parquet --cols name --baseline baseline/
$ [✖] name: 17 renames / 31 new codes
$ Drift detected: 48 anomalies
With Cython-optimized Levenshtein distance calculations:
Operation | Time | Throughput | Dataset |
---|---|---|---|
Snapshot | 1.6s | 5,264 rows/s | 10,000 unique values |
Check | 1.2s | 7,893 rows/s | 10,050 rows |
The Cython implementation provides an 8x speedup for snapshot operations compared to pure Python.