feat: add new aggregation checks and improve validation summary #33

flitzpiepe93 · 2025-05-17T15:45:51Z

This merge request introduces several new aggregate-level data quality checks to the SparkDQ framework, along with a few documentation and type-handling improvements.

✅ New Features

UniqueRowsCheck: Detects duplicate rows (optionally based on a subset of columns).
UniqueRatioCheck: Checks if the fraction of distinct values in a column exceeds a defined threshold.
CompletenessRatioCheck: Validates that a column has a sufficient non-null ratio.
ColumnsAreCompleteCheck: Ensures a list of specified columns contains no nulls (all-or-nothing check).
DistinctRatioCheck: Validates whether a column’s distinct-to-total ratio exceeds a specified minimum.

🛠 Improvements

Added an all_checks_passed property to the ValidationSummary to simplify downstream evaluations.
Added # type: ignore to ColumnsAreCompleteCheck to silence mypy due to a known type inference issue.
Linked the Spark installation guide in the README.md for easier setup.

📚 Documentation

Added a usage example link to the Sphinx documentation.

Notes for Reviewers:

All new checks are covered by unit tests.
Each check integrates cleanly into the existing validation engine.
Let me know if you prefer breaking this into smaller MRs.

…ssed

…ll or partial column sets

…o in a column

… a column

…ified columns

codecov · 2025-05-17T15:49:04Z

Codecov Report

Attention: Patch coverage is 98.67550% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...te/completeness_checks/completeness_ratio_check.py	93.33%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

flitzpiepe93 added 11 commits May 16, 2025 19:17

docs(sphinx): add example link

b1b0e98

feat: add property to validation summary to indicate if all checks pa…

7416dd4

…ssed

feat(check): add UniqueRowsCheck to detect duplicate rows based on fu…

5f46bdf

…ll or partial column sets

docs(README): add spark installation guideline link

b2469ca

feat(check): add UniqueRatioCheck to validate minimum uniqueness rati…

84ba7bd

…o in a column

chore: add init to test folder

6c44ba9

feat(check): add CompletenessRatioCheck to validate non-null ratio of…

098da34

… a column

feat(check): add aggregate-level check to ensure completeness of spec…

5d4fdd8

…ified columns

feat(check): add DistinctRatioCheck to validate column cardinality

b829f36

chore: add mypy ignore to ColumnsAreCompleteCheck

4f7fe73

chore(dependencies): execute sync

bdda37c

flitzpiepe93 self-assigned this May 17, 2025

flitzpiepe93 added documentation Improvements or additions to documentation enhancement New feature or request labels May 17, 2025

flitzpiepe93 merged commit a0cfdc8 into main May 17, 2025
5 checks passed

flitzpiepe93 deleted the feat/add-new-checks branch May 17, 2025 16:05

This was referenced May 17, 2025

New Check: Unique Ratio #19

Closed

New Check: Distinct Ratio #18

Closed

New Check: Duplicated Rows / Unique Values #15

Closed

New Check: Completeness Ratio #17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add new aggregation checks and improve validation summary #33

feat: add new aggregation checks and improve validation summary #33

Uh oh!

flitzpiepe93 commented May 17, 2025

Uh oh!

codecov bot commented May 17, 2025

Uh oh!

Uh oh!

Uh oh!

feat: add new aggregation checks and improve validation summary #33

feat: add new aggregation checks and improve validation summary #33

Uh oh!

Conversation

flitzpiepe93 commented May 17, 2025

✅ New Features

🛠 Improvements

📚 Documentation

Uh oh!

codecov bot commented May 17, 2025

Codecov Report

Uh oh!

Uh oh!

Uh oh!