Skip to content

feat: add new aggregation checks and improve validation summary #33

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
May 17, 2025

Conversation

flitzpiepe93
Copy link
Contributor

This merge request introduces several new aggregate-level data quality checks to the SparkDQ framework, along with a few documentation and type-handling improvements.

✅ New Features

  • UniqueRowsCheck: Detects duplicate rows (optionally based on a subset of columns).
  • UniqueRatioCheck: Checks if the fraction of distinct values in a column exceeds a defined threshold.
  • CompletenessRatioCheck: Validates that a column has a sufficient non-null ratio.
  • ColumnsAreCompleteCheck: Ensures a list of specified columns contains no nulls (all-or-nothing check).
  • DistinctRatioCheck: Validates whether a column’s distinct-to-total ratio exceeds a specified minimum.

🛠 Improvements

  • Added an all_checks_passed property to the ValidationSummary to simplify downstream evaluations.
  • Added # type: ignore to ColumnsAreCompleteCheck to silence mypy due to a known type inference issue.
  • Linked the Spark installation guide in the README.md for easier setup.

📚 Documentation

  • Added a usage example link to the Sphinx documentation.

Notes for Reviewers:

  • All new checks are covered by unit tests.
  • Each check integrates cleanly into the existing validation engine.
  • Let me know if you prefer breaking this into smaller MRs.

@flitzpiepe93 flitzpiepe93 self-assigned this May 17, 2025
@flitzpiepe93 flitzpiepe93 added documentation Improvements or additions to documentation enhancement New feature or request labels May 17, 2025
Copy link

codecov bot commented May 17, 2025

Codecov Report

Attention: Patch coverage is 98.67550% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...te/completeness_checks/completeness_ratio_check.py 93.33% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@flitzpiepe93 flitzpiepe93 merged commit a0cfdc8 into main May 17, 2025
5 checks passed
@flitzpiepe93 flitzpiepe93 deleted the feat/add-new-checks branch May 17, 2025 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant