Skip to content

fix: Improve Validation Output, Declarative Config Consistency, and Docs #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 12, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/source/_static/integration_patterns.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
25 changes: 25 additions & 0 deletions docs/source/built_in_checks/aggregate.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
Aggregate Checks
================

.. toctree::
:maxdepth: 1
:caption: Built-in Checks
:hidden:

checks/schema/column_presence_check
checks/count/count_between_check
checks/count/count_exact_check
checks/count/count_min_check
checks/count/count_max_check
checks/schema/schema_check

.. csv-table::
:header: "Check", "Description"
:widths: 20, 80

":ref:`count-min-check` ", "Ensures that the DataFrame contains at least a defined minimum number of rows."
":ref:`count-max-check` ", "Ensures that the DataFrame does not exceed a defined maximum number of rows."
":ref:`count-between-check` ", "Ensures that the number of rows in the dataset falls within a defined inclusive range."
":ref:`count-exact-check` ", "Ensures that the dataset contains exactly the specified number of rows."
":ref:`column-presence-check` ", "Verifies the existence of required columns in the DataFrame, independent of their data types."
":ref:`schema-check` ", "Ensures that a DataFrame matches an expected schema by verifying column names and data types, with optional strict enforcement against unexpected columns."
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Declarative Configuration

- check: is-contained-in-check
check-id: validate_status_and_country
allowed_values:
allowed-values:
status:
- ACTIVE
- INACTIVE
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ Declarative Configuration

- check: is-not-contained-in-check
check-id: block_test_status_and_invalid_countries
forbidden_values:
forbidden-values:
status:
- TEST
- DUMMY
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,8 +31,8 @@ Declarative Configuration

- check: row-count-between-check
check-id: expected_daily_batch_size
min_count: 1000
max_count: 5000
min-count: 1000
max-count: 5000
severity: error

Typical Use Cases
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Declarative Configuration

- check: row-count-exact-check
check-id: validate_snapshot_size
expected_count: 500
expected-count: 500
severity: error

Typical Use Cases
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Declarative Configuration

- check: row-count-max-check
check-id: prevent_oversize_batch
max_count: 100000
max-count: 100000
severity: error


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Declarative Configuration

- check: row-count-min-check
check-id: minimum_required_records
min_count: 10000
min-count: 10000
severity: warning

Typical Use Cases
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ Declarative Configuration
check-id: allowed_record_date_range
columns:
- record_date
min_value: "2020-01-01"
max_value: "2023-12-31"
min-value: "2020-01-01"
max-value: "2023-12-31"
inclusive: [true, true]
severity: critical

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Declarative Configuration
check-id: maximum_allowed_record_date
columns:
- record_date
max_value: "2023-12-31"
max-value: "2023-12-31"
inclusive: true
severity: critical

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Declarative Configuration
check-id: minimum_allowed_record_date
columns:
- record_date
min_value: "2020-01-01"
min-value: "2020-01-01"
inclusive: true
severity: critical

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ Declarative Configuration
check-id: allowed_discount_range
columns:
- discount
min_value: 0.0
max_value: 100.0
min-value: 0.0
max-value: 100.0
inclusive: [true, true]
severity: critical

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Declarative Configuration
check-id: maximum_allowed_discount
columns:
- discount
max_value: 100.0
max-value: 100.0
inclusive: true
severity: critical

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Declarative Configuration
columns:
- price
- discount
min_value: 0.0
min-value: 0.0
inclusive: true
severity: critical

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ Declarative Configuration

- check: column-presence-check
check-id: enforce_required_columns
required_columns:
required-columns:
- id
- event_timestamp
- status
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ Declarative Configuration

- check: schema-check
check-id: enforce_schema_contract
expected_schema:
expected-schema:
id: int
name: string
amount: decimal(10,2)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ Declarative Configuration
check-id: allowed_event_time_range
columns:
- event_time
min_value: "2020-01-01 00:00:00"
max_value: "2023-12-31 23:59:59"
min-value: "2020-01-01 00:00:00"
max-value: "2023-12-31 23:59:59"
inclusive: [true, true]
severity: critical

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Declarative Configuration
check-id: maximum_allowed_event_time
columns:
- event_time
max_value: "2023-12-31 23:59:59"
max-value: "2023-12-31 23:59:59"
inclusive: true
severity: critical

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@ Declarative Configuration
check-id: minimum_allowed_event_time
columns:
- event_time
min_value: "2020-01-01 00:00:00"
min-value: "2020-01-01 00:00:00"
inclusive: true
severity: critical

Expand Down
Original file line number Diff line number Diff line change
@@ -1,129 +1,50 @@
Built-In Checks
===============

SparkDQ includes various built-in checks for validating the integrity and quality of your data.
The following table lists all available checks with their identifiers and a brief
description — click on a check name to see details and examples.

All configuration examples are shown in YAML format for better readability.
You can still define checks via JSON, or external sources — as long as the
configuration is provided as a Python dictionary at runtime.
Row-Level Checks
================

.. toctree::
:maxdepth: 1
:caption: Built-in Checks
:hidden:

checks/columns_comparison/column_less
checks/schema/column_presence_check
checks/count/count_between_check
checks/count/count_exact_check
checks/count/count_min_check
checks/count/count_max_check

checks/date/date_between_check
checks/date/date_min_check
checks/date/date_max_check

checks/date/date_min_check
checks/null/exactly_one_not_null_check

checks/contained_in/is_contained_in_check
checks/contained_in/is_not_contained_in_check

checks/null/not_null_check
checks/null/null_check
checks/numeric/numeric_min_check
checks/numeric/numeric_max_check
checks/numeric/numeric_between_check

checks/numeric/numeric_max_check
checks/numeric/numeric_min_check
checks/strings/regex_match_check

checks/schema/schema_check
checks/strings/string_between_length
checks/strings/string_max_length
checks/strings/string_min_length

checks/timestamp/timestamp_min_check
checks/timestamp/timestamp_max_check
checks/timestamp/timestamp_between_check

Comparison Checks
-----------------

.. csv-table::
:header: "Check", "Description"
:widths: 20, 80

":ref:`column_less_than_check`", "Validates that values in one column are strictly less than (or less than or equal to) the values in another column."

Contained-In Checks
----------------------
checks/timestamp/timestamp_max_check
checks/timestamp/timestamp_min_check

.. csv-table::
:header: "Check", "Description"
:widths: 20, 80

":ref:`date-between-check` ", "Ensures that date column values are within a defined range."
":ref:`date-max-check` ", "Ensures that date column values are less than a defined maximum date."
":ref:`date-min-check` ", "Ensures that date column values are greater than a defined minimum date."
":ref:`exactly_one_not_null_check` ", "Validates that exactly one of the specified columns is non-null per row."
":ref:`is-contained-in-check` ", "Ensures that column values are contained within a predefined set of allowed values."
":ref:`is-not-contained-in-check` ", "Ensures that column values are not contained within a set of forbidden values."

Count Checks
------------

.. csv-table::
:header: "Check", "Description"
:widths: 20, 80

":ref:`count-min-check` ", "Ensures that the DataFrame contains at least a defined minimum number of rows."
":ref:`count-max-check` ", "Ensures that the DataFrame does not exceed a defined maximum number of rows."
":ref:`count-between-check` ", "Ensures that the number of rows in the dataset falls within a defined inclusive range."
":ref:`count-exact-check` ", "Ensures that the dataset contains exactly the specified number of rows."

Null Checks
-----------

.. csv-table::
:header: "Check", "Description"
:widths: 20, 80

":ref:`null_check` ", "Verifies whether a given column contains any null values."
":ref:`not_null_check` ", "Checks whether the specified column contains at least one non-null value."
":ref:`exactly_one_not_null_check` ", "Validates that exactly one of the specified columns is non-null per row."

Range Checks
------------

.. csv-table::
:header: "Check", "Description"
:widths: 20, 80

":ref:`numeric-min-check` ", "Ensures that numeric column values are greater than a defined minimum."
":ref:`numeric-max-check` ", "Ensures that numeric column values are less than a defined maximum."
":ref:`null_check` ", "Verifies whether a given column contains any null values."
":ref:`numeric-between-check` ", "Ensures that numeric column values are within a defined inclusive range."
":ref:`date-min-check` ", "Ensures that date column values are greater than a defined minimum date."
":ref:`date-max-check` ", "Ensures that date column values are less than a defined maximum date."
":ref:`date-between-check` ", "Ensures that date column values are within a defined range."
":ref:`timestamp-min-check` ", "Ensures that timestamp column values are greater than a defined timestamp."
":ref:`timestamp-max-check` ", "Ensures that timestamp column values are less than a defined maximum timestamp."
":ref:`numeric-max-check` ", "Ensures that numeric column values are less than a defined maximum."
":ref:`numeric-min-check` ", "Ensures that numeric column values are greater than a defined minimum."
":ref:`timestamp-between-check` ", "Ensures that timestamp column values are within a defined inclusive range between a minimum and maximum timestamp."

Schema Checks
-------------

.. csv-table::
:header: "Check", "Description"
:widths: 20, 80

":ref:`column-presence-check` ", "Verifies the existence of required columns in the DataFrame, independent of their data types."
":ref:`schema-check` ", "Ensures that a DataFrame matches an expected schema by verifying column names and data types, with optional strict enforcement against unexpected columns."

String Checks
-------------

.. csv-table::
:header: "Check", "Description"
:widths: 20, 80

":ref:`timestamp-max-check` ", "Ensures that timestamp column values are less than a defined maximum timestamp."
":ref:`timestamp-min-check` ", "Ensures that timestamp column values are greater than a defined timestamp."
":ref:`regex_match_check`", "Validates that string values match a given regular expression pattern."
":ref:`string_min_length_check`", "Validates that string values in a column meet a minimum length requirement."
":ref:`string_max_length_check`", "Ensures that string values do not exceed a maximum length."
":ref:`string_length_between_check`", "Checks that string lengths fall within a specified range."
":ref:`string_max_length_check`", "Ensures that string values do not exceed a maximum length."
":ref:`string_min_length_check`", "Validates that string values in a column meet a minimum length requirement."
2 changes: 1 addition & 1 deletion docs/source/custom_checks/implementation/aggregate.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ Minimal Example
@register_check_config(check_name="my-custom-count-check")
class RowCountMinCheckConfig(BaseAggregateCheckConfig):
check_class = RowCountMinCheck
min_count: int = Field(..., description="Minimum number of rows expected")
min_count: int = Field(..., description="Minimum number of rows expected", alias="min-count")

@model_validator(mode="after")
def validate_min(self) -> "RowCountMinCheckConfig":
Expand Down
4 changes: 2 additions & 2 deletions docs/source/custom_checks/plugin_architecture.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,8 +33,8 @@ Workflow
.. code-block:: python

config = [
{"check": "null-check", "check_id": "c1", "column": "age"},
{"check": "positive-value", "check_id": "c2", "column": "salary"}
{"check": "null-check", "check-id": "c1", "column": "age"},
{"check": "positive-value", "check-id": "c2", "column": "salary"}
]

2. For each entry, it uses the ``check`` to look up the matching `CheckConfig` class via the central **registry**.
Expand Down
5 changes: 5 additions & 0 deletions docs/source/getting_started/applying_validation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@ different destinations.

Here are two practical approaches that you can implement with just a few lines of code.

.. image:: /_static/integration_patterns.png
:alt: Integration patterns diagram
:align: center
:width: 100%

Fail-Fast Validation
--------------------

Expand Down
4 changes: 2 additions & 2 deletions docs/source/getting_started/defining_checks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,8 +121,8 @@ definitions via dictionaries — for example loaded from YAML or JSON files.

- check: row-count-between-check
check-id: my-count-check
min_count: 100
max_count: 5000
min-count: 100
max-count: 5000

To load the configuration into SparkDQ, use the following code:

Expand Down
4 changes: 1 addition & 3 deletions docs/source/getting_started/validation_dataframes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ Here’s how it works:
.. code-block:: python

from sparkdq.checks import NullCheckConfig
from sparkdq.core import Severity
from sparkdq.engine import BatchDQEngine
from sparkdq.management import CheckSet

check_set = CheckSet()
Expand Down Expand Up @@ -81,8 +81,6 @@ sparkdq automatically annotates your DataFrame with additional columns:

* ``_dq_errors``: Array of structured errors for each failed check (name, check-id, severity)

* ``_dq_aggregate_errors``: Optional column for dataset-wide violations

* ``_dq_validation_ts``: Timestamp marking when the validation run was executed

This enriched metadata allows you to:
Expand Down
Loading