Skip to content

New Check: Duplicated Rows / Unique Values #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
flitzpiepe93 opened this issue May 4, 2025 · 1 comment
Closed

New Check: Duplicated Rows / Unique Values #15

flitzpiepe93 opened this issue May 4, 2025 · 1 comment
Labels
enhancement New feature or request feature request

Comments

@flitzpiepe93
Copy link
Contributor

flitzpiepe93 commented May 4, 2025

New Check: Duplicated Rows / Unique Values

Description

Introduce a check that validates whether duplicate rows exist in a DataFrame, optionally based on a subset of columns.

Example Usage

DuplicatedRowsCheckConfig(
    check_id="no-duplicates",
    subset_columns=["trip_id", "pickup_time"]
)

Expected Behavior

  • Flags rows that appear more than once (based on full row or selected columns)
  • Allows optional specification of subset_columns
  • Ignores nulls by default unless configured otherwise

Benefits

  • Ensures uniqueness in datasets
  • Prevents unintentional data duplication
  • Supports primary-key style constraints
@flitzpiepe93 flitzpiepe93 added enhancement New feature or request feature request labels May 4, 2025
@flitzpiepe93 flitzpiepe93 changed the title New Check: Duplicated Rows New Check: Duplicated Rows / Unique Values May 6, 2025
@flitzpiepe93
Copy link
Contributor Author

Issue solved with the following MR: #33

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature request
Projects
None yet
Development

No branches or pull requests

1 participant