Skip to content

Conversation

@ariosramirez
Copy link

resolves #9656

Problem

Currently, the check strategy for Snapshots does not throw an error if there are duplicate column names specified in the check_cols config, and accidentally specifying a column twice results in a record being inserted even if none of the column values changed.

Problem described in #9656 (comment)

Solution

This PR adds validation to detect duplicate column names in the check_cols configuration of a snapshot. It raises an error if duplicates are found.

Checklist

  • I have read the contributing guide and understand what's expected of me
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • This PR has no interface changes (e.g. macros, cli, logs, json artifacts, config files, adapter interface, etc) or this PR has already received feedback and approval from Product or DX
  • This PR includes type annotations for new and modified functions

@ariosramirez ariosramirez requested a review from a team as a code owner February 28, 2024 17:04
@cla-bot cla-bot bot added the cla:yes label Feb 28, 2024
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

1 similar comment
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@dbeatty10 dbeatty10 added the community This PR is from a community member label Mar 22, 2024
@graciegoheen graciegoheen added the ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering label May 28, 2024
raise ValidationError("A snapshot must have a materialized value of 'snapshot'")

# Validate if there are duplicate column names in check_cols.
if len(data.get("check_cols")) != len(set(data.get("check_cols"))):
Copy link
Contributor

@jordivandooren jordivandooren Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though .get() is used, if check_cols is missing, this will result in an error because NoneType has no len().

Suggested change
if len(data.get("check_cols")) != len(set(data.get("check_cols"))):
if "check_cols" in data and len(data["check_cols"]) != len(set(data["check_cols"])):

But, judging by the code above it, this check should only apply in case of the check strategy, where the existence of this check_cols has been validated beforehand. I suggest moving this new validation to after line 40.

I expect the current implementation to break the breaks the timestamp strategy, because of the absence of check_cols.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jordivandooren Could you please review the latest changes? I updated my code, and everything is now up-to-date. The duplicated columns validation has been moved inside the final_validate function.

@cla-bot
Copy link

cla-bot bot commented Sep 18, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: ariosramirez.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot cla-bot bot removed the cla:yes label Sep 18, 2024
@cla-bot
Copy link

cla-bot bot commented Sep 19, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: ariosramirez.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

2 similar comments
@cla-bot
Copy link

cla-bot bot commented Sep 19, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: ariosramirez.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@cla-bot
Copy link

cla-bot bot commented Sep 19, 2024

Thank you for your pull request and welcome to our community. We could not parse the GitHub identity of the following contributors: ariosramirez.
This is most likely caused by a git client misconfiguration; please make sure to:

  1. check if your git client is configured with an email to sign commits git config --list | grep email
  2. If not, set it up using git config --global user.email [email protected]
  3. Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@codecov
Copy link

codecov bot commented Sep 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.89%. Comparing base (89caa33) to head (14f5ccb).
Report is 96 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9698      +/-   ##
==========================================
- Coverage   89.10%   88.89%   -0.21%     
==========================================
  Files         183      191       +8     
  Lines       23626    24400     +774     
==========================================
+ Hits        21051    21690     +639     
- Misses       2575     2710     +135     
Flag Coverage Δ
integration 85.95% <66.66%> (-0.45%) ⬇️
unit 62.84% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
Unit Tests 62.84% <100.00%> (+0.04%) ⬆️
Integration Tests 85.95% <66.66%> (-0.45%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dataders
Copy link
Contributor

  • check if your git client is configured with an email to sign commits git config --list | grep email

  • If not, set it up using git config --global user.email [email protected]

  • Make sure that the git commit email is configured in your GitHub account settings, see https://github.com/settings/emails

@ariosramirez did you do these steps already as well?

@QMalcolm
Copy link
Contributor

Hi @ariosramirez, I think I understand what the problem is.

I checked our CLA database, and it does look like you've signed it. What I believe is happening is that cla-bot checks every commit in the PR. A majority of your commits are signed with the email [email protected], and that email address seems correct. However, four commits (8082fa7, 5343684, 67d5cb4, and 7bd6cb5) are signed with the email [email protected]. It appears that email address is no longer associated with your github account. If you still have access to that email address, I recommend re-adding it to your list of associated email address in github (you can have multiple, I myself have five email address associated with my github account). If you no longer have access to that email address, you may need to modify the signature on those commits.

@QMalcolm
Copy link
Contributor

@ariosramirez Alternatively, you can squash the problematic (or all the commits) into one new commit. This should give all the work your current email, I believe

@ariosramirez ariosramirez force-pushed the ADAP-9656_raise_an_error_when_check_cols_has_duplicated_values branch from f6a7335 to b859dea Compare September 20, 2024 04:47
@cla-bot cla-bot bot added the cla:yes label Sep 20, 2024
@ariosramirez
Copy link
Author

QMalcolm

Thanks, @QMalcolm the squash commit worked well.

@dbeatty10 dbeatty10 added test all run tests for all python versions + systems artifact_minor_upgrade To bypass the CI check by confirming that the change is not breaking labels Mar 31, 2025
@dbeatty10
Copy link
Contributor

@QMalcolm I added the artifact_minor_upgrade label only to try to get it to run the rest of the CI checks. Could you take a look if that label should be left on or removed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

artifact_minor_upgrade To bypass the CI check by confirming that the change is not breaking cla:yes community This PR is from a community member ready_for_review Externally contributed PR has functional approval, ready for code review from Core engineering test all run tests for all python versions + systems

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Warn if duplicate columns are found in check Snapshot strategy

6 participants