Skip to content

feat: during LakeFS file operations, skip merge when 0 changes #3346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

smeyerre
Copy link
Contributor

Description

If a branch does not have any file changes, we no longer merge the branch to avoid empty commits.

Related Issue(s)

Documentation

N/A

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Mar 25, 2025
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@smeyerre
Copy link
Contributor Author

First time contributing here so I'll include my manual testing process here in case I missed anything!

Manual Testing

Testing was performed using a real LakeFS instance:

  1. Environment Setup

    • Started LakeFS using Docker Compose
    • Built delta-rs with the implemented changes
  2. Test Steps

    • Created a Delta table with initial data (verified first commit in LakeFS UI)
    • Added more data to the table (verified second commit in LakeFS UI)
    • Performed a no-op operation (repair with dry_run=True)
    • Verified that no empty commit was created (commit count remained at 2)
  3. Results

    • Confirmed that real operations (table creation, data addition) still create commits
    • Confirmed that no-op operations no longer create empty commits
    • The fix works correctly without affecting normal Delta Lake operations

@smeyerre smeyerre force-pushed the smeyerre/issue/3190/lakefs-skip-merge-in-file-operations branch 2 times, most recently from eae4513 to 74e2beb Compare March 25, 2025 03:18
@smeyerre smeyerre changed the title LakeFS: During file operations, skip merge when 0 changes feat: during LakeFS file operations, skip merge when 0 changes Mar 25, 2025
@smeyerre smeyerre force-pushed the smeyerre/issue/3190/lakefs-skip-merge-in-file-operations branch from 74e2beb to f637ea0 Compare March 25, 2025 04:10
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Mar 25, 2025

I had a quick glance and it looks good! You should change the test though, a repair dry run doesn trigger pre execute and post execute since it returns the metrics before that.

But an easier way to trigger a file operation lakefs commit is vacuum

@smeyerre
Copy link
Contributor Author

Ahh good to know, thanks I'll make that change!

Copy link

codecov bot commented Mar 25, 2025

Codecov Report

Attention: Patch coverage is 76.52174% with 27 lines in your changes missing coverage. Please review.

Project coverage is 72.20%. Comparing base (6437e7a) to head (aee2e6c).

Files with missing lines Patch % Lines
crates/lakefs/src/client.rs 81.08% 12 Missing and 2 partials ⚠️
crates/lakefs/src/logstore.rs 56.66% 8 Missing and 5 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3346      +/-   ##
==========================================
- Coverage   72.20%   72.20%   -0.01%     
==========================================
  Files         145      145              
  Lines       45890    45986      +96     
  Branches    45890    45986      +96     
==========================================
+ Hits        33135    33202      +67     
- Misses      10667    10679      +12     
- Partials     2088     2105      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@smeyerre
Copy link
Contributor Author

Ok I'm realizing I missed a failing test earlier. It seems that when creating checkpoints, the checkpoint files aren't reflected by the LakeFS diff API.

I can think of a couple ways to go about handling this:

  1. Add something like a force_merge parameter to commit_merge that gets passed when doing checkpoint operations
  2. Add a more rigorous check for changes in has_changes by doing something like looking for checkpoint files as well

Option 1 seems a little bit weird to me so I'd be more inclined to do 2 but open to feedback on this (also I could have easily missed a better option).

@ion-elgreco
Copy link
Collaborator

Ok I'm realizing I missed a failing test earlier. It seems that when creating checkpoints, the checkpoint files aren't reflected by the LakeFS diff API.

I can think of a couple ways to go about handling this:

  1. Add something like a force_merge parameter to commit_merge that gets passed when doing checkpoint operations
  2. Add a more rigorous check for changes in has_changes by doing something like looking for checkpoint files as well

Option 1 seems a little bit weird to me so I'd be more inclined to do 2 but open to feedback on this (also I could have easily missed a better option).

What does the diff API return here? In theory it should show any change

@smeyerre
Copy link
Contributor Author

smeyerre commented Apr 7, 2025

Ok I'm realizing I missed a failing test earlier. It seems that when creating checkpoints, the checkpoint files aren't reflected by the LakeFS diff API.
I can think of a couple ways to go about handling this:

  1. Add something like a force_merge parameter to commit_merge that gets passed when doing checkpoint operations
  2. Add a more rigorous check for changes in has_changes by doing something like looking for checkpoint files as well

Option 1 seems a little bit weird to me so I'd be more inclined to do 2 but open to feedback on this (also I could have easily missed a better option).

What does the diff API return here? In theory it should show any change

Apologies, I've been out of town the last week and a half.

So in my manual testing it seems like the LakeFS diff API actually returns empty results when a checkpoint has been made. Is that not expected?

@ion-elgreco
Copy link
Collaborator

Ok I'm realizing I missed a failing test earlier. It seems that when creating checkpoints, the checkpoint files aren't reflected by the LakeFS diff API.
I can think of a couple ways to go about handling this:

  1. Add something like a force_merge parameter to commit_merge that gets passed when doing checkpoint operations
  2. Add a more rigorous check for changes in has_changes by doing something like looking for checkpoint files as well

Option 1 seems a little bit weird to me so I'd be more inclined to do 2 but open to feedback on this (also I could have easily missed a better option).

What does the diff API return here? In theory it should show any change

Apologies, I've been out of town the last week and a half.

So in my manual testing it seems like the LakeFS diff API actually returns empty results when a checkpoint has been made. Is that not expected?

Are you sure the test is correct? I remember we used the diff api before doing a merge on the python side and that worked

@smeyerre smeyerre force-pushed the smeyerre/issue/3190/lakefs-skip-merge-in-file-operations branch from c33a55b to 023046c Compare May 12, 2025 23:44
@smeyerre smeyerre requested a review from houqp as a code owner May 12, 2025 23:44
@smeyerre smeyerre force-pushed the smeyerre/issue/3190/lakefs-skip-merge-in-file-operations branch from 023046c to 659d288 Compare May 13, 2025 00:01
@github-actions github-actions bot removed the proofs label May 13, 2025
@smeyerre smeyerre force-pushed the smeyerre/issue/3190/lakefs-skip-merge-in-file-operations branch from 659d288 to 509e955 Compare May 13, 2025 00:13
@smeyerre
Copy link
Contributor Author

Ok I'm realizing I missed a failing test earlier. It seems that when creating checkpoints, the checkpoint files aren't reflected by the LakeFS diff API.
I can think of a couple ways to go about handling this:

  1. Add something like a force_merge parameter to commit_merge that gets passed when doing checkpoint operations
  2. Add a more rigorous check for changes in has_changes by doing something like looking for checkpoint files as well

Option 1 seems a little bit weird to me so I'd be more inclined to do 2 but open to feedback on this (also I could have easily missed a better option).

What does the diff API return here? In theory it should show any change

Apologies, I've been out of town the last week and a half.
So in my manual testing it seems like the LakeFS diff API actually returns empty results when a checkpoint has been made. Is that not expected?

Are you sure the test is correct? I remember we used the diff api before doing a merge on the python side and that worked

Ah ok I think I fixed the issue. It seems like I had the branch parameters to the diff API call reversed which messed with the checkpoint diffs specifically. The tests are all passing now, and my manual tests worked as well.

@ion-elgreco
Copy link
Collaborator

@smeyerre great! Can you fix the formatting? Then we're good to go 💯

@smeyerre smeyerre force-pushed the smeyerre/issue/3190/lakefs-skip-merge-in-file-operations branch from 81c5b91 to ca4d409 Compare May 13, 2025 15:52
@smeyerre smeyerre force-pushed the smeyerre/issue/3190/lakefs-skip-merge-in-file-operations branch from ca4d409 to 8df3021 Compare May 13, 2025 15:55
@smeyerre
Copy link
Contributor Author

@smeyerre great! Can you fix the formatting? Then we're good to go 💯

Oh oops, ok that should be fixed now, unit/formatting tests are passing locally.

@ion-elgreco
Copy link
Collaborator

@smeyerre don't worry about the other test failures, there is a breaking change in the arrow release, I will merge it once I've addressed it in main

@smeyerre
Copy link
Contributor Author

@smeyerre don't worry about the other test failures, there is a breaking change in the arrow release, I will merge it once I've addressed it in main

Awesome thanks!

@ion-elgreco ion-elgreco force-pushed the smeyerre/issue/3190/lakefs-skip-merge-in-file-operations branch from 8df3021 to 9068be4 Compare May 14, 2025 15:04
Copy link
Collaborator

@ion-elgreco ion-elgreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this optimization!! <3

@ion-elgreco ion-elgreco enabled auto-merge May 14, 2025 15:04
@ion-elgreco ion-elgreco added this pull request to the merge queue May 14, 2025
Merged via the queue into delta-io:main with commit a99e502 May 14, 2025
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

lakefs: during file operations, skip merge when 0 changes
2 participants