Skip to content

fix: ignore temp log entries #3423

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 16, 2025
Merged

Conversation

corwinjoy
Copy link
Contributor

Description

Fix delta-rs methods so that delta_log entries in subdirectories, representing pending but uncommited transactions, are ignored.

Related Issue(s)

Closes load_with_datetime

Documentation

As per the bug report only root level delta files should be included:

According to the delta protocol:
https://github.com/delta-io/delta/blob/master/PROTOCOL.md#delta-log-entries
"Delta files are stored as JSON in a directory at the root of the table named _delta_log, and together with checkpoints make up the log of all changes that have occurred to a table."

@github-actions github-actions bot added the binding/rust Issues for the Rust crate label May 10, 2025
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

@corwinjoy corwinjoy changed the title Ignore temp log entries fix: Ignore temp log entries May 10, 2025
@@ -480,6 +480,7 @@ async fn list_log_files_with_checkpoint(
let version_prefix = format!("{:020}", cp.version);
let start_from = log_root.child(version_prefix.as_str());

// QUESTION: DOES THIS NEED TO BE FILTERED TO ELIMINATE temp subdirs?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function, plus the one below may need to be fixed as well to ignore temp subdirectories but I could use some feedback here + pointers to the associated unit test.

@@ -0,0 +1,2 @@
{"commitInfo":{"timestamp":1587968626637,"operation":"WRITE","operationParameters":{"mode":"Overwrite","partitionBy":"[]"},"readVersion":4,"isBlindAppend":true}}
{"add":{"path":"part-00000-c1777d7d-89d9-4790-b38a-6ee7e24456b1-c001.snappy.parquet","partitionValues":{},"size":262,"modificationTime":1587968626600,"dataChange":true}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the pending operation in the .tmp subfolder + parquet file below.
Essentially just copied from existing transactions.

@corwinjoy
Copy link
Contributor Author

@adamreeve @EnricoMi

@corwinjoy corwinjoy changed the title fix: Ignore temp log entries fix: ignore temp log entries May 10, 2025
@corwinjoy corwinjoy force-pushed the ignore_temp_log_entries branch from 9ad4e09 to c319ae4 Compare May 10, 2025 01:03
@EnricoMi
Copy link

What about ignoring any . file (not only .tmp) unless the file is specifically supported (e.g. .*.crc)?

Prefixing a file with a . expresses "please ignore".

@adamreeve
Copy link
Contributor

What about ignoring any . file (not only .tmp) unless the file is specifically supported (e.g. .*.crc)?

This fix will already ignore any files or directories beginning with ., as it excludes files in any subdirectories regardless of the subdirectory name, and the existing code skips any files where the name doesn't match the log file name format.

I wonder if it would make sense for object-store to support a non-recursive list operation as a longer-term fix? This seems like it would be a pretty common operation, and that way this operation could be optimised in object store implementations where possible, rather than needing to filter on the client side. Although for this scenario in Delta Lake I guess there won't often be many nested results to filter out.

@@ -13,6 +15,11 @@ async fn time_travel_by_ds() {
("00000000000000000002.json", "2020-05-03T22:47:31-07:00"),
("00000000000000000003.json", "2020-05-04T22:47:31-07:00"),
("00000000000000000004.json", "2020-05-05T22:47:31-07:00"),
// Final file is uncommitted by Spark and is in a .tmp subdir

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this fix ignores any sub-directory, the test should also contain a testcase of a sub-directory not starting with a dot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need a second test case? The name of the subdirectory does not matter, any and all files in subdirectories are ignored. I only used the name .tmp to clarify what is going on / better match what spark does.

@rtyler rtyler self-assigned this May 14, 2025
Copy link

codecov bot commented May 15, 2025

Codecov Report

Attention: Patch coverage is 92.85714% with 1 line in your changes missing coverage. Please review.

Project coverage is 71.93%. Comparing base (e985f95) to head (f0bf7ed).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
crates/core/src/table/mod.rs 85.71% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #3423   +/-   ##
=======================================
  Coverage   71.92%   71.93%           
=======================================
  Files         150      150           
  Lines       46494    46504   +10     
  Branches    46494    46504   +10     
=======================================
+ Hits        33441    33451   +10     
+ Misses      10899    10898    -1     
- Partials     2154     2155    +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@rtyler rtyler enabled auto-merge May 16, 2025 13:00
@rtyler rtyler added this to the Rust v1.0.0 milestone May 16, 2025
@rtyler rtyler force-pushed the ignore_temp_log_entries branch from 4310411 to 4a6740e Compare May 16, 2025 13:30
…directory of delta_log.

Signed-off-by: Corwin Joy <[email protected]>
Co-authored-by: Adam Reeve <[email protected]>
@rtyler rtyler force-pushed the ignore_temp_log_entries branch from 4a6740e to f0bf7ed Compare May 16, 2025 13:36
@rtyler rtyler added this pull request to the merge queue May 16, 2025
Merged via the queue into delta-io:main with commit 95e6a35 May 16, 2025
25 checks passed
@corwinjoy
Copy link
Contributor Author

@rtyler Thanks for reviewing this and merging!
Just to confirm, I guess the conclusion was that the functions list_log_files and list_log_files_with_checkpoint in log_segment.rs do not need to be updated to filter out temp files?
If they do, I am happy to add a followup.
Anyway, thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

delta-rs includes pending versions written by spark
4 participants