Closed
Description
Environment
Delta-rs version: 0.25.5
Binding: python
Environment: Linux
Bug
What happened:
Merge seems to scan files/partitions that should be filtered out by predicate.
The merge operation returns num_target_files_scanned=3 when I expect 2.
Unless I have misunderstood what this variable means?
What you expected to happen:
The merge operation should return num_target_files_scanned=2 when merge only touches two partitions (with one file each). This works as expected in deltalake version 0.24.0.
How to reproduce it:
import pandas as pd
from deltalake import write_deltalake, DeltaTable
merge_col = "_merge_key"
partition_col = "month"
insert_frame = pd.DataFrame(
{
merge_col: ["2020-01-01", "2020-02-01", "2020-03-01"],
partition_col: ["2020-01", "2020-02", "2020-03"],
"value": ["a", "b", "c"],
}
)
write_deltalake("test_table", insert_frame, partition_by=[partition_col], mode="overwrite")
merge_frame = pd.DataFrame(
{
merge_col: ["2020-01-01", "2020-03-01"],
partition_col: ["2020-01", "2020-03"],
"value": ["d", "f"],
}
)
dt = DeltaTable("test_table")
predicate = "target._merge_key = source._merge_key AND target.month = source.month"
res = dt.merge(
merge_frame,
predicate,
source_alias="source",
target_alias="target",
).when_matched_update_all().when_not_matched_insert_all().execute()
# Merge should only scan 2 files since only 2 partitions are updated
assert res["num_target_files_scanned"] == 2
assert res["num_target_files_skipped_during_scan"] == 1