Skip to content

Merge touches more partitions than necessary? #3432

Closed
@halvorlu

Description

@halvorlu

Environment

Delta-rs version: 0.25.5

Binding: python

Environment: Linux


Bug

What happened:
Merge seems to scan files/partitions that should be filtered out by predicate.
The merge operation returns num_target_files_scanned=3 when I expect 2.
Unless I have misunderstood what this variable means?

What you expected to happen:
The merge operation should return num_target_files_scanned=2 when merge only touches two partitions (with one file each). This works as expected in deltalake version 0.24.0.

How to reproduce it:

import pandas as pd
from deltalake import write_deltalake, DeltaTable
merge_col = "_merge_key"
partition_col = "month"
insert_frame = pd.DataFrame(
    {
        merge_col: ["2020-01-01", "2020-02-01", "2020-03-01"],
        partition_col: ["2020-01", "2020-02", "2020-03"],
        "value": ["a", "b", "c"],
    }
)
write_deltalake("test_table", insert_frame, partition_by=[partition_col], mode="overwrite")
merge_frame = pd.DataFrame(
    {
        merge_col: ["2020-01-01", "2020-03-01"],
        partition_col: ["2020-01", "2020-03"],
        "value": ["d", "f"],
    }
)
dt = DeltaTable("test_table")
predicate = "target._merge_key = source._merge_key AND target.month = source.month"
res = dt.merge(
    merge_frame,
    predicate,
    source_alias="source",
    target_alias="target",
).when_matched_update_all().when_not_matched_insert_all().execute()
# Merge should only scan 2 files since only 2 partitions are updated
assert res["num_target_files_scanned"] == 2
assert res["num_target_files_skipped_during_scan"] == 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions