Skip to content

DataFrame from read_parquet is slow after repartition (each new partition reads from many old partitions) #1181

@lhoestq

Description

@lhoestq

Describe the issue:

When I run read_parquet on several files, I get a DataFrame without divisions. The issue is: if I increase npartitions with .repartition(), the resulting DataFrame has very slow partitions.

For example, calling head() starts reading from many files instead of one. This is not the case when using dask without dask-expr

Minimal Complete Verifiable Example:

minimal reproducible example:

import fsspec
import pyarrow as pa
import pyarrow.parquet as pq

fs = fsspec.filesystem("memory")
pq.write_table(pa.table({"i": [0]}), f"0.parquet", filesystem=fs)
pq.write_table(pa.table({"i": [1]}), f"1.parquet", filesystem=fs)

import dask.dataframe as dd
df = dd.read_parquet("memory://*.parquet")
df.npartitions  # 2
df.divisions  # (None, None, None)

# Now calling .head() should only read the first partition, so it should only read the first file
fs.delete("1.parquet")
df.head()  # works
df.repartition(npartitions=5).head()  # fails with dask-expr, and works without dask-expr
# FileNotFoundError

real world issue

import dask.dataframe as dd

df = dd.read_parquet("hf://datasets/HuggingFaceTB/finemath/finemath-3plus/train-*.parquet")  # 128 files
df.npartitions  # 512
df.divisions  # (None,) * 513
df = df.repartition(npartitions=2048)
df.head()  # super slow with dask-expr (downloads data from many many files), fast without dask-expr

Anything else we need to know?:

I need this to work correctly to showcase Dask to the Hugging Face community :)

There are >200k datasets on HF in Parquet format, and making partitions fast again can be a big improvement. E.g. a fast .head() can be a big plus for DX.

Environment:

  • Dask version: 2024.12.1, dask-expr 1.1.21 (also on main with 2024.12.1+8.g4fb8993c and 1.1.21+1.gedb6fd5)
  • Python version: 3.12.2
  • Operating System: MacOS
  • Install method (conda, pip, source): pip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions