DataFrame from read_parquet is slow after repartition (each new partition reads from many old partitions)

**Describe the issue**:

When I run `read_parquet` on several files, I get a DataFrame without `divisions`. The issue is: if I increase `npartitions` with `.repartition()`, the resulting DataFrame has very slow partitions.

For example, calling `head()` starts reading from many files instead of one. This is not the case when using `dask` without `dask-expr`

**Minimal Complete Verifiable Example**:

minimal reproducible example:

```python
import fsspec
import pyarrow as pa
import pyarrow.parquet as pq

fs = fsspec.filesystem("memory")
pq.write_table(pa.table({"i": [0]}), f"0.parquet", filesystem=fs)
pq.write_table(pa.table({"i": [1]}), f"1.parquet", filesystem=fs)

import dask.dataframe as dd
df = dd.read_parquet("memory://*.parquet")
df.npartitions  # 2
df.divisions  # (None, None, None)

# Now calling .head() should only read the first partition, so it should only read the first file
fs.delete("1.parquet")
df.head()  # works
df.repartition(npartitions=5).head()  # fails with dask-expr, and works without dask-expr
# FileNotFoundError
```

real world issue

```python
import dask.dataframe as dd

df = dd.read_parquet("hf://datasets/HuggingFaceTB/finemath/finemath-3plus/train-*.parquet")  # 128 files
df.npartitions  # 512
df.divisions  # (None,) * 513
df = df.repartition(npartitions=2048)
df.head()  # super slow with dask-expr (downloads data from many many files), fast without dask-expr
```

**Anything else we need to know?**:

I need this to work correctly to showcase Dask to the Hugging Face community :)

There are >200k datasets on HF in Parquet format, and making partitions fast again can be a big improvement. E.g. a fast `.head()`  can be a big plus for DX.

**Environment**:

- Dask version: 2024.12.1, dask-expr 1.1.21 (also on `main` with 2024.12.1+8.g4fb8993c and 1.1.21+1.gedb6fd5)
- Python version: 3.12.2
- Operating System: MacOS
- Install method (conda, pip, source): pip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DataFrame from read_parquet is slow after repartition (each new partition reads from many old partitions) #1181

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

DataFrame from read_parquet is slow after repartition (each new partition reads from many old partitions) #1181

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions