-
-
Notifications
You must be signed in to change notification settings - Fork 27
Open
Description
Describe the issue:
When I run read_parquet
on several files, I get a DataFrame without divisions
. The issue is: if I increase npartitions
with .repartition()
, the resulting DataFrame has very slow partitions.
For example, calling head()
starts reading from many files instead of one. This is not the case when using dask
without dask-expr
Minimal Complete Verifiable Example:
minimal reproducible example:
import fsspec
import pyarrow as pa
import pyarrow.parquet as pq
fs = fsspec.filesystem("memory")
pq.write_table(pa.table({"i": [0]}), f"0.parquet", filesystem=fs)
pq.write_table(pa.table({"i": [1]}), f"1.parquet", filesystem=fs)
import dask.dataframe as dd
df = dd.read_parquet("memory://*.parquet")
df.npartitions # 2
df.divisions # (None, None, None)
# Now calling .head() should only read the first partition, so it should only read the first file
fs.delete("1.parquet")
df.head() # works
df.repartition(npartitions=5).head() # fails with dask-expr, and works without dask-expr
# FileNotFoundError
real world issue
import dask.dataframe as dd
df = dd.read_parquet("hf://datasets/HuggingFaceTB/finemath/finemath-3plus/train-*.parquet") # 128 files
df.npartitions # 512
df.divisions # (None,) * 513
df = df.repartition(npartitions=2048)
df.head() # super slow with dask-expr (downloads data from many many files), fast without dask-expr
Anything else we need to know?:
I need this to work correctly to showcase Dask to the Hugging Face community :)
There are >200k datasets on HF in Parquet format, and making partitions fast again can be a big improvement. E.g. a fast .head()
can be a big plus for DX.
Environment:
- Dask version: 2024.12.1, dask-expr 1.1.21 (also on
main
with 2024.12.1+8.g4fb8993c and 1.1.21+1.gedb6fd5) - Python version: 3.12.2
- Operating System: MacOS
- Install method (conda, pip, source): pip
Metadata
Metadata
Assignees
Labels
No labels