Clean-up indexing adapter classes #10355

benbovy · 2025-05-26T16:47:56Z

User visible changes (including notable bug fixes) are documented in whats-new.rst

Mostly cherry-picked from #10296 (cleaner in a separate PR).

This PR also prevents converting a whole pd.RangeIndex or a CoordinateTransform object into a numpy array when showing the array repr. The object is sliced / indexed into a smaller one before doing the conversion. For large indexes this can make a big difference:

>>> import pandas as pd
>>> import xarray as xr
>>> from xarray.indexes import PandasIndex

>>> xidx = PandasIndex(pd.RangeIndex(2_000_000_000), "x")
>>> ds = xr.Dataset(coords=xr.Coordinates.from_xindex(xidx))
>>> ds   # before this PR, was loading 16GB into memory!

However, I'm now wondering how we should format the array (inline) repr for those two cases. It is useful to show some concrete values but at the same time it is misleading to show them like plain numpy arrays.

>>> print(ds.x.variable)
>>> <xarray.IndexVariable 'x' (x: 2000000000)> Size: 16GB
array([[         0,          1,          2, ..., 1999999997, 1999999998,
        1999999999]])

>>> ds.x.variable._in_memory
True       # (not really, actually!)

It would certainly help to have some visual mark for lazy variables (perhaps next to their size), but that's another topic.

Grouping the logic into one method will make it easier overriding the behavior in subclasses (interval index) without affecting much readability. Also it yield more DRY code.

Prevent numpy.dtype conversions or castings implemented in various places, gather the logic into one method.

Maybe slice PandasIndexingAdapter / CoordinateTransformAdapter before formatting them as arrays. For PandasIndexingAdapter, this prevents converting a large pd.RangeIndex into a explicit index or array.

xarray/core/indexing.py

dcherian · 2025-05-27T17:03:46Z

xarray/core/formatting.py

@@ -651,6 +655,12 @@ def short_array_repr(array):
 def short_data_repr(array):
    """Format "data" for DataArray and Variable."""
    internal_data = getattr(array, "variable", array)._data
+
+    if isinstance(
+        internal_data, PandasIndexingAdapter | CoordinateTransformIndexingAdapter


Can't we get away with adding_repr_inline to CoordinateTransfromIndexingAdapter? I think it's preferable to avoid this kind of special casing.

This function does not format the inline repr but the data of a DataArray or variable.

We could add a short_data_repr method to those adapter classes and check if internal_data has such a method here, although it is not much different from this special casing.

I guess a better request is to see whether we can just reuse format_array_flat which already does indexing and should just work for these classes.

Hmm again format_array_flat is for formatting the inline repr, whereas the special case introduced here is for formatting the data repr.

Without this special case, short_data_repr() will convert the indexing adapters into numpy arrays via PandasIndexingAdpater.get_duck_array() and CoordinateTransformIndexingAdapter.get_duck_array() over their full shape / size. For both the inline and the data reprs, we want to select only the first and last relevant items before doing this conversion.

first_n_items() and last_n_items() in xarray.core.formatting do similar things than PandasIndexingAdapter._get_array_subset() and CoordinateTransformIndexingAdapter._get_array_subset(). We could perhaps reuse the two former instead, although for the data repr (at least for CoordinateTransform, and possibly later for pd.IntervalIndex) we don't want a flattened result. So this would involve more refactoring. Also this wouldn't remove the special case here.

Alternatively we could tweak Variable._in_memory such that it returns False when ._data is a PandasIndexingAdapter (only when it wraps a pd.RangeIndex) or a CoordinateTransformIndexingAdapter, which will turn their data repr from, e.g.,

>>> xidx = PandasIndex(pd.RangeIndex(2_000_000_000), "x") >>> ds = xr.Dataset(coords=xr.Coordinates.from_xindex(xidx)) >>> print(ds.x.variable) <xarray.IndexVariable 'x' (x: 2000000000)> Size: 16GB array([[ 0, 1, 2, ..., 1999999997, 1999999998, 1999999999]])

into

>>> print(ds.x.variable) <xarray.IndexVariable 'x' (x: 2000000000)> Size: 16GB [2000000000 values with dtype=dtype('int64')]

On one hand I like seeing the actual first and last values of a (lazy) range index or coordinate transform. On the other hand I find a bit confusing that is it shown like a plain numpy array.

Alternatively we could tweak Variable._in_memory such that it returns False when ._data is a PandasIndexingAdapter (only when it wraps a pd.RangeIndex) or a CoordinateTransformIndexingAdapter

This is what I ended up refactoring in 0e5154c, for PandasMultiIndex coordinates too.

I think it makes sense to show those coordinate variables as lazy variables rather than showing them as numpy arrays. Those aren't numpy arrays, coercing them like so may trigger a lot of computation and memory allocation.

A possible new source of confusion, however, is that they now reuse the same repr than for the other lazy variables loaded from disk. One difference is that the index coordinates will still be lazy even after "loading" a dataset, i.e.,

>>> print(ds.load().x.variable) <xarray.IndexVariable 'x' (x: 2000000000)> Size: 16GB [2000000000 values with dtype=dtype('int64')]

I'm not sure what would be best.

you could create another lazy array class that inherits from the current one but overrides __repr__ to show that they're different (and if that later turns out to be wrong we don't have to go through a deprecation cycle, as all that machinery is internal anyways)

xarray/core/formatting.py

dcherian

I agree that for index variables, it is useful to show (a) the first "n" and last "n" is useful and (b) that they are lazy.

That said, I think this PR is an OK "engineering" fix. We should rethink these reprs a bit.

Shall we add a benchmark to test peak memory so that we don't regress here?

benbovy added 5 commits May 26, 2025 15:29

clean-up indexing.PandasIndexingAdapter typing

bc94c6d

streamline PandasIndexingAdapter indexing logic

17ff7e9

Grouping the logic into one method will make it easier overriding the behavior in subclasses (interval index) without affecting much readability. Also it yield more DRY code.

clean-up PandasIndexingAdapter dtype handling

2b25155

Prevent numpy.dtype conversions or castings implemented in various places, gather the logic into one method.

more clean-up

9981078

repr: prevent loading lazy variables into memory

29098ac

Maybe slice PandasIndexingAdapter / CoordinateTransformAdapter before formatting them as arrays. For PandasIndexingAdapter, this prevents converting a large pd.RangeIndex into a explicit index or array.

github-actions bot added topic-indexing topic-documentation labels May 26, 2025

fix array (index) subsetting

5f09354

dcherian reviewed May 27, 2025

View reviewed changes

xarray/core/indexing.py Outdated Show resolved Hide resolved

dcherian reviewed May 27, 2025

View reviewed changes

benbovy added 3 commits July 4, 2025 13:01

Merge branch 'main' into cleanup-pandas-indexing-adapter

c4a853e

treat multi-index and coord-transform variables as lazy

0e5154c

update whats new

4efb135

dcherian reviewed Jul 6, 2025

View reviewed changes

xarray/core/formatting.py Show resolved Hide resolved

dcherian reviewed Jul 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Clean-up indexing adapter classes #10355

Clean-up indexing adapter classes #10355

benbovy commented May 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

dcherian May 27, 2025

Uh oh!

benbovy May 27, 2025

Uh oh!

dcherian May 28, 2025

Uh oh!

benbovy May 30, 2025 •

edited

Loading

Uh oh!

benbovy Jul 4, 2025 •

edited

Loading

Uh oh!

keewis Jul 5, 2025

Uh oh!

Uh oh!

dcherian left a comment

Uh oh!

Uh oh!

Uh oh!

Clean-up indexing adapter classes #10355

Are you sure you want to change the base?

Clean-up indexing adapter classes #10355

Conversation

benbovy commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dcherian May 27, 2025

Choose a reason for hiding this comment

Uh oh!

benbovy May 27, 2025

Choose a reason for hiding this comment

Uh oh!

dcherian May 28, 2025

Choose a reason for hiding this comment

Uh oh!

benbovy May 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benbovy Jul 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keewis Jul 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dcherian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

benbovy commented May 26, 2025 •

edited

Loading

benbovy May 30, 2025 •

edited

Loading

benbovy Jul 4, 2025 •

edited

Loading