Skip to content

Add NDPointIndex (KDTree) #10478

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

Add NDPointIndex (KDTree) #10478

wants to merge 20 commits into from

Conversation

benbovy
Copy link
Member

@benbovy benbovy commented Jul 1, 2025

This PR adds an xarray.indexes.TreeIndex for point-wise label-based indexing of irregular data (n-dimensional coordinates). By default it is using scipy.spatial.KDTree under the hood.

It is heavily inspired from xoak. Actually it could eventually replace it, or alternatively xoak could be built on top of this. There's no support for (dask) chunked coordinates here, although in xoak that feature is experimental and not working well anyway.

TreeIndex shares some similarities with CoordinateTransformIndex in its implementation and supported features:

  • Like for CoordinateTransform, there's a small TreeAdapter abstraction layer that allows plugging in alternative index objects like sklearn.neighbors.KDTree, sklearn.neighbors.BallTree, etc. We can keep that private for now and publicly expose it later.
  • There's some rudimentary support for (exact) alignment. Comparing scipy.spatial.KDTree objects is done here based on the fact that they hold a view to their original data points (i.e., that's basically equivalent to comparing the associated coordinate variables).
  • Only advanced label-based selection (i.e., point-wise selection using DataArray or Variable objects) is supported.

@dcherian
Copy link
Contributor

dcherian commented Jul 1, 2025

Nice! Would be good to add a section to this gallery example

@benbovy
Copy link
Member Author

benbovy commented Jul 2, 2025

Would be good to add a section to this gallery example

While I agree that it would be nice to showcase TreeIndex in the example gallery, I'm not sure that the specific ROMS ocean model example (based on lat/lon 2D coordinates) will be relevant with a (default) KDTree and no longitude wraparound. It will make more sense to use something like xoak's sklearn_geo_balltree or s2point.

@benbovy
Copy link
Member Author

benbovy commented Jul 2, 2025

This is ready for review.

I can do one pass over the doc's user guide and examples to see if we can use or mention this. I can also do that in a follow-up PR.

Any better idea for the index name than TreeIndex? I'm not a big fan of it. NDPointIndex? KDPointIndex?

@benbovy
Copy link
Member Author

benbovy commented Jul 2, 2025

We can also see later if there's a need to implement TreeIndex.concat, TreeIndex.stack, TreeIndex.isel... and if yes how best to do so. One concern I have with automatically re-building and propagating a TreeIndex is that building such index may be expensive (likely more than other kinds of indexes).

benbovy added 2 commits July 2, 2025 15:38
Display the name of the wrapper index adapter type.
@benbovy benbovy changed the title Add (KD)TreeIndex Add NDPointIndex (KDTree) Jul 2, 2025
@benbovy
Copy link
Member Author

benbovy commented Jul 2, 2025

Any better idea for the index name than TreeIndex? I'm not a big fan of it. NDPointIndex? KDPointIndex?

I renamed it to NDPointIndex that I like better.

The index inline repr now displays the underlying structure:

    <xarray.Dataset> Size: 64B
    Dimensions:  (y: 2, x: 2)
    Coordinates:
      * xx       (y, x) float64 32B 1.0 2.0 3.0 0.0
      * yy       (y, x) float64 32B 11.0 21.0 29.0 9.0
    Dimensions without coordinates: y, x
    Data variables:
        *empty*
    Indexes:
      ┌ xx       NDPointIndex (ScipyKDTreeAdapter)
      └ yy

@dcherian
Copy link
Contributor

dcherian commented Jul 3, 2025

Very nice! At the meeting, there was uniform support for merging this.

@benbovy
Copy link
Member Author

benbovy commented Jul 3, 2025

A few ideas for further improvements (I'll reference this comment in a new issue):

  • Alignment with tolerance

    • This is a good case
    • Until Xarray's API allows providing a tolerance value when aligning objects, we could allow providing it as an option at index creation.
    • Picking a tolerance value among the ones set for each index to align would depend on the join method, e.g., pick the smallest value for "exact" and "inner" and pick the largest value for "outer".
    • For tolerance > 0, we could rely Scipy KDTree's count_neighbors. For tolerance == 0 we could just compare the data points as implemented in this PR.
  • Join and re-index

    • I guess it is possible to support it via querying all matching point pairs within the given tolerance.
    • Raise an error if tolerance == 0 ?
    • We could rely Scipy KDTree's query(..., distance_upper_bound=tolerance). query_ball_tree() could also be a good candidate but it returns nested Python lists that might be slow to process.
  • Positional indexing (isel)

    • Currently (this PR) the index is dropped after isel, but we could create a new one (from-scratch) instead.
    • Implementation should be pretty straightforward, we can transform the input dimension indexers into a flat indexer for the (n_points, n_coordinates) point data array via numpy.ravel_multi_index.
    • Perhaps make this opt-in via an option at index creation ?
  • Concat

    • I guess numpy provides all the utilities needed to implement it in a clever way so that the concatenated (n_points, n_coordinates) point data array stay consistent with the location of the concat dimension dim in the original coordinate variables.
  • Stack

    • Would it be a useful alternative to pandas.MultiIndex for floating-point coordinates, even though the original distribution of the data points is regular?
    • Implementation is trivial
    • Cannot support unstack (it only works with pandas.MultiIndex)
  • Interpolation

    • This is a good case for n-dimensional nearest neighbor, linear, IDW, etc. interpolation.
    • We'd need to figure out first how to plug custom indexes with Xarray's interp API.
  • GroupBy

    • It would be great to have specialized grouper objects like KNearestNeighborsGrouper(points: xr.DataArray, k: int) and BallNeigborsGrouper(points: xr.DataArray, radius: float) (@dcherian).
  • neighbors accessor?

    • It would be nice to have some Xarray-friendly API to, e.g., return the distances to the nearest neighbors, select k-nearest neighbors, etc.

@benbovy
Copy link
Member Author

benbovy commented Jul 3, 2025

Integration with xoak: xarray-contrib/xoak#44

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to see this in.

The tests look good to me. Later we should create some kind of integration test suite for custom indexes that tests all the various flavors of indexing Xarray supports.

@benbovy benbovy added the plan to merge Final call for comments label Jul 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Selecting point closest to (lon0, lat0) when lon,lat coordinates are 2D API design for pointwise indexing
3 participants