Skip to content

Add bcftools-style filtering #1330

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

tomwhite
Copy link
Collaborator

@tomwhite tomwhite commented Jul 1, 2025

Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool!

# filter to variants where at least one sample has been selected
# note we have to call compute here since xarray indexing with a Dask or Cubed
# boolean array is not supported
return ds.isel(variants=ds.call_mask.any(dim="samples").compute())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the implications here for the memory requirements when working on a large dataset?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It will materialise a 1D boolean array of length #variants into memory, which unless a restrictive region filter has been applied first could be very large (potentially the whole genome).

This is an area where there is scope to improve - perhaps by making Xarray and the underlying distributed processing engine able to handle this case efficiently, or by using masked arrays in some way?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a 1D boolean array length #variants is fine - it could be improved but it's not terrible. I was worried that it was a O(num_samples) thing.

max(ds.sizes["variants"], int(chunk_indexes[-1] * variant_chunksize + 1)),
)

ds_sliced = ds.isel(variants=variant_slice)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works but could be very inefficient. Imagine we just want one small region at the start of the genome and one at the end - this would read all of the intervening chunks even though they are not needed.

What we really want is a way to index by a set of slices (or even chunks) in one go. NumPy and Xarray don't provide such a primitive, but it's something we could perhaps build.

cc-ing @keewis @TomNicholas and @alxmrs as we were discussing this exact issue in yesterday's Pangeo Distributed Computing meeting (in the context of a geospatial application Justus is working on).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened pydata/xarray#10479 to discuss this for xarray

@tomwhite
Copy link
Collaborator Author

tomwhite commented Jul 1, 2025

Looks like the build is failing on Python 3.12 due to pyranges/sorted_nearest#10 from pyranges. Not sure why vcztools CI is working on Python 3.12 though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support bcftools-style filtering
3 participants