-
Notifications
You must be signed in to change notification settings - Fork 36
Add bcftools
-style filtering
#1330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool!
# filter to variants where at least one sample has been selected | ||
# note we have to call compute here since xarray indexing with a Dask or Cubed | ||
# boolean array is not supported | ||
return ds.isel(variants=ds.call_mask.any(dim="samples").compute()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the implications here for the memory requirements when working on a large dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. It will materialise a 1D boolean array of length #variants into memory, which unless a restrictive region filter has been applied first could be very large (potentially the whole genome).
This is an area where there is scope to improve - perhaps by making Xarray and the underlying distributed processing engine able to handle this case efficiently, or by using masked arrays in some way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a 1D boolean array length #variants is fine - it could be improved but it's not terrible. I was worried that it was a O(num_samples) thing.
max(ds.sizes["variants"], int(chunk_indexes[-1] * variant_chunksize + 1)), | ||
) | ||
|
||
ds_sliced = ds.isel(variants=variant_slice) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works but could be very inefficient. Imagine we just want one small region at the start of the genome and one at the end - this would read all of the intervening chunks even though they are not needed.
What we really want is a way to index by a set of slices (or even chunks) in one go. NumPy and Xarray don't provide such a primitive, but it's something we could perhaps build.
cc-ing @keewis @TomNicholas and @alxmrs as we were discussing this exact issue in yesterday's Pangeo Distributed Computing meeting (in the context of a geospatial application Justus is working on).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened pydata/xarray#10479 to discuss this for xarray
Looks like the build is failing on Python 3.12 due to pyranges/sorted_nearest#10 from pyranges. Not sure why vcztools CI is working on Python 3.12 though. |
bcftools
-style filtering #1329changelog.rst
api.rst