Add `bcftools`-style filtering #1330

tomwhite · 2025-07-01T09:51:31Z

Fixes Support bcftools-style filtering #1329
Tests added
User visible changes (including notable bug fixes) are documented in changelog.rst
New functions are listed in api.rst

jeromekelleher

Cool!

jeromekelleher · 2025-07-01T10:58:30Z

sgkit/filtering.py

+    # filter to variants where at least one sample has been selected
+    # note we have to call compute here since xarray indexing with a Dask or Cubed
+    # boolean array is not supported
+    return ds.isel(variants=ds.call_mask.any(dim="samples").compute())


What are the implications here for the memory requirements when working on a large dataset?

Good question. It will materialise a 1D boolean array of length #variants into memory, which unless a restrictive region filter has been applied first could be very large (potentially the whole genome).

This is an area where there is scope to improve - perhaps by making Xarray and the underlying distributed processing engine able to handle this case efficiently, or by using masked arrays in some way?

I think a 1D boolean array length #variants is fine - it could be improved but it's not terrible. I was worried that it was a O(num_samples) thing.

tomwhite · 2025-07-01T13:10:06Z

sgkit/filtering.py

+        max(ds.sizes["variants"], int(chunk_indexes[-1] * variant_chunksize + 1)),
+    )
+
+    ds_sliced = ds.isel(variants=variant_slice)


This works but could be very inefficient. Imagine we just want one small region at the start of the genome and one at the end - this would read all of the intervening chunks even though they are not needed.

What we really want is a way to index by a set of slices (or even chunks) in one go. NumPy and Xarray don't provide such a primitive, but it's something we could perhaps build.

cc-ing @keewis @TomNicholas and @alxmrs as we were discussing this exact issue in yesterday's Pangeo Distributed Computing meeting (in the context of a geospatial application Justus is working on).

I've opened pydata/xarray#10479 to discuss this for xarray

tomwhite · 2025-07-01T13:45:19Z

Looks like the build is failing on Python 3.12 due to pyranges/sorted_nearest#10 from pyranges. Not sure why vcztools CI is working on Python 3.12 though.

tomwhite added 2 commits July 1, 2025 10:39

bcftools-style filtering

910c83b

Add vcztools dependency

2ba601e

tomwhite mentioned this pull request Jul 1, 2025

How to make a small zarr file from a large VCF? sgkit-dev/bio2zarr#365

Open

jeromekelleher reviewed Jul 1, 2025

View reviewed changes

tomwhite commented Jul 1, 2025

View reviewed changes

keewis mentioned this pull request Jul 1, 2025

indexing by a list of slices pydata/xarray#10479

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `bcftools`-style filtering #1330

Add `bcftools`-style filtering #1330

Uh oh!

tomwhite commented Jul 1, 2025

Uh oh!

jeromekelleher left a comment

Uh oh!

jeromekelleher Jul 1, 2025

Uh oh!

tomwhite Jul 1, 2025

Uh oh!

jeromekelleher Jul 1, 2025

Uh oh!

tomwhite Jul 1, 2025

Uh oh!

keewis Jul 1, 2025

Uh oh!

tomwhite commented Jul 1, 2025

Uh oh!

Uh oh!

Add bcftools-style filtering #1330

Are you sure you want to change the base?

Add bcftools-style filtering #1330

Uh oh!

Conversation

tomwhite commented Jul 1, 2025

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

tomwhite Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

jeromekelleher Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

tomwhite Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

keewis Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

tomwhite commented Jul 1, 2025

Uh oh!

Uh oh!

Add `bcftools`-style filtering #1330

Add `bcftools`-style filtering #1330