-
Notifications
You must be signed in to change notification settings - Fork 604
Echo Unmapped VID Investigation [VS-1671] #9202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: ah_var_store
Are you sure you want to change the base?
Conversation
d87e3e1
to
0b4f7f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds the investigation artifacts for Echo unmapped VIDs by introducing new scripts, workflows, prompts, documentation, and updated Docker setups to support extracting and merging JSON/TSV from gVCFs.
- Bumps the
variants_docker
image tag and adds a newMergeJSONs
WDL task to combine JSON shards. - Introduces Python scripts and Claude prompts for generating bcftools commands, filtering VCFs by VIDs, comparing VCFs, and processing gVCF variants.
- Updates Dockerfiles (base and final) to build and include GCS-enabled htslib/bcftools/vcftools and exposes the new workflow in
.dockstore.yml
.
Reviewed Changes
Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
scripts/variantstore/wdl/GvsUtils.wdl | Updated variants_docker tag; added MergeJSONs task to join JSON shards and emit TSV. |
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/read_gvcfs.prompt | New Claude prompt for reading gVCFs and reblocked gVCFs. |
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/generate_bcftools_commands.py | New script to parse VIDs and emit bcftools view commands. |
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/generate_bcftools_commands.prompt | New prompt describing VID-based bcftools command generation. |
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/filter_vcf_by_vids.py | New script to filter a VCF by matching VIDs. |
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/filter_vcf_by_vids.prompt | New prompt outlining the VCF filtering requirements. |
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/compare_vcfs.py | New script to compare input vs. left-aligned VCF variants. |
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/compare_vcfs.prompt | New prompt describing the VCF comparison and output format. |
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/SearchGVCFsForUnmappedVIDs.wdl | New WDL workflow to query BigQuery, read gVCFs, merge outputs, and upload content. |
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/README.md | Documentation for the pseudo-VID resolution procedure. |
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/ECHO_VID_INVESTIGATION.md | Detailed scientific-facing investigation write-up. |
scripts/variantstore/scripts/reorder_gvcf_content_cols.py | New helper to reorder TSV columns before BigQuery load. |
scripts/variantstore/scripts/process_gvcf_variants.py | New script to query gVCFs with bcftools and emit enriched JSON. |
scripts/variantstore/scripts/build_base.Dockerfile | Updated to build and install htslib, bcftools, and vcftools; cleans up artifacts. |
scripts/variantstore/scripts/Dockerfile | Updated final image to include htslib libraries and set LD_LIBRARY_PATH . |
.dockstore.yml | Exposes the new vs_1671_vat_discrepancies branch and SearchGVCFsForUnmappedVIDs workflow. |
Comments suppressed due to low confidence (2)
scripts/variantstore/wdl/GvsUtils.wdl:130
- Typo in comment: 'handlful' should be 'handful'.
# there are a handlful of tasks that require the larger GNU libc-based `slim`.
scripts/variantstore/scripts/reorder_gvcf_content_cols.py:12
- The comment indicates JSON input, but this script reads a TSV file. Please update the comment to reflect that it processes TSV data.
# Load input JSON
else: | ||
search_range = 200 | ||
|
||
start = position + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The range calculation only searches downstream of the variant. To cover the intended ±range around the position, 'start' should be computed as 'position - search_range' (or adjusted per spec).
start = position + 1 | |
start = max(1, position - search_range) |
Copilot uses AI. Check for mistakes.
Assorted artifacts of the Echo unmapped VID investigation.
Happy integration run with the updated Docker.
BQ tables
pseudo_vid_mapping
links vids inpseudo_vids_only_in_vat
to non-left aligned alleles that appear in the sites-only VCF used for making the VAT.pseudo_vid_sample_id
links the non-left aligned alleles inpseudo_vid_mapping
to samples containing these alleles.pseudo_vid_gvcf_content
includes reblocked and unreblocked gVCF paths and file content for non-left aligned alleles on a per sample basis.Docs
README.md
ECHO_VID_INVESTIGATION.md
Technical artifacts