Skip to content

Echo Unmapped VID Investigation [VS-1671] #9202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 53 commits into
base: ah_var_store
Choose a base branch
from

Conversation

mcovarr
Copy link
Collaborator

@mcovarr mcovarr commented Jun 5, 2025

Assorted artifacts of the Echo unmapped VID investigation.

  • Happy integration run with the updated Docker.

  • BQ tables

    • pseudo_vid_mapping links vids in pseudo_vids_only_in_vat to non-left aligned alleles that appear in the sites-only VCF used for making the VAT.
    • pseudo_vid_sample_id links the non-left aligned alleles in pseudo_vid_mapping to samples containing these alleles.
    • pseudo_vid_gvcf_content includes reblocked and unreblocked gVCF paths and file content for non-left aligned alleles on a per sample basis.
  • Docs

    • For devs in README.md
    • For scientific folk in ECHO_VID_INVESTIGATION.md
  • Technical artifacts

    • Claude prompts
    • Claude-generated Python scripts
    • Updated Variants Docker with GCS-enabled bcftools
    • WDL that uses said bcftools to read selected lines from 30K+ VCFs without localization
    • updated Dockerfiles

@mcovarr mcovarr force-pushed the vs_1671_vat_discrepancies branch from d87e3e1 to 0b4f7f7 Compare June 6, 2025 19:11
@mcovarr mcovarr changed the title VAT Discrepancies [VS-1671] Echo Unmapped VID Investigation [VS-1671] Jun 10, 2025
@mcovarr mcovarr requested a review from Copilot June 10, 2025 13:41
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds the investigation artifacts for Echo unmapped VIDs by introducing new scripts, workflows, prompts, documentation, and updated Docker setups to support extracting and merging JSON/TSV from gVCFs.

  • Bumps the variants_docker image tag and adds a new MergeJSONs WDL task to combine JSON shards.
  • Introduces Python scripts and Claude prompts for generating bcftools commands, filtering VCFs by VIDs, comparing VCFs, and processing gVCF variants.
  • Updates Dockerfiles (base and final) to build and include GCS-enabled htslib/bcftools/vcftools and exposes the new workflow in .dockstore.yml.

Reviewed Changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
scripts/variantstore/wdl/GvsUtils.wdl Updated variants_docker tag; added MergeJSONs task to join JSON shards and emit TSV.
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/read_gvcfs.prompt New Claude prompt for reading gVCFs and reblocked gVCFs.
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/generate_bcftools_commands.py New script to parse VIDs and emit bcftools view commands.
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/generate_bcftools_commands.prompt New prompt describing VID-based bcftools command generation.
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/filter_vcf_by_vids.py New script to filter a VCF by matching VIDs.
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/filter_vcf_by_vids.prompt New prompt outlining the VCF filtering requirements.
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/compare_vcfs.py New script to compare input vs. left-aligned VCF variants.
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/compare_vcfs.prompt New prompt describing the VCF comparison and output format.
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/SearchGVCFsForUnmappedVIDs.wdl New WDL workflow to query BigQuery, read gVCFs, merge outputs, and upload content.
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/README.md Documentation for the pseudo-VID resolution procedure.
scripts/variantstore/variant-annotations-table/pseudo_vids_only_in_vat/ECHO_VID_INVESTIGATION.md Detailed scientific-facing investigation write-up.
scripts/variantstore/scripts/reorder_gvcf_content_cols.py New helper to reorder TSV columns before BigQuery load.
scripts/variantstore/scripts/process_gvcf_variants.py New script to query gVCFs with bcftools and emit enriched JSON.
scripts/variantstore/scripts/build_base.Dockerfile Updated to build and install htslib, bcftools, and vcftools; cleans up artifacts.
scripts/variantstore/scripts/Dockerfile Updated final image to include htslib libraries and set LD_LIBRARY_PATH.
.dockstore.yml Exposes the new vs_1671_vat_discrepancies branch and SearchGVCFsForUnmappedVIDs workflow.
Comments suppressed due to low confidence (2)

scripts/variantstore/wdl/GvsUtils.wdl:130

  • Typo in comment: 'handlful' should be 'handful'.
# there are a handlful of tasks that require the larger GNU libc-based `slim`.

scripts/variantstore/scripts/reorder_gvcf_content_cols.py:12

  • The comment indicates JSON input, but this script reads a TSV file. Please update the comment to reflect that it processes TSV data.
# Load input JSON

else:
search_range = 200

start = position + 1
Copy link
Preview

Copilot AI Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The range calculation only searches downstream of the variant. To cover the intended ±range around the position, 'start' should be computed as 'position - search_range' (or adjusted per spec).

Suggested change
start = position + 1
start = max(1, position - search_range)

Copilot uses AI. Check for mistakes.

@mcovarr mcovarr marked this pull request as ready for review June 10, 2025 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant