Changes after creating 190K Exome Callset [VS-189] (#8459)

rsasch · web-flow · commit 34ed02ea59bf · 2023-08-31T08:51:13.000-04:00
diff --git a/.dockstore.yml b/.dockstore.yml
@@ -127,8 +127,6 @@ workflows:
        branches:
          - master
          - ah_var_store
-         - bulk_ingest_staging
-         - vs_1032_beta_wdl_pin
    - name: GvsPrepareRangesCallset
      subclass: WDL
      primaryDescriptorPath: /scripts/variantstore/wdl/GvsPrepareRangesCallset.wdl
@@ -209,7 +207,6 @@ workflows:
        branches:
          - master
          - ah_var_store
-         - vs_1014_exome_warp_ps
    - name: GvsQuickstartVcfIntegration
      subclass: WDL
      primaryDescriptorPath: /scripts/variantstore/wdl/GvsQuickstartVcfIntegration.wdl
@@ -231,7 +228,6 @@ workflows:
        branches:
          - master
          - ah_var_store
-         - vs_1032_beta_wdl_pin
    - name: GvsIngestTieout
      subclass: WDL
      primaryDescriptorPath: /scripts/variantstore/wdl/GvsIngestTieout.wdl
diff --git a/scripts/variantstore/docs/RUNNING_EXOMES_ON_GVS.md b/scripts/variantstore/docs/RUNNING_EXOMES_ON_GVS.md
@@ -1,55 +1,61 @@
 # Running Exomes on GVS
 
-This document describes the changes necessary to run exome gVCFs through the GVS workflow. The changes needed to run exomes primarily involve using different parameters.
-**NOTE** Currently this document is written to be at the developer level (that is for experienced developers). For other docs (specifically, for our beta users) see https://github.com/broadinstitute/gatk/tree/ah_var_store/scripts/variantstore/beta_docs/
-
-**NOTE** For Exome we want to use the latest BGE exome interval list:
-gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list
+This document describes the changes necessary to run exome gVCFs through the GVS workflow. The changes needed to run exomes primarily involve using different parameters. Currently this document is written to be at the developer level (that is for experienced developers). For other docs (specifically, for our beta users) see [https://github.com/broadinstitute/gatk/tree/ah_var_store/scripts/variantstore/beta_docs/]([https://github.com/broadinstitute/gatk/tree/ah_var_store/scripts/variantstore/beta_docs/)
 
 ## Setup
-- Create a Terra workspace
+
+- Create a Terra workspace and a BigQuery dataset with the necessary corresponding permissions for your PROXY group.
 - Populate the workspace with the following workflows:
-    - [GvsBulkIngestGenomes](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsBulkIngestGenomes) workflow
-    - [GvsAssignIds](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsAssignIds) workflow
-    - [GvsImportGenomes](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsImportGenomes) workflow
-    - [GvsPopulateAltAllele](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsPopulateAltAllele) workflow
-    - [GvsCreateFilterSet](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsCreateFilterSet) workflow
-    - [GvsPrepareRangesCallset](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsPrepareRangesCallset) workflow
-    - [GvsExtractCallset](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsExtractCallset) workflow
-    - [GvsCalculatePrecisionAndSensitivity](https://dockstore.org/workflows/github.com/broadinstitute/gatk/GvsCalculatePrecisionAndSensitivity) workflow
+  - [GvsBulkIngestGenomes](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsBulkIngestGenomes) workflow
+  - [GvsAssignIds](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsAssignIds) workflow (only if you want to calculate Precision and Sensitivity)
+  - [GvsImportGenomes](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsImportGenomes) workflow (only if you want to calculate Precision and Sensitivity)
+  - [GvsPopulateAltAllele](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsPopulateAltAllele) workflow
+  - [GvsCreateFilterSet](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsCreateFilterSet) workflow
+  - [GvsPrepareRangesCallset](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsPrepareRangesCallset) workflow
+  - [GvsExtractCallset](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsExtractCallset) workflow
+  - [GvsCalculatePrecisionAndSensitivity](https://dockstore.org/workflows/github.com/broadinstitute/gatk/GvsCalculatePrecisionAndSensitivity) workflow (only if you want to calculate Precision and Sensitivity)
 
 ## The Pipeline
 1. `GvsBulkIngestGenomes` workflow
-    - Run this workflow in order to load all samples into the database tables so that they can be run through the GVS workflow. This workflow encompasses the tasks described below (in `GvsAssignIds` and `GvsImportGenomes`)
-   - Run at the `sample set` level ("Step 1" in workflow submission) with a sample set of all the new samples to be included in the callset.
-    - **NOTE** For Exomes, use `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` for the `interval_list`
-- OR you can run the two workflows that `GvsBulkIngestGenomes` calls (for instance if you also need to load control samples) 
-1. `GvsAssignIds` workflow
-    - To optimize the GVS internal queries, each sample must have a unique and consecutive integer ID assigned. Running the `GvsAssignIds` will create a unique GVS ID for each sample (`sample_id`) and update the BQ `sample_info` table (creating it if it doesn't exist). This workflow takes care of creating the BQ `vet_*`, `ref_ranges_*` and `cost_observability` tables needed for the sample IDs generated.
-    - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
-    -  The `external_sample_names` input should be the GCS path of a text file that lists all the sample names (external sample IDs).
-    - If new controls are being added, they need to be done in a separate run, with the `samples_are_controls` input set to "true" (the referenced Data columns may also be different, e.g. "this.control_samples.control_sample_id" instead of "this.samples.research_id").
-2. `GvsImportGenomes` workflow
-    - This will import the re-blocked gVCF files into GVS. The workflow will check whether data for that sample has already been loaded into GVS. It is designed to be re-run (with the same inputs) if there is a failure during one of the workflow tasks (e.g. BigQuery write API interrupts).
-    - Run at the `sample set` level ("Step 1" in workflow submission).  You can either run this on a sample_set of all the samples and rely on the workflow logic to break it up into batches.
-    - You will want to set the `external_sample_names`, `input_vcfs` and `input_vcf_indexes` inputs based on the columns in the workspace Data table, e.g. "this.samples.research_id", "this.samples.reblocked_gvcf_v2" and "this.samples.reblocked_gvcf_index_v2".
-   - **NOTE** For Exomes, use `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` for the `interval_list`
-
-3. `GvsPopulateAltAllele` workflow
-    - This step loads data into the `alt_allele` table from the `vet_*` tables in preparation for running the filtering step.
-    - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
-4. `GvsCreateFilterSet` workflow
-    - This step calculates features from the `alt_allele` table, and trains the VQSR filtering model along with site-level QC filters and loads them into BigQuery into a series of `filter_set_*` tables.
-    - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
-    - **NOTE** For Exomes, use `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` for the `interval_list`
-5. `GvsPrepareRangesCallset` workflow
-    - This workflow transforms the data in the vet tables into a schema optimized for callset stats creation and for calculating sensitivity and precision.
-    - This workflow may only need to be run once (for controls extract for Precision and Sensitivity). Run it with `control_samples` set to "true" (the default value is `false`).
-    - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
-    - **NOTE** For Exomes, set the parameter `use_interval_weights` to `false`.  This avoids a bug seen in WeightedSplitIntervals when using exomes, forcing it to use the standard version of SplitIntervals.
-6. `GvsCalculatePrecisionAndSensitivity` workflow
-    - This workflow needs to be run with the control sample chr20 vcfs from `GvsExtractCallset` step which were placed in the `output_gcs_dir`.
-    - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
-    - **NOTE** For Exomes, use `gs://gvs-internal/truth/HG001.exome_evaluation_regions.v1.1.bed` as the "truth" bed for NA12878/HG001.
-
-
+   - Run this workflow in order to load all non-control samples into the database tables so that they can be run through the GVS workflow.
+   - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
+   - Set the `interval_list` input to `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"`
+ 
+    To ingest control samples (which you will need to calculate Precision and Sensitivity), you will need to run the`GvsAssignIds` and `GvsImportGenomes` workflows just for them:
+   1. `GvsAssignIds` workflow
+      - This workflow is set up to be re-run (with the same inputs) if there is a failure, but be sure to check for an existing `sample_id_assignment_lock` table in your dataset; if it exists, delete it before re-running.
+      - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
+      - The `external_sample_names` input should be the GCS path of a text file that lists all the control sample names (external sample IDs).
+      - Set the input `samples_are_controls` to `true`.
+   1. `GvsImportGenomes` workflow
+      - This will import the re-blocked gVCF files into GVS. The workflow will check whether data for that sample has already been loaded into GVS. It is designed to be re-run (with the same inputs) if there is a failure during one of the workflow tasks (e.g. BigQuery write API interrupts).
+      - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
+      - You will want to set the `external_sample_names`, `input_vcfs` and `input_vcf_indexes` inputs based on the GCP locations of files that contain lists of these values (in the same order). For NA12878/HG001, the files you will need for ingest are:
+          - `input_vcfs`: `"gs://broad-gotc-test-storage/germline_single_sample/exome/scientific/truth/master/D5327.NA12878/NA12878.rb.g.vcf.gz"`
+          - `input_vcf_indexes`: `"gs://broad-gotc-test-storage/germline_single_sample/exome/scientific/truth/master/D5327.NA12878/NA12878.rb.g.vcf.gz.tbi"`
+      - Set the `interval_list` input to `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"`
+1. `GvsPopulateAltAllele` workflow
+   - This step loads data into the `alt_allele` table from the `vet_*` tables in preparation for running the filtering step.
+   - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
+1. `GvsCreateFilterSet` workflow
+   - This step calculates features from the `alt_allele` table, and trains the VETS model along with site-level QC filters and loads them into BigQuery into a series of `filter_set_*` tables.
+   - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
+   - Set the `interval_list` input to `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"`
+   - Set the `use_VQSR_lite` input to `true` to use VETS (instead of VQSR)
+1. `GvsPrepareRangesCallset` workflow
+   - This workflow transforms the data in the vet tables into a schema optimized for VCF extraction.
+   - This workflow will need to be run once to extract the callset as a whole and an additional time to create the files used to calculate Precision and Sensitivity (with `control_samples` set to `true`).
+   - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
+1. `GvsExtractCallset` workflow
+   - This workflow takes the tables created in the `Prepare` step to output joint VCF shards.
+   - This workflow will need to be run once to extract the callset into VCF shards and an additional time to calculate Precision and Sensitivity (with `control_samples` set to `true`).
+   - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
+   - Set the parameter `use_interval_weights` to `false`.  This avoids a bug seen in WeightedSplitIntervals when using exomes, forcing it to use the standard version of SplitIntervals.
+   - Set the `output_gcs_dir` to a GCS location to collect all the VCF shards into one place.  If you are running it twice (to calculate Precision and Sensitivity), be sure to provide distinct locations for each.
+1. `GvsCalculatePrecisionAndSensitivity` workflow
+   - This workflow needs to be run with the control sample VCF shards from `GvsExtractCallset` step which were placed in the `output_gcs_dir` GCS location.
+   - This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
+   - Truth inputs for NA12878/HG001:
+     - `truth_beds`:  `"gs://gvs-internal/truth/HG001.exome_evaluation_regions.v1.1.bed"`
+     - `truth_vcfs`: `"gs://gvs-internal/truth/HG001_exome_filtered.recode.vcf.gz"`
+     - `truth_vcf_indices`: `"gs://gvs-internal/truth/HG001_exome_filtered.recode.vcf.gz.tbi"`
diff --git a/scripts/variantstore/wdl/GvsCalculatePrecisionAndSensitivity.wdl b/scripts/variantstore/wdl/GvsCalculatePrecisionAndSensitivity.wdl
@@ -16,7 +16,7 @@ workflow GvsCalculatePrecisionAndSensitivity {
     Array[File] truth_vcf_indices
     Array[File] truth_beds
 
-    File ref_fasta
+    File ref_fasta = "gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta"
 
     String? basic_docker
     String? variants_docker
@@ -348,8 +348,8 @@ task IsVQSRLite {
   command {
     set +e
 
-    # See if there are any non-header lines that contain the string 'AS_VQS_SENS'. If so, grep will return 0 else 1
-    grep -v '^#' ~{input_vcf} | grep AS_VQS_SENS > /dev/null
+    # See if there are any non-header lines that contain the string 'CALIBRATION_SENSITIVITY'. If so, grep will return 0 else 1
+    grep -v '^#' ~{input_vcf} | grep CALIBRATION_SENSITIVITY > /dev/null
     if [[ $? -eq 0 ]]; then
       echo "true" > ~{is_vqsr_lite_file}
     else
@@ -422,7 +422,7 @@ task EvaluateVcf {
     Int disk_size_gb = ceil(2 * size(ref_fasta, "GiB")) + 500
   }
 
-  String max_score_field_tag = if (is_vqsr_lite == true) then 'MAX_AS_VQS_SENS' else 'MAX_AS_VQSLOD'
+  String max_score_field_tag = if (is_vqsr_lite == true) then 'MAX_CALIBRATION_SENSITIVITY' else 'MAX_AS_VQSLOD'
 
   command <<<
     set -e -o pipefail
diff --git a/scripts/variantstore/wdl/GvsUtils.wdl b/scripts/variantstore/wdl/GvsUtils.wdl
@@ -55,7 +55,7 @@ task GetToolVersions {
     # GVS generally uses the smallest `alpine` version of the Google Cloud SDK as it suffices for most tasks, but
     # there are a handlful of tasks that require the larger GNU libc-based `slim`.
     String cloud_sdk_slim_docker = "gcr.io/google.com/cloudsdktool/cloud-sdk:435.0.0-slim"
-    String variants_docker = "us.gcr.io/broad-dsde-methods/variantstore:2023-08-11-alpine-3d48f01dd"
+    String variants_docker = "us.gcr.io/broad-dsde-methods/variantstore:2023-08-29-alpine"
     String gatk_docker = "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_2023_08_11"
     String variants_nirvana_docker = "us.gcr.io/broad-dsde-methods/variantstore:nirvana_2022_10_19"
     String real_time_genomics_docker = "docker.io/realtimegenomics/rtg-tools:latest"
diff --git a/scripts/variantstore/wdl/extract/add_max_as_vqs_score.py b/scripts/variantstore/wdl/extract/add_max_as_vqs_score.py
@@ -1,7 +1,7 @@
 import sys
 import gzip
 
-# Add new header for MAX_AS_VQS_SENS and MAX_AS_VQSLOD
+# Add new header for MAX_CALIBRATION_SENSITIVITY and MAX_AS_VQSLOD
 
 with gzip.open(sys.argv[1], 'rt') as file1:
     for line in file1:
@@ -10,46 +10,46 @@
         if "##" in line:
             print(line)
             continue
-        
+
         if "#CHROM" in line:
-            print('##INFO=<ID=MAX_AS_VQS_SENS,Number=1,Type=Float,Description="Maximum of AS_VQS_SENS scores">')
+            print('##INFO=<ID=MAX_CALIBRATION_SENSITIVITY,Number=1,Type=Float,Description="Maximum of CALIBRATION_SENSITIVITY scores">')
             print('##INFO=<ID=MAX_AS_VQSLOD,Number=1,Type=Float,Description="Maximum of AS_VQSLOD scores">')
             print(line)
             continue
 
         parts = line.split("\t")
-        
+
         # strip out hard filtered sites, so vcfeval can use "all-records" to plot the ROC curves
         if ("ExcessHet" in parts[6] or "LowQual" in parts[6] or "NO_HQ_GENOTYPES" in parts[6]):
             continue
 
-        info = parts[7]    
-        d = dict([ tuple(x.split("=")) for x in info.split(";") if "=" in x]) 
+        info = parts[7]
+        d = dict([ tuple(x.split("=")) for x in info.split(";") if "=" in x])
 
         format_key = [x for x in parts[8].split(":")]
         sample_data = dict(zip(format_key, parts[9].split(":")))
 
         gt = sample_data['GT']
-        
+
         if gt == "0/0" or gt == "./.":
             continue
-            
+
         if 'FT' in sample_data:
             ft = sample_data['FT']
 
             # if there is a non-passing FT value
             if not (ft == "PASS" or ft == "."):
-                
+
                 # overwrite FILTER if it was PASS or "."
                 if (parts[6] == "PASS" or parts[6] == "."):
                     parts[6] = ft
                 # otherwise append it to the end
                 else:
                     parts[6] = parts[6] + "," + ft
 
-        if "AS_VQS_SENS" in d:
-            if "," in d['AS_VQS_SENS']:
-                pieces = [x for x in d['AS_VQS_SENS'].split(",") if (x != "." and x != "NaN") ]
+        if "CALIBRATION_SENSITIVITY" in d:
+            if "," in d['CALIBRATION_SENSITIVITY']:
+                pieces = [x for x in d['CALIBRATION_SENSITIVITY'].split(",") if (x != "." and x != "NaN") ]
 
                 if (len(pieces) == 1):
                     m = pieces[0]
@@ -58,8 +58,8 @@
                 else:
                     m = max([float(x) for x in pieces])
             else:
-                m = d['AS_VQS_SENS']
-            parts[7] = f"{info};MAX_AS_VQS_SENS={m}"
+                m = d['CALIBRATION_SENSITIVITY']
+            parts[7] = f"{info};MAX_CALIBRATION_SENSITIVITY={m}"
         elif "AS_VQSLOD" in d:
             if "," in d['AS_VQSLOD']:
                 pieces = [x for x in d['AS_VQSLOD'].split(",") if (x != "." and x != "NaN") ]
@@ -74,6 +74,6 @@
                 m = d['AS_VQSLOD']
             parts[7] = f"{info};MAX_AS_VQSLOD={m}"
         else:
-            sys.exit(f"Can find neither AS_VQS_SENS nor AS_VQSLOD in {line}")
+            sys.exit(f"Can find neither CALIBRATION_SENSITIVITY nor AS_VQSLOD in {line}")
 
         print("\t".join(parts))