Skip to content

Commit 34ed02e

Browse files
authored
Changes after creating 190K Exome Callset [VS-189] (#8459)
1 parent 7155e0c commit 34ed02e

File tree

5 files changed

+73
-71
lines changed

5 files changed

+73
-71
lines changed

.dockstore.yml

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -127,8 +127,6 @@ workflows:
127127
branches:
128128
- master
129129
- ah_var_store
130-
- bulk_ingest_staging
131-
- vs_1032_beta_wdl_pin
132130
- name: GvsPrepareRangesCallset
133131
subclass: WDL
134132
primaryDescriptorPath: /scripts/variantstore/wdl/GvsPrepareRangesCallset.wdl
@@ -209,7 +207,6 @@ workflows:
209207
branches:
210208
- master
211209
- ah_var_store
212-
- vs_1014_exome_warp_ps
213210
- name: GvsQuickstartVcfIntegration
214211
subclass: WDL
215212
primaryDescriptorPath: /scripts/variantstore/wdl/GvsQuickstartVcfIntegration.wdl
@@ -231,7 +228,6 @@ workflows:
231228
branches:
232229
- master
233230
- ah_var_store
234-
- vs_1032_beta_wdl_pin
235231
- name: GvsIngestTieout
236232
subclass: WDL
237233
primaryDescriptorPath: /scripts/variantstore/wdl/GvsIngestTieout.wdl
Lines changed: 53 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -1,55 +1,61 @@
11
# Running Exomes on GVS
22

3-
This document describes the changes necessary to run exome gVCFs through the GVS workflow. The changes needed to run exomes primarily involve using different parameters.
4-
**NOTE** Currently this document is written to be at the developer level (that is for experienced developers). For other docs (specifically, for our beta users) see https://github.com/broadinstitute/gatk/tree/ah_var_store/scripts/variantstore/beta_docs/
5-
6-
**NOTE** For Exome we want to use the latest BGE exome interval list:
7-
gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list
3+
This document describes the changes necessary to run exome gVCFs through the GVS workflow. The changes needed to run exomes primarily involve using different parameters. Currently this document is written to be at the developer level (that is for experienced developers). For other docs (specifically, for our beta users) see [https://github.com/broadinstitute/gatk/tree/ah_var_store/scripts/variantstore/beta_docs/]([https://github.com/broadinstitute/gatk/tree/ah_var_store/scripts/variantstore/beta_docs/)
84

95
## Setup
10-
- Create a Terra workspace
6+
7+
- Create a Terra workspace and a BigQuery dataset with the necessary corresponding permissions for your PROXY group.
118
- Populate the workspace with the following workflows:
12-
- [GvsBulkIngestGenomes](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsBulkIngestGenomes) workflow
13-
- [GvsAssignIds](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsAssignIds) workflow
14-
- [GvsImportGenomes](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsImportGenomes) workflow
15-
- [GvsPopulateAltAllele](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsPopulateAltAllele) workflow
16-
- [GvsCreateFilterSet](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsCreateFilterSet) workflow
17-
- [GvsPrepareRangesCallset](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsPrepareRangesCallset) workflow
18-
- [GvsExtractCallset](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsExtractCallset) workflow
19-
- [GvsCalculatePrecisionAndSensitivity](https://dockstore.org/workflows/github.com/broadinstitute/gatk/GvsCalculatePrecisionAndSensitivity) workflow
9+
- [GvsBulkIngestGenomes](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsBulkIngestGenomes) workflow
10+
- [GvsAssignIds](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsAssignIds) workflow (only if you want to calculate Precision and Sensitivity)
11+
- [GvsImportGenomes](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsImportGenomes) workflow (only if you want to calculate Precision and Sensitivity)
12+
- [GvsPopulateAltAllele](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsPopulateAltAllele) workflow
13+
- [GvsCreateFilterSet](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsCreateFilterSet) workflow
14+
- [GvsPrepareRangesCallset](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsPrepareRangesCallset) workflow
15+
- [GvsExtractCallset](https://dockstore.org/my-workflows/github.com/broadinstitute/gatk/GvsExtractCallset) workflow
16+
- [GvsCalculatePrecisionAndSensitivity](https://dockstore.org/workflows/github.com/broadinstitute/gatk/GvsCalculatePrecisionAndSensitivity) workflow (only if you want to calculate Precision and Sensitivity)
2017

2118
## The Pipeline
2219
1. `GvsBulkIngestGenomes` workflow
23-
- Run this workflow in order to load all samples into the database tables so that they can be run through the GVS workflow. This workflow encompasses the tasks described below (in `GvsAssignIds` and `GvsImportGenomes`)
24-
- Run at the `sample set` level ("Step 1" in workflow submission) with a sample set of all the new samples to be included in the callset.
25-
- **NOTE** For Exomes, use `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` for the `interval_list`
26-
- OR you can run the two workflows that `GvsBulkIngestGenomes` calls (for instance if you also need to load control samples)
27-
1. `GvsAssignIds` workflow
28-
- To optimize the GVS internal queries, each sample must have a unique and consecutive integer ID assigned. Running the `GvsAssignIds` will create a unique GVS ID for each sample (`sample_id`) and update the BQ `sample_info` table (creating it if it doesn't exist). This workflow takes care of creating the BQ `vet_*`, `ref_ranges_*` and `cost_observability` tables needed for the sample IDs generated.
29-
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
30-
- The `external_sample_names` input should be the GCS path of a text file that lists all the sample names (external sample IDs).
31-
- If new controls are being added, they need to be done in a separate run, with the `samples_are_controls` input set to "true" (the referenced Data columns may also be different, e.g. "this.control_samples.control_sample_id" instead of "this.samples.research_id").
32-
2. `GvsImportGenomes` workflow
33-
- This will import the re-blocked gVCF files into GVS. The workflow will check whether data for that sample has already been loaded into GVS. It is designed to be re-run (with the same inputs) if there is a failure during one of the workflow tasks (e.g. BigQuery write API interrupts).
34-
- Run at the `sample set` level ("Step 1" in workflow submission). You can either run this on a sample_set of all the samples and rely on the workflow logic to break it up into batches.
35-
- You will want to set the `external_sample_names`, `input_vcfs` and `input_vcf_indexes` inputs based on the columns in the workspace Data table, e.g. "this.samples.research_id", "this.samples.reblocked_gvcf_v2" and "this.samples.reblocked_gvcf_index_v2".
36-
- **NOTE** For Exomes, use `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` for the `interval_list`
37-
38-
3. `GvsPopulateAltAllele` workflow
39-
- This step loads data into the `alt_allele` table from the `vet_*` tables in preparation for running the filtering step.
40-
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
41-
4. `GvsCreateFilterSet` workflow
42-
- This step calculates features from the `alt_allele` table, and trains the VQSR filtering model along with site-level QC filters and loads them into BigQuery into a series of `filter_set_*` tables.
43-
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
44-
- **NOTE** For Exomes, use `gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list` for the `interval_list`
45-
5. `GvsPrepareRangesCallset` workflow
46-
- This workflow transforms the data in the vet tables into a schema optimized for callset stats creation and for calculating sensitivity and precision.
47-
- This workflow may only need to be run once (for controls extract for Precision and Sensitivity). Run it with `control_samples` set to "true" (the default value is `false`).
48-
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
49-
- **NOTE** For Exomes, set the parameter `use_interval_weights` to `false`. This avoids a bug seen in WeightedSplitIntervals when using exomes, forcing it to use the standard version of SplitIntervals.
50-
6. `GvsCalculatePrecisionAndSensitivity` workflow
51-
- This workflow needs to be run with the control sample chr20 vcfs from `GvsExtractCallset` step which were placed in the `output_gcs_dir`.
52-
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
53-
- **NOTE** For Exomes, use `gs://gvs-internal/truth/HG001.exome_evaluation_regions.v1.1.bed` as the "truth" bed for NA12878/HG001.
54-
55-
20+
- Run this workflow in order to load all non-control samples into the database tables so that they can be run through the GVS workflow.
21+
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
22+
- Set the `interval_list` input to `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"`
23+
24+
To ingest control samples (which you will need to calculate Precision and Sensitivity), you will need to run the`GvsAssignIds` and `GvsImportGenomes` workflows just for them:
25+
1. `GvsAssignIds` workflow
26+
- This workflow is set up to be re-run (with the same inputs) if there is a failure, but be sure to check for an existing `sample_id_assignment_lock` table in your dataset; if it exists, delete it before re-running.
27+
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
28+
- The `external_sample_names` input should be the GCS path of a text file that lists all the control sample names (external sample IDs).
29+
- Set the input `samples_are_controls` to `true`.
30+
1. `GvsImportGenomes` workflow
31+
- This will import the re-blocked gVCF files into GVS. The workflow will check whether data for that sample has already been loaded into GVS. It is designed to be re-run (with the same inputs) if there is a failure during one of the workflow tasks (e.g. BigQuery write API interrupts).
32+
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
33+
- You will want to set the `external_sample_names`, `input_vcfs` and `input_vcf_indexes` inputs based on the GCP locations of files that contain lists of these values (in the same order). For NA12878/HG001, the files you will need for ingest are:
34+
- `input_vcfs`: `"gs://broad-gotc-test-storage/germline_single_sample/exome/scientific/truth/master/D5327.NA12878/NA12878.rb.g.vcf.gz"`
35+
- `input_vcf_indexes`: `"gs://broad-gotc-test-storage/germline_single_sample/exome/scientific/truth/master/D5327.NA12878/NA12878.rb.g.vcf.gz.tbi"`
36+
- Set the `interval_list` input to `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"`
37+
1. `GvsPopulateAltAllele` workflow
38+
- This step loads data into the `alt_allele` table from the `vet_*` tables in preparation for running the filtering step.
39+
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
40+
1. `GvsCreateFilterSet` workflow
41+
- This step calculates features from the `alt_allele` table, and trains the VETS model along with site-level QC filters and loads them into BigQuery into a series of `filter_set_*` tables.
42+
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
43+
- Set the `interval_list` input to `"gs://gcp-public-data--broad-references/hg38/v0/bge_exome_calling_regions.v1.1.interval_list"`
44+
- Set the `use_VQSR_lite` input to `true` to use VETS (instead of VQSR)
45+
1. `GvsPrepareRangesCallset` workflow
46+
- This workflow transforms the data in the vet tables into a schema optimized for VCF extraction.
47+
- This workflow will need to be run once to extract the callset as a whole and an additional time to create the files used to calculate Precision and Sensitivity (with `control_samples` set to `true`).
48+
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
49+
1. `GvsExtractCallset` workflow
50+
- This workflow takes the tables created in the `Prepare` step to output joint VCF shards.
51+
- This workflow will need to be run once to extract the callset into VCF shards and an additional time to calculate Precision and Sensitivity (with `control_samples` set to `true`).
52+
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
53+
- Set the parameter `use_interval_weights` to `false`. This avoids a bug seen in WeightedSplitIntervals when using exomes, forcing it to use the standard version of SplitIntervals.
54+
- Set the `output_gcs_dir` to a GCS location to collect all the VCF shards into one place. If you are running it twice (to calculate Precision and Sensitivity), be sure to provide distinct locations for each.
55+
1. `GvsCalculatePrecisionAndSensitivity` workflow
56+
- This workflow needs to be run with the control sample VCF shards from `GvsExtractCallset` step which were placed in the `output_gcs_dir` GCS location.
57+
- This workflow does not use the Terra Data Entity Model to run, so be sure to select the `Run workflow with inputs defined by file paths` workflow submission option.
58+
- Truth inputs for NA12878/HG001:
59+
- `truth_beds`: `"gs://gvs-internal/truth/HG001.exome_evaluation_regions.v1.1.bed"`
60+
- `truth_vcfs`: `"gs://gvs-internal/truth/HG001_exome_filtered.recode.vcf.gz"`
61+
- `truth_vcf_indices`: `"gs://gvs-internal/truth/HG001_exome_filtered.recode.vcf.gz.tbi"`

scripts/variantstore/wdl/GvsCalculatePrecisionAndSensitivity.wdl

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ workflow GvsCalculatePrecisionAndSensitivity {
1616
Array[File] truth_vcf_indices
1717
Array[File] truth_beds
1818

19-
File ref_fasta
19+
File ref_fasta = "gs://gcp-public-data--broad-references/hg38/v0/Homo_sapiens_assembly38.fasta"
2020

2121
String? basic_docker
2222
String? variants_docker
@@ -348,8 +348,8 @@ task IsVQSRLite {
348348
command {
349349
set +e
350350

351-
# See if there are any non-header lines that contain the string 'AS_VQS_SENS'. If so, grep will return 0 else 1
352-
grep -v '^#' ~{input_vcf} | grep AS_VQS_SENS > /dev/null
351+
# See if there are any non-header lines that contain the string 'CALIBRATION_SENSITIVITY'. If so, grep will return 0 else 1
352+
grep -v '^#' ~{input_vcf} | grep CALIBRATION_SENSITIVITY > /dev/null
353353
if [[ $? -eq 0 ]]; then
354354
echo "true" > ~{is_vqsr_lite_file}
355355
else
@@ -422,7 +422,7 @@ task EvaluateVcf {
422422
Int disk_size_gb = ceil(2 * size(ref_fasta, "GiB")) + 500
423423
}
424424

425-
String max_score_field_tag = if (is_vqsr_lite == true) then 'MAX_AS_VQS_SENS' else 'MAX_AS_VQSLOD'
425+
String max_score_field_tag = if (is_vqsr_lite == true) then 'MAX_CALIBRATION_SENSITIVITY' else 'MAX_AS_VQSLOD'
426426

427427
command <<<
428428
set -e -o pipefail

scripts/variantstore/wdl/GvsUtils.wdl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,7 +55,7 @@ task GetToolVersions {
5555
# GVS generally uses the smallest `alpine` version of the Google Cloud SDK as it suffices for most tasks, but
5656
# there are a handlful of tasks that require the larger GNU libc-based `slim`.
5757
String cloud_sdk_slim_docker = "gcr.io/google.com/cloudsdktool/cloud-sdk:435.0.0-slim"
58-
String variants_docker = "us.gcr.io/broad-dsde-methods/variantstore:2023-08-11-alpine-3d48f01dd"
58+
String variants_docker = "us.gcr.io/broad-dsde-methods/variantstore:2023-08-29-alpine"
5959
String gatk_docker = "us.gcr.io/broad-dsde-methods/broad-gatk-snapshots:varstore_2023_08_11"
6060
String variants_nirvana_docker = "us.gcr.io/broad-dsde-methods/variantstore:nirvana_2022_10_19"
6161
String real_time_genomics_docker = "docker.io/realtimegenomics/rtg-tools:latest"
Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
import sys
22
import gzip
33

4-
# Add new header for MAX_AS_VQS_SENS and MAX_AS_VQSLOD
4+
# Add new header for MAX_CALIBRATION_SENSITIVITY and MAX_AS_VQSLOD
55

66
with gzip.open(sys.argv[1], 'rt') as file1:
77
for line in file1:
@@ -10,46 +10,46 @@
1010
if "##" in line:
1111
print(line)
1212
continue
13-
13+
1414
if "#CHROM" in line:
15-
print('##INFO=<ID=MAX_AS_VQS_SENS,Number=1,Type=Float,Description="Maximum of AS_VQS_SENS scores">')
15+
print('##INFO=<ID=MAX_CALIBRATION_SENSITIVITY,Number=1,Type=Float,Description="Maximum of CALIBRATION_SENSITIVITY scores">')
1616
print('##INFO=<ID=MAX_AS_VQSLOD,Number=1,Type=Float,Description="Maximum of AS_VQSLOD scores">')
1717
print(line)
1818
continue
1919

2020
parts = line.split("\t")
21-
21+
2222
# strip out hard filtered sites, so vcfeval can use "all-records" to plot the ROC curves
2323
if ("ExcessHet" in parts[6] or "LowQual" in parts[6] or "NO_HQ_GENOTYPES" in parts[6]):
2424
continue
2525

26-
info = parts[7]
27-
d = dict([ tuple(x.split("=")) for x in info.split(";") if "=" in x])
26+
info = parts[7]
27+
d = dict([ tuple(x.split("=")) for x in info.split(";") if "=" in x])
2828

2929
format_key = [x for x in parts[8].split(":")]
3030
sample_data = dict(zip(format_key, parts[9].split(":")))
3131

3232
gt = sample_data['GT']
33-
33+
3434
if gt == "0/0" or gt == "./.":
3535
continue
36-
36+
3737
if 'FT' in sample_data:
3838
ft = sample_data['FT']
3939

4040
# if there is a non-passing FT value
4141
if not (ft == "PASS" or ft == "."):
42-
42+
4343
# overwrite FILTER if it was PASS or "."
4444
if (parts[6] == "PASS" or parts[6] == "."):
4545
parts[6] = ft
4646
# otherwise append it to the end
4747
else:
4848
parts[6] = parts[6] + "," + ft
4949

50-
if "AS_VQS_SENS" in d:
51-
if "," in d['AS_VQS_SENS']:
52-
pieces = [x for x in d['AS_VQS_SENS'].split(",") if (x != "." and x != "NaN") ]
50+
if "CALIBRATION_SENSITIVITY" in d:
51+
if "," in d['CALIBRATION_SENSITIVITY']:
52+
pieces = [x for x in d['CALIBRATION_SENSITIVITY'].split(",") if (x != "." and x != "NaN") ]
5353

5454
if (len(pieces) == 1):
5555
m = pieces[0]
@@ -58,8 +58,8 @@
5858
else:
5959
m = max([float(x) for x in pieces])
6060
else:
61-
m = d['AS_VQS_SENS']
62-
parts[7] = f"{info};MAX_AS_VQS_SENS={m}"
61+
m = d['CALIBRATION_SENSITIVITY']
62+
parts[7] = f"{info};MAX_CALIBRATION_SENSITIVITY={m}"
6363
elif "AS_VQSLOD" in d:
6464
if "," in d['AS_VQSLOD']:
6565
pieces = [x for x in d['AS_VQSLOD'].split(",") if (x != "." and x != "NaN") ]
@@ -74,6 +74,6 @@
7474
m = d['AS_VQSLOD']
7575
parts[7] = f"{info};MAX_AS_VQSLOD={m}"
7676
else:
77-
sys.exit(f"Can find neither AS_VQS_SENS nor AS_VQSLOD in {line}")
77+
sys.exit(f"Can find neither CALIBRATION_SENSITIVITY nor AS_VQSLOD in {line}")
7878

7979
print("\t".join(parts))

0 commit comments

Comments
 (0)