FEAT: Add notebook from s3

bhosking · bhosking · commit fdca2a21383f · 2019-07-04T21:43:41.000+10:00
diff --git a/cloud/aws-emr/cf-templates/VariantSpark_example_with_Hail_library.ipynb b/cloud/aws-emr/cf-templates/VariantSpark_example_with_Hail_library.ipynb
@@ -0,0 +1 @@
+{"nbformat_minor": 1, "nbformat": 4, "cells": [{"source": "# VariantSpark example using Hail library\n### Initialisation: loading libraries", "cell_type": "markdown", "metadata": {}}, {"source": "%%juspark\n{\"spark.driver.memory\":\"24G\",\"spark.jars\":\"/home/hadoop/miniconda2/envs/jupyter/lib/python2.7/site-packages/varspark/jars/variant-spark_2.11-0.2.0-a1-all.jar\"}\n", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"scrolled": false, "trusted": true}}, {"source": "import hail\nimport varspark.hail\nhc = hail.HailContext(sc)\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport matplotlib.patches as mpatches\nimport numpy as np\nfrom math import log, isnan\nfrom pprint import pprint\nfrom decimal import Decimal\nfrom hail import KeyTable\nfrom hail.keytable import asc\nfrom hail.keytable import desc", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "numCPU=32\nnumPartition=numCPU*4\n\n# Set Demo to True to quickly run the notebook on a very small dataset with 20K SNPs\nDemo=False ", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Code for Creating Manhattan Plot", "cell_type": "markdown", "metadata": {}}, {"source": "def ColorImportant(pd):\n    pd['c'] = \"silver\"\n    pd.loc[pd['chr'] == '1', 'c']  = \"lightgrey\"\n    pd.loc[pd['chr'] == '3', 'c']  = \"lightgrey\"\n    pd.loc[pd['chr'] == '5', 'c']  = \"lightgrey\"\n    pd.loc[pd['chr'] == '7', 'c']  = \"lightgrey\"\n    pd.loc[pd['chr'] == '9', 'c']  = \"lightgrey\"\n    pd.loc[pd['chr'] == '11', 'c'] = \"lightgrey\"\n    pd.loc[pd['chr'] == '13', 'c'] = \"lightgrey\"\n    pd.loc[pd['chr'] == '15', 'c'] = \"lightgrey\"\n    pd.loc[pd['chr'] == '17', 'c'] = \"lightgrey\"\n    pd.loc[pd['chr'] == '19', 'c'] = \"lightgrey\"\n    pd.loc[pd['chr'] == '21', 'c'] = \"lightgrey\"\n    \n    pd.loc[pd['jk'] == '2_223034082', 'c'] = \"black\"  #B6\n    pd.loc[pd['jk'] == '4_54511913' , 'c'] = \"red\"   #B2\n    pd.loc[pd['jk'] == '5_126626044', 'c'] = \"green\" #R1\n    pd.loc[pd['jk'] == '7_17284577'  , 'c'] = \"blue\" #C2\n    \n    pd['Important'] = False\n    \n    pd.loc[pd['jk'] == '2_223034082', 'Important'] = True #B6\n    pd.loc[pd['jk'] == '4_54511913' , 'Important'] = True #B2\n    pd.loc[pd['jk'] == '5_126626044', 'Important'] = True #R1\n    pd.loc[pd['jk'] == '7_17284577' , 'Important'] = True #C2\n    return pd\n\nofs = [\n    0,\n    0,         # offset for chr1\n    249250621, # offset for chr2\n    492449994, # offset for chr3\n    690472424,\n    881626700,\n    1062541960,\n    1233657027,\n    1392795690,\n    1539159712,\n    1680373143,\n    1815907890,\n    1950914406,\n    2084766301,\n    2199936179,\n    2307285719,\n    2409817111,\n    2500171864,\n    2581367074,\n    2659444322,\n    2722469842,\n    2781598825,\n    2832903391\n]\n", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "![VariantSpark](https://s3.us-east-2.amazonaws.com/csiro-graphics/variant-spark.png)\n* [**VariantSpark**](http://bioinformatics.csiro.au/variantspark) is a machine learning library for real-time genomic data analysis (for thousands of samples and millions of variants) and is...  \n  * Built on top of Apache Spark and written in Scala\n  * Authored by the team at [CSIRO Bioinformatics](http://bioinformatics.csiro.au/) in Australia\n  * Uses a custom machine learning **random forest** implementation to find the most *important* variants attributing to a phenotype of interest   \n* This demo...  \n  * Includes a dataset with a subset of the samples and variants (in VCF format) from the 1000 Genomes Project  \n  * Uses a synthetic phenotype called *HipsterIndex* (in CSV format) factoring various real phenotypes (monobrow, beard, etc.)", "cell_type": "markdown", "metadata": {}}, {"source": "## Hipster Index\n\nThe synthetic HipsterIndex was created based on the 1000 Genome Project data using the following four genotypes:\n\n| ID |SNP ID     | chromosome | position | phenotype | reference |\n|---:|----------:|----:|-------:|-----:|----------:|\n| B6 |[rs2218065](https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=2218065) | chr2 | 223034082 | monobrow | [Adhikari K, et al. (2016) Nat Commun.](https://www.ncbi.nlm.nih.gov/pubmed/?term=26926045) |\n| R1 |[rs1363387](https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1363387) | chr5 | 126626044 | Retina horizontal cells (checks) | [Kay, JN et al. (2012) Nature](https://www.ncbi.nlm.nih.gov/pubmed/?term=22407321)\n| B2 |[rs4864809](https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=4864809) | chr4 |  54511913 | beard | [Adhikari K, et al. (2016) Nat Commun.](https://www.ncbi.nlm.nih.gov/pubmed/?term=26926045)\n| C2 |[rs4410790](https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=4410790)  |chr7 |17284577| coffee consumption        | [Cornelis MC et al. (2011) PLoS Genet.](https://www.ncbi.nlm.nih.gov/pubmed/?term=21490707) |\n\nTo simulate phenotype, we first encode genotypes to integers in the following manner: *homozygote* reference encoded as 0, *heterozygote* as 1 and *homozygote alternative* as 2. Then we compute the score for each sample using the following equation.\n\n`Score = (5*B6*B2 \u2013 4*B6 \u2013 4*B2 + 1) + (7*R1*C2 \u2013 6*R1 \u2013 6*C2 + 1)`\n\nNext, we add noise to the score where noise is a random number with normal distribution, `mean = 0` and `standard deviation = 6`. Finally, we sort samples by `score+noise` and take the first half of as controls and the second half as cases. By doing so, we created a binary annotation for the individuals.\n\nIn the rest of this notebook, we will demonstrate the usage of VariantSpark to reverse-engineer the association of the selected SNPs to the phenotype of interest (i.e. being a hipster). We also use traditional GWAS approach and its failure to identify significant SNPs when there are complex interactions.", "cell_type": "markdown", "metadata": {}}, {"source": "### Load phenotypes data", "cell_type": "markdown", "metadata": {}}, {"source": "table = hc.import_table('s3://variant-spark/datasets/Hipster2Small/Hipster.csv', delimiter=',', impute=True)\\\n.key_by('Sample')\n\n# For details of important variant please refere to the following link\n# https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8497971343024764/53198984527781/2559267461126367/latest.html\nintervals = map(hail.representation.Interval.parse, ['2:223034081-223034083', '5:126626043-126626045', '4:54511912-54511914', \"7:17284576-17284578\"])\n", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Report the phenotype data", "cell_type": "markdown", "metadata": {}}, {"source": "tpd = table.to_pandas()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "tpd.head(5)", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "tpd.hist('Score')", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "tpd.hist('Noise')", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "tpd.groupby('isCase')['Sample'].count()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "tpd.hist('Score+Noise')", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Load VCF file into Hail VDS format", "cell_type": "markdown", "metadata": {}}, {"source": "vcf='s3://variant-spark/datasets/Hipster2Small/Small.558938.vcf.bgz'\nrate=0.001\nnTree=1000\n    \nif Demo==True:\n    vcf='s3://variant-spark/datasets/Hipster2Small/Tiny.vcf.bgz'\n    rate=0.05\n    nTree=100\n", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "caseCtrlVds = hc.import_vcf(vcf).repartition(numPartition).cache()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"scrolled": true, "trusted": true}}, {"source": "### Count Number of samples and SNPs.", "cell_type": "markdown", "metadata": {}}, {"source": "print(caseCtrlVds.count())", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Annotate dataset with phenotype", "cell_type": "markdown", "metadata": {}}, {"source": "caseCtrlVds = caseCtrlVds.annotate_samples_table(table, root='sa.pheno')", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Compute PCA vectors", "cell_type": "markdown", "metadata": {}}, {"source": "pca_kt = caseCtrlVds.sample_variants(rate).cache()\\\n.pca(k=2, scores='sa.pca_score', loadings='va.pca_loadings', eigenvalues='global.pca_evals')\\\n.samples_table()\n\npca_table = pca_kt.to_pandas()\n\ncolors = {'Control': 'green', 'Case': 'red'}\nplt.figure(figsize=(15,10))\nplt.scatter(pca_table[\"sa.pca_score.PC1\"], pca_table[\"sa.pca_score.PC2\"],\n            c = pca_table[\"sa.pheno.population\"].map(colors),\n            alpha = .5)\nplt.xlim(pca_table[\"sa.pca_score.PC1\"].min(), pca_table[\"sa.pca_score.PC1\"].max())\nplt.ylim(pca_table[\"sa.pca_score.PC2\"].min(), pca_table[\"sa.pca_score.PC2\"].max())\nplt.xlabel(\"PC1\")\nplt.ylabel(\"PC2\")\nplt.title(\"PCA\")\nlegend_entries = [mpatches.Patch(color=c, label=pheno) for pheno, c in colors.items()]\nplt.legend(handles=legend_entries, loc=2)\nplt.show()\n\ncaseCtrlVds = caseCtrlVds.annotate_samples_table(pca_kt, root='sa.pca')\n\ndel pca_kt\ndel pca_table", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### GWAS using VariantSpark (1000 Trees)", "cell_type": "markdown", "metadata": {}}, {"source": "%%sh\ndate +%s > time.txt", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "vskt = caseCtrlVds.importance_analysis(\"sa.pheno.Hipster\", n_trees = nTree, mtry_fraction = 0.1, oob = False,\\\n                                       seed = 13L,  batch_size = 100)\\\n.important_variants(1000000)\\\n.rename({'variant':'v'}).rename({'importance':'vsis'})\\\n.order_by(desc('vsis')).indexed('vs_rank').key_by('v')", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Annotate Hail VDS dataset with \"Importance Score\" computed using VariantSpark", "cell_type": "markdown", "metadata": {}}, {"source": "caseCtrlVds = caseCtrlVds.annotate_variants_table(vskt, root='va.vsis')", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Manhattan plot for VariantSpark \"Importance Score\" (z-score)", "cell_type": "markdown", "metadata": {}}, {"source": "Important_var = caseCtrlVds.filter_intervals(intervals)\n\npdf = caseCtrlVds.filter_variants_expr('va.vsis.vs_rank < 1000', keep=True)\\\n.union(Important_var).deduplicate().variants_table()\\\n.expand_types().flatten()\\\n.rename({'va.vsis.vsis':'vsis'})\\\n.rename({'v.contig':'chr'})\\\n.rename({'v.start':'pos'})\\\n.key_by(['vsis'])\\\n.select(['vsis', 'pos', 'chr'])\\\n.annotate('jk = chr + \"_\" + pos').to_pandas()\n\nkey='zvsis'\n\npdf[key] = (pdf['vsis'] - pdf['vsis'].mean())/pdf['vsis'].std(ddof=0)\n\npdf['xpos'] = pdf.apply(lambda row: ofs[int(row['chr'])] + row['pos'], axis=1)\npdf = ColorImportant(pdf)\npdf2 = pdf.loc[pdf['Important'] == True]\n\nplt.figure(figsize=(15,10))\nplt.scatter(pdf['xpos'], pdf[key], c = pdf['c'])\nplt.scatter(pdf2['xpos'], pdf2[key], c = pdf2['c'], linewidths=5)\nplt.xlim(pdf['xpos'].min(), pdf['xpos'].max())\nplt.ylim(0, pdf[key].max()*1.1)\nplt.xlabel(\"chromosomes\")\nplt.ylabel('-log10(p-value)')\nplt.title(\"Manhattan Plot of \"+ key)\n#legend_entries = [mpatches.Patch(color=c, label=pheno) for pheno, c in colors.items()]\n#plt.legend(handles=legend_entries, loc=2)\nplt.show()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "%%sh\nstart=`cat time.txt`\nend=`date +%s`\nruntime=$((end-start))\necho \"VriantSpark runtime is \"$runtime \"seconds\"", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Traditional GWAS using Logestic Regression (wald test)", "cell_type": "markdown", "metadata": {}}, {"source": "%%sh\ndate +%s > time.txt", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "caseCtrlVds = caseCtrlVds\\\n.logreg(test='wald' , root='va.wald' , y='sa.pheno.isCase', \\\n        covariates=['sa.pca.pca_score.PC1', 'sa.pca.pca_score.PC2'])\n", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Manhattan plot for traditional GWAS with highlighted important SNPs", "cell_type": "markdown", "metadata": {}}, {"source": "Important_var = caseCtrlVds.filter_intervals(intervals)\n\npdf = caseCtrlVds.filter_variants_expr('va.wald.pval < 0.01', keep=True)\\\n.union(Important_var).deduplicate().variants_table()\\\n.expand_types().flatten()\\\n.rename({'va.wald.pval':'wald_pval'})\\\n.rename({'v.contig':'chr'})\\\n.rename({'v.start':'pos'})\\\n.key_by(['wald_pval'])\\\n.select(['wald_pval', 'pos', 'chr'])\\\n.annotate('wald_log10 = -log10(wald_pval)')\\\n.annotate('jk = chr + \"_\" + pos').to_pandas()\n\nkey='wald_log10'\n\npdf['xpos'] = pdf.apply(lambda row: ofs[int(row['chr'])] + row['pos'], axis=1)\npdf = ColorImportant(pdf)\npdf2 = pdf.loc[pdf['Important'] == True]\n\nplt.figure(figsize=(15,10))\nplt.scatter(pdf['xpos'], pdf[key], c = pdf['c'])\nplt.scatter(pdf2['xpos'], pdf2[key], c = pdf2['c'], linewidths=5)\nplt.xlim(pdf['xpos'].min(), pdf['xpos'].max())\nplt.ylim(0, pdf[key].max()*1.1)\nplt.xlabel(\"chromosomes\")\nplt.ylabel('-log10(p-value)')\nplt.title(\"Manhattan Plot of \"+ key)\n#legend_entries = [mpatches.Patch(color=c, label=pheno) for pheno, c in colors.items()]\n#plt.legend(handles=legend_entries, loc=2)\nplt.show()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "%%sh\nstart=`cat time.txt`\nend=`date +%s`\nruntime=$((end-start))\necho \"Traditional GWAS runtime is \"$runtime \"seconds\"", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}], "metadata": {"kernelspec": {"display_name": "JuSpark", "name": "juspark", "language": "python"}, "language_info": {"mimetype": "text/x-python", "nbconvert_exporter": "python", "name": "python", "pygments_lexer": "ipython2", "version": "2.7.16", "file_extension": ".py", "codemirror_mode": {"version": 2, "name": "ipython"}}}}

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	+{"nbformat_minor": 1, "nbformat": 4, "cells": [{"source": "# VariantSpark example using Hail library\n### Initialisation: loading libraries", "cell_type": "markdown", "metadata": {}}, {"source": "%%juspark\n{\"spark.driver.memory\":\"24G\",\"spark.jars\":\"/home/hadoop/miniconda2/envs/jupyter/lib/python2.7/site-packages/varspark/jars/variant-spark_2.11-0.2.0-a1-all.jar\"}\n", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"scrolled": false, "trusted": true}}, {"source": "import hail\nimport varspark.hail\nhc = hail.HailContext(sc)\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport matplotlib.patches as mpatches\nimport numpy as np\nfrom math import log, isnan\nfrom pprint import pprint\nfrom decimal import Decimal\nfrom hail import KeyTable\nfrom hail.keytable import asc\nfrom hail.keytable import desc", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "numCPU=32\nnumPartition=numCPU4\n\n# Set Demo to True to quickly run the notebook on a very small dataset with 20K SNPs\nDemo=False ", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Code for Creating Manhattan Plot", "cell_type": "markdown", "metadata": {}}, {"source": "def ColorImportant(pd):\n pd['c'] = \"silver\"\n pd.loc[pd['chr'] == '1', 'c'] = \"lightgrey\"\n pd.loc[pd['chr'] == '3', 'c'] = \"lightgrey\"\n pd.loc[pd['chr'] == '5', 'c'] = \"lightgrey\"\n pd.loc[pd['chr'] == '7', 'c'] = \"lightgrey\"\n pd.loc[pd['chr'] == '9', 'c'] = \"lightgrey\"\n pd.loc[pd['chr'] == '11', 'c'] = \"lightgrey\"\n pd.loc[pd['chr'] == '13', 'c'] = \"lightgrey\"\n pd.loc[pd['chr'] == '15', 'c'] = \"lightgrey\"\n pd.loc[pd['chr'] == '17', 'c'] = \"lightgrey\"\n pd.loc[pd['chr'] == '19', 'c'] = \"lightgrey\"\n pd.loc[pd['chr'] == '21', 'c'] = \"lightgrey\"\n \n pd.loc[pd['jk'] == '2_223034082', 'c'] = \"black\" #B6\n pd.loc[pd['jk'] == '4_54511913' , 'c'] = \"red\" #B2\n pd.loc[pd['jk'] == '5_126626044', 'c'] = \"green\" #R1\n pd.loc[pd['jk'] == '7_17284577' , 'c'] = \"blue\" #C2\n \n pd['Important'] = False\n \n pd.loc[pd['jk'] == '2_223034082', 'Important'] = True #B6\n pd.loc[pd['jk'] == '4_54511913' , 'Important'] = True #B2\n pd.loc[pd['jk'] == '5_126626044', 'Important'] = True #R1\n pd.loc[pd['jk'] == '7_17284577' , 'Important'] = True #C2\n return pd\n\nofs = [\n 0,\n 0, # offset for chr1\n 249250621, # offset for chr2\n 492449994, # offset for chr3\n 690472424,\n 881626700,\n 1062541960,\n 1233657027,\n 1392795690,\n 1539159712,\n 1680373143,\n 1815907890,\n 1950914406,\n 2084766301,\n 2199936179,\n 2307285719,\n 2409817111,\n 2500171864,\n 2581367074,\n 2659444322,\n 2722469842,\n 2781598825,\n 2832903391\n]\n", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "![VariantSpark](https://s3.us-east-2.amazonaws.com/csiro-graphics/variant-spark.png)\n [VariantSpark](http://bioinformatics.csiro.au/variantspark) is a machine learning library for real-time genomic data analysis (for thousands of samples and millions of variants) and is... \n * Built on top of Apache Spark and written in Scala\n * Authored by the team at [CSIRO Bioinformatics](http://bioinformatics.csiro.au/) in Australia\n * Uses a custom machine learning random forest implementation to find the most important variants attributing to a phenotype of interest \n* This demo... \n * Includes a dataset with a subset of the samples and variants (in VCF format) from the 1000 Genomes Project \n * Uses a synthetic phenotype called HipsterIndex (in CSV format) factoring various real phenotypes (monobrow, beard, etc.)", "cell_type": "markdown", "metadata": {}}, {"source": "## Hipster Index\n\nThe synthetic HipsterIndex was created based on the 1000 Genome Project data using the following four genotypes:\n\n\| ID \|SNP ID \| chromosome \| position \| phenotype \| reference \|\n\|---:\|----------:\|----:\|-------:\|-----:\|----------:\|\n\| B6 \|[rs2218065](https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=2218065) \| chr2 \| 223034082 \| monobrow \| [Adhikari K, et al. (2016) Nat Commun.](https://www.ncbi.nlm.nih.gov/pubmed/?term=26926045) \|\n\| R1 \|[rs1363387](https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1363387) \| chr5 \| 126626044 \| Retina horizontal cells (checks) \| [Kay, JN et al. (2012) Nature](https://www.ncbi.nlm.nih.gov/pubmed/?term=22407321)\n\| B2 \|[rs4864809](https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=4864809) \| chr4 \| 54511913 \| beard \| [Adhikari K, et al. (2016) Nat Commun.](https://www.ncbi.nlm.nih.gov/pubmed/?term=26926045)\n\| C2 \|[rs4410790](https://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=4410790) \|chr7 \|17284577\| coffee consumption \| [Cornelis MC et al. (2011) PLoS Genet.](https://www.ncbi.nlm.nih.gov/pubmed/?term=21490707) \|\n\nTo simulate phenotype, we first encode genotypes to integers in the following manner: homozygote reference encoded as 0, heterozygote as 1 and homozygote alternative as 2. Then we compute the score for each sample using the following equation.\n\n`Score = (5B6B2 \u2013 4B6 \u2013 4B2 + 1) + (7R1C2 \u2013 6R1 \u2013 6C2 + 1)`\n\nNext, we add noise to the score where noise is a random number with normal distribution, `mean = 0` and `standard deviation = 6`. Finally, we sort samples by `score+noise` and take the first half of as controls and the second half as cases. By doing so, we created a binary annotation for the individuals.\n\nIn the rest of this notebook, we will demonstrate the usage of VariantSpark to reverse-engineer the association of the selected SNPs to the phenotype of interest (i.e. being a hipster). We also use traditional GWAS approach and its failure to identify significant SNPs when there are complex interactions.", "cell_type": "markdown", "metadata": {}}, {"source": "### Load phenotypes data", "cell_type": "markdown", "metadata": {}}, {"source": "table = hc.import_table('s3://variant-spark/datasets/Hipster2Small/Hipster.csv', delimiter=',', impute=True)\\\n.key_by('Sample')\n\n# For details of important variant please refere to the following link\n# https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/8497971343024764/53198984527781/2559267461126367/latest.html\nintervals = map(hail.representation.Interval.parse, ['2:223034081-223034083', '5:126626043-126626045', '4:54511912-54511914', \"7:17284576-17284578\"])\n", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Report the phenotype data", "cell_type": "markdown", "metadata": {}}, {"source": "tpd = table.to_pandas()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "tpd.head(5)", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "tpd.hist('Score')", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "tpd.hist('Noise')", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "tpd.groupby('isCase')['Sample'].count()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "tpd.hist('Score+Noise')", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Load VCF file into Hail VDS format", "cell_type": "markdown", "metadata": {}}, {"source": "vcf='s3://variant-spark/datasets/Hipster2Small/Small.558938.vcf.bgz'\nrate=0.001\nnTree=1000\n \nif Demo==True:\n vcf='s3://variant-spark/datasets/Hipster2Small/Tiny.vcf.bgz'\n rate=0.05\n nTree=100\n", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "caseCtrlVds = hc.import_vcf(vcf).repartition(numPartition).cache()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"scrolled": true, "trusted": true}}, {"source": "### Count Number of samples and SNPs.", "cell_type": "markdown", "metadata": {}}, {"source": "print(caseCtrlVds.count())", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Annotate dataset with phenotype", "cell_type": "markdown", "metadata": {}}, {"source": "caseCtrlVds = caseCtrlVds.annotate_samples_table(table, root='sa.pheno')", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Compute PCA vectors", "cell_type": "markdown", "metadata": {}}, {"source": "pca_kt = caseCtrlVds.sample_variants(rate).cache()\\\n.pca(k=2, scores='sa.pca_score', loadings='va.pca_loadings', eigenvalues='global.pca_evals')\\\n.samples_table()\n\npca_table = pca_kt.to_pandas()\n\ncolors = {'Control': 'green', 'Case': 'red'}\nplt.figure(figsize=(15,10))\nplt.scatter(pca_table[\"sa.pca_score.PC1\"], pca_table[\"sa.pca_score.PC2\"],\n c = pca_table[\"sa.pheno.population\"].map(colors),\n alpha = .5)\nplt.xlim(pca_table[\"sa.pca_score.PC1\"].min(), pca_table[\"sa.pca_score.PC1\"].max())\nplt.ylim(pca_table[\"sa.pca_score.PC2\"].min(), pca_table[\"sa.pca_score.PC2\"].max())\nplt.xlabel(\"PC1\")\nplt.ylabel(\"PC2\")\nplt.title(\"PCA\")\nlegend_entries = [mpatches.Patch(color=c, label=pheno) for pheno, c in colors.items()]\nplt.legend(handles=legend_entries, loc=2)\nplt.show()\n\ncaseCtrlVds = caseCtrlVds.annotate_samples_table(pca_kt, root='sa.pca')\n\ndel pca_kt\ndel pca_table", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### GWAS using VariantSpark (1000 Trees)", "cell_type": "markdown", "metadata": {}}, {"source": "%%sh\ndate +%s > time.txt", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "vskt = caseCtrlVds.importance_analysis(\"sa.pheno.Hipster\", n_trees = nTree, mtry_fraction = 0.1, oob = False,\\\n seed = 13L, batch_size = 100)\\\n.important_variants(1000000)\\\n.rename({'variant':'v'}).rename({'importance':'vsis'})\\\n.order_by(desc('vsis')).indexed('vs_rank').key_by('v')", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Annotate Hail VDS dataset with \"Importance Score\" computed using VariantSpark", "cell_type": "markdown", "metadata": {}}, {"source": "caseCtrlVds = caseCtrlVds.annotate_variants_table(vskt, root='va.vsis')", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Manhattan plot for VariantSpark \"Importance Score\" (z-score)", "cell_type": "markdown", "metadata": {}}, {"source": "Important_var = caseCtrlVds.filter_intervals(intervals)\n\npdf = caseCtrlVds.filter_variants_expr('va.vsis.vs_rank < 1000', keep=True)\\\n.union(Important_var).deduplicate().variants_table()\\\n.expand_types().flatten()\\\n.rename({'va.vsis.vsis':'vsis'})\\\n.rename({'v.contig':'chr'})\\\n.rename({'v.start':'pos'})\\\n.key_by(['vsis'])\\\n.select(['vsis', 'pos', 'chr'])\\\n.annotate('jk = chr + \"_\" + pos').to_pandas()\n\nkey='zvsis'\n\npdf[key] = (pdf['vsis'] - pdf['vsis'].mean())/pdf['vsis'].std(ddof=0)\n\npdf['xpos'] = pdf.apply(lambda row: ofs[int(row['chr'])] + row['pos'], axis=1)\npdf = ColorImportant(pdf)\npdf2 = pdf.loc[pdf['Important'] == True]\n\nplt.figure(figsize=(15,10))\nplt.scatter(pdf['xpos'], pdf[key], c = pdf['c'])\nplt.scatter(pdf2['xpos'], pdf2[key], c = pdf2['c'], linewidths=5)\nplt.xlim(pdf['xpos'].min(), pdf['xpos'].max())\nplt.ylim(0, pdf[key].max()1.1)\nplt.xlabel(\"chromosomes\")\nplt.ylabel('-log10(p-value)')\nplt.title(\"Manhattan Plot of \"+ key)\n#legend_entries = [mpatches.Patch(color=c, label=pheno) for pheno, c in colors.items()]\n#plt.legend(handles=legend_entries, loc=2)\nplt.show()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "%%sh\nstart=`cat time.txt`\nend=`date +%s`\nruntime=$((end-start))\necho \"VriantSpark runtime is \"$runtime \"seconds\"", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Traditional GWAS using Logestic Regression (wald test)", "cell_type": "markdown", "metadata": {}}, {"source": "%%sh\ndate +%s > time.txt", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "caseCtrlVds = caseCtrlVds\\\n.logreg(test='wald' , root='va.wald' , y='sa.pheno.isCase', \\\n covariates=['sa.pca.pca_score.PC1', 'sa.pca.pca_score.PC2'])\n", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "### Manhattan plot for traditional GWAS with highlighted important SNPs", "cell_type": "markdown", "metadata": {}}, {"source": "Important_var = caseCtrlVds.filter_intervals(intervals)\n\npdf = caseCtrlVds.filter_variants_expr('va.wald.pval < 0.01', keep=True)\\\n.union(Important_var).deduplicate().variants_table()\\\n.expand_types().flatten()\\\n.rename({'va.wald.pval':'wald_pval'})\\\n.rename({'v.contig':'chr'})\\\n.rename({'v.start':'pos'})\\\n.key_by(['wald_pval'])\\\n.select(['wald_pval', 'pos', 'chr'])\\\n.annotate('wald_log10 = -log10(wald_pval)')\\\n.annotate('jk = chr + \"_\" + pos').to_pandas()\n\nkey='wald_log10'\n\npdf['xpos'] = pdf.apply(lambda row: ofs[int(row['chr'])] + row['pos'], axis=1)\npdf = ColorImportant(pdf)\npdf2 = pdf.loc[pdf['Important'] == True]\n\nplt.figure(figsize=(15,10))\nplt.scatter(pdf['xpos'], pdf[key], c = pdf['c'])\nplt.scatter(pdf2['xpos'], pdf2[key], c = pdf2['c'], linewidths=5)\nplt.xlim(pdf['xpos'].min(), pdf['xpos'].max())\nplt.ylim(0, pdf[key].max()1.1)\nplt.xlabel(\"chromosomes\")\nplt.ylabel('-log10(p-value)')\nplt.title(\"Manhattan Plot of \"+ key)\n#legend_entries = [mpatches.Patch(color=c, label=pheno) for pheno, c in colors.items()]\n#plt.legend(handles=legend_entries, loc=2)\nplt.show()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}, {"source": "%%sh\nstart=`cat time.txt`\nend=`date +%s`\nruntime=$((end-start))\necho \"Traditional GWAS runtime is \"$runtime \"seconds\"", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {"trusted": true}}], "metadata": {"kernelspec": {"display_name": "JuSpark", "name": "juspark", "language": "python"}, "language_info": {"mimetype": "text/x-python", "nbconvert_exporter": "python", "name": "python", "pygments_lexer": "ipython2", "version": "2.7.16", "file_extension": ".py", "codemirror_mode": {"version": 2, "name": "ipython"}}}}