+{"cells":[{"cell_type":"markdown","source":["Running importance analysis with Python API\n=====================================\n\nThis is an *VariantSpark* example notebook.\n\n\nOne of the main applications of VariantSpark is discovery of genomic variants correlated with a response variable (e.g. case vs control) using random forest gini importance.\n\nThe `chr22_1000.vcf` is a very small sample of the chromosome 22 VCF file from the 1000 Genomes Project.\n\n`chr22-labels.csv` is a CSV file with sample response variables (labels). In fact the labels directly represent the number of alternative alleles for each sample at a specific genomic position. E.g.: column 22_16050408 has labels derived from variants in chromosome 22 position 16050408. We would expect then that position 22:16050408 in the VCF file is strongly correlated with the label 22_16050408.\n\nBoth data sets are located in the `..\\data` directory.\n\nThis notebook demonstrates how to run importance analysis on these data with *VariantSpark* Python API."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"4f8de557-109f-4d63-b3b0-46ea3aa4d2e4"}}},{"cell_type":"markdown","source":["Step 1: Create a spark session with VariantSpark jar attached."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"7f5f721b-5286-4c6b-b058-58d0027092d2"}}},{"cell_type":"code","source":["import varspark as vs\nfrom pyspark.sql import SparkSession \nspark = SparkSession.builder.config('spark.jars', vs.find_jar()).getOrCreate()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"33b1984f-7eb6-47b3-a1ba-0858bf7ce12e"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"<div class=\"ansiout\"></div>","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"></div>"]}}],"execution_count":0},{"cell_type":"markdown","source":["Step 2: Create a `VarsparkContext` using `SparkSession` object (here injected as `spark`):"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"1aa6a2cf-1c22-4e83-a375-399d1fb7fe59"}}},{"cell_type":"code","source":["vc = vs.VarsparkContext(spark, silent = True)"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"242479c5-d3c2-41d0-b9d3-abb0458f5f66"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"<div class=\"ansiout\"></div>","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"></div>"]}}],"execution_count":0},{"cell_type":"markdown","source":["Step 3: Load the features `fs` and labels `ls` from data files."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"b484ba90-059a-4c47-976a-99820dd7a403"}}},{"cell_type":"code","source":["features = vc.import_vcf('dbfs:/databricks/Filestore/chr22_1000.vcf')\nlabels = vc.load_label('dbfs:/databricks/Filestore/chr22-labels.csv', '22_16050408')"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"c0c7a5c8-3a78-477c-a4b6-4a3112c2b6d0"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"<div class=\"ansiout\"></div>","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"></div>"]}}],"execution_count":0},{"cell_type":"markdown","source":["Step 4: Run the importance analysis and retrieve top important variables:"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"2d648334-a27f-4d4d-bee3-589aff4a63f8"}}},{"cell_type":"code","source":["ia = features.importance_analysis(labels, seed = 13, n_trees=500, batch_size=20)\ntop_variables = ia.important_variables()"],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"e7a7e71a-9c33-4d5f-a0b7-21e49f094aa1"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"<div class=\"ansiout\"></div>","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\"></div>"]}}],"execution_count":0},{"cell_type":"markdown","source":["Step 5: Display the results."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"b4d91f18-2e28-4511-b3bd-f8d6c42b3ea9"}}},{"cell_type":"code","source":["print(\"%s\\t%s\" % ('Variable', 'Importance'))\nfor var_and_imp in top_variables:\n print(\"%s\\t%s\" % var_and_imp) "],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"21a6dbd7-60d5-47d2-be26-8643f6ae241a"}},"outputs":[{"output_type":"display_data","metadata":{"application/vnd.databricks.v1+output":{"datasetInfos":[],"data":"<div class=\"ansiout\">Variable\tImportance\n22_16050408_T_C\t0.0008736428902538276\n22_16051480_T_C\t0.0007419293893929183\n22_16053435_G_T\t0.0006531820847942653\n22_16050678_C_T\t0.0006184428574495989\n22_16051107_C_A\t0.0006073673092564597\n22_16052656_T_C\t0.0005943510809849819\n22_16051882_C_T\t0.000575291789207231\n22_16053197_G_T\t0.0005011769789499887\n22_16052838_T_A\t0.0004754239609993277\n22_16053509_A_G\t0.0004491742430430418\n</div>","removedWidgets":[],"addedWidgets":{},"metadata":{},"type":"html","arguments":{}}},"output_type":"display_data","data":{"text/html":["<style scoped>\n .ansiout {\n display: block;\n unicode-bidi: embed;\n white-space: pre-wrap;\n word-wrap: break-word;\n word-break: break-all;\n font-family: \"Source Code Pro\", \"Menlo\", monospace;;\n font-size: 13px;\n color: #555;\n margin-left: 4px;\n line-height: 19px;\n }\n</style>\n<div class=\"ansiout\">Variable\tImportance\n22_16050408_T_C\t0.0008736428902538276\n22_16051480_T_C\t0.0007419293893929183\n22_16053435_G_T\t0.0006531820847942653\n22_16050678_C_T\t0.0006184428574495989\n22_16051107_C_A\t0.0006073673092564597\n22_16052656_T_C\t0.0005943510809849819\n22_16051882_C_T\t0.000575291789207231\n22_16053197_G_T\t0.0005011769789499887\n22_16052838_T_A\t0.0004754239609993277\n22_16053509_A_G\t0.0004491742430430418\n</div>"]}}],"execution_count":0},{"cell_type":"markdown","source":["For more information on using *VariantSpark* and the Python API please visit the [documentation](http://variantspark.readthedocs.io/en/latest/)."],"metadata":{"application/vnd.databricks.v1+cell":{"title":"","showTitle":false,"inputWidgets":{},"nuid":"d3275383-1d7e-4db3-ab9c-cef1ba0fbc55"}}}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"mimetype":"text/x-python","name":"python","pygments_lexer":"ipython3","codemirror_mode":{"name":"ipython","version":3},"version":"3.6.12","nbconvert_exporter":"python","file_extension":".py"},"application/vnd.databricks.v1+notebook":{"notebookName":"run_importance_chr22.ipynb","dashboards":[],"notebookMetadata":{"pythonIndentUnit":4},"language":"python","widgets":{},"notebookOrigID":122098722592985}},"nbformat":4,"nbformat_minor":0}
0 commit comments