Implements p-value calculation from python (#208)

rocreguant · Roc Reguant Comellas · piotrszul · web-flow · commit 58ef2a0c729a · 2022-07-06T13:15:34.000+02:00
* Updating the READMEs

* updated python requirements

* add missing library

* adding plotting libraries

* initial p-values calculation. Optimization method and parameters to be updated

* adding a p-value computation example

* adding test for pval computation

* added new test to the battery of tests

* finalized the tests, some hidden

* corrected small bug in testing

* fixing a library import CI bug

* fixing the matplotlib imports error for the tests, plus hiding provate functions for pval calculation

* fixing typo

* testing different scipy versions

* fixing declarations

* commenting out the plotting libraries before deciding what to do with them

* Removed debugging and updated fitting funciton selection to best of three

* removed tsv saving file

* fixing styling

* fixing edge cases for the fitting

* fixing style

* integrating pvalue calculation to variant spark

* formatting files

* changing import order

* changing data path fix

* fixing dependency problems

* fixing dependency problems

* cleaning requirements

* remove unused code

* fix path

* adding pvalues readme

* fixing the readme visualization

* link to biorxiv

* Added refactored interface for local FDR calculation.

* refactored into two classes, need to make it work

* saving the new structure

* theoretically local fdr vs done

* last commit of the day

* workling local fdr

* everything working

* wip

* almost there

* everything should be working

* final touches

* fixed test

* fixing requirements

* fixing requirements 2

* removing antique file

* removing magic number

* change in nomenclature

* small description

* bug fix

* update test

* Fixing an error in FDR calculation and small refeactoring.

* Fixing pvalue tests

Co-authored-by: Roc Reguant Comellas &lt;reg032@castiron-cl.nexus.csiro.au&gt;
Co-authored-by: Piotr Szul &lt;piotr.szul@csiro.au&gt;
diff --git a/README.md b/README.md
@@ -100,7 +100,7 @@ variant-spark comes with a few example scripts in the `scripts` directory that d
 
 There is a few small data sets in the `data` directory suitable for running on a single machine. For example
 
-    ./scripts/local_run-importance-ch22.sh
+    ./examples/local_run-importance-ch22.sh
 
 runs variable importance command on a small sample of the chromosome 22 vcf file (from 1000 Genomes Project)
 
@@ -120,7 +120,7 @@ You can choose a different location by setting the `VS_DATA_DIR` environment var
 
 After the test data has been successfully copied to HDFS you can run examples scripts, e.g.:
 
-    ./scripts/yarn_run-importance-ch22.sh
+    ./examples/yarn_run-importance-ch22.sh
 
 Note: if you installed the data to a non default location the `VS_DATA_DIR` needs to be set accordingly when running the examples
 
diff --git a/dev/dev-requirements.txt b/dev/dev-requirements.txt
@@ -13,3 +13,7 @@ pandas==1.1.4
 typedecorator==0.0.5
 Jinja2==3.0.3
 hail==0.2.74
+numpy==1.21.2
+patsy==0.5.2
+statsmodels==0.13.2
+seaborn==0.11.2
diff --git a/dev/py-test.sh b/dev/py-test.sh
@@ -11,4 +11,5 @@ cd "$FWDIR"
 pushd python
 pytest -s -m spark
 pytest -s -m hail
+pytest -s -m pvalues
 popd
diff --git a/examples/compute_local_fdr.ipynb b/examples/compute_local_fdr.ipynb
diff --git a/examples/run_importance_chr22.ipynb b/examples/run_importance_chr22.ipynb
@@ -135,7 +135,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -149,7 +149,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.12"
+   "version": "3.8.12"
   }
  },
  "nbformat": 4,
diff --git a/examples/run_importance_chr22_with_hail.ipynb b/examples/run_importance_chr22_with_hail.ipynb
@@ -77,7 +77,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Step 2: Load labels into Hail table `labels`."
+    "Step 3: Load labels into Hail table `labels`."
    ]
   },
   {
@@ -115,7 +115,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Step 3: Annotate dataset samples with labels."
+    "Step 4: Annotate dataset samples with labels."
    ]
   },
   {
@@ -170,7 +170,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Step 4: Build the random forest model with `label.x22_16050408` as the respose variable."
+    "Step 5: Build the random forest model with `label.x22_16050408` as the respose variable."
    ]
   },
   {
@@ -196,7 +196,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Step 5: Display the results: print OOB error calculated variable importance."
+    "Step 6: Display the results: print OOB error calculated variable importance."
    ]
   },
   {
@@ -291,7 +291,7 @@
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3",
+   "display_name": "Python 3 (ipykernel)",
    "language": "python",
    "name": "python3"
   },
@@ -305,7 +305,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.12"
+   "version": "3.8.12"
   }
  },
  "nbformat": 4,
diff --git a/python/README.md b/python/README.md
@@ -73,7 +73,9 @@ For more information about how the VariantSpark wide random forest algorithm wor
 
 Install VariantSpark for development using this command:
 
-    git https://github.com/aehrc/VariantSpark.git
+    git clone https://github.com/aehrc/VariantSpark.git
+    mvn clean install
+    pip install -r dev/dev-requirements.txt
     cd VariantSpark/python
     pip install -e .
 
diff --git a/python/readme_pvalues.md b/python/readme_pvalues.md
@@ -0,0 +1,59 @@
+# Threshold Values for the Gini Variable Importance
+
+Random Forests are machine learning methods commonly used to model data. They are highly scalable 
+and robust to overfitting while modelling non-linearities. Using an empirical bayes approach, we 
+were able to quantify the importance of the variants in our models; thus, improving the 
+interpretability of such algorithms.
+
+A more detailed explanation can be found at the [manuscript](https://www.biorxiv.org/)
+
+## Requirements
+
+python==3.8.12\
+numpy==1.21.2 \
+pandas==1.4.1 \
+patsy==0.5.2 \
+scipy==1.7.3\
+statsmodels==0.13.2
+
+## Usage
+
+As an input it is expected a Pandas series data frame where the column is the logarithm of
+the importances. The method returns a dictionary with the FDR values as array, the estimates for
+the fitted function (array length three), and the p-values for the statistically significant
+variants (array).
+
+The code can be used stand-alone requiring only the script file. If that is your wish the file 
+can be found at: 
+https://github.com/aehrc/VariantSpark/blob/918c80be28818b8872ce346cbb2092da5c4d2ced/python/varspark/pvalues_calculation.py
+
+A hands-on jupyter notebook with a step by step implementation to go from importances to p-values can be found at: 
+https://github.com/aehrc/VariantSpark/blob/918c80be28818b8872ce346cbb2092da5c4d2ced/examples/computing_p-value_example.ipynb
+
+However, this method is also integrated with VariantSpark. This enables the p-value calculation 
+with a single function call. Training a model, calculating the p-values, and getting them can be done 
+in few lines of code using VariantSpark as shown in the following snippet: 
+
+
+        vds = hl.import_vcf(os.path.join(PROJECT_DIR, 'data/chr22_1000.vcf'))
+        labels = hl.import_table(os.path.join(PROJECT_DIR, 'data/chr22-labels-hail.csv'),
+                                 impute=True, delimiter=",").key_by('sample')
+
+        vds = vds.annotate_cols(label=labels[vds.s])
+        rf_model = vshl.random_forest_model(y=vds.label['x22_16050408'], x=vds.GT.n_alt_alleles(),
+                                            seed=13, mtry_fraction=0.05, min_node_size=5,
+                                            max_depth=10)
+        rf_model.fit_trees(100, 50)
+
+        significant_variants = rf_model.get_significant_variances()
+
+Notes: If you wish to use the VariantSpark implementation please consider reading more about the 
+tool [here](https://github.com/aehrc/VariantSpark/blob/master/README.md) and [here](https://github.com/aehrc/VariantSpark/blob/master/python/README.md).
+
+## Citation
+
+If you use this method please consider citing us:
+
+    Dunne, R. ... (2022). Threshold Values for the Gini Variable Importance: A Empirical Bayes 
+    Approach. arXiv preprint arXiv:XXXX.XXXXX.
+
diff --git a/python/requirements.txt b/python/requirements.txt
@@ -1,7 +1,14 @@
 # varspark dependencies
 # python 3.7
+Jinja2==3.0.3
 pandas==1.1.4
 typedecorator==0.0.5
 hail==0.2.74
 pyspark==3.1.2
-Jinja2==3.0.3
+scipy==1.7.3
+numpy==1.21.2
+patsy==0.5.2
+statsmodels==0.13.2
+seaborn==0.11.2
+
+
diff --git a/python/varspark/hail/lfdrvs.py b/python/varspark/hail/lfdrvs.py
@@ -0,0 +1,131 @@
+from varspark.stats.lfdr import *
+
+
+class LocalFdrVs:
+    local_fdr: object
+    df_: object
+
+    def __init__(self, df):
+        """
+        Constructor class
+        :param df: Takes a pandas dataframe as argument with three columns: variant_id,
+        logImportance and splitCount.
+        """
+        self.df_ = df.sort_values('logImportance', ascending=True)
+
+
+    @classmethod
+    def from_imp_df(cls, df):
+        """
+        Alternative class instantiation from a pandas dataframe
+        :param cls: LocalFdrVs class
+        :param df: Pandas dataframe with columns locus, alleles, importance, and splitCount.
+        :return: Initialized class instance.
+        """
+        df = df.assign(logImportance = np.log(df.importance))
+        df['variant_id'] = df.apply(lambda row: str(row['locus'][0])+'_'+str(row['locus'][1])+'_'+ \
+                                            str('_'.join(row['alleles'])), axis=1)
+        return cls(df[['variant_id', 'logImportance', 'splitCount']])
+
+
+    @classmethod
+    def from_imp_table(cls, impTable):
+        """
+        Alternative class instantiation from a Hail Table (VariantSpark users).
+        :param cls: LocalFdrVs class
+        :param impTable: Hail table with locus, alleles, importance, and splitCount.
+        :return: Initialized class instance.
+        """
+        impTable = impTable.filter(impTable.splitCount >= 1)
+        return LocalFdrVs.from_imp_df(impTable.to_spark(flatten=False).toPandas())
+
+    def plot_log_densities(self, ax, min_split_count=1, max_split_count=6, palette='Set1',
+                           find_automatic_best=False, xLabel='log(importance)', yLabel='density'):
+        """
+        Plotting the log densities to visually identify the unimodal distributions.
+        :param ax: Matplotlib axis as a canvas for this plot.
+        :param min_split_count: n>=1, from which the split count plotting starts.
+        :param max_split_count: when to stop the split count filtering.
+        :param find_automatic_best: The user may let the computer highlight the potential best option.
+        :param palette: Matplotlib color palette used for the plotting.
+        :param xLabel: Label on the x-axis of the plot.
+        :param yLabel: Label on the y-axis of the plot.
+        """
+
+        assert min_split_count < max_split_count, 'min_split_count should be smaller than max_split_count'
+        assert min_split_count > 0, 'min_split_count should be bigger than 0'
+        assert type(palette) == str, 'palette should be a string'
+        assert type(xLabel) == str, 'xLabel should be a string'
+        assert type(yLabel) == str, 'yLabel should be a string'
+
+        n_lines = max_split_count - min_split_count + 1
+        colors = sns.mpl_palette(palette, n_lines)
+        df = self.df_
+        for i, c in zip(range(min_split_count, max_split_count + 1), colors):
+            sns.kdeplot(df.logImportance[df.splitCount >= i],
+                        ax=ax, c=c, bw_adjust=0.5) #bw low show sharper distributions
+
+        if find_automatic_best:
+            potential_best = self.find_split_count_th( min_split_count, max_split_count)
+            sns.kdeplot(df.logImportance[df.splitCount >= potential_best],
+                        ax = ax, c=colors[potential_best-1], bw_adjust=0.5, lw=8, linestyle=':')
+            best_split = [str(x) if x != potential_best else str(x)+'*' for x in range(
+                min_split_count, max_split_count+1)]
+        else:
+            best_split = list(range(min_split_count, max_split_count+1))
+
+        ax.legend(title='Minimum split counts in distribution')
+        ax.legend(labels=best_split, bbox_to_anchor=(1,1))
+        ax.set_xlabel(xLabel)
+        ax.set_ylabel(yLabel)
+
+
+    def plot_log_hist(self, ax, split_count, bins=120, xLabel='log(importance)', yLabel='count'):
+        """
+        Ploting the log histogram for the chosen split_count
+        :param ax: Matplotlib axis as a canvas for this plot.
+        :param split_count: Minimum split count threshold for the plot.
+        :param bins: Number of bins in the histogram
+        :param xLabel: Label on the x-axis of the plot.
+        :param yLabel: Label on the y-axis of the plot.
+        """
+
+        assert bins > 0, 'bins should be bigger than 0'
+        assert split_count > 0, 'split_count should be bigger than 0'
+        assert type(xLabel) == str, 'xLabel should be a string'
+        assert type(yLabel) == str, 'yLabel should be a string'
+
+        df = self.df_
+        sns.histplot(df.logImportance[df.splitCount >= split_count], ax=ax, bins=bins)
+        ax.set_xlabel(xLabel)
+        ax.set_ylabel(yLabel)
+
+
+    def plot(self, ax):
+        self.local_fdr.plot(ax)
+
+
+    def compute_fdr(self, countThreshold=2, local_fdr_cutoff=0.05, bins=120):
+        """
+        Compute the FDR and p-values of the SNPs.
+        :param countThreshold: The split count threshold for the SNPs to be considered.
+        :param local_fdr_cutoff: Threshold of False positives over total of genes
+        :param bins: number of bins to which the log importances will be aggregated
+        :return: A tuple with a dataframe containing the SNPs and their p-values,
+                    and the expected FDR for the significant genes.
+        """
+
+        assert countThreshold > 0, 'countThreshold should be bigger than 0'
+        assert 0 < local_fdr_cutoff < 1, 'local_fdr_cutoff threshold should be between 0 and 1'
+
+        impDfWithLog = self.df_[self.df_.splitCount >= countThreshold]
+        impDfWithLog = impDfWithLog[['variant_id','logImportance']].set_index('variant_id').squeeze()
+
+        self.local_fdr = LocalFdr()
+        self.local_fdr.fit(impDfWithLog, bins)
+        pvals = self.local_fdr.get_pvalues()
+        fdr, mask = self.local_fdr.get_fdr(local_fdr_cutoff)
+        return (
+            impDfWithLog.reset_index().assign(pvalue=pvals, is_significant=mask),
+            fdr
+        )
diff --git a/python/varspark/hail/rf.py b/python/varspark/hail/rf.py
@@ -2,6 +2,10 @@
 from hail.table import Table
 from hail.typecheck import *
 from hail.utils.java import Env
+from varspark.hail.lfdrvs import LocalFdrVs
+
+#from varspark.pvalues_calculation import *
+import hail as hl
 
 
 class RandomForestModel(object):
@@ -69,6 +73,14 @@ def variable_importance(self):
         """
         return Table._from_java(self._jrf_model.variableImportance())
 
+
+    def get_lfdr(self):
+        """ Returns the class with the information preloaded to compute the local FDR
+        :return: class LocalFdrVs with the importances loaded
+        """
+        return LocalFdrVs.from_imp_table(self.variable_importance())
+
+
     def covariate_importance(self):
         """ Returns the covariate importance for this model in a hail `Table` with the following
         row fields:
@@ -85,6 +97,7 @@ def covariate_importance(self):
         """
         return Table._from_java(self._jrf_model.covariatesImportance())
 
+
     @typecheck_method(
         filename=str,
         resolve_names=bool
diff --git a/python/varspark/stats/__init__.py b/python/varspark/stats/__init__.py
diff --git a/python/varspark/stats/lfdr.py b/python/varspark/stats/lfdr.py
diff --git a/python/varspark/test/test_pvalues_calculation.py b/python/varspark/test/test_pvalues_calculation.py

Original file line number	Diff line number	Diff line change
`@@ -77,7 +77,7 @@`
`77`	`77`	`"cell_type": "markdown",`
`78`	`78`	`"metadata": {},`
`79`	`79`	`"source": [`
`80`		- "Step 2: Load labels into Hail table `labels`."
	`80`	+ "Step 3: Load labels into Hail table `labels`."
`81`	`81`	`]`
`82`	`82`	`},`
`83`	`83`	`{`
`@@ -115,7 +115,7 @@`
`115`	`115`	`"cell_type": "markdown",`
`116`	`116`	`"metadata": {},`
`117`	`117`	`"source": [`
`118`		`- "Step 3: Annotate dataset samples with labels."`
	`118`	`+ "Step 4: Annotate dataset samples with labels."`
`119`	`119`	`]`
`120`	`120`	`},`
`121`	`121`	`{`
`@@ -170,7 +170,7 @@`
`170`	`170`	`"cell_type": "markdown",`
`171`	`171`	`"metadata": {},`
`172`	`172`	`"source": [`
`173`		- "Step 4: Build the random forest model with `label.x22_16050408` as the respose variable."
	`173`	+ "Step 5: Build the random forest model with `label.x22_16050408` as the respose variable."
`174`	`174`	`]`
`175`	`175`	`},`
`176`	`176`	`{`
`@@ -196,7 +196,7 @@`
`196`	`196`	`"cell_type": "markdown",`
`197`	`197`	`"metadata": {},`
`198`	`198`	`"source": [`
`199`		`- "Step 5: Display the results: print OOB error calculated variable importance."`
	`199`	`+ "Step 6: Display the results: print OOB error calculated variable importance."`
`200`	`200`	`]`
`201`	`201`	`},`
`202`	`202`	`{`
`@@ -291,7 +291,7 @@`
`291`	`291`	`],`
`292`	`292`	`"metadata": {`
`293`	`293`	`"kernelspec": {`
`294`		`- "display_name": "Python 3",`
	`294`	`+ "display_name": "Python 3 (ipykernel)",`
`295`	`295`	`"language": "python",`
`296`	`296`	`"name": "python3"`
`297`	`297`	`},`
`@@ -305,7 +305,7 @@`
`305`	`305`	`"name": "python",`
`306`	`306`	`"nbconvert_exporter": "python",`
`307`	`307`	`"pygments_lexer": "ipython3",`
`308`		`- "version": "3.6.12"`
	`308`	`+ "version": "3.8.12"`
`309`	`309`	`}`
`310`	`310`	`},`
`311`	`311`	`"nbformat": 4,`