Skip to content

Commit 58ef2a0

Browse files
rocreguantRoc Reguant Comellaspiotrszul
authored
Implements p-value calculation from python (#208)
* Updating the READMEs * updated python requirements * add missing library * adding plotting libraries * initial p-values calculation. Optimization method and parameters to be updated * adding a p-value computation example * adding test for pval computation * added new test to the battery of tests * finalized the tests, some hidden * corrected small bug in testing * fixing a library import CI bug * fixing the matplotlib imports error for the tests, plus hiding provate functions for pval calculation * fixing typo * testing different scipy versions * fixing declarations * commenting out the plotting libraries before deciding what to do with them * Removed debugging and updated fitting funciton selection to best of three * removed tsv saving file * fixing styling * fixing edge cases for the fitting * fixing style * integrating pvalue calculation to variant spark * formatting files * changing import order * changing data path fix * fixing dependency problems * fixing dependency problems * cleaning requirements * remove unused code * fix path * adding pvalues readme * fixing the readme visualization * link to biorxiv * Added refactored interface for local FDR calculation. * refactored into two classes, need to make it work * saving the new structure * theoretically local fdr vs done * last commit of the day * workling local fdr * everything working * wip * almost there * everything should be working * final touches * fixed test * fixing requirements * fixing requirements 2 * removing antique file * removing magic number * change in nomenclature * small description * bug fix * update test * Fixing an error in FDR calculation and small refeactoring. * Fixing pvalue tests Co-authored-by: Roc Reguant Comellas <[email protected]> Co-authored-by: Piotr Szul <[email protected]>
1 parent 8ddbb12 commit 58ef2a0

14 files changed

+1021
-12
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ variant-spark comes with a few example scripts in the `scripts` directory that d
100100

101101
There is a few small data sets in the `data` directory suitable for running on a single machine. For example
102102

103-
./scripts/local_run-importance-ch22.sh
103+
./examples/local_run-importance-ch22.sh
104104

105105
runs variable importance command on a small sample of the chromosome 22 vcf file (from 1000 Genomes Project)
106106

@@ -120,7 +120,7 @@ You can choose a different location by setting the `VS_DATA_DIR` environment var
120120

121121
After the test data has been successfully copied to HDFS you can run examples scripts, e.g.:
122122

123-
./scripts/yarn_run-importance-ch22.sh
123+
./examples/yarn_run-importance-ch22.sh
124124

125125
Note: if you installed the data to a non default location the `VS_DATA_DIR` needs to be set accordingly when running the examples
126126

dev/dev-requirements.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,7 @@ pandas==1.1.4
1313
typedecorator==0.0.5
1414
Jinja2==3.0.3
1515
hail==0.2.74
16+
numpy==1.21.2
17+
patsy==0.5.2
18+
statsmodels==0.13.2
19+
seaborn==0.11.2

dev/py-test.sh

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,4 +11,5 @@ cd "$FWDIR"
1111
pushd python
1212
pytest -s -m spark
1313
pytest -s -m hail
14+
pytest -s -m pvalues
1415
popd

examples/compute_local_fdr.ipynb

Lines changed: 524 additions & 0 deletions
Large diffs are not rendered by default.

examples/run_importance_chr22.ipynb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -135,7 +135,7 @@
135135
],
136136
"metadata": {
137137
"kernelspec": {
138-
"display_name": "Python 3",
138+
"display_name": "Python 3 (ipykernel)",
139139
"language": "python",
140140
"name": "python3"
141141
},
@@ -149,7 +149,7 @@
149149
"name": "python",
150150
"nbconvert_exporter": "python",
151151
"pygments_lexer": "ipython3",
152-
"version": "3.6.12"
152+
"version": "3.8.12"
153153
}
154154
},
155155
"nbformat": 4,

examples/run_importance_chr22_with_hail.ipynb

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@
7777
"cell_type": "markdown",
7878
"metadata": {},
7979
"source": [
80-
"Step 2: Load labels into Hail table `labels`."
80+
"Step 3: Load labels into Hail table `labels`."
8181
]
8282
},
8383
{
@@ -115,7 +115,7 @@
115115
"cell_type": "markdown",
116116
"metadata": {},
117117
"source": [
118-
"Step 3: Annotate dataset samples with labels."
118+
"Step 4: Annotate dataset samples with labels."
119119
]
120120
},
121121
{
@@ -170,7 +170,7 @@
170170
"cell_type": "markdown",
171171
"metadata": {},
172172
"source": [
173-
"Step 4: Build the random forest model with `label.x22_16050408` as the respose variable."
173+
"Step 5: Build the random forest model with `label.x22_16050408` as the respose variable."
174174
]
175175
},
176176
{
@@ -196,7 +196,7 @@
196196
"cell_type": "markdown",
197197
"metadata": {},
198198
"source": [
199-
"Step 5: Display the results: print OOB error calculated variable importance."
199+
"Step 6: Display the results: print OOB error calculated variable importance."
200200
]
201201
},
202202
{
@@ -291,7 +291,7 @@
291291
],
292292
"metadata": {
293293
"kernelspec": {
294-
"display_name": "Python 3",
294+
"display_name": "Python 3 (ipykernel)",
295295
"language": "python",
296296
"name": "python3"
297297
},
@@ -305,7 +305,7 @@
305305
"name": "python",
306306
"nbconvert_exporter": "python",
307307
"pygments_lexer": "ipython3",
308-
"version": "3.6.12"
308+
"version": "3.8.12"
309309
}
310310
},
311311
"nbformat": 4,

python/README.md

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,9 @@ For more information about how the VariantSpark wide random forest algorithm wor
7373

7474
Install VariantSpark for development using this command:
7575

76-
git https://github.com/aehrc/VariantSpark.git
76+
git clone https://github.com/aehrc/VariantSpark.git
77+
mvn clean install
78+
pip install -r dev/dev-requirements.txt
7779
cd VariantSpark/python
7880
pip install -e .
7981

python/readme_pvalues.md

Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Threshold Values for the Gini Variable Importance
2+
3+
Random Forests are machine learning methods commonly used to model data. They are highly scalable
4+
and robust to overfitting while modelling non-linearities. Using an empirical bayes approach, we
5+
were able to quantify the importance of the variants in our models; thus, improving the
6+
interpretability of such algorithms.
7+
8+
A more detailed explanation can be found at the [manuscript](https://www.biorxiv.org/)
9+
10+
## Requirements
11+
12+
python==3.8.12\
13+
numpy==1.21.2 \
14+
pandas==1.4.1 \
15+
patsy==0.5.2 \
16+
scipy==1.7.3\
17+
statsmodels==0.13.2
18+
19+
## Usage
20+
21+
As an input it is expected a Pandas series data frame where the column is the logarithm of
22+
the importances. The method returns a dictionary with the FDR values as array, the estimates for
23+
the fitted function (array length three), and the p-values for the statistically significant
24+
variants (array).
25+
26+
The code can be used stand-alone requiring only the script file. If that is your wish the file
27+
can be found at:
28+
https://github.com/aehrc/VariantSpark/blob/918c80be28818b8872ce346cbb2092da5c4d2ced/python/varspark/pvalues_calculation.py
29+
30+
A hands-on jupyter notebook with a step by step implementation to go from importances to p-values can be found at:
31+
https://github.com/aehrc/VariantSpark/blob/918c80be28818b8872ce346cbb2092da5c4d2ced/examples/computing_p-value_example.ipynb
32+
33+
However, this method is also integrated with VariantSpark. This enables the p-value calculation
34+
with a single function call. Training a model, calculating the p-values, and getting them can be done
35+
in few lines of code using VariantSpark as shown in the following snippet:
36+
37+
38+
vds = hl.import_vcf(os.path.join(PROJECT_DIR, 'data/chr22_1000.vcf'))
39+
labels = hl.import_table(os.path.join(PROJECT_DIR, 'data/chr22-labels-hail.csv'),
40+
impute=True, delimiter=",").key_by('sample')
41+
42+
vds = vds.annotate_cols(label=labels[vds.s])
43+
rf_model = vshl.random_forest_model(y=vds.label['x22_16050408'], x=vds.GT.n_alt_alleles(),
44+
seed=13, mtry_fraction=0.05, min_node_size=5,
45+
max_depth=10)
46+
rf_model.fit_trees(100, 50)
47+
48+
significant_variants = rf_model.get_significant_variances()
49+
50+
Notes: If you wish to use the VariantSpark implementation please consider reading more about the
51+
tool [here](https://github.com/aehrc/VariantSpark/blob/master/README.md) and [here](https://github.com/aehrc/VariantSpark/blob/master/python/README.md).
52+
53+
## Citation
54+
55+
If you use this method please consider citing us:
56+
57+
Dunne, R. ... (2022). Threshold Values for the Gini Variable Importance: A Empirical Bayes
58+
Approach. arXiv preprint arXiv:XXXX.XXXXX.
59+

python/requirements.txt

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,14 @@
11
# varspark dependencies
22
# python 3.7
3+
Jinja2==3.0.3
34
pandas==1.1.4
45
typedecorator==0.0.5
56
hail==0.2.74
67
pyspark==3.1.2
7-
Jinja2==3.0.3
8+
scipy==1.7.3
9+
numpy==1.21.2
10+
patsy==0.5.2
11+
statsmodels==0.13.2
12+
seaborn==0.11.2
13+
14+

python/varspark/hail/lfdrvs.py

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
from varspark.stats.lfdr import *
2+
3+
4+
class LocalFdrVs:
5+
local_fdr: object
6+
df_: object
7+
8+
def __init__(self, df):
9+
"""
10+
Constructor class
11+
:param df: Takes a pandas dataframe as argument with three columns: variant_id,
12+
logImportance and splitCount.
13+
"""
14+
self.df_ = df.sort_values('logImportance', ascending=True)
15+
16+
17+
@classmethod
18+
def from_imp_df(cls, df):
19+
"""
20+
Alternative class instantiation from a pandas dataframe
21+
:param cls: LocalFdrVs class
22+
:param df: Pandas dataframe with columns locus, alleles, importance, and splitCount.
23+
:return: Initialized class instance.
24+
"""
25+
df = df.assign(logImportance = np.log(df.importance))
26+
df['variant_id'] = df.apply(lambda row: str(row['locus'][0])+'_'+str(row['locus'][1])+'_'+ \
27+
str('_'.join(row['alleles'])), axis=1)
28+
return cls(df[['variant_id', 'logImportance', 'splitCount']])
29+
30+
31+
@classmethod
32+
def from_imp_table(cls, impTable):
33+
"""
34+
Alternative class instantiation from a Hail Table (VariantSpark users).
35+
:param cls: LocalFdrVs class
36+
:param impTable: Hail table with locus, alleles, importance, and splitCount.
37+
:return: Initialized class instance.
38+
"""
39+
impTable = impTable.filter(impTable.splitCount >= 1)
40+
return LocalFdrVs.from_imp_df(impTable.to_spark(flatten=False).toPandas())
41+
42+
def plot_log_densities(self, ax, min_split_count=1, max_split_count=6, palette='Set1',
43+
find_automatic_best=False, xLabel='log(importance)', yLabel='density'):
44+
"""
45+
Plotting the log densities to visually identify the unimodal distributions.
46+
:param ax: Matplotlib axis as a canvas for this plot.
47+
:param min_split_count: n>=1, from which the split count plotting starts.
48+
:param max_split_count: when to stop the split count filtering.
49+
:param find_automatic_best: The user may let the computer highlight the potential best option.
50+
:param palette: Matplotlib color palette used for the plotting.
51+
:param xLabel: Label on the x-axis of the plot.
52+
:param yLabel: Label on the y-axis of the plot.
53+
"""
54+
55+
assert min_split_count < max_split_count, 'min_split_count should be smaller than max_split_count'
56+
assert min_split_count > 0, 'min_split_count should be bigger than 0'
57+
assert type(palette) == str, 'palette should be a string'
58+
assert type(xLabel) == str, 'xLabel should be a string'
59+
assert type(yLabel) == str, 'yLabel should be a string'
60+
61+
n_lines = max_split_count - min_split_count + 1
62+
colors = sns.mpl_palette(palette, n_lines)
63+
df = self.df_
64+
for i, c in zip(range(min_split_count, max_split_count + 1), colors):
65+
sns.kdeplot(df.logImportance[df.splitCount >= i],
66+
ax=ax, c=c, bw_adjust=0.5) #bw low show sharper distributions
67+
68+
if find_automatic_best:
69+
potential_best = self.find_split_count_th( min_split_count, max_split_count)
70+
sns.kdeplot(df.logImportance[df.splitCount >= potential_best],
71+
ax = ax, c=colors[potential_best-1], bw_adjust=0.5, lw=8, linestyle=':')
72+
best_split = [str(x) if x != potential_best else str(x)+'*' for x in range(
73+
min_split_count, max_split_count+1)]
74+
else:
75+
best_split = list(range(min_split_count, max_split_count+1))
76+
77+
ax.legend(title='Minimum split counts in distribution')
78+
ax.legend(labels=best_split, bbox_to_anchor=(1,1))
79+
ax.set_xlabel(xLabel)
80+
ax.set_ylabel(yLabel)
81+
82+
83+
def plot_log_hist(self, ax, split_count, bins=120, xLabel='log(importance)', yLabel='count'):
84+
"""
85+
Ploting the log histogram for the chosen split_count
86+
:param ax: Matplotlib axis as a canvas for this plot.
87+
:param split_count: Minimum split count threshold for the plot.
88+
:param bins: Number of bins in the histogram
89+
:param xLabel: Label on the x-axis of the plot.
90+
:param yLabel: Label on the y-axis of the plot.
91+
"""
92+
93+
assert bins > 0, 'bins should be bigger than 0'
94+
assert split_count > 0, 'split_count should be bigger than 0'
95+
assert type(xLabel) == str, 'xLabel should be a string'
96+
assert type(yLabel) == str, 'yLabel should be a string'
97+
98+
df = self.df_
99+
sns.histplot(df.logImportance[df.splitCount >= split_count], ax=ax, bins=bins)
100+
ax.set_xlabel(xLabel)
101+
ax.set_ylabel(yLabel)
102+
103+
104+
def plot(self, ax):
105+
self.local_fdr.plot(ax)
106+
107+
108+
def compute_fdr(self, countThreshold=2, local_fdr_cutoff=0.05, bins=120):
109+
"""
110+
Compute the FDR and p-values of the SNPs.
111+
:param countThreshold: The split count threshold for the SNPs to be considered.
112+
:param local_fdr_cutoff: Threshold of False positives over total of genes
113+
:param bins: number of bins to which the log importances will be aggregated
114+
:return: A tuple with a dataframe containing the SNPs and their p-values,
115+
and the expected FDR for the significant genes.
116+
"""
117+
118+
assert countThreshold > 0, 'countThreshold should be bigger than 0'
119+
assert 0 < local_fdr_cutoff < 1, 'local_fdr_cutoff threshold should be between 0 and 1'
120+
121+
impDfWithLog = self.df_[self.df_.splitCount >= countThreshold]
122+
impDfWithLog = impDfWithLog[['variant_id','logImportance']].set_index('variant_id').squeeze()
123+
124+
self.local_fdr = LocalFdr()
125+
self.local_fdr.fit(impDfWithLog, bins)
126+
pvals = self.local_fdr.get_pvalues()
127+
fdr, mask = self.local_fdr.get_fdr(local_fdr_cutoff)
128+
return (
129+
impDfWithLog.reset_index().assign(pvalue=pvals, is_significant=mask),
130+
fdr
131+
)

0 commit comments

Comments
 (0)