FlowGrid density-based clustering algorithm that can perform fast and accurate clustering on very large scRNA-seq data sets. It can be implemented with Scanpy for fast clustering of Scanpy Anndata.
FlowGrid supports pip installation.
pip install FlowGrid / pip3 install FlowGrid
Running Flowgrid within Scanpy for scRNA-seq analysis
requirement | location |
---|---|
Package: Scanpy | https://scanpy.readthedocs.io/en/stable/ |
Data: Mouse Brain data set [https://www.nature.com/articles/s41593-017-0029-5?WT.feed_name=subjects_molecular-biology] | https://storage.googleapis.com/h5ad/2017-12-Hrvatin-et-al-NNeuroscience/GSE102827_merged_all_raw.h5ad |
The result of the steps below and detailed workflow can be found in the FlowGrid_Example.ipynb
pip install FlowGrid
pip install scanpy
import FlowGrid
import scanpy as sc
#You can change your file location here
adata = sc.read('~/GSE102827_merged_all_raw.h5ad')
#Normalization
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
sc.pp.log1p(adata)
adata.raw = adata
#Highly variable genes selection
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var['highly_variable']]
#PCA to 5 dimensions#
sc.tl.pca(adata, n_comps=5)
You can use autoFlowGrid to do clustering for the data automatically.
#recomm_parameters = FlowGrid.autoFlowGrid(adata, int(set_n), list(binN_range), list(eps_range), list(MinDenB_range), list(MinDenC_range))
FlowGrid is extremely good at scalability, so we can implement a wide range parameter space of bin_n and eps, where eps = [1,2,3,4,5] and bin_n=[6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]. autoFlowGrid will iterate all good possibilities of bin_n and eps with effective pruning strategy. Users can also specify binN_range and eps_range to reduce computational time.
Sample usage is as follows:
recomm_parameters, CHI_reports = FlowGrid.autoFlowGrid(adata, 5)
#neighbor graph
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=5)
#umap
sc.tl.umap(adata)
#results of recommended parameters
for i in range(len(recomm_parameters)):
sc.pl.umap(adata, color=recomm_parameters[i],frameon =False)
You can also specify the parameter to do clustering.
#FlowGrid.cluster(adata, int(binN), float(eps), int(MinDenB), int(MinDenC))
binN is the number of bins for grid, recommended range for binN is from 10 to 25, large binN should result in more cluster groups.
eps is the maximun distance between two bins, recommended range for eps is from 1.0 to 2.5, larger eps should result in less cluster groups.
Sample usage is as follows:
FlowGrid.cluster(adata, 10, 1.2)
Adjusted Rand index can be calculated when there are reference labels, or you can compare results between FlowGrid and Louvain or different parameters.
#FlowGrid.AdjustedRandScore(adata, list[predlabel_list], list[reflabel_list])
predlabel_list is the cluster label list to evaluate.
reflabel_list is the ref label list to be used as a reference.
Sample usage is as follows:
FlowGrid.AdjustedRandScore(adata, ['binN_10_eps_1.0_FlowGrid', 'louvain'], ['maintype', 'celltype'])
Unneccessary results can be removed to make Anndata.obs more clean.
#FlowGrid.keep_labels(adata, list[remain_list])
remain_list is the list of FlowGrid clustering results you want to reserve.
Sample usage is as follows:
FlowGrid.keep_labels(adata, ['binN_9_eps_1.1_FlowGrid', 'binN_10_eps_1.0_FlowGrid'])
ConsensusFlowGrid can be used for high-dimensional data.
Sample usage is as follows:
sc.tl.pca(adata100k, n_comps=20)
consensusResult = consensusFlowGrid(adata, nDims = 20)
MIT