Skip to content

xiayuan-huang/FlowGrid

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FlowGrid

FlowGrid density-based clustering algorithm that can perform fast and accurate clustering on very large scRNA-seq data sets. It can be implemented with Scanpy for fast clustering of Scanpy Anndata.

Installation

FlowGrid supports pip installation.

pip install FlowGrid / pip3 install FlowGrid

Example1:

Running Flowgrid within Scanpy for scRNA-seq analysis

requirement location
Package: Scanpy https://scanpy.readthedocs.io/en/stable/
Data: Mouse Brain data set [https://www.nature.com/articles/s41593-017-0029-5?WT.feed_name=subjects_molecular-biology] https://storage.googleapis.com/h5ad/2017-12-Hrvatin-et-al-NNeuroscience/GSE102827_merged_all_raw.h5ad

Remind!

The result of the steps below and detailed workflow can be found in the FlowGrid_Example.ipynb

Install the packages

pip install FlowGrid
pip install scanpy

Import the packages and do the basic setting

import FlowGrid
import scanpy as sc

Load the data

#You can change your file location here
adata = sc.read('~/GSE102827_merged_all_raw.h5ad')

Preprocess

#Normalization
sc.pp.normalize_per_cell(adata, counts_per_cell_after=1e4)
sc.pp.log1p(adata)
adata.raw = adata
#Highly variable genes selection
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
adata = adata[:, adata.var['highly_variable']]

PCA for dimensionality reduction

#PCA to 5 dimensions#
sc.tl.pca(adata, n_comps=5)

Cluster using FlowGrid

You can use autoFlowGrid to do clustering for the data automatically.

#recomm_parameters = FlowGrid.autoFlowGrid(adata, int(set_n), list(binN_range), list(eps_range), list(MinDenB_range), list(MinDenC_range))

FlowGrid is extremely good at scalability, so we can implement a wide range parameter space of bin_n and eps, where eps = [1,2,3,4,5] and bin_n=[6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]. autoFlowGrid will iterate all good possibilities of bin_n and eps with effective pruning strategy. Users can also specify binN_range and eps_range to reduce computational time.

Sample usage is as follows:

recomm_parameters, CHI_reports = FlowGrid.autoFlowGrid(adata, 5)

Visualize the result

#neighbor graph
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=5)
#umap
sc.tl.umap(adata)

#results of recommended parameters
for i in range(len(recomm_parameters)):
    sc.pl.umap(adata, color=recomm_parameters[i],frameon =False)

NOTE

Run FlowGrid with specified parameters

You can also specify the parameter to do clustering.

#FlowGrid.cluster(adata, int(binN), float(eps), int(MinDenB), int(MinDenC))

binN is the number of bins for grid, recommended range for binN is from 10 to 25, large binN should result in more cluster groups.
eps is the maximun distance between two bins, recommended range for eps is from 1.0 to 2.5, larger eps should result in less cluster groups.
Sample usage is as follows:

FlowGrid.cluster(adata, 10, 1.2)

Compute adjusted Rand index when there are reference labels

Adjusted Rand index can be calculated when there are reference labels, or you can compare results between FlowGrid and Louvain or different parameters.

#FlowGrid.AdjustedRandScore(adata, list[predlabel_list], list[reflabel_list])

predlabel_list is the cluster label list to evaluate.
reflabel_list is the ref label list to be used as a reference.
Sample usage is as follows:

FlowGrid.AdjustedRandScore(adata, ['binN_10_eps_1.0_FlowGrid', 'louvain'], ['maintype', 'celltype'])

Keep only valuable results

Unneccessary results can be removed to make Anndata.obs more clean.

#FlowGrid.keep_labels(adata, list[remain_list])

remain_list is the list of FlowGrid clustering results you want to reserve.
Sample usage is as follows:

FlowGrid.keep_labels(adata,  ['binN_9_eps_1.1_FlowGrid', 'binN_10_eps_1.0_FlowGrid'])

consensusFlowGrid

ConsensusFlowGrid can be used for high-dimensional data.

Sample usage is as follows:

sc.tl.pca(adata100k, n_comps=20)
consensusResult = consensusFlowGrid(adata, nDims = 20)

License

MIT

About

Ultra-fast clustering of very large single cell RNA-seq data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 97.1%
  • Python 2.9%