Learning for single cell assignment • scLearn

Overview

scLearn is a learning-based framework designed to automatically infer quantitative measurements and similarity thresholds for single-cell assignment tasks. It achieves well-generalized performance across different single-cell types and is particularly robust in identifying novel cell types not present in reference datasets. scLearn introduces a multi-label single-cell assignment strategy for the first time, allowing simultaneous assignment of cell type and developmental stage, which is highly effective for cell development and lineage analysis.

The Overview of scLearn

☀️ Key Features

Robustness and Generalization: scLearn is designed to be robust across various single-cell assignment tasks, providing consistent performance regardless of the cell type or dataset used.
Efficiency in Novel Cell Type Identification: scLearn efficiently identifies novel cell types that are absent in the reference datasets, overcoming limitations of traditional methods that rely on predefined similarity thresholds.
Multi-Label Assignment: scLearn proposes a multi-label assignment strategy, allowing simultaneous assignment of cell type and developmental stage. This dual assignment is particularly useful for understanding cell lineage and development.
Pre-trained Models: scLearn comes with pre-trained models and comprehensive reference datasets for human and mammalian single cells, facilitating broad applications in single-cell assignment.

⏬ Installation

Release Version

BiocManager::install("scLearn")

Development Version

# install.packages("devtools")
devtools::install_github("DuanLab1/scLearn", dependencies = c("Depends", "Imports", "LinkingTo"))

library(scLearn)
library(tidyverse)
library(SingleCellExperiment)
library(M3Drop)

🚀 Quick Start

Single-label single cell assignment

Data preparation

Reference Cell Database: baron-human
Query Cell Data: Muraro-human

data(RefCellData)
RefCellData

## class: SingleCellExperiment 
## dim: 20125 1033 
## metadata(0):
## assays(2): counts logcounts
## rownames(20125): A1BG A1CF ... ZZZ3 pk
## rowData names(10): feature_symbol is_feature_control ... total_counts
##   log10_total_counts
## colnames(1033): human3_lib2.final_cell_0129 human3_lib2.final_cell_0359
##   ... human3_lib3.final_cell_0896 human4_lib1.final_cell_0568
## colData names(30): human cell_type1 ... pct_counts_ERCC is_cell_control
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

data(QueryCellData)
QueryCellData

## class: SingleCellExperiment 
## dim: 19127 1005 
## metadata(0):
## assays(2): normcounts logcounts
## rownames(19127): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17
##   ZZZ3__chr1
## rowData names(1): feature_symbol
## colnames(1005): D31.6_62 D30.4_62 ... D30.4_58 D31.7_42
## colData names(3): cell_type1 donor batch
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

cell quality control

RefRawcounts <- assays(RefCellData)[[1]]
RefAnn <- as.character(RefCellData$cell_type1)
names(RefAnn) <- colnames(RefCellData)

RefDataQC <- Cell_qc(
  expression_profile = RefRawcounts, 
  sample_information_cellType = RefAnn, 
  sample_information_timePoint = NULL, 
  species = "Hs",
  gene_low = 500, 
  gene_high = 10000, 
  mito_high = 0.1, 
  umi_low = 1500, 
  umi_high = Inf, 
  logNormalize = TRUE, 
  plot = FALSE, 
  plot_path = "./quality_control.pdf")

rare cell type filtered

RefDataQC_filtered <- Cell_type_filter(
  expression_profile = RefDataQC$expression_profile, 
  sample_information_cellType = RefDataQC$sample_information_cellType,
  sample_information_timePoint = NULL,
  min_cell_number = 10)

feature selection

RefDataQC_HVG_names <- Feature_selection_M3Drop(
  expression_profile = RefDataQC_filtered$expression_profile,
  log_normalized = TRUE, 
  threshold = 0.05)

## Warning in bg__calc_variables(expr_mat): Warning: Removing 4382 undetected
## genes.

feature selection

Model learning

Training the model. To improve the accuracy for “unassigned” cell, you can increase “bootstrap_times”, but it will takes longer time. The default value of “bootstrap_times” is 10.

scLearn_model_learning_result <- scLearn_model_learning(
  high_varGene_names = RefDataQC_HVG_names,
  expression_profile = RefDataQC_filtered$expression_profile,
  sample_information_cellType = RefDataQC_filtered$sample_information_cellType,
  sample_information_timePoint = NULL,
  bootstrap_times = 1,
  cutoff = 0.01,
  dim_para = 0.999
)

## [1] "Bootstrapying 1"

Results：scLearn_model_learning_resultis the final result file containing the processed reference cell matrix.

high_varGene_names: The set of highly variable genes (693 HVGs)
simi_threshold_learned: The similarity results between cell types
trans_matrix_learned: The transformed matrix (23 features * 693 HVGs)
feature_matrix_learned: After DCA, the matrix of cell type features (these features are obtained by dimensionality reduction of the highly variable gene set) (12 celltype * 23 features)

Cell assignment

Assignment with trained model above. To get a less strict result for “unassigned” cells, you can decrease “diff” and “vote_rate”. If you are sure that the cell type of query cells must be in the reference dataset, you can set “threshold_use” as FALSE. It means you don’t want to use the thresholds learned by scLearn.

QueryRawcounts <- assays(QueryCellData)[[1]]
QueryDataQC <- Cell_qc(
  expression_profile = QueryRawcounts, 
  species = "Hs", 
  gene_low = 50, 
  umi_low = 50)

rownames(QueryDataQC$expression_profile) <- gsub("__\\w+\\d+", "", rownames(QueryDataQC$expression_profile))

scLearn_predict_result <- scLearn_cell_assignment(
  scLearn_model_learning_result = scLearn_model_learning_result, 
  expression_profile_query = QueryDataQC$expression_profile, 
  diff = 0.05, 
  threshold_use = TRUE, 
  vote_rate = 0.6)

## [1] "The number of missing features in the query data is  36 "
## [1] "The rate of missing features in the query data is  0.051948051948052 "

head(scLearn_predict_result)

##          Query_cell_id Predict_cell_type Additional_information
## D31.6_62      D31.6_62            acinar      0.910551385257761
## D30.4_62      D30.4_62            ductal      0.797792782512641
## D28.4_85      D28.4_85            acinar      0.898705438318699
## D28.2_87      D28.2_87            acinar      0.855877643727897
## D30.5_79      D30.5_79            ductal      0.846464074908766
## D28.2_42      D28.2_42            acinar      0.908734384816745

Results: The output consists of three columns of information. The Query_cell_id represents the cell ID, Predict_cell_type indicates the predicted cell type, and Additional_information provides the similarity results.

Cells with a Predict_cell_type of unassigned are those that do not intersect with the reference cell set.

Accuracy

QueryData_trueLabel <- as.character(QueryCellData$cell_type1)
names(QueryData_trueLabel) <- colnames(QueryCellData)

QueryData_CellType <- QueryData_trueLabel |> 
  as.data.frame() |>
  tibble::rownames_to_column("CellID") |>
  dplyr::rename(trueLabel = QueryData_trueLabel) |>  
  dplyr::inner_join(scLearn_predict_result |>
                      as.data.frame() |>
                      dplyr::rename(predLabel = Predict_cell_type),
                    by = c("CellID" = "Query_cell_id")) |>
  dplyr::select(CellID, trueLabel, predLabel, everything())

print(
  paste("Final Accuracy =", 
    sprintf("%1.2f%%",
      100 * sum(QueryData_CellType$predLabel == QueryData_CellType$trueLabel) / nrow(QueryData_CellType))))

## [1] "Final Accuracy = 88.99%"

Multi-label single cell assignment

Data preprocessing
Model learning
Cell assignment: We just use ESC.rds itself to test the multi-label single cell assignment here.

Download ESC.rds

# loading the reference dataset
RefData_MLab <- readRDS("ESC.rds")
RefRawcounts_MLab <- assays(RefData_MLab)[[1]]
RefAnn_MLab <- as.character(RefData_MLab$cell_type1)
names(RefAnn_MLab) <- colnames(RefData_MLab)
RefAnn2_MLab <- as.character(RefData_MLab$cell_type2)
names(RefAnn2_MLab) <- colnames(RefData_MLab)

# cell quality control and rare cell type filtered and feature selection
RefDataQC_MLab <- Cell_qc(
  expression_profile = RefRawcounts_MLab, 
  sample_information_cellType = RefAnn_MLab, 
  sample_information_timePoint = RefAnn2_MLab, 
  species = "Hs",
  gene_low = 500, 
  gene_high = 10000, 
  mito_high = 0.1, 
  umi_low = 1500, 
  umi_high = Inf, 
  logNormalize = TRUE, 
  plot = FALSE, 
  plot_path = "./quality_control.pdf")

RefDataQC_filtered_MLab <- Cell_type_filter(
  expression_profile = RefDataQC_MLab$expression_profile, 
  sample_information_cellType = RefDataQC_MLab$sample_information_cellType, 
  sample_information_timePoint = RefDataQC_MLab$sample_information_timePoint, 
  min_cell_number = 10)

RefDataQC_HVG_names_MLab <- Feature_selection_M3Drop(
  expression_profile = RefDataQC_filtered_MLab$expression_profile,
  log_normalized = TRUE, 
  threshold = 0.05)

# training the model
scLearn_model_learning_result <- scLearn_model_learning(
  high_varGene_names = RefDataQC_HVG_names_MLab, 
  expression_profile = RefDataQC_filtered_MLab$expression_profile, 
  sample_information_cellType = RefDataQC_filtered_MLab$sample_information_cellType, 
  sample_information_timePoint = RefDataQC_filtered_MLab$sample_information_timePoint, 
  bootstrap_times = 10, 
  cutoff = 0.01,
  dim_para = 0.999)

# loading the quary cell and performing cell quality control
QueryData_MLab <- readRDS("ESC.rds")
QueryRawcounts_MLab <- assays(QueryData_MLab)[[1]]
### the true labels of this test dataset
# query_ann1 <- as.character(data2$cell_type1)
# names(query_ann1) <- colnames(data2)
# query_ann2 <- as.character(data2$cell_type2)
# names(query_ann2) <- colnames(data2)
# rawcounts2 <- rawcounts2[, names(query_ann1)]
# data_qc_query <- Cell_qc(rawcounts2, query_ann1, query_ann2, species = "Hs")
QueryDataQC_MLab <- Cell_qc(
  expression_profile = QueryRawcounts_MLab, 
  species = "Hs", 
  gene_low = 50, 
  umi_low = 50)
# Assignment with trained model above
scLearn_predict_result_MLab <- scLearn_cell_assignment(
  scLearn_model_learning_result = scLearn_model_learning_result, 
  expression_profile_query = QueryDataQC_MLab$expression_profile)

head(scLearn_predict_result_MLab)

Pre-trained Models

scLearn provides pre-trained models for 30 datasets and 20 mouse organs datasets, covering a wide range of commonly used cell types and tissues. These models can be directly used for single-cell categorization tasks.

The information of pre-trained scLearn models of the 30 datasets

Pre-trained model names	Description	No. of cell types	Corresponding dataset(Journal, date)
pancreas_mouse_baron.rds	Mouse pancreas	9	Baron_mouse(Cell System, 2016)
pancreas_human_baron.rds	Human pancreas	13	Baron_human(Cell System, 2016)
pancreas_human_muraro.rds	Human pancreas	8	Muraro(Cell System, 2016)
pancreas_human_segerstolpe.rds	Human pancreas	8	Segerstolpe(Cell Metabolism, 2016)
pancreas_human_xin.rds	Human pancreas	4	Xin(Cell Metabolism, 2016)
embryo_development_mouse_deng.rds	Mouse embryo development	4	Deng(Science, 2014)
cerebral_cortex_human_pollen.rds	Human cerebral cortex	9	Pollen(Nature biotechnology, 2014)
colorectal_tumor_human_li.rds	Human colorectal tumors	5	Li(Nature genetics, 2017)
brain_mouse_usoskin.rds	Mouse brain	4	Usoskin(Nature neuroscience,2015)
cortex_mouse_tasic.rds	Mouse cortex	17	Tasic(Nature neuroscience, 2016)
embryo_stem_cells_mouse_klein.rds	Mouse embryo stem cells	4	Klein(Cell, 2015)
brain_mouse_zeisel.rds	Mouse brain	9	Zeisel(Science, 2015)
retina_mouse_shekhar_coarse-grained_annotation.rds	Mouse retina	4	Shekhar(Cell, 2016)
retina_mouse_shekhar_fine-grained_annotation.rds	Mouse retina	17	Shekhar(Cell, 2016)
retina_mouse_macosko.rds	Mouse retina	12	Macosko(Cell, 2015)
lung_cancer_cell_lines_human_cellbench10X.rds	Mixture of five human lung cancer cell lines	5	CellBench_10X(Nature methods, 2019)
lung_cancer_cell_lines_human_cellbenchCelSeq.rds	Mixture of five human lung cancer cell lines	5	CellBench_CelSeq2(Nature methods, 2019)
whole_mus_musculus_mouse_TM.rds	Whole Mus musculus	55	TM(Nature, 2018)
primary_visual_cortex_mouse_AMB_coarse-grained_annotation_3.rds	Primary mouse visual cortex	3	AMB(Nature, 2018)
primary_visual_cortex_mouse_AMB_fine-grained_annotation_14.rds	Primary mouse visual cortex	14	AMB(Nature, 2018)
primary_visual_cortex_mouse_AMB_fine-grained_annotation_68.rds	Primary mouse visual cortex	68	AMB(Nature, 2018)
PBMC_human_zheng_sorted.rds	FACS-sorted PBMC	10	Zheng sorted(Nature communications ,2017)
PBMC_human_zheng_68K.rds	PBMC	11	Zheng 68k(Nature communications, 2017)
primary_visual_cortex_mouse_VISP_coarse-grained_annotation.rds	Mouse primary visual cortex	3	VISp(Nature, 2018)
primary_visual_cortex_mouse_VISP_fine-grained_annotation.rds	Mouse primary visual cortex	33	VISp(Nature, 2018)
anterior_lateral_motor_area_mouse_ALM_coarse-grained_annotation.rds	Mouse anterior lateral motor area	3	ALM(Nature, 2018)
anterior_lateral_motor_area_mouse_ALM_fine-grained_annotation.rds	Mouse anterior lateral motor area	32	ALM(Nature, 2018)
middle_temporal_gyrus_human_MTG_coarse-grained_annotation.rds	Human middle temporal gyrus	3	MTG(Nature, 2019)
middle_temporal_gyrus_human_MTG_fine-grained_annotation.rds	Human middle temporal gyrus	34	MTG(Nature, 2019)
PBMC_human_a10Xv2.rds	Human PBMC	9	PbmcBench_a10Xv2(bioRxiv, 2019)
PBMC_human_a10Xv3.rds	Human PBMC	8	PbmcBench a10Xv3(bioRxiv, 2019)
PBMC_human_CL.rds	Human PBMC	7	PbmcBench_CL(bioRxiv, 2019)
PBMC_human_DR.rds	Human PBMC	9	PbmcBench_DR(bioRxiv, 2019)
PBMC_human_iD.rds	Human PBMC	7	PbmcBench_iD(bioRxiv, 2019)
PBMC_human_SM2.rds	Human PBMC	6	PbmcBench_SM2(bioRxiv, 2019)
PBMC_human_SW.rds	Human PBMC	7	PbmcBench_SW(bioRxiv, 2019)

The information of pre-trained scLearn models for the 20 mouse organs datasets

Trained model names	Description	No. of cell types
Aorta_mouse_FACS.rds	Mouse aorta	4
Bladder_mouse_FACS.rds	Mouse bladder	2
Brain_Myeloid_mouse_FACS.rds	Mouse brain myeloid	2
Brain_Non-Myeloid_mouse_FACS.rds	Mouse brain non-myeloid	7
Diaphragm_mouse_FACS.rds	Mouse diaphragm	5
Fat_mouse_FACS.rds	Mouse fat	6
Heart_mouse_FACS.rds	Mouse heart	10
Kidney_mouse_FACS.rds	Mouse kidney	5
Large_Intestine_mouse_FACS.rds	Mouse large intestine	5
Limb_Muscle_mouse_FACS.rds	Mouse limb muscle	8
Liver_mouse_FACS.rds	Mouse liver	5
Lung_mouse_FACS.rds	Mouse lung	11
Mammary_Gland_mouse_FACS.rds	Mouse mammary gland	4
Marrow_mouse_FACS.rds	Mouse marrow	21
Pancreas_mouse_FACS.rds	Mouse pancreas	9
Skin_mouse_FACS.rds	Mouse skin	5
Spleen_mouse_FACS.rds	Mouse spleen	3
Thymus_mouse_FACS.rds	Mouse thymus	3
Tongue_mouse_FACS.rds	Mouse tongue	2
Trachea_mouse_FACS.rds	Mouse trachea	4

📖 Vignette

Using the following command and Choosing the html for more details.

utils::browseVignettes(package = "scLearn")

💖 Contributing

Welcome any contributions or comments, and you can file them here.

🏆 Acknowledgement

Thanks all the developers of the methods integrated into scLearn.

✴️ Citation

B. Duan, C. Zhu, G. Chuai, C. Tang, X. Chen, S. Chen, S. Fu, G. Li, Q. Liu, Learning for single-cell assignment. Sci. Adv. 6, eabd0855 (2020).

For further inquiries or support, please contact bioinfo_db@163.com (binduan@sjtu.edu.cn) or qiliu@tongji.edu.cn.

scLearn: A Unified Framework for Classification, Regression, and Model Interpretation

Overview

☀️ Key Features

⏬ Installation

Release Version

Development Version

🚀 Quick Start

Single-label single cell assignment

Data preparation

cell quality control

rare cell type filtered

feature selection

Model learning

Cell assignment

Accuracy

Multi-label single cell assignment

Pre-trained Models

📖 Vignette

💖 Contributing

🏆 Acknowledgement

✴️ Citation

✍️ Authors