scLearn tutorial: a learning-based framework for single cell assignment

Introduction

scLearn is a learning-based framework designed to automatically infer quantitative measurements and similarity thresholds for single-cell assignment tasks. It achieves well-generalized performance across different single-cell types and is particularly robust in identifying novel cell types not present in reference datasets. scLearn introduces a multi-label single-cell assignment strategy for the first time, allowing simultaneous assignment of cell type and developmental stage, which is highly effective for cell development and lineage analysis.

The Overview of scLearn

The scLearn workflow. scLearn comprises three steps: data preprocessing, model learning, and cell assignment. (i) In the first step, the main processes comprise routine normalization, cell quality control, rare cell type filtering, and feature selection; nGene, number of genes; nUMI, number of unique molecular identifiers; P-mitGene, percentage of mitochondrial genes; and G, cell group. (ii) In the second step, for single-label single-cell assignment, discriminative component analysis (DCA) is applied to learn the transformation matrix. For multilabel single-cell assignment, MDDM (multilabel dimension reduction via dependence maximization) is applied to learn the transformation matrix. Then, with the learned transformation matrix, the transformed reference cell samples are obtained for the following assignment. The thresholds for labeling a cell as unassigned for each cell type are also automatically learned. LTM, learned transformation matrix, which can be calculated as the optimal transformation matrix for single-label single-cell assignment or by Eq. 6 for multilabel single-cell assignment, respectively (see Materials and Methods); TRCM, transformed reference cell matrix, which can be calculated using Eq. 1 (see Materials and Methods). (iii) In the third step, the transformed query cell samples are obtained on the basis of LTM with an available optional cell quality control procedure. The transformed query samples are compared against the TRCM to derive the measurement fulfilling the cell-type assignment with the rejection task. TQCM, transformed query cell matrix, which can be calculated using Eq. 2

Installation

if (!require("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}

# # Install suggested packages
# BiocManager::install(c(
#   "SingleCellExperiment",
#   "M3Drop",
#   "dml"
# ))
 
# devtools::install_github("hemberg-lab/scmap")

# BiocManager::install("scLearn")

# install.packages("devtools")
# devtools::install_github("DuanLab1/scLearn", dependencies = c("Depends", "Imports", "LinkingTo"))

library(scLearn)
library(tidyverse) 
library(SingleCellExperiment)
library(M3Drop)

Single-label single cell assignment

Data preparation

Reference Cell Database: baron-human
Query Cell Data: Muraro-human

data(RefCellData)
data(QueryCellData)
stratified_sample <- function(cell_types, n_total = 300, min_per_type = 10) {
  sampled <- lapply(split(seq_along(cell_types), cell_types), function(idx) {
    if(length(idx) > min_per_type) {
      sample(idx, max(min_per_type, round(n_total * length(idx)/length(cell_types))))
    } else {
      idx
    }
  })
   unlist(sampled)
}


set.seed(123)

cell_types <- colData(RefCellData)$cell_type1
sampled_cells <- stratified_sample(cell_types)
RefCellData <- RefCellData[, sampled_cells]
RefCellData

## class: SingleCellExperiment 
## dim: 20125 342 
## metadata(0):
## assays(2): counts logcounts
## rownames(20125): A1BG A1CF ... ZZZ3 pk
## rowData names(10): feature_symbol is_feature_control ... total_counts
##   log10_total_counts
## colnames(342): human3_lib3.final_cell_0081 human3_lib1.final_cell_0121
##   ... human3_lib3.final_cell_0896 human4_lib1.final_cell_0568
## colData names(30): human cell_type1 ... pct_counts_ERCC is_cell_control
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

cell_types <- colData(QueryCellData)$cell_type1
sampled_cells <- stratified_sample(cell_types)
QueryCellData <- QueryCellData[, sampled_cells]
QueryCellData

## class: SingleCellExperiment 
## dim: 19127 311 
## metadata(0):
## assays(2): normcounts logcounts
## rownames(19127): A1BG-AS1__chr19 A1BG__chr19 ... ZZEF1__chr17
##   ZZZ3__chr1
## rowData names(1): feature_symbol
## colnames(311): D31.7_38 D28.3_4 ... D30.4_58 D31.7_42
## colData names(3): cell_type1 donor batch
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):

cell quality control

RefRawcounts <- assays(RefCellData)[[1]]
RefAnn <- as.character(RefCellData$cell_type1)
names(RefAnn) <- colnames(RefCellData)

RefDataQC <- Cell_qc(
  expression_profile = RefRawcounts, 
  sample_information_cellType = RefAnn, 
  sample_information_timePoint = NULL, 
  species = "Hs",
  gene_low = 500, 
  gene_high = 10000, 
  mito_high = 0.1, 
  umi_low = 1500, 
  umi_high = Inf, 
  logNormalize = TRUE, 
  plot = FALSE, 
  plot_path = "./quality_control.pdf")

rare cell type filtered

RefDataQC_filtered <- Cell_type_filter(
  expression_profile = RefDataQC$expression_profile, 
  sample_information_cellType = RefDataQC$sample_information_cellType,
  sample_information_timePoint = NULL,
  min_cell_number = 10)

feature selection

RefDataQC_HVG_names <- Feature_selection_M3Drop(
  expression_profile = RefDataQC_filtered$expression_profile,
  log_normalized = TRUE, 
  threshold = 0.05)

Model learning

Training the model. To improve the accuracy for “unassigned” cell, you can increase “bootstrap_times”, but it will takes longer time. The default value of “bootstrap_times” is 10.

DCA: Discriminative Component Analysis (DCA) Transformation

scLearn_model_learning_result <- scLearn_model_learning(
  high_varGene_names = RefDataQC_HVG_names,
  expression_profile = RefDataQC_filtered$expression_profile,
  sample_information_cellType = RefDataQC_filtered$sample_information_cellType,
  sample_information_timePoint = NULL,
  method = "DCA",
  bootstrap_times = 1,
  cutoff = 0.01,
  dim_para = 0.999
)

Results：scLearn_model_learning_resultis the final result file containing the processed reference cell matrix.

high_varGene_names: The set of highly variable genes (324 HVGs)
simi_threshold_learned: The similarity results between cell types
trans_matrix_learned: The transformed matrix (23 features * 324 HVGs)
feature_matrix_learned: After DCA, the matrix of cell type features (these features are obtained by dimensionality reduction of the highly variable gene set) (12 celltype * 23 features)

Cell assignment

Assignment with trained model above. To get a less strict result for “unassigned” cells, you can decrease “diff” and “vote_rate”. If you are sure that the cell type of query cells must be in the reference dataset, you can set “threshold_use” as FALSE. It means you don’t want to use the thresholds learned by scLearn.

QueryRawcounts <- assays(QueryCellData)[[1]]
QueryDataQC <- Cell_qc(
  expression_profile = QueryRawcounts, 
  species = "Hs", 
  gene_low = 50, 
  umi_low = 50)

rownames(QueryDataQC$expression_profile) <- gsub("__\\w+\\d+", "", rownames(QueryDataQC$expression_profile))

scLearn_predict_result <- scLearn_cell_assignment(
  scLearn_model_learning_result = scLearn_model_learning_result, 
  expression_profile_query = QueryDataQC$expression_profile, 
  diff = 0.05, 
  threshold_use = TRUE, 
  vote_rate = 0.6)

## [1] "The number of missing features in the query data is  17 "
## [1] "The rate of missing features in the query data is  0.0524691358024691 "

head(scLearn_predict_result)

##          Query_cell_id Predict_cell_type Additional_information
## D31.7_38      D31.7_38        unassigned             Novel_Cell
## D28.3_4        D28.3_4        unassigned             Novel_Cell
## D30.1_63      D30.1_63            ductal      0.877513396467364
## D31.6_53      D31.6_53            ductal      0.898189213920747
## D30.7_39      D30.7_39            ductal      0.828981549060616
## D31.1_22      D31.1_22            ductal      0.839213242283002

Results: The output consists of three columns of information. The Query_cell_id represents the cell ID, Predict_cell_type indicates the predicted cell type, and Additional_information provides the similarity results.

Cells with a Predict_cell_type of unassigned are those that do not intersect with the reference cell set.

Accuracy

QueryData_trueLabel <- as.character(QueryCellData$cell_type1)
names(QueryData_trueLabel) <- colnames(QueryCellData)

QueryData_CellType <- QueryData_trueLabel |> 
  as.data.frame() |>
  tibble::rownames_to_column("CellID") |>
  dplyr::rename(trueLabel = QueryData_trueLabel) |>  
  dplyr::inner_join(scLearn_predict_result |>
                      as.data.frame() |>
                      dplyr::rename(predLabel = Predict_cell_type),
                    by = c("CellID" = "Query_cell_id")) |>
  dplyr::select(CellID, trueLabel, predLabel, everything())

print(
  paste("Final Accuracy =", 
    sprintf("%1.2f%%",
      100 * sum(QueryData_CellType$predLabel == QueryData_CellType$trueLabel) / nrow(QueryData_CellType))))

## [1] "Final Accuracy = 81.23%"

Multi-label single cell assignment

Data preprocessing
Model learning
Cell assignment: We just use ESC.rds itself to test the multi-label single cell assignment here.

Download ESC.rds

# loading the reference dataset
RefData_MLab <- readRDS("ESC.rds")
RefRawcounts_MLab <- assays(RefData_MLab)[[1]]
RefAnn_MLab <- as.character(RefData_MLab$cell_type1)
names(RefAnn_MLab) <- colnames(RefData_MLab)
RefAnn2_MLab <- as.character(RefData_MLab$cell_type2)
names(RefAnn2_MLab) <- colnames(RefData_MLab)

# cell quality control and rare cell type filtered and feature selection
RefDataQC_MLab <- Cell_qc(
  expression_profile = RefRawcounts_MLab, 
  sample_information_cellType = RefAnn_MLab, 
  sample_information_timePoint = RefAnn2_MLab, 
  species = "Hs",
  gene_low = 500, 
  gene_high = 10000, 
  mito_high = 0.1, 
  umi_low = 1500, 
  umi_high = Inf, 
  logNormalize = TRUE, 
  plot = FALSE, 
  plot_path = "./quality_control.pdf")

RefDataQC_filtered_MLab <- Cell_type_filter(
  expression_profile = RefDataQC_MLab$expression_profile, 
  sample_information_cellType = RefDataQC_MLab$sample_information_cellType, 
  sample_information_timePoint = RefDataQC_MLab$sample_information_timePoint, 
  min_cell_number = 10)

RefDataQC_HVG_names_MLab <- Feature_selection_M3Drop(
  expression_profile = RefDataQC_filtered_MLab$expression_profile,
  log_normalized = TRUE, 
  threshold = 0.05)

# training the model
scLearn_model_learning_result <- scLearn_model_learning(
  high_varGene_names = RefDataQC_HVG_names_MLab, 
  expression_profile = RefDataQC_filtered_MLab$expression_profile, 
  sample_information_cellType = RefDataQC_filtered_MLab$sample_information_cellType, 
  sample_information_timePoint = RefDataQC_filtered_MLab$sample_information_timePoint, 
  bootstrap_times = 10, 
  cutoff = 0.01,
  dim_para = 0.999)

# loading the quary cell and performing cell quality control
QueryData_MLab <- readRDS("ESC.rds")
QueryRawcounts_MLab <- assays(QueryData_MLab)[[1]]
### the true labels of this test dataset
# query_ann1 <- as.character(data2$cell_type1)
# names(query_ann1) <- colnames(data2)
# query_ann2 <- as.character(data2$cell_type2)
# names(query_ann2) <- colnames(data2)
# rawcounts2 <- rawcounts2[, names(query_ann1)]
# data_qc_query <- Cell_qc(rawcounts2, query_ann1, query_ann2, species = "Hs")
QueryDataQC_MLab <- Cell_qc(
  expression_profile = QueryRawcounts_MLab, 
  species = "Hs", 
  gene_low = 50, 
  umi_low = 50)
# Assignment with trained model above
scLearn_predict_result_MLab <- scLearn_cell_assignment(
  scLearn_model_learning_result = scLearn_model_learning_result, 
  expression_profile_query = QueryDataQC_MLab$expression_profile)

head(scLearn_predict_result_MLab)

Pre-trained Models

scLearn provides pre-trained models for 30 datasets and 20 mouse organs datasets, covering a wide range of commonly used cell types and tissues. These models can be directly used for single-cell categorization tasks.

The information of pre-trained scLearn models of the 30 datasets

Pre-trained model names	Description	No. of cell types	Corresponding dataset(Journal, date)
pancreas_mouse_baron.rds	Mouse pancreas	9	Baron_mouse(Cell System, 2016)
pancreas_human_baron.rds	Human pancreas	13	Baron_human(Cell System, 2016)
pancreas_human_muraro.rds	Human pancreas	8	Muraro(Cell System, 2016)
pancreas_human_segerstolpe.rds	Human pancreas	8	Segerstolpe(Cell Metabolism, 2016)
pancreas_human_xin.rds	Human pancreas	4	Xin(Cell Metabolism, 2016)
embryo_development_mouse_deng.rds	Mouse embryo development	4	Deng(Science, 2014)
cerebral_cortex_human_pollen.rds	Human cerebral cortex	9	Pollen(Nature biotechnology, 2014)
colorectal_tumor_human_li.rds	Human colorectal tumors	5	Li(Nature genetics, 2017)
brain_mouse_usoskin.rds	Mouse brain	4	Usoskin(Nature neuroscience,2015)
cortex_mouse_tasic.rds	Mouse cortex	17	Tasic(Nature neuroscience, 2016)
embryo_stem_cells_mouse_klein.rds	Mouse embryo stem cells	4	Klein(Cell, 2015)
brain_mouse_zeisel.rds	Mouse brain	9	Zeisel(Science, 2015)
retina_mouse_shekhar_coarse-grained_annotation.rds	Mouse retina	4	Shekhar(Cell, 2016)
retina_mouse_shekhar_fine-grained_annotation.rds	Mouse retina	17	Shekhar(Cell, 2016)
retina_mouse_macosko.rds	Mouse retina	12	Macosko(Cell, 2015)
lung_cancer_cell_lines_human_cellbench10X.rds	Mixture of five human lung cancer cell lines	5	CellBench_10X(Nature methods, 2019)
lung_cancer_cell_lines_human_cellbenchCelSeq.rds	Mixture of five human lung cancer cell lines	5	CellBench_CelSeq2(Nature methods, 2019)
whole_mus_musculus_mouse_TM.rds	Whole Mus musculus	55	TM(Nature, 2018)
primary_visual_cortex_mouse_AMB_coarse-grained_annotation_3.rds	Primary mouse visual cortex	3	AMB(Nature, 2018)
primary_visual_cortex_mouse_AMB_fine-grained_annotation_14.rds	Primary mouse visual cortex	14	AMB(Nature, 2018)
primary_visual_cortex_mouse_AMB_fine-grained_annotation_68.rds	Primary mouse visual cortex	68	AMB(Nature, 2018)
PBMC_human_zheng_sorted.rds	FACS-sorted PBMC	10	Zheng sorted(Nature communications ,2017)
PBMC_human_zheng_68K.rds	PBMC	11	Zheng 68k(Nature communications, 2017)
primary_visual_cortex_mouse_VISP_coarse-grained_annotation.rds	Mouse primary visual cortex	3	VISp(Nature, 2018)
primary_visual_cortex_mouse_VISP_fine-grained_annotation.rds	Mouse primary visual cortex	33	VISp(Nature, 2018)
anterior_lateral_motor_area_mouse_ALM_coarse-grained_annotation.rds	Mouse anterior lateral motor area	3	ALM(Nature, 2018)
anterior_lateral_motor_area_mouse_ALM_fine-grained_annotation.rds	Mouse anterior lateral motor area	32	ALM(Nature, 2018)
middle_temporal_gyrus_human_MTG_coarse-grained_annotation.rds	Human middle temporal gyrus	3	MTG(Nature, 2019)
middle_temporal_gyrus_human_MTG_fine-grained_annotation.rds	Human middle temporal gyrus	34	MTG(Nature, 2019)
PBMC_human_a10Xv2.rds	Human PBMC	9	PbmcBench_a10Xv2(bioRxiv, 2019)
PBMC_human_a10Xv3.rds	Human PBMC	8	PbmcBench a10Xv3(bioRxiv, 2019)
PBMC_human_CL.rds	Human PBMC	7	PbmcBench_CL(bioRxiv, 2019)
PBMC_human_DR.rds	Human PBMC	9	PbmcBench_DR(bioRxiv, 2019)
PBMC_human_iD.rds	Human PBMC	7	PbmcBench_iD(bioRxiv, 2019)
PBMC_human_SM2.rds	Human PBMC	6	PbmcBench_SM2(bioRxiv, 2019)
PBMC_human_SW.rds	Human PBMC	7	PbmcBench_SW(bioRxiv, 2019)

The information of pre-trained scLearn models for the 20 mouse organs datasets

Trained model names	Description	No. of cell types
Aorta_mouse_FACS.rds	Mouse aorta	4
Bladder_mouse_FACS.rds	Mouse bladder	2
Brain_Myeloid_mouse_FACS.rds	Mouse brain myeloid	2
Brain_Non-Myeloid_mouse_FACS.rds	Mouse brain non-myeloid	7
Diaphragm_mouse_FACS.rds	Mouse diaphragm	5
Fat_mouse_FACS.rds	Mouse fat	6
Heart_mouse_FACS.rds	Mouse heart	10
Kidney_mouse_FACS.rds	Mouse kidney	5
Large_Intestine_mouse_FACS.rds	Mouse large intestine	5
Limb_Muscle_mouse_FACS.rds	Mouse limb muscle	8
Liver_mouse_FACS.rds	Mouse liver	5
Lung_mouse_FACS.rds	Mouse lung	11
Mammary_Gland_mouse_FACS.rds	Mouse mammary gland	4
Marrow_mouse_FACS.rds	Mouse marrow	21
Pancreas_mouse_FACS.rds	Mouse pancreas	9
Skin_mouse_FACS.rds	Mouse skin	5
Spleen_mouse_FACS.rds	Mouse spleen	3
Thymus_mouse_FACS.rds	Mouse thymus	3
Tongue_mouse_FACS.rds	Mouse tongue	2
Trachea_mouse_FACS.rds	Mouse trachea	4

Summary

Key Features of scLearn

Robustness and Generalization: scLearn is designed to be robust across various single-cell assignment tasks, providing consistent performance regardless of the cell type or dataset used.
Efficiency in Novel Cell Type Identification: scLearn efficiently identifies novel cell types that are absent in the reference datasets, overcoming limitations of traditional methods that rely on predefined similarity thresholds.
Multi-Label Assignment: scLearn proposes a multi-label assignment strategy, allowing simultaneous assignment of cell type and developmental stage. This dual assignment is particularly useful for understanding cell lineage and development.
Pre-trained Models: scLearn comes with pre-trained models and comprehensive reference datasets for human and mammalian single cells, facilitating broad applications in single-cell assignment.

Citation

B. Duan, C. Zhu, G. Chuai, C. Tang, X. Chen, S. Chen, S. Fu, G. Li, Q. Liu, Learning for single-cell assignment. Sci. Adv. 6, eabd0855 (2020).

For further inquiries or support, please contact bioinfo_db@163.com (binduan@sjtu.edu.cn) or qiliu@tongji.edu.cn.

System information

sessionInfo()

## R version 4.4.3 (2025-02-28)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.4.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Shanghai
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] M3Drop_1.32.0               numDeriv_2016.8-1.1        
##  [3] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0
##  [5] Biobase_2.66.0              GenomicRanges_1.58.0       
##  [7] GenomeInfoDb_1.42.3         IRanges_2.40.1             
##  [9] S4Vectors_0.44.0            BiocGenerics_0.52.0        
## [11] MatrixGenerics_1.18.1       matrixStats_1.5.0          
## [13] lubridate_1.9.4             forcats_1.0.0              
## [15] stringr_1.5.1               dplyr_1.1.4                
## [17] purrr_1.0.4                 readr_2.1.5                
## [19] tidyr_1.3.1                 tibble_3.2.1               
## [21] ggplot2_3.5.1               tidyverse_2.0.0            
## [23] scLearn_2.0.0               BiocManager_1.30.25        
## [25] BiocStyle_2.34.0           
## 
## loaded via a namespace (and not attached):
##   [1] bitops_1.0-9            gridExtra_2.3           inline_0.3.21          
##   [4] rlang_1.1.5             magrittr_2.0.3          compiler_4.4.3         
##   [7] loo_2.8.0               mgcv_1.9-1              systemfonts_1.2.1      
##  [10] vctrs_0.6.5             pkgconfig_2.0.3         crayon_1.5.3           
##  [13] fastmap_1.2.0           backports_1.5.0         XVector_0.46.0         
##  [16] caTools_1.18.3          rmarkdown_2.29          tzdb_0.5.0             
##  [19] UCSC.utils_1.2.0        ragg_1.3.3              densEstBayes_1.0-2.2   
##  [22] xfun_0.51               zlibbioc_1.52.0         cachem_1.1.0           
##  [25] jsonlite_2.0.0          DelayedArray_0.32.0     parallel_4.4.3         
##  [28] irlba_2.3.5.1           cluster_2.1.8           R6_2.6.1               
##  [31] dml_1.1.0               StanHeaders_2.32.10     bslib_0.9.0            
##  [34] stringi_1.8.4           RColorBrewer_1.1-3      rpart_4.1.24           
##  [37] jquerylib_0.1.4         Rcpp_1.0.14             bookdown_0.42          
##  [40] rstan_2.32.7            knitr_1.50              base64enc_0.1-3        
##  [43] timechange_0.3.0        Matrix_1.7-2            splines_4.4.3          
##  [46] nnet_7.3-20             tidyselect_1.2.1        rstudioapi_0.17.1      
##  [49] abind_1.4-8             yaml_2.3.10             gplots_3.2.0           
##  [52] codetools_0.2-20        curl_6.2.2              pkgbuild_1.4.7         
##  [55] lattice_0.22-6          withr_3.0.2             rARPACK_0.11-0         
##  [58] evaluate_1.0.3          foreign_0.8-88          desc_1.4.3             
##  [61] RcppParallel_5.1.10     pillar_1.10.1           KernSmooth_2.23-26     
##  [64] checkmate_2.3.2         generics_0.1.3          hms_1.1.3              
##  [67] rstantools_2.4.0        munsell_0.5.1           scales_1.3.0           
##  [70] gtools_3.9.5            glue_1.8.0              Hmisc_5.2-3            
##  [73] tools_4.4.3             data.table_1.17.0       RSpectra_0.16-2        
##  [76] fs_1.6.5                mvtnorm_1.3-3           grid_4.4.3             
##  [79] bbmle_1.0.25.1          QuickJSR_1.7.0          bdsmatrix_1.3-7        
##  [82] colorspace_2.1-1        nlme_3.1-167            GenomeInfoDbData_1.2.13
##  [85] lfda_1.1.3              htmlTable_2.4.3         Formula_1.2-5          
##  [88] cli_3.6.4               textshaping_1.0.0       S4Arrays_1.6.0         
##  [91] V8_6.0.2                gtable_0.3.6            sass_0.4.9             
##  [94] digest_0.6.37           reldist_1.7-2           SparseArray_1.6.2      
##  [97] htmlwidgets_1.6.4       htmltools_0.5.8.1       pkgdown_2.1.1          
## [100] lifecycle_1.0.4         httr_1.4.7              statmod_1.5.0          
## [103] MASS_7.3-64

Bin Duan

2025-05-12