内容

1简介
2设置
3.运行的管道
4结果
5策划
6总结

1简介

本小插图涵盖了如何使用CellBench进行单细胞RNA seq方法基准测试的最小示例。我们将使用用CellBench和来自GitHub包的外部包装器打包的单个单元数据。对于2个数据集，将测试3种归一化方法、2种imputation方法和3种聚类方法，并使用调整后的Rand指数度量对最终的聚类结果进行评估。

这些需要从github安装使用install_github从devtools或# remotes包库(NormCBM) # install_github("shians/NormCBM")库(ClusterCBM) # install_github("shians/ClusterCBM")库(ImputeCBM) # install_github("shians/ImputeCBM") # Tidyverse包库(dplyr)库(purrr)库(forcats)库(ggplot2)

我们使用load_sc_data ()，返回一个列表的SingleCellExperiments对象准备使用CellBench。我们选择CelSeq和DropSeq数据集，因为它们相对较小，可以在合理的时间内运行。

2设置

#从mixology数据集中获取较小的CelSeq和DropSeq单细胞数据集<- load_sc_data()[c("sc_celseq"， "sc_dropseq")]

的选择包装器中创建方法列表NormCBM，ImputeCBM而且ClusterCBMGitHub包。选择这组方法是考虑运行时而不是性能，但请注意，更长的运行时并不能保证性能。

这些方法中的每一个都需要一个SingleCellExperiment参数，返回一个SingleCellExperiment结果，将新数据添加到适当的槽中。这种方法使包装器更加通用，并以更大的内存使用为代价为下游使用保留最大的数据。

对于归一化和imputation，我们也有一个“none”方法，它使用# identity函数简单地返回对象而不做任何更改归一化<- list("none" = identity， "linnorm" = norm_linnorm， "scran" = norm_scran， "tmmwzp" = norm_tmmwzp) imputation <- list("none" = identity， "drimpute" = impute_drimpute， "knn_smooth" = impute_knn_smooth) cluster <- list("race_id" = cluster_race_id， "seurat" = cluster_seurat，"tscan" = cluster_tscan) #为通用case编写的调整rand索引包装器rand_index <- function(sce, cluster_col, truth_col) {cluster <- colData(sce)[， cluster_col] truth <- colData(sce)[， truth_col] mclust::adjustedRandIndex(cluster, truth)} #为特定case创建指标<- list(ARI = function(x) {rand_index(x， "cluster_id"， "cell_line")})

我们可以看一个非常简单的包装器示例

正常化tmmwzp美元

## function(sce) {## sizeFactors(sce) <- edgeR::calcNormFactors(counts(sce)， method = "TMMwzp") ## sce <- scater::normalize(sce) ## ## return(sce) ##} ##  ## <环境:namespace:NormCBM>

或者稍微复杂一点的包装器

集群tscan美元

## function (sce) ## {# expr <- logcounts(sce) ## res <- TSCAN::exprmclust(expr) ## res <- res$clusterid ## colData(sce)$cluster_id <- factor(res) ## return(sce) ##} ##  ## <环境:命名空间:ClusterCBM>

3.运行的管道

包装器设置好后，我们可以用非常少的代码运行我们的管道。

Res_metric <- datasets %>% apply_methods(归一化)%>% apply_methods(imputation) %>% apply_methods(cluster) %>% apply_methods(metric)

4结果

查看结果，我们将看到与应用的方法相对应的列，以及包含管道结果列表的最终列。

res_metric

## #小猫咪:72 x 6 # #数据正常化归罪集群指标结果# # < fct > < fct > < fct > < fct > < fct > <列表> # # 1 sc_cels…没有没有race_id ARI <双[1 # # 2 sc_cels……没有没有修ARI <双[1 # # 3 sc_cels……没有没有tscan ARI < task_r…# # 4 sc_cels…没有drimpute race_id ARI <双[1…# # 5 sc_cels…没有drimpute ARI <双修[1…# # 6 sc_cels…没有drimpute tscan ARI <双[1…# # 7 sc_cels…没有knn_smooth race_id ARI <双[1…# # 8 sc_cels…没有knn_smooth ARI <双修[1…# # 9 sc_cels…没有linnorm none race_id ARI


    我们将看到一些元素作为“task_error”对象返回，这表明计算失败了，这通常是由于单单元数据中大量的零值导致数学方法出现问题。但是大多数结果都成功返回，所以我们过滤掉错误并将结果清理成更好的形式。
    #帮助函数用于dplyr::filter，因为结果列是一个列表，#与常用的向量化函数不兼容is_task_error <- function(x) {purrr::map_lgl(x, function(obj) is(obj， "task_error"))} #过滤出错误的条目，将结果转换为数字列，并基于# ARI res_metric %>%过滤器(!is_task_error(result)) %>%变异(result = as.numeric(result)) %>%排列(desc(result)) %>%打印(n = 20)
    ## #小猫咪:64 x 6 # #数据正常化归罪集群指标结果# # < fct > < fct > < fct > < fct > < fct > <双> # # 1 sc_celseq都没有修ARI 0.977 # # 2 sc_celseq没有drimpute修ARI 0.977 # # 3 sc_celseq没有knn_smooth修ARI 0.977 # # 4 sc_celseq linnorm没有修ARI 0.977 # # 5 sc_celseq linnorm drimpute修ARI 0.977 # # 6 sc_celseq linnorm knn_smooth修ARI 0.977 # # 7 sc_celseq残渣没有修ARI 0.977 # # 8 sc_celseq残渣drimpute修ARI 0.977 # # 9 sc_celseq残渣knn_smooth修ARI 0.977 # # 10 sc_celseq tmmwzp没有修ARI 0.977 # # 11 sc_celseq tmmwzp drimpute修ARI 0.977 # # 12 sc_celseq tmmwzp knn_smooth修ARI 0.977 # # 13 sc_celseq tmmwzp drimpute tscan ARI 0.833 # # 14 sc_dropseq没有没有tscan ARI 0.818 # # 15 sc_dropseq残渣没有tscan ARI 0.818 # # 16 sc_dropseq没有knn_smooth tscan ARI 0.774 # # 17 sc_dropseq都没有修ARI 0.763 # # 18 sc_dropseq没有knn_smooth修ARI 0.763 # # 19 sc_dropseq linnorm没有修ARI 0.763 # #20 sc_dropseq linnorm drimpute seurat ARI 0.763 ## # … with 44 more rows


   
    5策划
    我们执行一些表操作，将数据转换为适合绘图的形式
    #过滤错误，将结果转换为数值并删除metric列#因为我们只有一个res_metric_filtered <- res_metric %>% Filter (!is_task_error(result)) %>% mutate(result = as.numeric(result)) %>% select(-metric) # pipeline_collapse生成一个新列，将管道汇总为#单个字符串res_metric_summarised <- res_metric_filtered %>% pipeline_collapse(drop. Filter)steps = FALSE, sep = " > ") res_metric_summarised
    ## #小猫咪:64 x 6 # #数据正常化归罪集群管道结果# # < fct > < fct > < fct > < fct > < fct > <双> # # 1 sc_ce…没有没有race_id sc_celseq 0.372 # # 2 sc_ce……没有没有修sc_celseq 0.977 # # 3 sc_ce…没有drimpute race_id sc_celseq 0.372 # # 4 sc_ce…没有drimpute修sc_celseq 0.977 # # 5 sc_ce…没有drimpute tscan sc_celseq 0.531 # # 6 sc_ce…没有knn_smooth race_id sc_celseq 0.372 # # 7 sc_ce…没有knn_smooth修sc_celseq……0.977 # # 8 sc_ce linnorm没有race_id sc_celseq…0.372## 9 sc_ce… linnorm none seurat sc_celseq… 0.977 ## 10 sc_ce… linnorm drimpute race_id sc_celseq… 0.372 ## # … with 54 more rows
    现在我们可以画出所有的结果。我们将看到seurat和race_id对于上游归一化或归一化是不变的。这实际上是因为这两种方法都执行自己的归一化，而忽略了输入的表达式值。
    #重新排序因素水平，使条形图排名正确plot_data <- res_metric_summarised %>% arrange(result) %>% mutate(pipeline = fct_inorder(as.character(pipeline))， cluster = fct_inorder(as.character(cluster))) plot_data %>% ggplot(aes(x = pipeline, y = result, fill = cluster)) + geom_bar(stat = "identity") + theme_classic() + xlab(" pipeline(归一化> imputation >聚类)")+ ylab("调整后的Rand指数")+ theme(轴.text。X = element_text(角度= 90,hjust = 1, size = 15))
    
    绘制所有的数据可能是压倒性的，我们通常只对表现最好(也可能是最差)的管道感兴趣。因为我们的数据是整齐的格式，所以可以使用dplyr函数向下过滤到感兴趣的数据。
    #只保留CelSeq数据plot_data <- res_metric_summarised %>% filter(data == "sc_celseq") #只保留每个clustring方法的4个最佳结果plot_data <- plot_data%>% group_by(cluster) %>% arrange(desc(result)) %>% slice(1:4) %>% ungroup() plot_data <- plot_data%>% arrange(result) %>% mutate(pipeline = fct_inorder(as.character(pipeline))， cluster = fct_inorder(as.character(cluster))) plot_data%>% ggplot(aes(x = pipeline, y = result，fill = cluster)) + theme_classic() + geom_bar(stat = "identity") + xlab("Pipeline(归一化> imputation >聚类)")+ ylab("Adjusted Rand Index") + theme(axis.text。X = element_text(角度= 90,hjust = 1, size = 15))+ guides(fill = guide_legend(title = "Clustering Method")) + ggtitle("Four highest ranked pipelines for each clustering method (CelSeq)")
    
    #只保留DropSeq数据plot_data <- res_metric_summarised %>% filter(data == "sc_dropseq") #只保留每个clustring方法的4个最佳结果plot_data <- plot_data %>% group_by(cluster) %>% arrange(desc(result)) %>% slice(1:4) %>% ungroup() plot_data <- plot_data %>% arrange(result) %>% mutate(pipeline = fct_inorder(as.character(pipeline))， cluster = fct_inorder(as.character(cluster))) plot_data %>% ggplot(aes(x = pipeline, y = result，fill = cluster)) + theme_classic() + geom_bar(stat = "identity") + xlab("Pipeline(归一化> imputation >聚类)")+ ylab("Adjusted Rand Index") + theme(axis.text。X = element_text(角度= 90,hjust = 1, size = 15))+ guides(fill = guide_legend(title = "Clustering Method")) + ggtitle("Four highest ranked pipelines for each clustering method (DropSeq)")
    
   
   
    6总结
    CellBench的结构允许核心基准测试代码非常清晰地呈现，就像“正在运行的管道”部分一样。复杂性在包装器的其他地方。通过以整齐的格式返回结果，可以使用tidyverse提供的丰富工具集轻松地对结果进行操作，以调查更具体的细节。

CellBench案例分析

2019-09-30

内容

1简介

2设置

3.运行的管道

4结果

5策划

6总结