从cellxgene数据门户发现和下载数据集和文件

马丁•摩根1 *

1罗斯威尔公园综合癌症中心

2022年11月9日

摘要

cellxgene数据门户网站(https://cellxgene.cziscience.com/)为以标准方式处理的单细胞序列数据集合提供了一个图形用户界面，以“计数矩阵”摘要。cellxgenedp包提供了另一种基于r的接口，允许灵活地发现、查看和下载数据。

包

cellxgenedp 1.3.1

内容

1安装与使用
2cxg ()提供了一个“闪亮”的界面
3.集合、数据集和文件
4数据可视化cellxgene
5文件下载及使用
6下一个步骤
会话信息

1安装与使用

这个包在Bioconductor版本3.15及更高版本。下面的代码安装cellxgenedp以及这个小插图所需的其他包装。

pkgs <- c("cellxgenedp"， "zellkonverter"， " singlecellexexperiment "， "HDF5Array") required_pkgs <- pkgs[!]pkgs %in% rownames(installed.packages())] BiocManager::install(required_pkgs)

使用以下命令包裹vector从GitHub(最新的，未检查的，开发版本)安装

pkgs <- c("mtmorgan/cellxgenedp"， "zellkonverter"， " singlecellexexperiment "， "HDF5Array")

把包加载到你的电流中R会话。我们广泛使用了dplyr包，并在小视频的最后使用了singlecellexexperiment和zellkonverter，所以也加载它们。

suppressPackageStartupMessages({library(zellkonverter) library(singlecellexexperiment) #提前加载以避免屏蔽dplyr::count() library(dplyr) library(cellxgenedp)})

2`cxg ()`提供了一个“闪亮”的界面

控件的使用方法cellxgenedp包装在一个R脚本;中的大多数功能也可用cxg ()闪亮的应用程序，提供了一种简单的方法来识别、下载和可视化一个或多个数据集。启动应用程序

cxg ()

在第一个选项卡上选择一个项目，一个数据集用于可视化，或者一个或多个数据集用于下载!

3.集合、数据集和文件

检索关于cellxgene数据门户中可用资源的元数据，使用db ()：

Db <- Db ()

打印db对象以函数的形式提供了可用数据的简要概述以及提示集合()，以便进一步探索。

db

##集合数():106 ##数据集数():544 ##文件数():1624

门户按层次组织数据，包括“集合”(大约是研究)、“数据集”和“文件”。使用相应的功能发现数据。

集合(db)

## #一个提示:106×16 # # collec…¹acc…²帐目…³帐目…⁴curat…⁵data_…⁶备注说明⁷基因…⁸链接名称# # <空空的> <空空的> <空空的> <空空的> <空空的> <空空的> <空空的> < lgl > <列表> < >从而向# # 1 03 f821b…读km16@s Kersti…Batuha…2.0年,是……NA <列表>轨迹……# # 2 43 d4bb3……读raymon raymon…Batuha…2.0 Pertur…NA <列表> Tran…# # 3 0434 a9d…读avilla…Alexan…Batuha…2.0 va…NA <列表>一刀切…# # 4 3472 f32…读wongcb…raymon Batuha…2.0的NA <列表> si…# # 5 2902 f08…读lopes@…s m…魏Kh 2.0 ov……NA <列表>Sing… ## 6 83ed3be… READ tom.ta… Tom Ta… Jennif… 2.0 During… NA  Inte… ## 7 2b02dff… READ miriam… Miriam… Batuha… 2.0 Clinic… NA  Sing… ## 8 eb735cc… READ rv4@sa… Roser … Batuha… 2.0 Human … NA  Samp… ## 9 44531dd… READ tallul… Tallul… Jennif… 2.0 The cr… NA  Sing… ## 10 e75342a… READ nhuebn… Norber… Jennif… 2.0 Pathog… NA  Path… ## # … with 96 more rows, 6 more variables: publisher_metadata , ## # visibility , created_at , published_at , ## # revised_at , updated_at , and abbreviated variable names ## # ¹collection_id, ²access_type, ³contact_email, ⁴contact_name, ⁵curator_name, ## # ⁶data_submission_policy_version, ⁷description, ⁸genesets

数据集(db)

## #一个提示:544×28 # # dataset_id贴画…¹捐助…²化验cell_…³cell_⁴数据……⁵重击…⁶疾病# # <空空的> <空空的> <列表> <列表> < int > <列表> <空空的> <列表> <列表> # # 1 edc8d3fe - 153 c 03 f821…<列表> <列表> 236977 <列表> https:…<列表> <列表> # # 2 2 a498ace - 872 - 03 f821…<列表> <列表> 422220 <列表> https:…<列表> <列表> # # 3 f512b8b6 - 369 d 43 d4bb…<列表> <列表> 68036 <列表> https:…<列表> <列表> # # 4 fa8605cf-f27e…0434 a9…<列表> <列表> 59506 <列表> https:…<列表> <列表> # # 5 d5c67a4e-a8d9…3472 f3…<列表> <列表>19694  https:…  ## 6 1f1c5c14-5949…2902f0…  20676   71732  https:…  ## 8 36c867a7-be10…2b02df…  32458  https:…  ## 9 c2a461b1-0c15…eb735c…  97499   ## 10 0895c838-e550…44531d…  146  ## #…共534行，19个变量:Is_primary_data ， ## # is_valid ， linked_genesets ， mean_genes_per_cell ， ## # name ， organism ， processing_status ， published ， ## # revision ， schema_version ， self_reported_ethnicity ， ## # sex ， suspension_type ， tissue ， tombstone ， ## # created_at ， publishhed_at ， revised_at ， ## # updated_at ，以及缩写变量名¹collection_id，…

文件(db)

## # A tibble: 1,624 × 8 ## file_id data…¹filen…²filet…³s3_uri user_…⁴created_at updated_at ##        ## 1 8c4737ab-cd8d-4…edc8d3…explore…CXG s3://…TRUE 2022-10-18 2022-10-18 ## 2 15fda108-90fd-4…edc8d3…local。…RDS s3://…TRUE 2022-10-18 2022-10-18 ## 3 4a052f7b-7de0-4…edc8d3…local。…H5AD s3://…TRUE 2022-10-18 2022-10-18 ## 4 e6fd765c-bcfe-4…2a498a…local。…H5AD s3://…TRUE 2022-10-18 2022-10-18 ## 5 9a8737f1-775f-4…2a498a…exploror…CXG s3://…TRUE 2022-10-18 2022-10-18 ## 6 aeb40efc-77ae-4…2a498a…local。…RDS s3://…TRUE 2022-10-18 2022-10-18 ## 7 8eabe485-be71-4…f512b8…local。…H5AD s3://…TRUE 2022-10-28 2022-10-28 ## 8 33dfa406-132a-4…f512b8…local。…RDS s3://…TRUE 2022-10-28 2022-10-28 ## 9 454e0b3cc -e207-4…f512b8…exploror…CXG s3://…TRUE 2022-10-28 2022-10-28 ## 10 b50c48f1-ef5e-4…fa8605…local。…H5AD s3://…TRUE 2022-10-20 2022-10-20 ## #…与1,614更多的行，和缩写变量名¹dataset_id， ## #²filename，³filetype，⁴user_submitted

这些资源中的每一个都有唯一的主标识符(例如，file_id)，以及描述资源与数据库其他组件之间关系的标识符(例如:dataset_id)。这些标识符可用于跨表“连接”信息。

3.1使用`dplyr`浏览数据

一个集合可能有几个数据集，而数据集可能有几个文件。例如，这是包含最多数据集的集合

collection_with_most_datasets <- datasets(db) |> count(collection_id, sort = TRUE) |> slice(1)

我们可以通过加入集合()表格

左t_join(collection_with_most_datasets |> select(collection_id)， collections(db)， by = "collection_id") |> glimpse()

##列:16 ## $ collection_id  "8e880741-bf9a-4c8e- 92227 -934204631d2a" ## $ access_type  "READ" ## $ contact_email  "jmarshal@broadinstitute.org" ## $ contact_name  "Jamie L Marshall" ## $ curator_name  "Jennifer yusheng Chien" ## $ data_submission_policy_version  "2.0" ## $ description  "高分辨率空间转录组学…## $ genessets  NA ## $ links  [[""， "RAW_DATA"，"https://www.ncbi.nlm…## $ name  "高分辨率幻灯片- seqv2空间Tr…## $ publisher_metadata  [[["Marshall"， "Jamie L. "， ["Noel"， "T…## $ visibility  "PUBLIC" ## $ created_at  2021-05-28 ## $ publishhed_at  2021-12-09 ## $ revised_at  2022-10-24 ## $ updated_at  2022-10-24

我们可以采用类似的策略来识别属于此集合的所有数据集

左t_join(collection_with_most_datassets |> select(collection_id)，数据集(db)， by = "collection_id")

## #一个提示:129×28 # # collection_id数据…¹捐助…²化验cell_…³cell_⁴数据……⁵重击…⁶疾病# # <空空的> <空空的> <列表> <列表> < int > <列表> <空空的> <列表> <列表> # # 1 8 e880741-bf9a ff77ee…<列表> <列表> 38024 <列表> https:…<列表> <列表> # # 2 8 e880741-bf9a…5 c451b…<列表> <列表> 13147 <列表> https:…<列表> <列表> # # 3 8 e880741-bf9a…4 ebe33…<列表> <列表> 17909 <列表> https:…<列表> <列表> # # 4 8 e880741-bf9a…88 b7da…<列表> <列表> 44588 <列表> https:…<列表> <列表> # # 5 8 e880741-bf9a…230 eee…<列表> <列表>22430  https:…  ## 6 8e880741-bf9a…1831d8… https:…  ## 7 8e880741-bf9a…868026…   31194    22502  https:…  ## 9 8e880741-bf9a…348383…  27814  https:…  ## 10 8e880741-bf9a…4f420b…  19886  https:…  ## #…共119行，19个变量:Is_primary_data ， ## # is_valid ， linked_genesets ， mean_genes_per_cell ， ## # name ， organism ， processing_status ， published ， ## # revision ， schema_version ， self_reported_ethnicity ， ## # sex ， suspension_type ， tissue ， tombstone ， ## # created_at ， publishhed_at ， revised_at ， ## # updated_at ，以及缩写变量名¹dataset_id，²donor_id，…

3.2`方面()`提供特定列中“级别”的信息

请注意，有些列是“列表”，而不是像“字符”或“整数”这样的原子向量。

数据集(db) |> select(where(is.list))

## #一个提示:544×11 # # donor_id化验cell_…¹重击…²疾病器官…³处理…⁴self_…⁵性# # <列表> <列表> <列表> <列表> <列表> <列表> <列表> <列表> <列表> # # 1 <列表> <列表> <列表> <列表> <列表> <列表> <命名列表> <列表> <列表> # # 2 <列表> <列表> <列表> <列表> <列表> <列表> <命名列表> <列表> <列表> # # 3[9]<列表> <列表> <列表> <列表> <列表> <列表> <命名列表> <列表> <列表> # # 4 <列表> <列表> <列表> <列表> <列表> <列表> <命名列表> <列表> <列表> # # 5[3]<列表> <列表> <列表> <列表> <列表> <列表>   ## 6          ## 7          ## 8          ## 9          ## 10          ## # … with 534 more rows, 2 more variables: suspension_type , ## # tissue , and abbreviated variable names ¹cell_type, ## # ²development_stage, ³organism, ⁴processing_status, ⁵self_reported_ethnicity

这表明至少有一些数据集具有不止一种类型的分析，cell_type等。的方面()函数提供了一种方便的方法来发现每个列的可能级别，例如:分析，生物，self_reported_ethnicity,或性，以及每个标签的数据集数量。

方面(db,“分析”)

## #一个提示:32×4 # # facet标签ontology_term_id n # # <空空的> <空空的> <空空的> < int > # # 1测定10 x 3 v3 EFO: 0009922 192 # # 2测定10 x 3 ' v2 EFO: 0009899 165 # # 3测定Slide-seqV2 EFO: 0030062 129 # # 4测定10 x 5”v1 EFO: 0011025 45 # # 5分析Smart-seq2 EFO: 0008931 36 # # 6基因表达分析Visium空间EFO: 0010961 35 # # 7试验10 x multiome EFO: 0030059 24 # # 8化验Drop-seq EFO: 0008722 12 # # 9试验10 x 3 '转录剖析EFO: 0030003 9 # # 10测定10 x 5 ' v2 EFO: 0009900 9  ## # ... 有22个行

“self_reported_ethnicity”方面(db)

## #一个提示:18×4 # # facet标签ontol…¹n # # <空空的> <空空的> <空空的> < int > # # 206 # # self_reported_ethnicity未知的未知2 self_reported_ethnicity欧洲拱腰…190 # # 3 self_reported_ethnicity na na 176 # # 4 self_reported_ethnicity亚洲拱腰…76 # # 5 self_reported_ethnicity非裔美国人拱腰…37 # # 6 self_reported_ethnicity多民族multie…25 # # 7 self_reported_ethnicity大中东(中间复活节拱腰……21 # # 8 self_reported_ethnicity西班牙裔或拉丁美洲拱腰…16 # #9 self_reported_ethnicity African American or Afro-Caribbean HANCES… 5 ## 10 self_reported_ethnicity East Asian HANCES… 4 ## 11 self_reported_ethnicity African HANCES… 3 ## 12 self_reported_ethnicity South Asian HANCES… 2 ## 13 self_reported_ethnicity Chinese HANCES… 1 ## 14 self_reported_ethnicity Eskimo HANCES… 1 ## 15 self_reported_ethnicity Finnish HANCES… 1 ## 16 self_reported_ethnicity Han Chinese HANCES… 1 ## 17 self_reported_ethnicity Oceanian HANCES… 1 ## 18 self_reported_ethnicity Pacific Islander HANCES… 1 ## # … with abbreviated variable name ¹ontology_term_id

“性”方面(db)

## #一个tibble: 3 × 4 ## facet label ontology_term_id n ##     ## # 1性别男性PATO:0000384 461 ## 2性别女性PATO:0000383 321 ## 3性别未知未知52

3．3过滤分面列

假设我们有兴趣从10x3 ' v3分析中找到数据集(ontology_term_id的EFO: 0009922)，包括非裔美国人和女性。使用facets_filter ()根据需要过滤数据集的实用程序功能

african_american_female <- datasets(db) |> filter(facets_filter(assay，“ontology_term_id”，“EFO:0009922”)，facets_filter(self_reported_ethnicity，“label”，“African American”)，facets_filter(sex，“label”，“female”))

使用nrow (african_american_female)找出满足我们标准的数据集的数量。看起来好像有

汇总(total_cell_count = sum(cell_count))

## #显示:1 × 1 ## total_cell_count ##  ## 1 2608650

细胞测序(每个数据集可能包含来自多个种族的细胞，以及男性或性别未知的个体，因此如果不下载文件，我们无法知道可用细胞的实际数量)。使用left_join识别相应的集合:

## collections left_join(african_american_female > select(collection_id) |> distinct()， collections(db)， by = "collection_id")

# # #一个宠物猫:7×16 # #收集…¹acc…²帐目…³帐目…⁴curat…⁵data_…⁶备注说明⁷基因…⁸链接名称# # <空空的> <空空的> <空空的> <空空的> <空空的> <空空的> <空空的> < lgl > <列表> < >从而向# # 1 c9706a92…读hnaksh…Harikr…Jennif…2.0“Singl…NA <列表> si…# # 2 2 f75d249……读rsatij拉胡尔Jennif…2.0”…NA <列表> Azim…# # 3 b9fc3d70…读布鲁斯。…布鲁斯…珍妮…2.0 "数字…NA <列表> A我们…## 4 62e8f058…阅读chanj3…约瑟夫…珍妮…2.0 "155,0…NA <列表> HTAN…## 5 625f6bf4…阅读a5wang…Allen…珍妮…2.0 "大型…NA <列表>肺…## 6 b953c942…阅读icobos…Inma C…珍妮…2.0 "Tau A…NA <列表> Sing…## 7 bcb61471…阅读info@k... KPMP珍妮…2.0 "下…NA <列表> An…## #…有6个变量:Publisher_metadata ， visibility ， ## # created_at ， publishhed_at ， revised_at ， ## # updated_at ，以及缩写变量名¹collection_id， ## #²access_type，³contact_email，⁴contact_name, curator_name， ## #⁶data_submission_policy_version，⁷description，⁸genessets

4数据可视化`cellxgene`

发现与第一个选定数据集相关的文件

Selected_files <- left_join(african_american_female |> select(dataset_id)， files(db)， by = "dataset_id") Selected_files

## # A tibble: 63 × 8 ## dataset_id file_id filen…¹filet…²s3_uri user_…³created_at updated_at ##         ## 1 de985818-285f-4…15e9d9…local。…H5AD s3://…TRUE 2022-10-21 2022-10-21 ## 2 de985818-285f-4…0d3974…expor…CXG s3://…TRUE 2022-10-21 2022-10-21 ## 3 de985818-285f-4…e254f9…local。…RDS s3://…TRUE 2022-10-21 2022-10-21 ## 4 f72958f5-7f42-4…59bd46…local。…RDS s3://…TRUE 2022-10-18 2022-10-18 ## 5 f72958f5-7f42-4…3a2467…explor…CXG s3://…TRUE 2022-10-18 2022-10-18 ## 6 f72958f5-7f42-4…d9f9d0…local。…H5AD s3://…TRUE 2022-10-18 2022-10-18 ## 7 bc2a7b3d-f04e-4…f6d9f2…local。…H5AD s3://…TRUE 2022-10-18 2022-10-18 ## 8 bc2a7b3d-f04e-4…46de9c…exploror…CXG s3://…TRUE 2022-10-18 2022-10-18 ## 9 bc2a7b3d-f04e-4…5331f2…local。…RDS s3://…TRUE 2022-10-18 2022-10-18 ## 10 96a3f64b-0ee9-4…b77452…local。…H5AD s3://…TRUE 2022-10-18 2022-10-18 ## #…与53多行，和缩写变量名¹filename，²filetype， ## #³user_submitted

的文件类型列列出了每个文件的类型。cellxgene服务可用于可视化数据集有CXG文件。

selected_files |> filter(filetype == "CXG") |> slice(1) |> #可视化单个数据集datasets_visualize()

可视化是一个互动的过程，所以datasets_visualize ()每次通话最多只能打开5个浏览器选项卡。

5文件下载及使用

数据集通常包含CXG(cellxgene可视化),H5AD(由python AnnData模块生成的文件)，以及Rds生成的序列化文件R修包)。没有公共解析器CXG，和Rds如果用于创建文件的Seurat版本与用于读取文件的版本不同，则文件可能不可读。因此，我们把重点放在H5AD文件。为了说明，我们下载一个选定的文件。

local_file <- selected_files |> filter(dataset_id == "3de0ad6d-4378-4f62-b37b-ec0b75a50d94"， filetype == "H5AD") |> files_download(dry.run = FALSE) basename(local_file)

## [1] "f69ba4b3-fc45-483c-8a7c-434fd056aeed。H5AD”

这些被下载到本地缓存(使用内部函数)cellxgenedp::: .cellxgenedb_cache_path ()对于缓存的位置)，因此该过程只在第一次时耗费时间。

H5AD文件可以转换为R/Bioconductor对象。zellkonverter包中。

h5ad <- readH5AD(local_file, use_hdf5 = TRUE)

警告:'X'矩阵不支持换位，已被跳过

h5ad

## class: singlecellexexperiment ## dim: 26329 46500 ## metadata(3): cell_type_ontology_term_id_colors schema_version title ## assays(1): X ## rownames(26329): ENSG00000182308 ENSG00000124827…ENSG00000155229 ## ENSG00000105609 ## rowData names(4): feature_is_filtered feature_name feature_reference ## feature_biotype ## colnames(46500): D032_AACAAGACAGCCCACA D032_AACAGGGGTCCAGCGT…D231_TTCGCTGAGGAACATT ## colData names(26): nCount_RNA nFeature_RNA…self_reported_ethnicity ## development_stage ## reducedDimNames(1): X_umap ## mainExpName: NULL ## altExpNames(0):

的SingleCellExperiment对象是一个类似矩阵的对象，行对应于基因，列对应于细胞。因此，我们可以很容易地探索数据中存在的细胞。

h5ad |> colData(h5ad) |> as_tibble() |> count(sex, donor_id)

## # A tibble: 9 × 3 ## sex donor_id n ##    ## 1女D088 5903 ## 2女D139 5217 ## 3女D175 1778 ## 4女D231 4680 ## 5男D032 4970 ## 6男D046 8894 ## 7男D062 4852 ## 8男D122 3935 ## 9男D150 6271

6下一个步骤

的与生物导体协调单细胞分析在线资源提供了一个很好的介绍单细胞数据的分析和可视化R/Bioconductor。中使用AnnData对象的广泛机会R但是使用原生python接口的简要描述，例如，AnnData2SCE ?的帮助页面zellkonverter。

的杂环胺package提供了对人类细胞图谱的程序化访问数据门户，允许检索主单元和派生的单细胞数据文件。

会话信息

## R开发中(不稳定)(2022-10-25 r83175) ##平台:x86_64-pc-linux-gnu(64位)##运行在:Ubuntu 22.04.1 LTS ## ##矩阵产品:默认## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas。因此## LAPACK: /usr/lib/x86_64-linux-gnu/ LAPACK /liblapack.so.3.10.0 ## ## locale: ## [1] LC_CTYPE=en_US。UTF-8 LC_NUMERIC= c# # [3] LC_TIME=en_GB LC_COLLATE= c# # [5] LC_MONETARY=en_US。utf - 8 LC_MESSAGES = en_US。UTF-8 ## [7] LC_PAPER=en_US。UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US。UTF-8 LC_IDENTIFICATION=C ## ##附加的基本包:## [1]stats4 stats graphics grDevices utils数据集方法## [8]base ## ##其他附加包:## [1] cellxgenedp_1.3.1 dplyr_1. 29.1 ## [3] SingleCellExperiment_1.21.0 SummarizedExperiment_1.29.1 ## [5] Biobase_2.59.0 genomicranges_1 . 1.51.1 ## [7] GenomeInfoDb_1.35.2 IRanges_2.33.0 ## [9] S4Vectors_0.37.0 BiocGenerics_0.45.0 ## [11] MatrixGenerics_1.11.0 matrixStats_0.62.0 ## [13] zellkonverter_1.9.0 BiocStyle_2.27.0 ## ##通过命名空间加载(未附加):## [1] dir.expiry_1.7.0 xfun_0.34 bslib_0.4.1 ## [4] htmlwidgets_1.5.4 lattice_0.20-45 rjsoncons_1.0.0 ## [7] vctrs_0.5.0 tools_4.3.0 bitops_1.0-7 ## [10] generics_0.1.3 curl_4.3.3 parallel_4.3.0 ## [13] tibble_3.1.8 fansi_1.0.3 pkgconfig_1 .0.3 ## [16] Matrix_1.5-1 assertthat_1 .2.1 lifecycle_1.0.3 ## [19] GenomeInfoDbData_1.2.9 compiler_4.3.0 stringr_1.4.1 ## [22] httpuv_1.6.6 htmltools_0.5.3 sass_0.4.2 ## [25] RCurl_1.98-1.9 yaml_2.3.6 later_1.3.0 ## [28] pillar_1.8.1 jquerylib_1 .1.4ellipsis_0.3.2 ## [31] DT_0.26 DelayedArray_0.25.0 cachem_1.0.6 ## [34] mime_0.12 tidyselect_1.2.0 ## [37] digest_0.6.30 stringi_1.7.8 bookdown_0.30 ## [40] rprojroot_2.0.3 fastmap_1.1.0 grid_4.3.0 ## [43] here_1.0.1 cli_3.4.1 magrittr_1 .0.3 ## [46] utf8_1.2.2 withthr_1 .5.0 promises_1.2.0.1 ## [49] filelock_1.0.2 httr_1.4.4 rmarkdown_2.18 ## [52] xvector_1 .39.0 reticulate_1.26 png_0.1-7 ## [55] shiny_1.7.3 evaluate_0.18 knitr_1.40 ## [58] basilisk.utils_1.11.0 rlang_1.0.6Rcpp_1.0.9 ## [61] xtable_1. 1.8-4 glue_1.6.2 DBI_1.1.3 ## [64] BiocManager_1.30.19 jsonlite_1.8.3 r6_1 .5.1 ## [67] zlibbioc_1.45.0