分析长阅读与bambu RNA-Seq数据

应陈,Andre Sim的趣事祺Wan,乔纳森Goke

介绍

Bambu是记录的方法发现和从长读RNA-Seq数据量化。Bambu使用一致的读取和基因组参考注释作为输入,并返回所有已知的转录丰度估计和新发现的记录。Bambu使用信息从参考注释来纠正偏差在接头连接,然后减少一致读阅读等价类,并使用这些信息来识别小说记录感兴趣的所有样品。读取然后分配到成绩单,表达式使用一个期望最大化算法估计得到。在这里,我们提供了一个示例工作流分析纳米孔长时间阅读RNA-Sequencing数据从两个人类癌症细胞系项目(SG-NEx)从新加坡纳米孔的表情。

快速入门:成绩单与bambu发现和量化

安装

你可以从github安装bambu:

一般使用

默认的模式运行bambu是使用一组一致的读取(bam)文件,参考基因组注释(gtf文件、TxDb对象或bambuAnnotation对象),和参考基因组序列(fasta文件或BSgenome)。bambu将返回一个summarizedExperiment对象基因组注释和新记录和转录表达坐标估计。我们强烈建议使用相同的基因组注释,用于对齐。如果你有一个gtf文件和fasta文件您可以运行bambu以下选项:

图书馆(bambu)测试。bam < -执行(“extdata”,“SGNex_A549_directRNA_replicate5_run1_chr9_1_1000000.bam”,包=“bambu”)足总。文件< -执行(“extdata”,“Homo_sapiens.GRCh38.dna_sm.primary_assembly_chr9_1_1000000.fa”,包=“bambu”)gtf。文件< -执行(“extdata”,“Homo_sapiens.GRCh38.91_chr9_1_1000000.gtf”,包=“bambu”)bambuAnnotations < -prepareAnnotations(gtf.file)se < -bambu(读取=test.bam,注释=bambuAnnotations,基因组=fa.file)# # >|||0% (15:52:16]警告:合并/。。/src/learner.cc:1040年:# # >如果你加载一个序列化模型(R像泡菜在Python中,RDS)生成的# # >老XGBoost,请导出模型通过调用的助推器。save_model”版本# # >首先,然后加载当前版本。看到的:# # ># # > https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html# # ># # >储蓄更多细节差异模型和序列化。# # ># # > [15:52:16]警告:合并/ . . / src /学习者。cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.# # > [15:52:16]警告:合并/ . . / src / learner.cc: 438:# # >如果你加载一个序列化模型(R像泡菜在Python中,RDS)生成的# # >老XGBoost,请导出模型通过调用的助推器。save_model”版本# # >首先,然后加载当前版本。看到的:# # ># # > https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html# # ># # >储蓄更多细节差异模型和序列化。# # ># # > [15:52:16]警告:合并/ . . / src / learner.cc: 1040:# # >如果你加载一个序列化模型(R像泡菜在Python中,RDS)生成的# # >老XGBoost,请导出模型通过调用的助推器。save_model”版本# # >首先,然后加载当前版本。看到的:# # ># # > https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html# # ># # >储蓄更多细节差异模型和序列化。# # ># # > [15:52:16]警告:合并/ . . / src /学习者。cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.# # > [15:52:16]警告:合并/ . . / src / learner.cc: 438:# # >如果你加载一个序列化模型(R像泡菜在Python中,RDS)生成的# # >老XGBoost,请导出模型通过调用的助推器。save_model”版本# # >首先,然后加载当前版本。看到的:# # ># # > https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html# # ># # >储蓄更多细节差异模型和序列化。# # ># # > [15:52:16]警告:合并/ . . / src / learner.cc: 1040:# # >如果你加载一个序列化模型(R像泡菜在Python中,RDS)生成的# # >老XGBoost,请导出模型通过调用的助推器。save_model”版本# # >首先,然后加载当前版本。看到的:# # ># # > https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html# # ># # >储蓄更多细节差异模型和序列化。# # ># # > [15:52:16]警告:合并/ . . / src /学习者。cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.# # > [15:52:16]警告:合并/ . . / src / learner.cc: 438:# # >如果你加载一个序列化模型(R像泡菜在Python中,RDS)生成的# # >老XGBoost,请导出模型通过调用的助推器。save_model”版本# # >首先,然后加载当前版本。看到的:# # ># # > https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html# # ># # >储蓄更多细节差异模型和序列化。# # ># # > [15:52:16]警告:合并/ . . / src / learner.cc: 1040:# # >如果你加载一个序列化模型(R像泡菜在Python中,RDS)生成的# # >老XGBoost,请导出模型通过调用的助推器。save_model”版本# # >首先,然后加载当前版本。看到的:# # ># # > https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html# # ># # >储蓄更多细节差异模型和序列化。# # ># # > [15:52:16]警告:合并/ . . / src /学习者。cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.# # > [15:52:16]警告:合并/ . . / src / learner.cc: 438:# # >如果你加载一个序列化模型(R像泡菜在Python中,RDS)生成的# # >老XGBoost,请导出模型通过调用的助推器。save_model”版本# # >首先,然后加载当前版本。看到的:# # ># # > https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html# # ># # >储蓄更多细节差异模型和序列化。# # ># # > [15:52:19]警告:合并/ . . / src / learner.cc: 1040:# # >如果你加载一个序列化模型(R像泡菜在Python中,RDS)生成的# # >老XGBoost,请导出模型通过调用的助推器。save_model”版本# # >首先,然后加载当前版本。看到的:# # ># # > https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html# # ># # >储蓄更多细节差异模型和序列化。# # ># # > [15:52:19]警告:合并/ . . / src /学习者。cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.# # > [15:52:19]警告:合并/ . . / src / learner.cc: 438:# # >如果你加载一个序列化模型(R像泡菜在Python中,RDS)生成的# # >老XGBoost,请导出模型通过调用的助推器。save_model”版本# # >首先,然后加载当前版本。看到的:# # ># # > https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html# # ># # >储蓄更多细节差异模型和序列化。# # ># # > [15:52:19]警告:合并/ . . / src / learner.cc: 1040:# # >如果你加载一个序列化模型(R像泡菜在Python中,RDS)生成的# # >老XGBoost,请导出模型通过调用的助推器。save_model”版本# # >首先,然后加载当前版本。看到的:# # ># # > https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html# # ># # >储蓄更多细节差异模型和序列化。# # ># # > [15:52:19]警告:合并/ . . / src /学习者。cc:749: Found JSON model saved before XGBoost 1.6, please save the model using current version again. The support for old JSON model will be discontinued in XGBoost 2.3.# # > [15:52:19]警告:合并/ . . / src / learner.cc: 438:# # >如果你加载一个序列化模型(R像泡菜在Python中,RDS)生成的# # >老XGBoost,请导出模型通过调用的助推器。save_model”版本# # >首先,然后加载当前版本。看到的:# # ># # > https://xgboost.readthedocs.io/en/latest/tutorials/saving_model.html# # ># # >储蓄更多细节差异模型和序列化。# # ># # >|| = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = |One hundred.%# # ># # >|||0%|| = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = |One hundred.%# # ># # >|||0%|| = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = |One hundred.%# # ># # >|||0%|| = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = |One hundred.%# # ># # >|||0%|| = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = |One hundred.%# # ># # >|||0%|| = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = |One hundred.%

bambu返回一个SummarizedExperiment对象可以访问如下:

一个完整的工作流来识别和量化转录表达

纳米孔RNA-Seq {# complete-workflow}数据演示Bambu的用法,我们使用读RNA-Seq生成的数据使用NanoporeRNASeq牛津纳米孔测序计划,由6个样本两个人类细胞系(K562和MCF7)生成的SG-NEx项目。这些细胞系有三个复制,1直接RNA序列和2 cDNA序列运行运行。读取正确对齐,22号染色体(Grch38)和存储为bam文件。在这个工作流程中,我们将演示如何应用bambu这些bam文件来识别新的成绩单和估计转录表达,可视化结果,识别差异表达基因和转录。

输入数据

基因组序列(fasta文件/ BSGenome对象)

bambu另外需要一个基因组序列,用于正确拼接连接读取比对。理想情况下,我们推荐使用相同的基因组seqeunce文件用于用于bambu对齐。

作为一个选项,用户还可以选择使用BSgenome对象:

基因组注释(bambu注释对象/ gtf文件/ TxDb对象)

{#注释}
bambu还需要记录注释对象的引用,用于正确读取比对,来确定对成绩单和基因(和小说文本的类型),和量化。gtf的注释对象可以创建文件:

也可以创建注释对象从TxDb对象:

注释对象可以存储和再次用于重新运行bambu。我们将使用注释对象的NanoporeRNASeq从gtf文件包wasis准备使用的函数的函数bambu

记录发现和量化

运行bambu

接下来,我们应用bambu在输入数据(bam文件注释,genomeSequence)。Bambu将执行同种型发现扩展提供的注释,然后量化这些扩展的转录表达注释使用Expectation-Maximisation算法。我们将使用1核心,可以改为并行处理多个文件。

对于下游分析,我们将添加条件感兴趣的对象,描述了样本。我们比较感兴趣的2细胞系:

可选地,用户可以选择应用bambu只做量化(没有发现同种型)

想象的结果

bambu提供函数来想象和探索的结果。当使用多个样品,我们可以想象所有样本的相关性和集群的热图:

另外,我们也可以想象相关2-dimmensional PCA的阴谋。

除了想象样本之间的相关性,bambu还提供了扩展的注释和可视化表达功能的评估单个基因。我们看看基因ENSG00000099968和想象注释记录坐标,小说亚型和这些亚型的表达水平在所有样本。

# # # # > [[1]]> TableGrob (3 * 1)“安排”:3 grob grob名字# # # # > z细胞> 1 1(2 - 2,- 1)安排gtable(布局)# # > 2 2(3 - 3,1 - 1)安排gtable(布局)# # > 3 3(1 - 1、1 - 1)安排文本(GRID.text.250)

从记录获得基因表达估计表达式

{#基因表达}可以计算基因表达的转录表达式返回的估计bambu使用函数。查看输出,我们可以看到有新的基因识别

我们可以再次使用函数来想象跨6基因表达数据样本的热图或PCA阴谋。正如所料,样本相同的细胞系显示相关性高于细胞系。

保存数据(gtf /文本)

bambu包括一个函数编写扩展的注释,成绩单和基因表达估计,包括任何新发现的基因和转录文本文件。

bambu还包括一个函数,只有出口扩展注释gtf文件:

识别差异表达基因

分析RNA-Seq数据时最常见的任务之一是基因差异表达分析在已经的一个条件。在这里,我们使用DESeq2寻找差异表达基因MCF7和K562细胞株之间。类似于使用的结果大马哈鱼估计,从bambu首先是圆形的。

图书馆(DESeq2)dds < -DESeqDataSetFromMatrix((化验(seGene)美元数),colData =colData(se),设计=~条件)dds.deseq < -DESeq(dds)德杰尼勒斯< -DESeq2::结果(dds.deseqindependentFiltering =)(德杰尼勒斯订单(德杰尼勒斯美元padj)))# # > log2褶皱变化(标定):MCF7 vs K562条件# # >瓦尔德测试假定值:MCF7 vs K562条件# # > DataFrame 6行6列# # > baseMean log2FoldChange lfcSE统计# # > <数字> <数字> <数字> <数字># # > ENSG00000185686 500.6470 -7.15159 0.500364 -14.29278# # > ENSG00000283633 95.7518 -9.10519 1.246517 -7.30451# # > ENSG00000197077 26.9189 9.17563 1.320390 6.94918# # > ENSG00000240972 2443.3934 2.47082 0.357369 6.91392# # > ENSG00000099977 235.8601 1.92963 0.306609 6.29347# # > ENSG00000169635 43.5245 -3.36897 0.569649 -5.91411# # > pvalue# # > <数字># # > ENSG00000185686 0.000000000000000000000000000000000000000000000242728# # > ENSG00000283633 0.000000000000278275679947602694397391918138592006762# # > ENSG00000197077 0.000000000003674139901044544811771829869715655608568# # > ENSG00000240972 0.000000000004714473225560826484330951067763324359265# # > ENSG00000099977 0.000000000310450546572332925146307722118438170155752# # > ENSG00000169635 0.000000003336692881462437529443458496432317605950857# # > padj# # > <数字># # > ENSG00000185686 0.0000000000000000000000000000000000000000000631093# # > ENSG00000283633 0.0000000000361758383931883483531512278605162501177# # > ENSG00000197077 0.0000000003064407596614537449075252608477826568589# # > ENSG00000240972 0.0000000003064407596614537449075252608477826568589# # > ENSG00000099977 0.0000000161434284217613106600419295823603538231339# # > ENSG00000169635 0.0000001445900248633722995599947686029551618958067

一个快速的总结差异表达基因

我们也可以想象的MA-plot不同亚型使用。然而,可视化MA-plots使用原始log-fold变化结果将受到噪音的影响与log2褶皱变化从低计数基因无需任意过滤阈值。推荐的DESeq2教程)(//www.andersvercelli.com/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html # alternative-shrinkage-estimators)。我们应用相同的收缩效应大小改善可视化。

确定微分记录使用

我们使用DEXSeq亚型检测替代使用。

我们可以想象MA-plot

运行bambu与大量的样本

对于较大的样本数量我们建议将处理过的数据写入一个文件。这可以通过提供readClass.outputDir:

se < -bambu(读取=bamFiles,rcOutDir =”。/ bambu /”,注释=annotaiton,基因组=genomeSequence)

Bambu参数

高级选项

记录发现我们建议调整参数根据复制的数量和测序的吞吐量。最相关的参数进行了介绍。您可以使用这些操作的组合参数。# # #更严格的过滤阈值对潜在的小说文本

量化没有偏差纠正

默认自动估计偏差修正表达式估计。然而,你可以选择执行量化没有偏差纠正。

并行计算

bambu允许并行计算。

看到我们的页面有关详细信息,定制其他条件。

得到帮助

问题和问题可以提高Bioconductor支持网站(通过Bioconductor一旦bambu可用):https://support.bioconductor.org。请标记你的帖子bambu。

另外,问题可以提高bambu Github库:https://github.com/GoekeLab/bambu

援引bambu

一份手稿描述bambu目前正在准备。如果你使用bambu研究,请引用使用以下doi: 10.5281 / zenodo.3900025。

会话信息

sessionInfo()# # > R版本4.2.0 RC (2022-04-21 r82226)# # >平台:x86_64-pc-linux-gnu(64位)# # >下运行:Ubuntu 20.04.4 LTS# # ># # >矩阵产品:违约# # >布拉斯特区:/home/biocbuild/bbs - 3.16 - bioc / R / lib / libRblas.so# # > LAPACK: /home/biocbuild/bbs - 3.16 - bioc / R / lib / libRlapack.so# # ># # >语言环境:# # > [1]LC_CTYPE = en_US。utf - 8 LC_NUMERIC = C# # >[3]而= en_GB LC_COLLATE = C# # > [5]LC_MONETARY = en_US。utf - 8 LC_MESSAGES = en_US.UTF-8# # > [7]LC_PAPER = en_US。utf - 8 LC_NAME = C# # > [9]LC_ADDRESS C = C LC_TELEPHONE =# # > [11]LC_MEASUREMENT = en_US。utf - 8 LC_IDENTIFICATION = C# # ># # >附加基本包:# # > [1]stats4统计图形grDevices跑龙套的数据集的方法# # >[8]的基础# # ># # >其他附加包:# # > [1]DEXSeq_1.43.0# # > [2]RColorBrewer_1.1-3# # > [3]AnnotationDbi_1.59.0# # > [4]BiocParallel_1.31.0# # > [5]apeglm_1.19.0# # > [6]DESeq2_1.37.0# # > [7]ggplot2_3.3.5# # > [8]BSgenome.Hsapiens.NCBI.GRCh38_1.3.1000# # > [9]Rsamtools_2.13.0# # > [10]NanoporeRNASeq_1.7.0# # > [11]ExperimentHub_2.5.0# # > [12]AnnotationHub_3.5.0# # > [13]BiocFileCache_2.5.0# # > [14]dbplyr_2.1.1# # > [15]bambu_2.3.0# # > [16]BSgenome_1.65.0# # > [17]rtracklayer_1.57.0# # > [18]Biostrings_2.65.0# # > [19]XVector_0.37.0# # > [20]SummarizedExperiment_1.27.1# # > [21]Biobase_2.57.0# # > [22]GenomicRanges_1.49.0# # > [23]GenomeInfoDb_1.33.1# # > [24]IRanges_2.31.0# # > [25]S4Vectors_0.35.0# # > [26]BiocGenerics_0.43.0# # > [27]MatrixGenerics_1.9.0# # > [28]matrixStats_0.62.0# # ># # >加载通过名称空间(而不是附加):# # > [1]utf8_1.2.2 tidyselect_1.1.2# # > [3]RSQLite_2.2.13 htmlwidgets_1.5.4# # > [5]grid_4.2.0 munsell_0.5.0# # > [7]codetools_0.2-18 statmod_1.4.36# # > [9]xgboost_1.6.0.1 withr_2.5.0# # > [11]colorspace_2.0-3 filelock_1.0.2# # > [13]OrganismDbi_1.39.0 highr_0.9# # > [15]knitr_1.39 rstudioapi_0.13# # > [17]labeling_0.4.2 bbmle_1.0.24# # > [19]GenomeInfoDbData_1.2.8 hwriter_1.3.2.1# # > [21]bit64_4.0.5 farver_2.1.0# # > [23]coda_0.19-4 vctrs_0.4.1# # > [25]generics_0.1.2 xfun_0.30# # > [27]biovizBase_1.45.0 R6_2.5.1# # > [29]doParallel_1.0.17 clue_0.3-60# # > [31]locfit_1.5 - 9.5 AnnotationFilter_1.21.0# # > [33]bitops_1.0-7 cachem_1.0.6# # > [35]reshape_0.8.9 DelayedArray_0.23.0# # > [37]assertthat_0.2.1 promises_1.2.0.1# # > [39]BiocIO_1.7.0 scales_1.2.0# # > [41]nnet_7.3-17 gtable_0.3.0# # > [43]Cairo_1.5-15 ggbio_1.45.0# # > [45]ensembldb_2.21.1 rlang_1.0.2# # > [47]genefilter_1.79.0 GlobalOptions_0.1.2# # > [49]splines_4.2.0 lazyeval_0.2.2# # > [51]dichromat_2.0-0 checkmate_2.1.0# # > [53]BiocManager_1.30.17 yaml_2.3.5# # > [55]reshape2_1.4.4 GenomicFeatures_1.49.1# # > [57]backports_1.4.1 httpuv_1.6.5# # > [59]Hmisc_4.7-0 RBGL_1.73.0# # > [61]tools_4.2.0 ellipsis_0.3.2# # > [63]jquerylib_0.1.4 Rcpp_1.0.8.3# # > [65]plyr_1.8.7 base64enc_0.1-3# # > [67]progress_1.2.2 zlibbioc_1.43.0# # > [69]purrr_0.3.4 rcurl_1.98 - 1.6# # > [71]prettyunits_1.1.1 rpart_4.1.16# # > [73]GetoptLong_1.0.5 cluster_2.1.3# # > [75]magrittr_2.0.3 data.table_1.14.2# # > [77]magick_2.7.3 circlize_0.4.14# # > [79]mvtnorm_1.1-3 ProtGenerics_1.29.0# # > [81]hms_1.1.1 mime_0.12# # > [83]evaluate_0.15 xtable_1.8-4# # > [85]xml_3.99 - 0.9 emdbook_1.3.12# # > [87]jpeg_0.1-9 gridExtra_2.3# # > [89]shape_1.4.6 bdsmatrix_1.3-4# # > [91]compiler_4.2.0 biomaRt_2.53.0# # > [93]tibble_3.1.6 crayon_1.5.1# # > [95]htmltools_0.5.2 later_1.3.0# # > [97]Formula_1.2-4 tidyr_1.2.0# # > [99]geneplotter_1.75.0 DBI_1.1.2# # > [101]formatR_1.12 ComplexHeatmap_2.13.0# # > [103]MASS_7.3-57 rappdirs_0.3.3# # > [105]Matrix_1.4-1 cli_3.3.0# # > [107]parallel_4.2.0 pkgconfig_2.0.3# # > [109]GenomicAlignments_1.33.0 numderiv_2016.8 - 1.1# # > [111]foreign_0.8 - 82 xml2_1.3.3# # > [113]foreach_1.5.2 annotate_1.75.0# # > [115]bslib_0.3.1 stringr_1.4.0# # > [117]VariantAnnotation_1.43.0 digest_0.6.29# # > [119]graph_1.75.0 rmarkdown_2.14# # > [121]htmlTable_2.4.0 restfulr_0.0.13# # > [123]curl_4.3.2 shiny_1.7.1# # > [125]rjson_0.2.21 lifecycle_1.0.1# # > [127]jsonlite_1.8.0 fansi_1.0.3# # > [129]pillar_1.7.0 lattice_0.20-45# # > [131]GGally_2.1.2 KEGGREST_1.37.0# # > [133]fastmap_1.1.0 httr_1.4.2# # > [135]survival_3.3-1 interactiveDisplayBase_1.35.0# # > [137]glue_1.6.2 png_0.1-7# # > [139]iterators_1.0.14 BiocVersion_3.16.0# # > [141]bit_4.0.4 stringi_1.7.6# # > [143]sass_0.4.1 blob_1.2.3# # > [145]latticeExtra_0.6-29 memoise_2.0.1# # > [147]dplyr_1.0.9