包:AnnotationHub
作者: Bioconductor包维护者[cre], Martin Morgan [aut], Marc Carlson [ctb], Dan Tenenbaum [ctb], Sonali Arora [ctb], Valerie Oberchain [ctb], Kayla Morrell [ctb], Lori Shepherd [aut]
修改:孙军28 10:41:23 2015
编译:4月26日星期二15:58:42 2022
Bioconductor提供预先构建的org . *
模型生物的注释包,其用法见OrgDb注释工作流程的部分。这里我们发现OrgDb
用于非典型生物的对象
library(注解hub) ah <-注解hub ()
## snapshotDate(): 2022-04-21
查询(啊,“OrgDb”)
##注释中心,1759条记录## #快照日期():2022-04-21 ## # $dataprovider: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/ ## # $物种:大肠杆菌,大印度果蝠,Zythia rabiei, Zymoseptoria tritici_IPO3…## # $rdataclass: OrgDb ## #附加mcols(): taxonomyid,基因组,描述,coordinate_1_based, maintainer, ## # rdatadateadded, prepareerclass, tags, rdatapath, sourceurl, sourcetype ## #检索记录,例如,'object[["AH95973"]]]' ## ## title ## AH95973 | org.Brassica_napus.eg。sqlite ## AH95974 | org.Arachis_hypogaea.eg。sqlite ## AH95975 | org.Hibiscus_syriacus.eg。sqlite ## AH95976 | org.Triticum_dicoccoides.eg。sqlite ## AH95977 | org.Triticum_turgidum_subsp._dicoccoides.eg。Sqlite ## ... ...# # # # AH100412 | org.Mmu.eg.db.sqlite AH100413 | org.Ce.eg.db.sqlite # # AH100414 | org.Xl.eg.db.sqlite # # AH100415 | org.Sc.sgd.db.sqlite # # AH100416 | org.Dr.eg.db.sqlite
orgdb <- query(ah, c(" orgdb ", "maintainer@bioconductor.org"))[[1]]
##从缓存加载
方法返回的对象可以直接使用select ()
接口,例如,发现可用的键类型查询对象,这些键类型可以映射到的列,最后选择符号和GENENAME对应的前6个entrezid
keytypes (orgdb)
## [1] " accnum " " alias " " entrezid " " evidence " " evidenceall " " genename " ## [7] " gid " " go " " goall " " ontology " " ontologyall " " midid " ## [13] " refseq " " symbol " " unigene "
列(orgdb)
## [1] " accnum " " alias " " chr " " entrezid " " evidence " " evidenceall " ## [7] " genename " " gid " " go " " goall " " ontology " " ontologyall " ## [13] " midid " " refseq " " symbol " " unigene "
egid <- head(keys(orgdb, "ENTREZID")) select(orgdb, egid, c("SYMBOL", "GENENAME"), "ENTREZID")
## 'select()'返回键和列之间的1:1映射
1 106345161 LOC106345161蛋白类laz1蛋白2 106345162 LOC106345162未特征的LOC106345162上游激活因子亚基UAF30 4 106345164 LOC106345164未特征的LOC106345164 5 106345165 LOC106345165蛋白短链基类6 106345166 LOC106345166 glutamyl-tRNA还原酶结合蛋白,叶绿体
所有路线图表观基因组文件都是托管的在这里.如果一个人必须自己下载这些文件,他会通过web界面导航来找到有用的文件,然后使用如下的东西R代码。
url <- "http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E001-H3K4me1.broadPeak.gz" filename <- basename(url)下载。File (url, destfile=filename) if (File .exists(filename)) data <- import(filename, format="bed")
对于所有文件都必须重复此操作,而识别、下载、导入和管理这些文件的本地磁盘位置的责任将落在用户身上。
AnnotationHub将此任务简化为几行R代码
library(注解hub) ah =注解hub ()
## snapshotDate(): 2022-04-21
epiFiles <- query(啊,"EpigenomeRoadMap")
返回的值epiFiles
向我们显示18248路线图资源可通过AnnotationHub.关于文件的其他信息也可用,例如,文件来自哪里(数据提供者),基因组,物种,sourceurl,源类型。
epiFiles
##带18248条记录的注解中心## # snapshotDate(): 2022-04-21 ## $dataprovider: BroadInstitute ## # $species:智人## # $rdataclass: BigWigFile, GRanges, data.frame ## #附加mcols():taxonomyid, genome, description, coordinate_1_based, maintainer, ## # rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype ## #检索记录,例如,'object[["AH28856"]]]' ## ## title ## AH28856 | E001-H3K4me1.broadPeak.gz ## AH28857 | E001-H3K4me3.broadPeak.gz ## AH28858 | e001 - h3k9acs . broadpeak .gz ## AH28859 | E001-H3K9me3.broadPeak.gz ## AH28860 | E001-H3K9me3.broadPeak.gz ## ... ...部分甲基化。部分甲基化。部分甲基化。部分甲基化。bigwig的用法和样例
为了确保我们只有来自Roadmap Epigenomics项目的文件,一个很好的完整性检查是检查返回的较小hub对象中的所有文件都来自智人还有hg19基因组
独特的(epiFiles物种美元)
“智人”
独特的(epiFiles基因组美元)
##[1]“hg19”
一般来说,通过查看sourcetype,可以从这个项目中了解不同的文件
表(epiFiles sourcetype美元)
## ## BED BigWig GTF Zip标签## 8298 9932 3 14 1 .
为了更详细地了解这些不同的文件,可以使用以下方法:
排序(表(epiFiles描述)美元,减少= TRUE)
## ## Bigwig文件包含EpigenomeRoadMap项目的-log10(p值)信号轨迹## 6881 ## Bigwig文件包含EpigenomeRoadMap项目的折叠富集信号轨迹## 2947 ## EpigenomeRoadMap项目的整合表观基因组的窄ChIP-seq峰值## 2894 ## EpigenomeRoadMap项目的整合表观基因组的宽ChIP-seq峰值## 2534 ## EpigenomeRoadMap项目的整合表观基因组的窄ChIP-seq峰值## 2534 ##整合窄dnnasepeaksepigenomes from EpigenomeRoadMap Project ## 131 ## 15 state chromatin segmentations from EpigenomeRoadMap Project ## 127 ## Broad domains on enrichment for DNase-seq for consolidated epigenomes from EpigenomeRoadMap Project ## 78 ## RRBS fractional methylation calls from EpigenomeRoadMap Project ## 51 ## Whole genome bisulphite fractional methylation calls from EpigenomeRoadMap Project ## 37 ## MeDIP/MRE(mCRF) fractional methylation calls from EpigenomeRoadMap Project ## 16 ## GencodeV10 gene/transcript coordinates and annotations corresponding to hg19 version of the human genome ## 3 ## RNA-seq read count matrix for intronic protein-coding RNA elements ## 2 ## RNA-seq read counts matrix for ribosomal gene exons ## 2 ## RPKM expression matrix for ribosomal gene exons ## 2 ## Metadata for EpigenomeRoadMap Project ## 1 ## RNA-seq read counts matrix for non-coding RNAs ## 1 ## RNA-seq read counts matrix for protein coding exons ## 1 ## RNA-seq read counts matrix for protein coding genes ## 1 ## RNA-seq read counts matrix for ribosomal genes ## 1 ## RPKM expression matrix for non-coding RNAs ## 1 ## RPKM expression matrix for protein coding exons ## 1 ## RPKM expression matrix for protein coding genes ## 1 ## RPKM expression matrix for ribosomal RNAs ## 1
路线图表观基因组学项目提供的“元数据”也可用。注意,显示具有单个资源的集线器的信息与引用多个资源时显示的信息有很大不同。
元数据。tab <- query(ah, c("EpigenomeRoadMap", "Metadata")) Metadata .tab
##注释中心与1记录## # snapshotDate(): 2022-04-21 ## # names(): AH41830 ## # $dataprovider: BroadInstitute ## # $species: Homo sapiens ## # $rdataclass: data.frame ## # $title: EID_metadata。## $description:元数据的EpigenomeRoadMap项目## # $taxonomyid: 9606 ## $genome: hg19 ## $sourcetype: tab ## # $sourceurl: http://egg2.wustl.edu/roadmap/data/byFileType/metadata/EID_metadata.tab ## # $sourcesize: 18035 ## $tags: c(“EpigenomeRoadMap”,“元数据”)## #检索记录与对象[["AH41830"]]]'
到目前为止,我们一直在探索有关资源的信息,而没有将资源下载到本地缓存并导入到r中[[
如show方法末尾所示
##从缓存加载
元数据。tab <- ah[["AH41830"]]
##从缓存加载
元数据。返回为data.frame.前5列的前6行显示在这里:
元数据。选项卡(1:6,1:5)
## eid组颜色记忆符std_name ## 1 e001 esc #924965 esc。I3 ES-I3 cell ## 2 E002 ESC #924965 ESC。WA7 ES-WA7 cell ## 3 E003 ESC #924965 ESC。H1 H1 Cells ## 4 E004 ES-deriv #4178AE ESDR.H1.BMP4。ES-deriv #4178AE ESDR.H1.BMP4衍生的中胚层培养细胞## 5TROP H1 BMP4来源的滋养层培养细胞## 6 E006 ES-deriv #4178AE ESDR.H1。MSC H1衍生间充质干细胞
您可以使用多个参数继续构造不同的查询,以精简这些18248以获得所需的文件。例如,要获得整合表观基因组的ChIP-Seq文件,可以使用
bpChipEpi <- query(啊,c("EpigenomeRoadMap", "broadPeak", "chip", "consolidated"))
要获得所有bigWig信号文件,可以使用
allBigWigFiles <- query(啊,c("EpigenomeRoadMap", "BigWig"))
要访问15个状态染色质片段,可以使用
seg <- query(ah, c("EpigenomeRoadMap", " segments "))
如果有兴趣获得与一个示例相关的所有文件
E126 <- query(ah, c("EpigenomeRoadMap", "E126", "H3K4ME2")
##注释中心与6条记录## # snapshotDate(): 2022-04-21 ## $dataprovider: BroadInstitute ## # $species: Homo sapiens ## # $rdataclass: GRanges, BigWigFile ## #附加mcols():taxonomyid, genome, description, coordinate_1_based, maintainer, ## # rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype ## #检索记录,例如,'object[["AH29817"]]]' ## ## title ## AH29817 | E126-H3K4me2.broadPeak.gz ## AH30868 | e126 - h3k4me2 .窄峰.gz ## AH31801 | E126-H3K4me2.gappedPeak.gz ## AH32990 | E126-H3K4me2.fc.signal。bigwig ## AH34022 | E126-H3K4me2.pval.signal。bigwig ## AH40177 | e126 - h3k4me2 . imputd .pval.signal.bigwig
还可以使用$
,子集()
,显示()
;看主要内容AnnotationHub装饰图案更多细节。
根据需要导入中心资源Bioconductor对象,用于进一步分析。例如,峰值文件返回为农庄对象。
##从缓存加载
# #要求(“rtracklayer”)
峰值<- E126[['AH29817']]
##从缓存加载
seqinfo(峰值)
从hg19基因组中获得298个序列(2个圆形)的Seqinfo对象:## seqnames seqlength isCircular genome ## chr1 249250621 FALSE hg19 ## chr2 243199373 FALSE hg19 ## chr3 198022430 FALSE hg19 ## chr4 191154276 FALSE hg19 ## chr5 180915260 FALSE hg19 ## ... ... ... ...## chrUn_gl000245 36651 FALSE hg19 ## chrUn_gl000246 38154 FALSE hg19 ## chrUn_gl000247 36422 FALSE hg19 ## chrUn_gl000248 39786 FALSE hg19 ## chrUn_gl000249 38502 FALSE hg19
BigWig文件返回为BigWigFile对象。一个BigWigFile是对磁盘上文件的引用;文件中的数据可以在使用时读取rtracklayer:进口()
,也许可以在帮助页上查询这些大文件中感兴趣的特定基因组区域import.bw ?
.
里面的每条记录AnnotationHub与唯一标识符关联。大多数农庄返回的对象。AnnotationHub属性所在资源的唯一AnnotationHub标识符农庄是派生的。属性时,这可以派上用场农庄对象,以及正在使用的对象的附加信息(例如,缓存中文件的名称,或资源底层数据的原始sourceurl)。
元数据(峰值)
## $AnnotationHubName ## [1] "AH29817" ## ## $ ' File Name ' ## [1] " e126 - h3k4me2 . broadtop .gz" ## ## $ ' Data Source ' ## [1] "http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E126-H3K4me2.broadPeak.gz" ## ## $Provider ## [1] "BroadInstitute" ## ## $Organism ##[1] "智人" ## ## $ ' Taxonomy ID ' ## [1] 9606
啊(元数据(山峰)AnnotationHubName美元)sourceurl美元
##[1]“http://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/E126-H3K4me2.broadPeak.gz”
Bioconductor使用“转录本”数据库表示基因模型。这些可以通过包获得,例如TxDb.Hsapiens.UCSC.hg38.knownGene或者可以使用函数来构造GenomicFeatures::makeTxDbFromBiomart ()
.
AnnotationHub提供了一种简单的方法来处理由Ensembl发布的基因模型。让我们看看ensemble的Release-94有什么关于河豚鱼的数据,Takifugu摘要.
查询(啊,c(“泷豚”,“release-94”))
## AnnotationHub与7条记录## # snapshotDate(): 2022-04-21 ## $dataprovider: Ensembl ## # $species: Takifugu rubripes ## # $rdataclass: TwoBitFile, GRanges ## #附加mcols(): taxonomyid,基因组,描述,coordinate_1_based, maintainer, ## # rdatadateadded, preparerclass, tags, rdatapath, sourceurl, sourcetype ## #检索记录,例如,'object[["AH64856"]]]' ## ## title ## AH64856 | Takifugu_rubripes.FUGU5.94.abinitio。gtf ## AH64857 |泷fugu_rubrips . fugu5.94 .chr。takifugu_rubrips . fugu5.94。takifugu_rubrips . fugu5 .cdna.all。2bit ## AH66115 | Takifugu_rubripes.FUGU5.dna_rm.toplevel.2bit ## AH66116 | Takifugu_rubripes.FUGU5.dna_sm.toplevel.2bit ## AH66117 | Takifugu_rubripes.FUGU5.ncrna.2bit
我们看到有一个描述基因模型的GTF文件,以及各种DNA序列。找回GTF和顶级DNA序列文件。导入的GTF文件为农庄例如,DNA序列是一个双比特文件。
gtf <- ah[["AH64858"]]
##从缓存加载
##导入文件到R ..
dna <- ah[["AH66116"]]]
##从缓存加载
头(gtf), 3)
GRanges对象,包含3个范围和22个元数据列:## seqnames ranges链|源类型分数阶段gene_id ## | [1] 1 217531-252954 + |集合基因NA ENSTRUG00000009922[2] 1 217531-252954 + |集合转录NA ENSTRUG00000009922[3] 1 217531-217702 + |集合外显子NA ENSTRUG00000009922 gene_version gene_name gene_source gene_biotype transcript_id transcript_version ## <字符> <字符> <字符> <字符> <字符> ## [1] 2 sdk2b ensembl protein_coding ## [2] 2 sdk2b ensembl protein_coding ENSTRUT00000025027 2 [3] 2 sdk2b ensembl protein_coding ENSTRUT00000025027转录t_name转录t_source转录t_biotype exon_number exon_id ## <字符> <字符> <字符> <字符> <字符> ## [1] [2] sdk2b-201 ensembl protein_coding ## [3] sdk2b-201 ensembl protein_coding 1 ENSTRUE00000325931 ## exon_version protein_id蛋白版本projection_parent_gene projection_parent_transcript ## <字符> <字符> <字符> <字符> <字符> # [1] # [2] # [3] 1 # tag ## <字符> # [1] # b[2] # [3] ## seqinfo: FUGU5基因组1627序列(1个循环);没有seqlengths
dna
资源:/home/biocbuild/.cache/ r / annotationhub /de9225196f0c4_72862
头(seqlevels (dna))
##[1]“1”“2”“3”“4”“5”“6”
让我们找出25个最长的DNA序列,只在这些支架上留下注释。
Keep <- names(tail(sort(seqlength (dna)), 25)) gtf_子集<- gtf[seqnames(gtf) %in% Keep]
创建这个子集(或整个gtf)的TxDb实例很简单。
library(GenomicFeatures) # for makeTxDbFromGRanges txdb <- makeTxDbFromGRanges(gtf_子集)
## get_cds_idx (mcols0$type, mcols0$phase)中的警告:“phase”元数据列包含stop_codon类型的特征的非na值。这个##信息被忽略了。
并将其与DNA序列结合使用,例如,找到所有注释基因的外显子序列。
getSeq,FaFile-method exons <- exons(txdb) length(exons)
## [1] 178769
getSeq (dna,外显子)
长度为178769的DNAStringSet对象:## width seq ## [1] 172 CGATACGGCGCGCTCCGTTTGCCTCCGCCCCCCCCGTGGCG[2] 28 ttgggattattattctcacacgtgatcggt# #[3] 160……107 gtacgtgatcccgtctttggaccgctcccacgccggattct…gggcgccctgctgcagagacgcaccgaagtccaggtggt148 ttatgggaagcttcgaggagggccccagtccgtc…tggtaccggggggggacgcaagattcccccgagcagccgcat ## ... ... ...## [178765] 54 atgccctcaattacactaccgcagaaggagaacgctctcttcaaaagaatattg ## [178766] 863 ctcttggtgaggggagagagatgatatccagtg…gtgatataagttttagagagagccccataggctgatgtag ## [178767] 270 tttgtgcaatgggtggcaccagcagcaccagcaggttgttgttttt…Cccgtctatccggatcatgcagtggaacatactggcacaag ## [178768] 982 cagttgtacagaaatcgttggagcagacctggagctgttg…Cccgtctatccggatcatgcagtggaacatactggcacaag ## [178769] 627 gggggagattccgatggggtatatttaaaaagttgaaact
包含的基因组范围之间存在一对一的映射外显子
和返回的DNA序列getSeq ()
.
在处理这部分组装的基因组时出现了一些困难,这需要更高级的基因组范围技能,请参阅GenomicRanges小插图,尤其是"GenomicRangesHOWTOs”和“介绍GenomicRanges”。
假设我们想要从一个基因组构建中提升特征到另一个基因组构建中,例如,因为注释是为hg19生成的,但我们的实验分析使用了hg18。我们知道UCSC为基因组构建之间的映射提供了“传输”文件。
在这个例子中,我们将选择宽广的Peak农庄从来自“hg19”基因组的E126,并将这些特征提升到它们的“hg38”坐标。
Chainfiles <- query(啊,c("hg38", "hg19", "chainfile")
##注释中心与4条记录## # snapshotDate(): 2022-04-21 ## # $dataprovider: UCSC, NCBI ## # $物种:智人## # $rdataclass: ChainFile ## #附加mcols():taxonomyid,基因组,描述,coordinate_1_based, maintainer, ## # rdatadateadded,准备类,标签,rdatapath, sourceurl, sourcetype ## #检索记录,例如,'object[["AH14108"]]]' ## ## title ## AH14108 | hg38ToHg19.over.chain.gz ## AH14150 | hg19ToHg38.over.chain.gz ## AH78916 |智人rRNA hg19到hg19的链文件
我们对从hg19提升到hg38的特性的文件感兴趣,所以让我们使用它下载
##从缓存加载
chain <- chainfiles[['AH14150']]
##从缓存加载
链
##长度25 ##名称(25):chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8…chr18 chr19 chr20 chr21 chr22 chrX chrY chrM
使用执行liftOver操作rtracklayer: liftOver ()
:
库(rtracklayer) gr38 <- liftOver(峰值,链)
返回一个GRangeslist;更新结果的基因组,得到最终结果
基因组(gr38) <-“hg38”gr38
##长度为153266的GRangesList对象:##[[1]]##具有1个范围和5个元数据列的GRanges对象:## seqnames ranges strand | name score signalValue pValue qValue ## | <字符> <数字> <数字> <数字> <数字> ## [1]chr1 28667912-28670147 * | Rank_1 189 10.5585 22.0132 18.9991 ## ------- ## seqinfo: 23个序列来自hg38基因组;## seqnames ranges strand | name score signalValue pValue qValue ## | <字符> <数字> <数字> <数字> <数字> ## [1]chr4 54090990-54092984 * b| Rank_2 188 8.11483 21.8044 18.8066 ## ------- # seqinfo: 23个序列来自hg38基因组;## seqnames ranges strand | name score signalValue pValue qValue ## | <数字> <数字> <数字> <数字> ## [1]chr14 75293392-75296621 * b| Rank_3 180 8.89834 20.9771 18.0282 ## ------- # seqinfo: 23个序列来自hg38基因组;No seqlength## ##…## <153263多个元素>
人们也可能对研究具有医学价值的常见种系变异感兴趣。此资料可于NCBI.
查询集线器中的dbDNP文件:
返回一个VcfFile在使用中可以读取哪些r Biocpkg(“VariantAnnotation”)
;因为VCF文件可能很大,readVcf ()
支持几种策略,仅导入文件的相关部分(例如,特定的基因组位置,变体的特定特征),参见readVcf ?
获取更多信息。
变体<- readVcf(vcf, genome="hg19")变体
##类:折叠dvcf ## dim: 111138 0 ## rowRanges(vcf): ## GRanges与5元数据列:paramRangeID, REF, ALT, QUAL, FILTER ## info(vcf): ## DataFrame与58列:RS, RSPOS, RV, VP, GENEINFO, dbSNPBuildID, SAO, SSR, WGT, VC, PM, T…## info(header(vcf)): ## RS 1 Integer dbSNP ID(即RS号)## RSPOS 1 Integer Chr position in dbSNP ## RV 0 Flag RS orientation is reversed ## VP 1 String Variation Property。文档在ftp://ftp.ncbi.nlm.nih.g…基因符号每对:基因id。基因符号和id是de…## dbSNPBuildID 1整数第一个dbSNP构建RS ## SAO 1整数变体等位基因起源:0 -未指定,1 -种系,2 -体细胞…SSR 1整数变量可疑原因代码(可能有多个值被添加到…## WGT 1整数权重,00 -未映射,1 -权重1,2 -权重2,3 -权重3 o…## vc1字符串变体类## PM 0标志变体是珍贵的(临床,Pubmed引用)## TPA 0标志临时第三方注释(TPA)(目前来自制药公司…## PMC 0标志存在到PubMed Central文章的链接## S3D 0标志具有3D结构- SNP3D表## SLO 0标志具有SubmitterLinkOut - From SNP->SubSNP->Batch。NSF 0标志具有非同义移码编码区域变化,其中一个… ## NSM 0 Flag Has non-synonymous missense A coding region variation where one a... ## NSN 0 Flag Has non-synonymous nonsense A coding region variation where one a... ## REF 0 Flag Has reference A coding region variation where one allele in the s... ## SYN 0 Flag Has synonymous A coding region variation where one allele in the ... ## U3 0 Flag In 3' UTR Location is in an untranslated region (UTR). FxnCode = 53 ## U5 0 Flag In 5' UTR Location is in an untranslated region (UTR). FxnCode = 55 ## ASS 0 Flag In acceptor splice site FxnCode = 73 ## DSS 0 Flag In donor splice-site FxnCode = 75 ## INT 0 Flag In Intron FxnCode = 6 ## R3 0 Flag In 3' gene region FxnCode = 13 ## R5 0 Flag In 5' gene region FxnCode = 15 ## OTH 0 Flag Has other variant with exactly the same set of mapped positions o... ## CFL 0 Flag Has Assembly conflict. This is for weight 1 and 2 variant that ma... ## ASP 0 Flag Is Assembly specific. This is set if the variant only maps to one... ## MUT 0 Flag Is mutation (journal citation, explicit fact): a low frequency va... ## VLD 0 Flag Is Validated. This bit is set if the variant has 2+ minor allele... ## G5A 0 Flag >5% minor allele frequency in each and all populations ## G5 0 Flag >5% minor allele frequency in 1+ populations ## HD 0 Flag Marker is on high density genotyping kit (50K density or greater)... ## GNO 0 Flag Genotypes available. The variant has individual genotype (in SubI... ## KGPhase1 0 Flag 1000 Genome phase 1 (incl. June Interim phase 1) ## KGPhase3 0 Flag 1000 Genome phase 3 ## CDA 0 Flag Variation is interrogated in a clinical diagnostic assay ## LSD 0 Flag Submitted from a locus-specific database ## MTP 0 Flag Microattribution/third-party annotation(TPA:GWAS,PAGE) ## OM 0 Flag Has OMIM/OMIA ## NOC 0 Flag Contig allele not present in variant allele list. The reference s... ## WTD 0 Flag Is Withdrawn by submitter If one member ss is withdrawn by submit... ## NOV 0 Flag Rs cluster has non-overlapping allele sets. True when rs set has ... ## CAF . String An ordered, comma delimited list of allele frequencies based on 1... ## COMMON 1 Integer RS is a common SNP. A common SNP is one that has at least one 10... ## CLNHGVS . String Variant names from HGVS. The order of these variants correspon... ## CLNALLE . Integer Variant alleles from REF or ALT columns. 0 is REF, 1 is the firs... ## CLNSRC . String Variant Clinical Chanels ## CLNORIGIN . String Allele Origin. One or more of the following values may be added: ... ## CLNSRCID . String Variant Clinical Channel IDs ## CLNSIG . String Variant Clinical Significance, 0 - Uncertain significance, 1 - no... ## CLNDSDB . String Variant disease database name ## CLNDSDBID . String Variant disease database ID ## CLNDBN . String Variant disease name ## CLNREVSTAT . String no_assertion - No assertion provided, no_criteria - No assertion ... ## CLNACC . String Variant Accession and Versions ## geno(vcf): ## List of length 0:
rowRanges ()
返回VCF文件的CHROM, POS和ID字段的信息,表示为农庄实例
rowRanges(变异)
与111138年# #农庄对象范围和5元数据列:# # seqnames范围链| paramRangeID REF ALT战# # < Rle > < IRanges > < Rle > | <因素> < DNAStringSet > < DNAStringSetList > <数字> # # rs786201005 1 1014143 * C T | NA NA # # rs672601345 1 1014316 * | CG NA NA C # # rs672601312 1 1014359 * | NA G T NA # # rs115173026 1 1020217 G T * | NA NA # # rs201073369 1 1020239 * | G C NA NA ## ... ... ... ... . ... ... ... ...## rs527236200 MT 15943 * | NA T C NA ## rs118203890 MT 15950 * | NA G A NA ## rs199474700 MT 15965 * | NA A G NA ## rs199474701 MT 15967 * | NA G A NA ## rs199474699 MT 15990 * | NA C T NA ## FILTER ## <字符> ## rs786201005。## rs672601345。## rs672601312。## rs115173026。## rs201073369。## ... ...## rs527236200。## rs118203890。## rs199474700。 ## rs199474701 . ## rs199474699 . ## ------- ## seqinfo: 25 sequences from hg19 genome; no seqlengths
注意,broadPeaks文件遵循UCSC染色体命名约定,vcf数据遵循NCBI风格的染色体命名约定。为了使这些范围具有相同的染色体命名约定(即UCSC),我们将使用
seqlevelsStyle(变异)< -seqlevelsStyle(峰值)
最后,为了找出哪些变体与这些broadPeaks重叠,我们将使用:
overlap <- finoverllaps(变体,峰值)
## seqinfo中的警告。mergexy(x, y):两个组合的对象没有共同的序列级别。(使用## suppressWarnings()取消此警告。)
重叠
##命中0命中和0元数据列的对象:## queryHits subjectHits ## ## ------- ## queryLength: 111138 / subjectLength: 153266
对于如何解释这些结果的一些见解来自于观察一个特定的峰,例如,第3852个峰
idx <- subjectHits(overlap) == 3852 overlap[idx]
##命中0命中和0元数据列的对象:## queryHits subjectHits ## ## ------- ## queryLength: 111138 / subjectLength: 153266
有三个变体重叠在这个峰值上;峰值和重叠变量的坐标为
峰[3852]
## seqnames ranges strand | name score signalValue pValue qValue ## | ## [1] chr22 50622494-50626143 * | Rank_3852 79 6.06768 10.1894 7.99818 ## ------- seqinfo:来自hg19基因组的298个序列(2个循环)
rowRanges(变异)[queryHits(重叠[idx])]
## seqnames ranges strand | paramRangeID REF ALT QUAL FILTER ## | <因子> <数字> <字符> ## ------- ## seqinfo: 25个序列来自hg19基因组;没有seqlengths
sessionInfo ()
## R版本4.2.0 RC (2022-04-19 r82224) ##平台:x86_64-pc-linux-gnu(64位)##运行在Ubuntu 20.04.4 LTS ## ##矩阵产品:默认## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas。/home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack。所以## ## locale: ## [1] LC_CTYPE=en_US。UTF-8 LC_NUMERIC=C LC_TIME=en_GB ## [4] LC_COLLATE=C LC_MONETARY=en_US。utf - 8 LC_MESSAGES = en_US。UTF-8 ## [7] LC_PAPER=en_US。UTF-8 LC_NAME=C LC_ADDRESS= c# # [10] lc_phone =C LC_MEASUREMENT=en_US。UTF-8 LC_IDENTIFICATION=C ## ##附加的基本包:## [1]stats4 stats graphics grDevices utils datasets methods基础## ##其他附加包:# # # # [1] BSgenome.Hsapiens.UCSC.hg19_1.4.3 BSgenome_1.64.0 [3] rtracklayer_1.56.0 VariantAnnotation_1.42.0 # # [5] SummarizedExperiment_1.26.0 MatrixGenerics_1.8.0 # # [7] matrixStats_0.62.0 Rsamtools_2.12.0 # # [9] Biostrings_2.64.0 XVector_0.36.0 # # [11] GenomicFeatures_1.48.0 AnnotationDbi_1.58.0 # # [13] Biobase_2.56.0 GenomicRanges_1.48.0 # # [15] GenomeInfoDb_1.32.0 IRanges_2.30.0 # # [17] S4Vectors_0.34.0 AnnotationHub_3.4.0 # # [19] BiocFileCache_2.4.0 dbplyr_2.1.1 # # [21] BiocGenerics_0.42.0BiocStyle_2.24.0 ## ##通过命名空间加载(并且没有附加):## [1] bitops_1.0-7 bit64_4.0.5 filelock_1.0.2 ## [7] bslib_1.2.2 utf8_1.2.2 R6_2.5.1 ## [10] DBI_1.1.2 withr_2.5.0 tidyselect_1.1.2 ## [13] prettyunits_1.1.1 bit_4.0.4 curl_4.3.2 ## [13] prettyunits_1.1.1 bit_4.0.4 curl_4.3.2 ## [13] DelayedArray_0.22.0 bookdown_0.26 sass_0.4.1 ## [22] rappdirs_0.3.3 string_1 .4.0 digest_0.6.29 ## [25] rmarkdown_2.14 pkgconfig_2.0.3 htmltools_0.5.2 ## [28] fastmap_1.1.0 rlang_1.0.2 RSQLite_2.2.12 ## [31] shiny_1.7.1jquerylib_0.1.4 BiocIO_1.6.0 ## [34] generics_0.1.2 jsonlite_1.8.0 BiocParallel_1.30.0 ## [37] GenomeInfoDbData_1.2.8 Matrix_1.4-1 Rcpp_1.0.8.3 ## [43] fansi_1.0.3 lifecycle_1.0.1 stringi_1.7.6 ## [46] yaml_2.3.5 zlibbioc_1.42.0 grid_4.2.0 ## [49] blob_1.2.3 parallel_4.2.0 promisd_1.2.0.1 ## [55] KEGGREST_1.36.0 knitr_1.38 pillar_1.7.0 ## [58] rjson_0.2.21 biomaRt_2.52.0 xml_3 . 0.9 ## [61]glue_1.6.2 BiocVersion_3.15.2 evaluate_0.15 ## [64] BiocManager_1.30.17 png_0.1-7 vctrs_0.4.1 ## [67] httpuv_1.6.5 purrr_0.3.4 assertthat_0.2.1 ## [70] cachem_1.0.6 xfun_0.30 mime_0.12 ## [73] xtable_1.8-4 restfulr_0.0.13 later_1.3.0 ## [76] tibble_3.1.6 GenomicAlignments_1.32.0 memoise_2.0.1 ## [79] ellipsis_0.3.2 interactiveDisplayBase_1.34.0