纳米物:结构变体注释和分类

Surajit Bhattacharya,Hayk Barsheghyan, Emmanuele C Delot和Eric Vilain

简介

短内容排序(SRS)是用于临床诊断的主要DNA测序技术。它利用覆盖着数百万表面结合寡核苷酸的流池,允许对数亿个独立短读进行并行测序。然而,由于平均测序读取长度约为150 bp,较大的结构变异(sv)和副本编号变体(基因拷贝数异变)是具有挑战性的观察。这在生物医学科学领域的临床表型和潜在遗传机制之间造成了诊断差距。光学基因组图谱和长读测序等新技术已经部分解决了SV和CNV检测问题;然而,在整个基因组中数千个被称为SV / cnv的致病性变异的识别具有挑战性,因为可用的单核苷酸变异的分析管道不适用于SV分析。因此,我们专门为结构变异构建了一个基于r的注释包“nanotatoR”,为大型基因组变异提供大量关键的功能注释。

#包安装

nanotatoR目前可以从GitHub存储库中获得。安装方法如下:

如果requireNamespace“BiocManager”静静地=真正的))install.packages“BiocManager”BiocManager::安装“nanotatoR”version =“3.8”
图书馆“nanotatoR”
# #

nanotatoRR版本≥3.6。

#包的功能

给出一份包含染色体数目、SV类型和变异起始/结束位置的结构变异列表,纳米粒子可以执行以下功能(GRCh37/38都可以使用):

# #结构变异的频率

基因组变异频率是鉴别罕见的、可能致病的变异的最重要的过滤特征之一。NanotatoR使用基因组变异数据库(DGV)和DECIPHER (解读),公开的参考控制数据库,以估计一般人群的结构变异频率。与传统的单核苷酸变异频率计算相比,结构变异的频率估计由于“相同”结构变异之间的断点变异性要高得多而难度更大。为了提供准确的种群频率估计,纳米器识别五类SVs:插入、删除、重复、倒置和易位(例如,DNA物质的“获得”被认为是“插入/重复”,DNA物质的“损失”被认为是“删除”)。为了使两个SV被认为是“相同的”,在默认情况下,nanotatoR会检查它们是否属于同一类别(例如删除),大小相似度大于50%,SV断点(起始位置和结束位置)之间的距离在10kbp之内。目前,50%的尺寸相似度并不适用于重复、倒置和易位,因为大多数SV调用者并不计算这些结构变体的尺寸;但是,为了识别匹配的变量,应用了50 KBP的断点截止(注意:重复断点截止为10 KBP)。由Bionano Genomics确定的默认断点截断和相似度百分比(参见参考资料)。为了使natotatoR估计SV频率,它需要以下输入文件:decipherpath)和“smap”文件,其中包含通过Bionano Genomics Solve/Tools pipeline为基于光学映射的SV调用生成的结构变体信息(smappath).SV断点和相似性百分比的默认输入参数如下:插入/重复/删除(win_indel)的10,000个碱基,倒置和易位(win_inv_trans)和相似度百分比(perc_similarity)的0.5%或50%。这个函数的输出可以有两种类型:一个R对象(dataFrame)或文本文件(文本).

hgpath =执行“extdata”“grch37_hg19_variants_2016 - 05 - 15. - txt”包=“nanotatoR”smappath =执行“extdata”“GM24385_Ason_DLE1_VAP_trio5.smap”包=“nanotatoR”datDGV < -DGVfrequencyhgpath =hgpath,smap =smappath,EnzymeType =“本身”win_indel_DGV =10000win_inv_trans_DGV =50000input_fmt_SV =“文本”perc_similarity_DGV =0.5returnMethod =“dataFrame”
“1号染色体”
datDGV [1,]
## SampleID SmapEntryID QryContigID RefcontigID1 RefcontigID2 QryStartPos ## 1 GM24385_Ason 343 302 1 1 19362181 ## QryEndPos RefStartPos RefEndPos Confidence Type XmapID1 XmapID2 LinkID ## 1 19371069 103038897 103048124 -1删除21 21 -1 ## QryStartIdx QryEndIdx RefStartIdx RefEndIdx Zygosity基因型基因型组## 1 3763 3764 19515 19516 homozygous 1 143 ## # RawConfidence RawConfidenceLeft RawConfidenceCenter SVsize ## 1 1.85 4609.43 1541.23 1.85 339.6 ## SVfreq定向样本算法大小## 1 0.514 NA GM24385_Ason_DLE1_assembly5 assembly_comparison 340 ## Present_in_。_of_BNG_control_samples # # 1 # # Present_in_ 22.5。1 80.7 ## Fail_assembly_chimeric_score OverlapGenes NearestNonOverlapGene ## 1 not_applicable - COL11A1 ## # NearestNonOverlapGeneDistance PutativeGeneFusion Found_in_parents_assemblies ## 1 293899 - both ## Found_in_parents_molecules # found_in_selfmolecules ## 1 both yes 17 ## ## Father_molecule_count Self_molecule_count DGV_Count dgv_freq_perc# # 1 19 31 0 0
decipherpath =执行“extdata”“population_cnv.txt”包=“nanotatoR”smappath =执行“extdata”“GM24385_Ason_DLE1_VAP_trio5.smap”包=“nanotatoR”datdecipher < -Decipherfrequencydecipherpath =decipherpath,smap =smappath,win_indel =10000perc_similarity =0.5returnMethod =“dataFrame”EnzymeType =“SV”input_fmt_SV =“文本”
###计算解码频率###
datdecipher [1,]
## SampleID SmapEntryID QryContigID RefcontigID1 RefcontigID2 QryStartPos ## 1 GM24385_Ason 343 302 1 1 19362181 ## QryEndPos RefStartPos RefEndPos Confidence Type XmapID1 XmapID2 LinkID ## 1 19371069 103038897 103048124 -1删除21 21 -1 ## QryStartIdx QryEndIdx RefStartIdx RefEndIdx Zygosity基因型基因型组## 1 3763 3764 19515 19516 homozygous 1 143 ## # RawConfidence RawConfidenceLeft RawConfidenceCenter SVsize ## 1 1.85 4609.43 1541.23 1.85 339.6 ## SVfreq定向样本算法大小## 1 0.514 NA GM24385_Ason_DLE1_assembly5 assembly_comparison 340 ## Present_in_。_of_BNG_control_samples # # 1 # # Present_in_ 22.5。1 80.7 ## Fail_assembly_chimeric_score OverlapGenes NearestNonOverlapGene ## 1 not_applicable - COL11A1 ## # NearestNonOverlapGeneDistance PutativeGeneFusion Found_in_parents_assemblies ## 1 293899 - both ## Found_in_parents_molecules # Found_in_self_molecules ## 1 both yes 17 ## ## Father_molecule_count Self_molecule_count DECIPHER_Frequency ## 1 19 31 0

除了DGV而且解读, nanotatoR还可以将Bionano Genomics创建的包含204个样本的变异数据集作为输入。输入参数与解读,并增加以下参数:插入和删除的置信度评分阈值(indelconf;默认值为0.5),反转(invconf;默认值为0.01)和易位(transconf;默认值是0.1(由Bionano Genomics确定)。注:Bionano基因组参考数据库可以使用以下命令wget下载http://bnxinstall.com/solve/Solve3.3_10252018.tar.gz,然后使用tar -xvzf Solve3.3_10252018.tar.gz解压。包含数据库的文件夹在配置文件中,配置文件可以在以下目录$PWD/Solve3.3_10252018/VariantAnnotation/10252018/config/中找到。参考文件根据变异类型和参考基因组命名。例如:current_ctrl_dup_hg19_anonymalize .txt,它是hg19参考基因组的重复参考变体文件。

路径< -执行“extdata”“Bionano_config /”包=“nanotatoR”模式< -“* _hg19_ *”smapName =“GM24385_Ason_DLE1_VAP_trio5.smap”smap =执行“extdata”smapName,包=“nanotatoR”datbndb < -BNDBfrequencysmap =大声叫嚷,buildBNInternalDB =真正的input_fmt_SV =“文本”dbOutput =“dataframe”BNDBpath =路径,BNDBpattern =模式,outpath帧,win_indel =10000win_inv_trans =50000perc_similarity =0.5indelconf =0.5invconf =0.01limsize =1000transconf =0.1returnMethod =c“dataFrame”),EnzymeType =c“本身”))
##[1] "###队列频率计算###" ## [1]"## [1]"current_ctrl_dup_hg19_anonymize.txt" ## [1] 198 10 ## [1] "current_ctrl_ins_del_hg19_anonymize.txt" ## [1] 200 10 ## [1] "current_ctrl_inv_hg19_anonymize.txt" ## [1] 200 11 ## [1] "current_ctrl_trans_hg19_anonymize.txt" ## [1] 199 ### #" ## [1] "Chrom:1"
datbndb [1,]
## SampleID SmapEntryID QryContigID RefcontigID1 RefcontigID2 QryStartPos ## 1 GM24385_Ason 343 302 1 1 19362181 ## QryEndPos RefStartPos RefEndPos Confidence Type XmapID1 XmapID2 LinkID ## 1 19371069 103038897 103048124 -1删除21 21 -1 ## QryStartIdx QryEndIdx RefStartIdx RefEndIdx Zygosity基因型基因型组## 1 3763 3764 19515 19516 homozygous 1 143 ## # RawConfidence RawConfidenceLeft RawConfidenceCenter SVsize ## 1 1.85 4609.43 1541.23 1.85 339.6 ## SVfreq定向样本算法大小## 1 0.514 NA GM24385_Ason_DLE1_assembly5 assembly_comparison 340 ## Present_in_。_of_BNG_control_samples # # 1 # # Present_in_ 22.5。80.7 _of_BNG_control_samples_with_the_same_enzyme # # 1 # # Fail_assembly_chimeric_score OverlapGenes NearestNonOverlapGene # # 1 not_applicable - COL11A1 # # NearestNonOverlapGeneDistance PutativeGeneFusion Found_in_parents_assemblies # # 1 293899 - # # Found_in_parents_molecules Found_in_self_molecules Mother_molecule_count # # 1都是的17 # # Father_molecule_count Self_molecule_count BNG_Freq_Perc_Filtered # # 1 19 31 0 # # BNG_Freq_Perc_UnFiltered BNG_Homozygotes # # 1 0 0

# #断代分析

队列分析旨在提供队列内的内部变异频率和亲本合子度。该函数首先合并所有可用的单个映射,以形成一个“内部引用”(buildSVInternalDB真正的)或如果内部SV数据库已经可用,则使用现有文件(buildSVInternalDB).该函数需要查询smap文件、合并的内部数据库文件的路径,以及以下参数:插入和删除的置信度评分阈值(indelconf;默认值为0.5),反转(invconf;默认值为0.01)和易位(transconf;默认值是0.1)(由Bionano Genomics确定)。输出可以是a (dataFrame)或文本文件(文本).

smapName =“GM24385_Ason_DLE1_VAP_trio5.smap”smap =执行“extdata”smapName,包=“nanotatoR”indelconf =0.5;invconf =0.01; transconf =0.1; input_fmt =“文本”datInf < -internalFrequencyTrio_Duosmap =大声叫嚷,buildSVInternalDB =真正的win_indel =10000win_inv_trans =50000labelType =c“本身”),SE_path =执行“extdata”“SoloFile /”包=“nanotatoR”),SE_pattern =“* _DLE1_ *”Samplecodes =执行“extdata”“nanotatoR_sample_codes.csv”包=“nanotatoR”),mergeKey =执行“extdata”“nanotatoR_control_sample_codes.csv”包=“nanotatoR”),mergedKeyoutpath =执行“extdata”包=“nanotatoR”),mergedKeyFname =“Sample_index.csv”EnzymeType =“本身”indexfile =执行“extdata”mergedKeyFname,包=“nanotatoR”),perc_similarity =0.5indelconf =0.5invconf =0.01transconf =0.1limsize =1000win_indel_parents =5000win_inv_trans_parents =40000returnMethod =“dataFrame”input_fmt_SV =“文本”
“# # # # #[1]计算内部频率# # # " # #[1]“妹妹不存在“# #[1]“兄弟不存在“# #[1]“兄弟不存在”# #[1]“表哥不存在”# #[1]”丈夫不存在“# #[1]”妻子不存在“# #[1]“儿子不存在“# #[1]”女儿不存在“# #[1]“孕产妇大妈妈不存在”# #[1]”父亲母亲大不存在“# #[1]“父亲的大妈妈不存在”# #[1]”父亲父亲大不存在“# #[1]”舅舅不存在“# #”[1]孕产妇阿姨不存在“# #”[1]的叔叔不存在“# #[1]”父亲父亲大不存在“# #”[1]母亲缺席“# #”[1]父亲缺席“# #”[1]姐姐不存在“# #”[1]哥哥不存在“# #[1]”兄弟姐妹不存在“# #[1]”表哥不存在“# #”[1]的丈夫不存在“# #”[1]的妻子不存在“# #”[1]的儿子不存在“# #”[1]的女儿不存在“# #[1]”母亲母亲大不存在“# #[1]”父亲母亲大不存在“# #”[1]的大妈妈不存在“# #[1]“父亲祖父不在场”##[1]“母亲的叔叔不在场”##[1]“父亲的叔叔不在场”##[1]“父亲的祖父不在场”##[1]1 ##[1]2 ##[1]3 ##[1]4 ##[1]“GM24385_Ason”##[1]“GM24143_Amother”##[1]“GM24149_Afather”##[1]“NA12878”##[1]“Chrom:1”
datInf [1,]
## SampleID SmapEntryID QryContigID RefcontigID1 RefcontigID2 QryStartPos ## 1 GM24385_Ason 343 302 1 1 19362181 ## QryEndPos RefStartPos RefEndPos Confidence Type XmapID1 XmapID2 LinkID ## 1 19371069 103038897 103048124 -1删除21 21 -1 ## QryStartIdx QryEndIdx RefStartIdx RefEndIdx Zygosity基因型基因型组## 1 3763 3764 19515 19516 homozygous 1 143 ## # RawConfidence RawConfidenceLeft RawConfidenceCenter SVsize ## 1 1.85 4609.43 1541.23 1.85 339.6 ## SVfreq定向样本算法大小## 1 0.514 NA GM24385_Ason_DLE1_assembly5 assembly_comparison 340 ## Present_in_。_of_BNG_control_samples # # 1 # # Present_in_ 22.5。80.7 _of_BNG_control_samples_with_the_same_enzyme # # 1 # # Fail_assembly_chimeric_score OverlapGenes NearestNonOverlapGene # # 1 not_applicable - COL11A1 # # NearestNonOverlapGeneDistance PutativeGeneFusion Found_in_parents_assemblies # # 1 293899 - # # Found_in_parents_molecules Found_in_self_molecules Mother_molecule_count # # 1都是的17 # # Father_molecule_count Self_molecule_count Internal_Freq_Perc_Filtered # # 1 19 31 0 # # Internal_Freq_Perc_Unfiltered Internal_Homozygotes MotherZygosity父亲纯合子度1纯合子

# #基因重叠

包含已知的基因和非编码RNA基因组位置,与已识别的结构变异重叠或接近。NanotatoR自动确定给定结构变体的重叠基因数量,提供基因链(+或-)以及sv与该基因的重叠百分比。该函数还为上游和下游最近的基因(用户选择,默认每边3个)提供基因名称和与SVs的相应距离。该函数需要一个输入BED文件(inputfmt床上)或Bionano兼容的BED文件(inputfmtbnb),将X和Y染色体的符号分别重新编码为23和24。BED文件用于重叠查询smap文件(smap)与上游和下游基因重叠和不重叠(n这个函数的输出可以有两种类型:一个R对象(dataFrame)或文本文件(文本).

smapName =“GM24385_Ason_DLE1_VAP_trio5.smap”smap =执行“extdata”smapName,包=“nanotatoR”bedFile < -执行“extdata”“HomoSapienGRCH19_lift37.bed”包=“nanotatoR”outpath < -执行“extdata”包=“nanotatoR”datcomp < -overlapnearestgeneSearchsmap =大声叫嚷,床上=bedFile,inputfmtBed =“床上”outpath,n =3.returnMethod_bedcomp =c“dataFrame”),input_fmt_SV =“文本”EnzymeType =“本身”bperrorindel =3000bperrorinvtrans =50000
##[1]“###比较sv和床### #[1]“基因组没有任何性染色体”##[1]“***重叠基因***”##[1]“***非重叠基因***”
datcomp [1,]
## SampleID SmapEntryID QryContigID RefcontigID1 RefcontigID2 QryStartPos ## 1 GM24385_Ason 343 302 1 1 19362181 ## QryEndPos RefStartPos RefEndPos Confidence Type XmapID1 XmapID2 LinkID ## 1 19371069 103038897 103048124 -1删除21 21 -1 ## QryStartIdx QryEndIdx RefStartIdx RefEndIdx Zygosity基因型基因型组## 1 3763 3764 19515 19516 homozygous 1 143 ## # RawConfidence RawConfidenceLeft RawConfidenceCenter SVsize ## 1 1.85 4609.43 1541.23 1.85 339.6 ## SVfreq定向样本算法大小## 1 0.514 NA GM24385_Ason_DLE1_assembly5 assembly_comparison 340 ## Present_in_。_of_BNG_control_samples # # 1 # # Present_in_ 22.5。_of_BNG_control_samples_with_the_same_enzyme 80.7 # # 1 # # Fail_assembly_chimeric_score OverlapGenes NearestNonOverlapGene # # 1 not_applicable - COL11A1 # # NearestNonOverlapGeneDistance PutativeGeneFusion Found_in_parents_assemblies # # 1 293899 - # # Found_in_parents_molecules Found_in_self_molecules Mother_molecule_count # # 1都是的17 # # Father_molecule_count Self_molecule_count OverlapGenes_strand_perc # 19 31 # 1 - # # Upstream_nonOverlapGenes_dist_kb # # 1MXRA8(-:101747.828);DVL1(-:101765.241) ## Downstream_nonOverlapGenes_dist_kb ## 1 -

# #可以提取

生成与患者表型相关的基因列表,并将其与跨越结构变异的基因名称重叠。用户提供,表型关键字用于从选定的数据库如ClinVar, OMIM, GTR和基因登记处生成基因列表。生成的列表用于对已知与患者表型相关的基因中发生的结构变异进行排序。

函数的输入是一个术语,它可以作为单个字符输入提供(方法= " Single "),或项的向量(方法= "多个")或作为文本文件(方法=“文本”)。输出可以选择为a (dataFrame)或文本文件(文本).

条款=“肝硬化,家族”基因< -gene_list_generationmethod_entrez =c“单身”),词=术语中,returnMethod =c“dataFrame”),omimID =“人类:118980”人类=执行“extdata”“mim2gene.txt”包=“nanotatoR”),clinvar =执行“extdata”“localPDB /”包=“nanotatoR”),gtr =执行“extdata”“gtrDatabase.txt”包=“nanotatoR”),downloadClinvar =downloadGTR =基因(110,]

# #表达数据集成

SVexpression_solo而且SVexpression_duo_trio功能分别用于单克隆和二/三克隆,为用户提供了将组织特异性基因表达值与SVs集成的工具。该函数将每个样本的基因名称和相应的表达式值作为输入矩阵(单个文件可以通过RNAseqcombine函数合并为duos/trios, RNAseqcombine_solo函数合并为solos)。

RNASeqDir =执行“extdata”包=“nanotatoR”returnMethod =“dataFrame”datRNASeq < -RNAseqcombine_soloRNASeqDir =RNASeqDir,returnMethod =returnMethod)
'select()'返回键和列之间的1:1映射
smapName =“NA12878_DLE1_VAP_solo5.smap”smap =执行“extdata”smapName,包=“nanotatoR”smap =执行“extdata”smapName,包=“nanotatoR”bedFile < -执行“extdata”“HomoSapienGRCH19_lift37.bed”包=“nanotatoR”outpath < -执行“extdata”包=“nanotatoR”datcomp < -overlapnearestgeneSearchsmap =大声叫嚷,床上=bedFile,inputfmtBed =“床上”outpath,n =3.returnMethod_bedcomp =c“dataFrame”),input_fmt_SV =“文本”EnzymeType =“本身”bperrorindel =3000bperrorinvtrans =50000
##[1]“###比较sv和床### #[1]“基因组没有任何性染色体”##[1]“***重叠基因***”##[1]“***非重叠基因***”
datRNASeq1 < -SVexpression_soloinput_fmt_SV =c“dataFrame”),smapdata =datcomp,input_fmt_RNASeq =c“dataFrame”),RNASeqData =datRNASeq,outputfmt =c“datFrame”),pattern_Proband =“* _P_ *”EnzymeType =c“本身”))
# #[1]“# # # OverlapGenes # # # # #[1]“# # # NonOverlapUPStreamGenes # # # # #[1]“# # # NonOverlapDNStreamGenes # # #”
datRNASeq1 [1,]
## SampleID SmapEntryID QryContigID RefcontigID1 RefcontigID2 QryStartPos ## 1 NA12878 397 3451 1 1 16678749 ## QryEndPos RefStartPos RefEndPos Confidence Type XmapID1 XmapID2 LinkID ## 1 16683839 100343445 100348831 -1删除24 24 -1 ## QryStartIdx QryEndIdx RefStartIdx RefEndIdx Zygosity基因型## 1 3260 3261 18927 18928杂合子1 ##基因类型组RawConfidence RawConfidenceLeft RawConfidenceRight ## 1 1 2.96 4105.98 1852.86 ## RawConfidenceCenter SVsize SVfreq定位样本## #12.96 296.1 0.51 NA NA12878_DLE1_assembly5 ## Algorithm Size Present_in_._of_BNG_control_samples ## 1 assembly_comparison 296 0 ## Present_in_._of_BNG_control_samples_with_the_same_enzyme ## 1 0 ## Fail_assembly_chimeric_score OverlapGenes NearestNonOverlapGene ## 1 not_applicable AGL BC112312 ## NearestNonOverlapGeneDistance PutativeGeneFusion Found_in_self_molecules ## 1 85170 - yes ## Self_molecule_count OverlapGenes_strand_perc ## 1 41 - ## Upstream_nonOverlapGenes_dist_kb ## 1 AURKAIP1(-:99031.335);MXRA8(-:99052.376);DVL1(-:99069.789) ## Downstream_nonOverlapGenes_dist_kb OverlapProbandEXP ## 1 - - ## NonOverlapUPprobandEXP NonOverlapDNprobandEXP ## 1 AURKAIP1(-);MXRA8(-);DVL1(-) -

# #过滤器变体

提供易于使用的、用户选择的过滤标准,将变体隔离到相应的组中(如从头、继承或出现在基因列表中)。在这一步中,生成的或可用的基因列表被附加到smap文件中。此函数的输入文件可以是(smap)或数据帧。原始的和纳米粒子注释的地图都可以作为输入。它有一个选项来接受smap (input_fmt_svMap)及基因表(input_fmt_geneList)作为dataFrame或文本文件,并生成excel作为输出。

smapName < -“GM24385_Ason_DLE1_VAP_trio5.smap”outputFilename < -“GM24385_Ason_DLE1_VAP_trio5_out”smappath < -执行“extdata”smapName,包=“nanotatoR”outpath < -执行“extdata”smapName,包=“nanotatoR”RZIPpath < -执行“extdata”“zip.exe”包=“nanotatoR”smap =执行“extdata”smapName,包=“nanotatoR”bedFile < -执行“extdata”“HomoSapienGRCH19_lift37.bed”包=“nanotatoR”outpath < -执行“extdata”包=“nanotatoR”directoryName < -执行“extdata”包=“nanotatoR”datcomp < -overlapnearestgeneSearchsmap =大声叫嚷,床上=bedFile,inputfmtBed =“床上”outpath,n =3.returnMethod_bedcomp =c“dataFrame”),input_fmt_SV =“文本”EnzymeType =“本身”bperrorindel =3000bperrorinvtrans =50000
##[1]“###比较sv和床### #[1]“基因组没有任何性染色体”##[1]“***重叠基因***”##[1]“***非重叠基因***”
hgpath =执行“extdata”“grch37_hg19_variants_2016 - 05 - 15. - txt”包=“nanotatoR”datDGV < -DGVfrequencyhgpath =hgpath,smap_data =datcomp,win_indel_DGV =10000input_fmt_SV =“dataFrame”EnzymeType =“本身”perc_similarity_DGV =0.5returnMethod =“dataFrame”
“1号染色体”
indelconf =0.5;invconf =0.01; transconf =0.1datInf < -internalFrequencyTrio_Duosmapdata =datDGV,buildSVInternalDB =真正的win_indel =10000win_inv_trans =50000labelType =c“本身”),EnzymeType =“本身”SE_path =执行“extdata”“SoloFile /”包=“nanotatoR”),SE_pattern =“* _DLE1_ *”perc_similarity_parents =0.9Samplecodes =执行“extdata”“nanotatoR_sample_codes.csv”包=“nanotatoR”),mergeKey =执行“extdata”“nanotatoR_control_sample_codes.csv”包=“nanotatoR”),mergedKeyoutpath =执行“extdata”包=“nanotatoR”),mergedKeyFname =“Sample_index.csv”indexfile =执行“extdata”mergedKeyFname,包=“nanotatoR”),perc_similarity =0.5indelconf =0.5invconf =0.01transconf =0.1limsize =1000win_indel_parents =5000win_inv_trans_parents =40000returnMethod =“dataFrame”input_fmt_SV =“dataFrame”
“# # # # #[1]计算内部频率# # # " # #[1]“妹妹不存在“# #[1]“兄弟不存在“# #[1]“兄弟不存在”# #[1]“表哥不存在”# #[1]”丈夫不存在“# #[1]”妻子不存在“# #[1]“儿子不存在“# #[1]”女儿不存在“# #[1]“孕产妇大妈妈不存在”# #[1]”父亲母亲大不存在“# #[1]“父亲的大妈妈不存在”# #[1]”父亲父亲大不存在“# #[1]”舅舅不存在“# #”[1]孕产妇阿姨不存在“# #”[1]的叔叔不存在“# #[1]”父亲父亲大不存在“# #”[1]母亲缺席“# #”[1]父亲缺席“# #”[1]姐姐不存在“# #”[1]哥哥不存在“# #[1]”兄弟姐妹不存在“# #[1]”表哥不存在“# #”[1]的丈夫不存在“# #”[1]的妻子不存在“# #”[1]的儿子不存在“# #”[1]的女儿不存在“# #[1]”母亲母亲大不存在“# #[1]”父亲母亲大不存在“# #”[1]的大妈妈不存在“# #[1]“父亲祖父不在场”##[1]“母亲的叔叔不在场”##[1]“父亲的叔叔不在场”##[1]“父亲的祖父不在场”##[1]1 ##[1]2 ##[1]3 ##[1]4 ##[1]“GM24385_Ason”##[1]“GM24143_Amother”##[1]“GM24149_Afather”##[1]“NA12878”##[1]“Chrom:1”
路径< -执行“extdata”“Bionano_config /”包=“nanotatoR”模式< -“* _hg19_ *”datBNDB < -BNDBfrequencysmapdata =datInf,buildBNInternalDB =真正的input_fmt_SV =“dataFrame”dbOutput =“dataframe”BNDBpath =路径,BNDBpattern =模式,outpath帧,win_indel =10000win_inv_trans =50000perc_similarity =0.5indelconf =0.5invconf =0.01limsize =1000transconf =0.1returnMethod =c“dataFrame”),EnzymeType =c“本身”))
##[1] "###队列频率计算###" ## [1]"## [1]"current_ctrl_dup_hg19_anonymize.txt" ## [1] 198 10 ## [1] "current_ctrl_ins_del_hg19_anonymize.txt" ## [1] 200 10 ## [1] "current_ctrl_inv_hg19_anonymize.txt" ## [1] 200 11 ## [1] "current_ctrl_trans_hg19_anonymize.txt" ## [1] 199 ### #" ## [1] "Chrom:1"
decipherpath =执行“extdata”“population_cnv.txt”包=“nanotatoR”datdecipher < -Decipherfrequencydecipherpath =decipherpath,smap_data =datBNDB,win_indel =10000perc_similarity =0.5returnMethod =“dataFrame”input_fmt_SV =“dataFrame”EnzymeType =c“本身”))
###计算解码频率###
run_bionano_filter_SE_Trioinput_fmt_geneList =c“文本”),input_fmt_SV =c“dataFrame”),svData =datdecipher,dat_geneList =dat_geneList,RZIPpath =RZIPpath,EnzymeType =c“本身”),outputType =c“csv”),primaryGenesPresent =fileprefix =“AnnotatedSamplesGM24385”outputFilename =outputFilename,directoryName =directoryName,outpath =outpath)
## [1] "fname <- paste(outputFilename, \" _dl .xlsx\", sep = \"\")\n write.xlsx(list_of_datasets, file = file. txt)路径(outpath、帧),keepNA = TRUE)”

# #主要

主函数通过附加每个函数的输出,连续运行可用的纳米子函数。它以smap文件、DGV文件、BED文件、内部数据库文件、表型术语表为输入。它还接受最终excel文件的输出路径和文件名。个别、子功能、输入参数也可供用户选择。

smapName =“NA12878_DLE1_VAP_solo5.smap”smap =执行“extdata”smapName,包=“nanotatoR”bedFile < -执行“extdata”“HomoSapienGRCH19_lift37.bed”包=“nanotatoR”hgpath =执行“extdata”“grch37_hg19_variants_2016 - 05 - 15. - txt”包=“nanotatoR”labelType =c“本身”SE_path =执行“extdata”“SoloFile /”包=“nanotatoR”SE_pattern =“* _DLE1_ *”Samplecodes =执行“extdata”“nanotatoR_sample_codes.csv”包=“nanotatoR”mergeKey =执行“extdata”“nanotatoR_control_sample_codes.csv”包=“nanotatoR”mergedKeyoutpath =执行“extdata”包=“nanotatoR”mergedKeyFname =“Sample_index.csv”RNASeqDir =执行“extdata”“NA12878_P_Blood_S1.genes.results”包=“nanotatoR”路径=执行“extdata”“Bionano_config /”包=“nanotatoR”模式=“_hg19.txt”outputFilename < -“NA12878_DLE1_VAP_solo5_out”outpath < -执行“extdata”smapName,包=“nanotatoR”RZIPpath < -执行“extdata”“zip.exe”包=“nanotatoR”directoryName < -执行“extdata”包=“nanotatoR”nanotatoR_main_Solo_SEsmap =大声叫嚷,床上=bedFile,inputfmt =c“床上”),n =3.buildBNInternalDB =真正的路径=路径,模式=模式,buildSVInternalDB =真正的EnzymeType =c“本身”),labelType =c“本身”),SE_path =SE_path,SE_pattern =SE_pattern,win_indel_INF =10000win_inv_trans_INF =50000perc_similarity_INF =0.5indelconf =0.5invconf =0.01transconf =0.1hgpath =hgpath,win_indel_DGV =10000win_inv_trans_DGV =50000perc_similarity_DGV =0.5RNAseqcombo =真正的RNASeqDir =RNASeqDir,returnMethod =“dataFrame”pattern_Proband =“* _P_ *”outpath =outpath,outputFilename =outputFilename,termListPresent =datGeneListPresent =InternaldatabasePresent =primaryGenesPresent =fileprefix =“AnnotatedNA12878”directoryName =directoryName,outputType =c“csv”))
## [1] "####PipeLine Starts####" ##[1] "运行gene_list_generation的时间是:-1.52587890625e-05" ##[1] "###比较sv和床####[1]"基因组没有任何性染色体"##[1]"***重叠基因***" ##[1]"***非重叠基因***" ##[1]"运行compSmapbed的时间是:-0.0210995674133301" ##[1]"染色体:1" ##[1]"运行DGVfrequency的时间是:-0.0202255249023438“# #”[1]计算内部频率# # # # # # # #”[1]“internalFrequency不能工作“# #[1]”时间运行internalFrequency是:-0.00423622131347656“# #[1]”# # #群频率计算# # # # #[1]“# # #建筑数据库# # # # #[1]“BNDBfrequency不能工作“# #[1]”时间运行BNDBfrequency是:-0.000319242477416992“# #”[1]计算破译频率# # # # # # # #”[1]“Decipherfrequency不能工作“# #[1]”时间运行Decipherfrequency是:-0.000112056732177734" ## [1] "RNAseqcombo不能工作"##[1]"###[1]"###非overlapupstreamgenes ### #[1] "运行SmapRNAseqquery的时间是:-0.00465917587280273" ## [1]"run_bionano_filter_SE_solo不能工作"##[1]"运行run_bionano_filter_SE_solo的时间是:-0.000916242599487305"

#引用

  1. Hayk Barseghyan, Wilson Tang, Richard T. Wang, Miguel Almalvez, Eva Segura, Matthew S. Bramble, Allen Lipson, Emilie D. Douine, Hane Lee, Emmanuèle C. Délot, Stanley F. Nelson和Eric Vilain。下一代图谱:一种检测致病结构变异的新方法,在临床诊断中具有潜在的实用价值。基因组医学2017 9:90。https://doi.org/10.1186/s13073-017-0479-0

  2. 温特,D. J. rentrez: NCBI eUtils API的R包,R杂志2017 (2):520-526

  3. 克里斯托弗·布朗。哈希:哈希/关联数组/字典的完整功能实现。R包版本2.2.6。https://CRAN.R-project.org/package=hash

  4. 哈德利韦翰。stringr:通用字符串操作的简单、一致的包装器。R包版本1.3.1。https://CRAN.R-project.org/package=stringr

  5. Bionano基因组学。操作理论——结构变量调用。https://bionanogenomics.com/wp-content/uploads/2018/04/30110-Bionano-Solve-Theory-of-Operation-Structural-Variant-Calling.pdf

  6. Bionano基因组学。操作理论-变体注释管道。https://bionanogenomics.com/wp-content/uploads/2018/04/30190-Bionano-Solve-Theory-of-Operation-Variant-Annotation-Pipeline.pdf