R version: R version 4.2.0 RC (2022-04-19 r82224)
Bioconductor version: 3.15
Package version: 1.10.0
生物信息编码的DNA序列is often organized into independent modules or clusters. For instance, the eukaryotic system of expression relies on combinations of homotypic or heterotypic transcription factors (TFs) which play an important role in activating and repressing target genes. Identifying clusters of genomic features is a frequent task and includes application such as genomic regions enriched for the presence of a combination of transcription factor binding sites or enriched for the presence of mutations often referred as regions with increase mutational hotspots.
fcScan is designed to detect clusters of genomic features, based on user defined search criteria. Such criteria include:
fcScan is designed to handle large number of genomic features (i.e data generated by High Throughput Sequencing).
fcScan depends on the following packages:
Currently, fcScan has one main function, thegetCluster
function. Additional functionality will be added for future releases including cross-species identification of orthologous clusters.
The input forgetCluster
给出through the parameterx
. This function accepts input in data frame, GRanges as well as vector of files in BED and VCF (compressed or not) formats. BED and VCF files are loaded by packagesrtracklayer
andVariantAnnotation
respectively. There is no limit to the number of files the user can define.
When input is data frame or GRanges objects are given, they should contain the following “named” 5 columns:
seqnames | start | end | strand | site |
---|---|---|---|---|
chr1 | 10 | 15 | + | a |
chr1 | 17 | 20 | + | b |
chr1 | 25 | 30 | + | b |
chr1 | 27 | 35 | + | a |
chr1 | 32 | 40 | + | c |
chr1 | 41 | 48 | + | b |
Theseqnames,startandendcolumns define the name of the chromosome, the starting and ending position of the feature in the chromosome, respectively. Thestrandcolumn defines the strand. Thesitecolumn contains the ID of the site and will be used for clustering. Thestartandendcolumns are numeric while the remaining columns are characters.
Note: when input is data frame, the data is consideredzero-based
as in BED format.
Window size is set usingw
. It defines themaximum
size of the clusters.
The clustering conditionc
defines the number and name of genomic features to search for based on features names defined in thesite
columnc = c("a" = 1, "b" = 2)
This searches for clusters with 3 sites, Oneasite and twobsites.
Another way of writing the condition, only if input is a vector to file paths, is the followingx = ("a.bed", "b.bed"), c = c(1,2)
Given 2 files,a.bedandb.bed, this condition states that the user is looking for clusters made from 1 “a” site and 2 “b” sites. In this case, the order of sites defined inc
is relative to the order of files.
When input is a data frame or GRanges object (instead of files), the user needs toexplicitlydefine the site names along with the desired number relative to each site. For instance, giving the condition asc = c(1,2)
for a data frame or GRanges is not allowed.
x = dataframe, c = c("a" = 1, "b" = 2)
wherea
andb
are valid site names insite
column in the dataframe/GRanges
Users can exclude clusters containing a specific site(s). This is done by specifying zero0
in the condition asc = c("a" = 1, "b" = 2, "c" = 0)
. In this case, any cluster(s) containingc
site will be excluded even if it has 1a
and 2b
sites.
By default, clustering will be performed on both strands and on all seqnames unless specified by the user using thes
andseqnames
arguments to limit the search on a specific strand and/or seqname.
Users can choose to cluster on one specific seqname(seqnames = "chr1")
, or on several seqnames(seqnames = c("chr1","chr3","chr4"))
(Default forseqnames
isNULL) meaning that clustering on all seqnames will be performed.
Fors
, the values allowed are:
The gap/overlap between adjacent clusters, and not sites, can be controled using theoverlap
option. Whenoverlap
is a positive integer, adjacent clusters will be separated by a minimum of the given value. Whenoverlap
is negative, adjacent clusters will be allowed to overlap by a maximum of the given value. (Default is set to0)
greedy
allows the user to control the number of genomic features found in clusters.
Whengreedy = FALSE
,getCluster
will build clusters having the required window size and will labelTRUEthe ones that contain the确切的number of sites provided in the condition argument. Clusters having the user defined window size but not satisfying the condition will be labelled asFALSE.
Whengreedy = TRUE
, additional sites defined in condition will be added to the cluster as long as the cluster size is below the defined window size. (Default is set toFALSE)
Theorder
option defines the desired order of the sites within identified clusters. For instance,order = c("a","b","b")
will search for clusters containing 1a
and 2b
sites and checks if they are in the specified order. Clusters with 1a
and 2b
sites that do not contain the specified order will be rejected. When greedy is set toTRUE
, order can be satisfied if a subcluster contains the desired order. For example if a cluster hasa, a, b, b, b
sites, it satisfies the required order (a, b, b) and therefore will be considered as a correct cluster . (Default is set toNULL)
Thesite_orientation
option defines the orientation or strandness of sites in the found clusters. This option cannot be used iforder
isNULL
.site_orientation
should be specified for each site inorder
. For instance, iforder = c("a","b","b")
, we can definesite_orientation
for each site respectively as follow:site_orientation = c("+","-","-")
. The cluster will be correct if it satisfies the required order and sites orientation. (Default is set toNULL)
Thesite_overlap
option defines the maximum or minimum distance allowed between sites in the cluster. Whensite_overlap
is a positive integer, sites within a cluster should haveminimum distance and above
, then the cluster is consideredTRUE
cluster. Whensite_overlap
is a negative integer, sites within a cluster should havemax distance and below
, then the cluster is consideredTRUE
cluster. Whensite_overlap
is zero, distance between sites is not taken into consideration. (Default is set to0)
Thecluster_by
option allows the usage of different ways in scanning for sites. Using this option, scanning does not ‘jump’ clusters found.cluster_by
can bestartsEnds
,endsStarts
,starts
,ends
, ormiddles
. If we choosestartsEnds
, the clustering begins at the start of the first site and it ends at the end of the last site within the window size. If we chooseendsStarts
, scanning begins at the end of the first site and it ends at the start of the last site within the window size. if we choosestarts
, the clustering begins at the start of the first site and it ends at the start of the last site within the window size. if we chooseends
, the clustering begins at the end of the first site and it ends at the end of the last site within the window size. Also, if we choosemiddles
, the clustering begins at the middle of the first site and it ends at the middle of the last site within the window size. (Default is set tostartsEnds)
Theallow_clusters_overlap
allows the user whether to include overlapping clusters or not. (Default is set to FALSE)
This option can be only used in case where the “cluster_by” choosen is “startsEnds” or “ends”, otherwise it is ignored. If it is set toTRUE
, partially overlapping sites within window size (at the end of the cluster) are allowed. However, if it is set toFALSE
, partially overlapping sites are not included in the cluster found. (Default is set toFALSE)
This option allows the user to choose which partially overlapping site are included. The amount of overlap between the last site found and the window size controls the addition or removal of those sites. For example, ifpartial_overlap_percentage
is set to 0.6, this means that the window size should overlap this site by 60% of its size. However, if it fails, this site won’t be included in the cluster found. This option can be only used in case where the “cluster_by” chosen is “startsEnds” or “ends”, otherwise it is ignored. (Default is set toNULL)
This option allows to distribute the run over multiple threads. It is recommended to use multiple threads when the input dataset is large. Whenthreads
is set to 0,getCluster
reserves the adequate number of threads and performs the run. This will dramatically increase the speed ofgetCluster
. The user can choose the number of threads between 1 and the maximum number of available threads. (Default is set to1)
Theverbose
option allows the printing of additional messages and the list of clusters that failed for lack of correct combination of sites. This option is used mainly for debugging purposes. (Default is set to FALSE)
The output ofgetCluster
is a GRanges object with fields:
The algorithm returns all clusters containing the correct count of sites/features, unless verbose is set toTRUE
. If the combination, overlap and order options are satisfied, the cluster is considered aTRUE
cluster.
Thestatusof a cluster can be either PASS, excludedSites, orderFail, siteOrientation or siteOverlapPASS
is a cluster that satisfied the desired combination, overlap, order and sites orientation.orderFail
is a cluster that had the required combination but did not satisfy the required order of sites.excludedSites
is a cluster that had the required combination and order but it has one or more sites to exclude.siteOrientation
is a cluster that had the required combination and order but it has one or more sites with different orientation than requested.siteOverlap
is a cluster that meets all the requirements but the distance between its sites is not respected.
注意:如果user is usinggreedy = FALSE
andorder
contains values more than in the condition parameter (c
), an error will be raised. However, ifgreedy = TRUE
, then usingorder
with more values than the condition parameter is allowed since the cluster may contain more sites than the requiredc
condition as long as the window size is satisfied.
Example usinggetCluster
:
getCluster
looks for desired genomic regions of interest like transcription factor binding sites, within a window size and specific condition.
This function accepts a data frame and GRanges object. getCluster also accepts BED or VCF (or mix of both) files as input. The output ofgetCluster
is a GRanges object that contains the genomic coordinates (seqnames, ranges and strand) and three metadata columns:
sites: contains clusters of sites that conforms with the conditionc
specified ingetCluster
.
isCluster:TRUE
if the cluster found conform with the conditionc
and theorder
(if indicated in condition) andFALSE
if the cluster fails to conform with the condition or order.
status:PASS
ifisCluster
equalsTRUE
. However, ifisCluster
isFALSE
, status shows why the found cluster is not aTRUE
cluster. If the order of sites is not respected in the found cluster, status would returnOrderFail
. in Addition, if the cluster found contains non desired sites, it returnsexcludedSites
. Moreover, if the sites orientation is not respected in found cluster, status would returnsiteOrientation
. Finally , if the distance between sites in the cluster found does not conform to the condition specified by the user, status would returnsiteOverlap
.
In this example, we askgetCluster
to look for clusters that contains one site “s1”, one site “s2” and zero “s3” sites. In addition, we requested clusters to have sites in the order s1,s2 and having orientation “+”,“+” respectively with minimum distance between sites of 2bp.
x1 = data.frame(seqnames = rep("chr1", times = 17), start = c(1,10,17,25,27,32,41,47,60,70,87,94,99,107,113,121,132), end = c(8,15,20,30,35,40,48,55,68,75,93,100,105,113,120,130,135), strand = c("+","+","+","+","+","+","+","+","+", "+","+","+","+","+","+","+","-"), site = c("s3","s1","s2","s2","s1","s2","s1","s1","s2","s1","s2", "s2","s1","s2","s1","s1","s2")) clusters = getCluster(x1, w = 20, c = c("s1" = 1, "s2" = 1, "s3" = 0), greedy = TRUE, order = c("s1","s2"), site_orientation=c("+","+"), site_overlap = 2, verbose = TRUE) #> 17 entries loaded #> Running getCluster using 1 threads #> Time difference of 1.189334 secs clusters #> GRanges object with 9 ranges and 3 metadata columns: #> seqnames ranges strand | sites isCluster status #> | #> [1] chr1 2-20 * | s3,s1,s2 FALSE excludedSites #> [2] chr1 11-30 * | s1,s2,s2 TRUE PASS #> [3] chr1 33-48 * | s2,s1 FALSE orderFail #> [4] chr1 48-68 * | s1,s2 TRUE PASS #> [5] chr1 88-105 * | s2,s2,s1 FALSE orderFail #> [6] chr1 95-113 * | s2,s1,s2 FALSE siteOverlap #> [7] chr1 100-120 * | s1,s2,s1 FALSE siteOverlap #> [8] chr1 108-120 * | s2,s1 FALSE orderFail #> [9] chr1 122-135 * | s1,s2 FALSE siteOrientation #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths
Another example but using GRanges as input: in this example, we askgetCluster
to look for clusters that contains one sites1
and two sitess2
within a window size of 25 bp. Also, we requested clusters to be searched as+
strand.
suppressMessages(library(GenomicRanges)) x = GRanges( seqnames = Rle("chr1", 16), ranges = IRanges(start = c(10L,17L,25L,27L,32L,41L,47L, 60L,70L,87L,94L,99L,107L,113L,121L,132L), end = c(15L,20L,30L,35L,40L,48L,55L,68L,75L,93L,100L,105L, 113L,120L,130L,135L)), strand = Rle("+",16), site = c("s1","s2","s2","s1","s2","s1","s1","s2", "s1","s2","s2","s1","s2","s1","s1","s2")) clusters = getCluster(x, w = 25, c = c("s1"=1,"s2"=2), s = "+") #> 16 entries loaded #> Running getCluster using 1 threads #> Time difference of 0.1115539 secs clusters #> GRanges object with 2 ranges and 3 metadata columns: #> seqnames ranges strand | sites isCluster status #> | #> [1] chr1 10-30 + | s1,s2,s2 TRUE PASS #> [2] chr1 87-105 + | s2,s2,s1 TRUE PASS #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths
sessionInfo() #> R version 4.2.0 RC (2022-04-19 r82224) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Ubuntu 20.04.4 LTS #> #> Matrix products: default #> BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so #> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_GB LC_COLLATE=C #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats4 stats graphics grDevices utils datasets methods #> [8] base #> #> other attached packages: #> [1] GenomicRanges_1.48.0 GenomeInfoDb_1.32.0 IRanges_2.30.0 #> [4] S4Vectors_0.34.0 BiocGenerics_0.42.0 fcScan_1.10.0 #> [7] BiocStyle_2.24.0 #> #> loaded via a namespace (and not attached): #> [1] bitops_1.0-7 matrixStats_0.62.0 #> [3] bit64_4.0.5 filelock_1.0.2 #> [5] doParallel_1.0.17 progress_1.2.2 #> [7] httr_1.4.2 tools_4.2.0 #> [9] bslib_0.3.1 utf8_1.2.2 #> [11] R6_2.5.1 DBI_1.1.2 #> [13] tidyselect_1.1.2 prettyunits_1.1.1 #> [15] bit_4.0.4 curl_4.3.2 #> [17] compiler_4.2.0 cli_3.3.0 #> [19] Biobase_2.56.0 xml2_1.3.3 #> [21] DelayedArray_0.22.0 rtracklayer_1.56.0 #> [23] bookdown_0.26 sass_0.4.1 #> [25] rappdirs_0.3.3 stringr_1.4.0 #> [27] digest_0.6.29 Rsamtools_2.12.0 #> [29] rmarkdown_2.14 XVector_0.36.0 #> [31] pkgconfig_2.0.3 htmltools_0.5.2 #> [33] MatrixGenerics_1.8.0 highr_0.9 #> [35] dbplyr_2.1.1 fastmap_1.1.0 #> [37] BSgenome_1.64.0 rlang_1.0.2 #> [39] RSQLite_2.2.12 jquerylib_0.1.4 #> [41] BiocIO_1.6.0 generics_0.1.2 #> [43] jsonlite_1.8.0 BiocParallel_1.30.0 #> [45] dplyr_1.0.8 VariantAnnotation_1.42.0 #> [47] RCurl_1.98-1.6 magrittr_2.0.3 #> [49] GenomeInfoDbData_1.2.8 Matrix_1.4-1 #> [51] Rcpp_1.0.8.3 fansi_1.0.3 #> [53] lifecycle_1.0.1 stringi_1.7.6 #> [55] yaml_2.3.5 SummarizedExperiment_1.26.0 #> [57] zlibbioc_1.42.0 plyr_1.8.7 #> [59] BiocFileCache_2.4.0 grid_4.2.0 #> [61] blob_1.2.3 parallel_4.2.0 #> [63] crayon_1.5.1 lattice_0.20-45 #> [65] Biostrings_2.64.0 GenomicFeatures_1.48.0 #> [67] hms_1.1.1 KEGGREST_1.36.0 #> [69] knitr_1.38 pillar_1.7.0 #> [71] rjson_0.2.21 codetools_0.2-18 #> [73] biomaRt_2.52.0 XML_3.99-0.9 #> [75] glue_1.6.2 evaluate_0.15 #> [77] BiocManager_1.30.17 png_0.1-7 #> [79] vctrs_0.4.1 foreach_1.5.2 #> [81] purrr_0.3.4 assertthat_0.2.1 #> [83] cachem_1.0.6 xfun_0.30 #> [85] restfulr_0.0.13 tibble_3.1.6 #> [87] iterators_1.0.14 GenomicAlignments_1.32.0 #> [89] AnnotationDbi_1.58.0 memoise_2.0.1 #> [91] ellipsis_0.3.2