1Version Info

R version: R version 4.2.0 RC (2022-04-19 r82224)
Bioconductor version: 3.15
Package version: 1.10.0

2Introduction

生物信息编码的DNA序列is often organized into independent modules or clusters. For instance, the eukaryotic system of expression relies on combinations of homotypic or heterotypic transcription factors (TFs) which play an important role in activating and repressing target genes. Identifying clusters of genomic features is a frequent task and includes application such as genomic regions enriched for the presence of a combination of transcription factor binding sites or enriched for the presence of mutations often referred as regions with increase mutational hotspots.

fcScan is designed to detect clusters of genomic features, based on user defined search criteria. Such criteria include:

A Path to BED/VCF files, a dataframe or a GRanges object as input (required)
A window size specifying the maximum cluster size allowed (required)
The combination of features required and/or to exclude (required)
The order of features required within identified clusters (optional)
The allowed distance or overlap between identified clusters (optional)
The strand(s) to build the clusters on (optional)
The seqname(s) to build the clusters on (optional)
The sites orientation to build the clusters on (optional)
The distance between site in the cluster (optional)

fcScan is designed to handle large number of genomic features (i.e data generated by High Throughput Sequencing).

3Dependencies

fcScan depends on the following packages:

stats
plyr
utils
rtracklayer
SummarizedExperiment
IRanges
VariantAnnotation
GenomicRanges

4Overview

Currently, fcScan has one main function, thegetClusterfunction. Additional functionality will be added for future releases including cross-species identification of orthologous clusters.

5Input arguments for getCluster

5.1Input data (x)

The input forgetCluster给出through the parameterx. This function accepts input in data frame, GRanges as well as vector of files in BED and VCF (compressed or not) formats. BED and VCF files are loaded by packagesrtracklayerandVariantAnnotationrespectively. There is no limit to the number of files the user can define.

When input is data frame or GRanges objects are given, they should contain the following “named” 5 columns:

seqnames: Contains the seqname name
start: Contains the start coordinates
end: Contains the end coordinates
strand:包含链相对于每个站点
site: Contains the category name relative to genomic site

seqnames	start	end	strand	site
chr1	10	15	+	a
chr1	17	20	+	b
chr1	25	30	+	b
chr1	27	35	+	a
chr1	32	40	+	c
chr1	41	48	+	b

Theseqnames,startandendcolumns define the name of the chromosome, the starting and ending position of the feature in the chromosome, respectively. Thestrandcolumn defines the strand. Thesitecolumn contains the ID of the site and will be used for clustering. Thestartandendcolumns are numeric while the remaining columns are characters.

Note: when input is data frame, the data is consideredzero-basedas in BED format.

5.2Window Size (w)

Window size is set usingw. It defines themaximumsize of the clusters.

5.3Condition (c)

The clustering conditioncdefines the number and name of genomic features to search for based on features names defined in thesitecolumnc = c("a" = 1, "b" = 2)
This searches for clusters with 3 sites, Oneasite and twobsites.

Another way of writing the condition, only if input is a vector to file paths, is the followingx = ("a.bed", "b.bed"), c = c(1,2)
Given 2 files,a.bedandb.bed, this condition states that the user is looking for clusters made from 1 “a” site and 2 “b” sites. In this case, the order of sites defined incis relative to the order of files.

When input is a data frame or GRanges object (instead of files), the user needs toexplicitlydefine the site names along with the desired number relative to each site. For instance, giving the condition asc = c(1,2)for a data frame or GRanges is not allowed.

x = dataframe, c = c("a" = 1, "b" = 2)whereaandbare valid site names insitecolumn in the dataframe/GRanges

Users can exclude clusters containing a specific site(s). This is done by specifying zero0in the condition asc = c("a" = 1, "b" = 2, "c" = 0). In this case, any cluster(s) containingcsite will be excluded even if it has 1aand 2bsites.

5.4Seqnames (seqnames) and Strand (s).

By default, clustering will be performed on both strands and on all seqnames unless specified by the user using thesandseqnamesarguments to limit the search on a specific strand and/or seqname.

Users can choose to cluster on one specific seqname(seqnames = "chr1"), or on several seqnames(seqnames = c("chr1","chr3","chr4"))
(Default forseqnamesisNULL) meaning that clustering on all seqnames will be performed.

Fors, the values allowed are:

+: Build clusters on positive strand
-在负链:构建集群
*: Clusters are not strand specific
(Default is set to*)

5.5Overlap (overlap)

The gap/overlap between adjacent clusters, and not sites, can be controled using theoverlapoption. Whenoverlapis a positive integer, adjacent clusters will be separated by a minimum of the given value. Whenoverlapis negative, adjacent clusters will be allowed to overlap by a maximum of the given value. (Default is set to0)

5.6Greedy vs Non-Greedy (greedy)

greedyallows the user to control the number of genomic features found in clusters.

Whengreedy = FALSE,getClusterwill build clusters having the required window size and will labelTRUEthe ones that contain the确切的number of sites provided in the condition argument. Clusters having the user defined window size but not satisfying the condition will be labelled asFALSE.

Whengreedy = TRUE, additional sites defined in condition will be added to the cluster as long as the cluster size is below the defined window size. (Default is set toFALSE)

5.7Order (order)

Theorderoption defines the desired order of the sites within identified clusters. For instance,order = c("a","b","b")will search for clusters containing 1aand 2bsites and checks if they are in the specified order. Clusters with 1aand 2bsites that do not contain the specified order will be rejected. When greedy is set toTRUE, order can be satisfied if a subcluster contains the desired order. For example if a cluster hasa, a, b, b, bsites, it satisfies the required order (a, b, b) and therefore will be considered as a correct cluster . (Default is set toNULL)

5.8Sites Orientation (site_orientation)

Thesite_orientationoption defines the orientation or strandness of sites in the found clusters. This option cannot be used iforderisNULL.site_orientationshould be specified for each site inorder. For instance, iforder = c("a","b","b"), we can definesite_orientationfor each site respectively as follow:site_orientation = c("+","-","-"). The cluster will be correct if it satisfies the required order and sites orientation. (Default is set toNULL)

5.9Distance between sites in clusters (site_overlap)

Thesite_overlapoption defines the maximum or minimum distance allowed between sites in the cluster. Whensite_overlapis a positive integer, sites within a cluster should haveminimum distance and above, then the cluster is consideredTRUEcluster. Whensite_overlapis a negative integer, sites within a cluster should havemax distance and below, then the cluster is consideredTRUEcluster. Whensite_overlapis zero, distance between sites is not taken into consideration. (Default is set to0)

5.10Clustering option (cluster_by)

Thecluster_byoption allows the usage of different ways in scanning for sites. Using this option, scanning does not ‘jump’ clusters found.cluster_bycan bestartsEnds,endsStarts,starts,ends, ormiddles. If we choosestartsEnds, the clustering begins at the start of the first site and it ends at the end of the last site within the window size. If we chooseendsStarts, scanning begins at the end of the first site and it ends at the start of the last site within the window size. if we choosestarts, the clustering begins at the start of the first site and it ends at the start of the last site within the window size. if we chooseends, the clustering begins at the end of the first site and it ends at the end of the last site within the window size. Also, if we choosemiddles, the clustering begins at the middle of the first site and it ends at the middle of the last site within the window size. (Default is set tostartsEnds)

5.11Overlapping Clusters option (allow_clusters_overlap)

Theallow_clusters_overlapallows the user whether to include overlapping clusters or not. (Default is set to FALSE)

5.12Include partially overlapping sites (include_partial_sites)

This option can be only used in case where the “cluster_by” choosen is “startsEnds” or “ends”, otherwise it is ignored. If it is set toTRUE, partially overlapping sites within window size (at the end of the cluster) are allowed. However, if it is set toFALSE, partially overlapping sites are not included in the cluster found. (Default is set toFALSE)

5.13Control addition of partially overlapping sites (partial_overlap_percentage)

This option allows the user to choose which partially overlapping site are included. The amount of overlap between the last site found and the window size controls the addition or removal of those sites. For example, ifpartial_overlap_percentageis set to 0.6, this means that the window size should overlap this site by 60% of its size. However, if it fails, this site won’t be included in the cluster found. This option can be only used in case where the “cluster_by” chosen is “startsEnds” or “ends”, otherwise it is ignored. (Default is set toNULL)

5.14多线程(线程)

This option allows to distribute the run over multiple threads. It is recommended to use multiple threads when the input dataset is large. Whenthreadsis set to 0,getClusterreserves the adequate number of threads and performs the run. This will dramatically increase the speed ofgetCluster. The user can choose the number of threads between 1 and the maximum number of available threads. (Default is set to1)

5.15Verbose (verbose)

Theverboseoption allows the printing of additional messages and the list of clusters that failed for lack of correct combination of sites. This option is used mainly for debugging purposes. (Default is set to FALSE)

6Output of getCluster

The output ofgetClusteris a GRanges object with fields:

seqnames: The seqname on which a cluster is found
ranges: The ranges of the cluster
strand: The strand of the cluster, if any
sites: The combination of sites that define the cluster
isCluster: A logical indicating if the cluster is TRUE or FALSE
status: Describes the reason behind the rejection of a cluster

The algorithm returns all clusters containing the correct count of sites/features, unless verbose is set toTRUE. If the combination, overlap and order options are satisfied, the cluster is considered aTRUEcluster.

Thestatusof a cluster can be either PASS, excludedSites, orderFail, siteOrientation or siteOverlapPASSis a cluster that satisfied the desired combination, overlap, order and sites orientation.orderFailis a cluster that had the required combination but did not satisfy the required order of sites.excludedSitesis a cluster that had the required combination and order but it has one or more sites to exclude.siteOrientationis a cluster that had the required combination and order but it has one or more sites with different orientation than requested.siteOverlapis a cluster that meets all the requirements but the distance between its sites is not respected.

注意:如果user is usinggreedy = FALSEandordercontains values more than in the condition parameter (c), an error will be raised. However, ifgreedy = TRUE, then usingorderwith more values than the condition parameter is allowed since the cluster may contain more sites than the requiredccondition as long as the window size is satisfied.

7Example on Clustering

Example usinggetCluster:

getClusterlooks for desired genomic regions of interest like transcription factor binding sites, within a window size and specific condition.

This function accepts a data frame and GRanges object. getCluster also accepts BED or VCF (or mix of both) files as input. The output ofgetClusteris a GRanges object that contains the genomic coordinates (seqnames, ranges and strand) and three metadata columns:

sites: contains clusters of sites that conforms with the conditioncspecified ingetCluster.
isCluster:TRUEif the cluster found conform with the conditioncand theorder(if indicated in condition) andFALSEif the cluster fails to conform with the condition or order.
status:PASSifisClusterequalsTRUE. However, ifisClusterisFALSE, status shows why the found cluster is not aTRUEcluster. If the order of sites is not respected in the found cluster, status would returnOrderFail. in Addition, if the cluster found contains non desired sites, it returnsexcludedSites. Moreover, if the sites orientation is not respected in found cluster, status would returnsiteOrientation. Finally , if the distance between sites in the cluster found does not conform to the condition specified by the user, status would returnsiteOverlap.

In this example, we askgetClusterto look for clusters that contains one site “s1”, one site “s2” and zero “s3” sites. In addition, we requested clusters to have sites in the order s1,s2 and having orientation “+”,“+” respectively with minimum distance between sites of 2bp.

x1 = data.frame(seqnames = rep("chr1", times = 17), start = c(1,10,17,25,27,32,41,47,60,70,87,94,99,107,113,121,132), end = c(8,15,20,30,35,40,48,55,68,75,93,100,105,113,120,130,135), strand = c("+","+","+","+","+","+","+","+","+", "+","+","+","+","+","+","+","-"), site = c("s3","s1","s2","s2","s1","s2","s1","s1","s2","s1","s2", "s2","s1","s2","s1","s1","s2")) clusters = getCluster(x1, w = 20, c = c("s1" = 1, "s2" = 1, "s3" = 0), greedy = TRUE, order = c("s1","s2"), site_orientation=c("+","+"), site_overlap = 2, verbose = TRUE) #> 17 entries loaded #> Running getCluster using 1 threads #> Time difference of 1.189334 secs clusters #> GRanges object with 9 ranges and 3 metadata columns: #> seqnames ranges strand | sites isCluster status #>    |    #> [1] chr1 2-20 * | s3,s1,s2 FALSE excludedSites #> [2] chr1 11-30 * | s1,s2,s2 TRUE PASS #> [3] chr1 33-48 * | s2,s1 FALSE orderFail #> [4] chr1 48-68 * | s1,s2 TRUE PASS #> [5] chr1 88-105 * | s2,s2,s1 FALSE orderFail #> [6] chr1 95-113 * | s2,s1,s2 FALSE siteOverlap #> [7] chr1 100-120 * | s1,s2,s1 FALSE siteOverlap #> [8] chr1 108-120 * | s2,s1 FALSE orderFail #> [9] chr1 122-135 * | s1,s2 FALSE siteOrientation #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths

Another example but using GRanges as input: in this example, we askgetClusterto look for clusters that contains one sites1and two sitess2within a window size of 25 bp. Also, we requested clusters to be searched as+strand.

suppressMessages(library(GenomicRanges)) x = GRanges( seqnames = Rle("chr1", 16), ranges = IRanges(start = c(10L,17L,25L,27L,32L,41L,47L, 60L,70L,87L,94L,99L,107L,113L,121L,132L), end = c(15L,20L,30L,35L,40L,48L,55L,68L,75L,93L,100L,105L, 113L,120L,130L,135L)), strand = Rle("+",16), site = c("s1","s2","s2","s1","s2","s1","s1","s2", "s1","s2","s2","s1","s2","s1","s1","s2")) clusters = getCluster(x, w = 25, c = c("s1"=1,"s2"=2), s = "+") #> 16 entries loaded #> Running getCluster using 1 threads #> Time difference of 0.1115539 secs clusters #> GRanges object with 2 ranges and 3 metadata columns: #> seqnames ranges strand | sites isCluster status #>    |    #> [1] chr1 10-30 + | s1,s2,s2 TRUE PASS #> [2] chr1 87-105 + | s2,s2,s1 TRUE PASS #> ------- #> seqinfo: 1 sequence from an unspecified genome; no seqlengths

8Session Info

sessionInfo() #> R version 4.2.0 RC (2022-04-19 r82224) #> Platform: x86_64-pc-linux-gnu (64-bit) #> Running under: Ubuntu 20.04.4 LTS #> #> Matrix products: default #> BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so #> LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so #> #> locale: #> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C #> [3] LC_TIME=en_GB LC_COLLATE=C #> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 #> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C #> [9] LC_ADDRESS=C LC_TELEPHONE=C #> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C #> #> attached base packages: #> [1] stats4 stats graphics grDevices utils datasets methods #> [8] base #> #> other attached packages: #> [1] GenomicRanges_1.48.0 GenomeInfoDb_1.32.0 IRanges_2.30.0 #> [4] S4Vectors_0.34.0 BiocGenerics_0.42.0 fcScan_1.10.0 #> [7] BiocStyle_2.24.0 #> #> loaded via a namespace (and not attached): #> [1] bitops_1.0-7 matrixStats_0.62.0 #> [3] bit64_4.0.5 filelock_1.0.2 #> [5] doParallel_1.0.17 progress_1.2.2 #> [7] httr_1.4.2 tools_4.2.0 #> [9] bslib_0.3.1 utf8_1.2.2 #> [11] R6_2.5.1 DBI_1.1.2 #> [13] tidyselect_1.1.2 prettyunits_1.1.1 #> [15] bit_4.0.4 curl_4.3.2 #> [17] compiler_4.2.0 cli_3.3.0 #> [19] Biobase_2.56.0 xml2_1.3.3 #> [21] DelayedArray_0.22.0 rtracklayer_1.56.0 #> [23] bookdown_0.26 sass_0.4.1 #> [25] rappdirs_0.3.3 stringr_1.4.0 #> [27] digest_0.6.29 Rsamtools_2.12.0 #> [29] rmarkdown_2.14 XVector_0.36.0 #> [31] pkgconfig_2.0.3 htmltools_0.5.2 #> [33] MatrixGenerics_1.8.0 highr_0.9 #> [35] dbplyr_2.1.1 fastmap_1.1.0 #> [37] BSgenome_1.64.0 rlang_1.0.2 #> [39] RSQLite_2.2.12 jquerylib_0.1.4 #> [41] BiocIO_1.6.0 generics_0.1.2 #> [43] jsonlite_1.8.0 BiocParallel_1.30.0 #> [45] dplyr_1.0.8 VariantAnnotation_1.42.0 #> [47] RCurl_1.98-1.6 magrittr_2.0.3 #> [49] GenomeInfoDbData_1.2.8 Matrix_1.4-1 #> [51] Rcpp_1.0.8.3 fansi_1.0.3 #> [53] lifecycle_1.0.1 stringi_1.7.6 #> [55] yaml_2.3.5 SummarizedExperiment_1.26.0 #> [57] zlibbioc_1.42.0 plyr_1.8.7 #> [59] BiocFileCache_2.4.0 grid_4.2.0 #> [61] blob_1.2.3 parallel_4.2.0 #> [63] crayon_1.5.1 lattice_0.20-45 #> [65] Biostrings_2.64.0 GenomicFeatures_1.48.0 #> [67] hms_1.1.1 KEGGREST_1.36.0 #> [69] knitr_1.38 pillar_1.7.0 #> [71] rjson_0.2.21 codetools_0.2-18 #> [73] biomaRt_2.52.0 XML_3.99-0.9 #> [75] glue_1.6.2 evaluate_0.15 #> [77] BiocManager_1.30.17 png_0.1-7 #> [79] vctrs_0.4.1 foreach_1.5.2 #> [81] purrr_0.3.4 assertthat_0.2.1 #> [83] cachem_1.0.6 xfun_0.30 #> [85] restfulr_0.0.13 tibble_3.1.6 #> [87] iterators_1.0.14 GenomicAlignments_1.32.0 #> [89] AnnotationDbi_1.58.0 memoise_2.0.1 #> [91] ellipsis_0.3.2

fcScan: features cluster Scan

26 April 2022

Contents