1Introduction

TADCompare is an R package for differential analysis of TAD boundaries. It is designed to work on a wide range of formats and resolutions of Hi-C data. TADCompare package contains four functions:TADCompare,TimeCompare,ConsensusTADs, andDiffPlot.TADCompare函数允许不同的识别tial TAD boundaries betweentwo contact matrices.TimeComparefunction takes a set of contact matrices, one matrix per time point, identifies TAD boundaries, and classifieshow they change over time.ConsensusTADsfunction takes a list of TADs and identifies a consensus of TAD boundaries across all matrices using ournovel consensus boundary score.DiffPlotallows forvisualization of TAD boundary differencesbetween two matrices. The required input includes matrices in sparse 3-column format,\(n \times n\), or\(n \times (n+3)\)formats. This vignette provides a complete overview of input data formats.

2Getting Started

2.1Installation

BiocManager::install("TADCompare")

library(dplyr)

## ## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats': ## ## filter, lag

## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union

library(SpectralTAD) library(TADCompare)

3Working with different types of data

3.1Working with\(n \times n\)matrices

\(n \times n\)contact matrices are most commonly associated with data coming from the Bing Ren lab (http://chromosome.sdsc.edu/mouse/hi-c/download.html). These contact matrices are square and symmetric with entry\(ij\)corresponding to the number of contacts between region\(i\)and region\(j\). Below is an example of a\(5 \times 5\)region of an\(n \times n\)contact matrix derived from Rao et al. 2014 data, GM12878 cell line(Rao et al. 2014), chromosome 22, 50kb resolution. Note the symmetry around the diagonal - the typical shape of chromatin interaction matrix. The figure was created using thepheatmappackage.

3.2Working with\(n \times (n+3)\)matrices

\(n \times (n+3)\)matrices are commonly associated with theTopDomTAD caller (http://zhoulab.usc.edu/TopDom/). These matrices consist of an\(n \times n\)matrix but with three additional leading columns containing the chromosome, the start of the region and the end of the region. Regions in this case are determined by the resolution of the data. The subset of a typical\(n \times (n+3)\)matrix is shown below.

## chr start end X18500000 X18550000 X18600000 X18650000 ## 1 chr22 18500000 18550000 13313 4817 1664 96 ## 2 chr22 18550000 18600000 4817 15500 5120 178 ## 3 chr22 18600000 18650000 1664 5120 11242 316 ## 4 chr22 18650000 18700000 96 178 316 162

3.3Working with sparse 3-column matrices

Sparse 3-column matrices are matrices where the first and second columns refer to region\(i\)and region\(j\)of the chromosome, and the third column is the number of contacts between them. This style is becoming increasingly popular and is associated with raw data from Lieberman-Aiden lab (e.g.,https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE63525), and is the data output produced by the Juicer tool(Durand et al. 2016). 3-column matrices are handled internally in the package by converting them to\(n \times n\)matrices using theHiCcomparepackage’ssparse2full()function. The first 5 rows of a typical sparse 3-column matrix are shown below.

## region1 region2 IF ## 1: 16050000 16050000 12 ## 2: 16200000 16200000 4 ## 3: 16150000 16300000 1 ## 4: 16200000 16300000 1 ## 5: 16250000 16300000 1 ## 6: 16300000 16300000 10

3.4Working with other data types

3.5Working with .hic files

.hic files are a common form of files generally associated with the lab of Erez Lieberman-Aiden (http://aidenlab.org/data.html). To use .hic files you must use the following steps.

Downloadstrawfromhttps://github.com/aidenlab/straw/and follow instalation instructions.
Download .hic data files. Here, we use data from Rao 2014 and download them using the following commands:

wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE63nnn/GSE63525/suppl/GSE63525_GM12878_insitu_primary_30.hic

wget ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE63nnn/GSE63525/suppl/GSE63525_GM12878_insitu_replicate.hic

Extract chromosome 22 at 50kb resolution with no normalization:

./straw NONE GSE63525_GM12878_insitu_primary_30.hic 22 22 BP 50000 > primary.chr22.50kb.txt

./straw NONE GSE63525_GM12878_insitu_replicate_30.hic 22 22 BP 50000 > replicate.chr22.50kb.txt

Analyze normally:

#Read in data primary = read.table('primary.chr22.50kb.txt', header = FALSE) replicate = read.table('replicate.chr22.50kb.txt', header = FALSE) #Run TADCompare tad_diff=TADCompare(primary, replicate, resolution=50000)

3.6Working with .cool files

Users can also find TADs from data output bycooler(http://cooler.readthedocs.io/en/latest/index.html) and HiC-Pro (https://github.com/nservant/HiC-Pro) with minor pre-processing using theHiCcomparepackage.

The cooler software can be downloaded fromhttps://mirnylab.github.io/cooler/. A catalog of popular HiC datasets can be found atftp://cooler.csail.mit.edu/coolers. We can extract chromatin interaction data from .cool files using the following steps:

Follow instructions to install the cooler software,https://mirnylab.github.io/cooler/
Download the first contact matrix wgetftp://cooler.csail.mit.edu/coolers/hg19/Zuin2014-HEK293CtcfControl-HindIII-allreps-filtered.50kb.cool
Convert the first matrix to a text file usingcooler dump --join Zuin2014-HEK293CtcfControl-HindIII-allreps-filtered.50kb.cool > Zuin.HEK293.50kb.Control.txt
Download the second contact matrix wgetftp://cooler.csail.mit.edu/coolers/hg19/Zuin2014-HEK293CtcfDepleted-HindIII-allreps-filtered.50kb.cool
Convert the matrix to a text file usingcooler dump --join Zuin2014-HEK293CtcfDepleted-HindIII-allreps-filtered.50kb.cool > Zuin.HEK293.50kb.Depleted.txt
Run the code below

# Read in data cool_mat1 <- read.table("Zuin.HEK293.50kb.Control.txt") cool_mat2 <- read.table("Zuin.HEK293.50kb.Depleted.txt") # Convert to sparse 3-column matrix using cooler2sparse from HiCcompare sparse_mat1 <- HiCcompare::cooler2sparse(cool_mat1) sparse_mat2 <- HiCcompare::cooler2sparse(cool_mat2) # Run TADCompare diff_tads = lapply(names(sparse_mat1), function(x) { TADCompare(sparse_mat1[[x]], sparse_mat2[[x]], resolution = 50000) })

3.7Working with HiC-Pro files

HiC-Pro data is represented as two files, the.matrixfile and the.bedfile. The.bedfile contains four columns (chromosome, start, end, ID). The.matrixfile is a three-column matrix where the 1^stand 2^ndcolumns contain region IDs that map back to the coordinates in the bed file, and the third column contains the number of contacts between the two regions. In this example we analyze two matrix filessample1_100000.matrixandsample2_100000.matrixand their corresponding bed filessample1_100000_abs.bedandsample2_100000_abs.bed. We do not include HiC-Pro data in the package, so these serve as placeholders for the traditional files output by HiC-Pro. The steps for analyzing these files is shown below:

# Read in both files mat1 <- read.table("sample1_100000.matrix") bed1 <- read.table("sample1_100000_abs.bed") # Matrix 2 mat2 <- read.table("sample2_100000.matrix") bed2 <- read.table("sample2_100000_abs.bed") # Convert to modified bed format sparse_mats1 <- HiCcompare::hicpro2bedpe(mat1,bed1) sparse_mats2 <- HiCcompare::hicpro2bedpe(mat2,bed2) # Remove empty matrices if necessary # sparse_mats$cis = sparse_mats$cis[sapply(sparse_mats, nrow) != 0] # Go through all pairwise chromosomes and run TADCompare sparse_tads = lapply(1:length(sparse_mats1$cis), function(z) { x <- sparse_mats1$cis[[z]] y <- sparse_mats2$cis[[z]] #Pull out chromosome chr <- x[, 1][1] #Subset to make three column matrix x <- x[, c(2, 5, 7)] y <- y[, c(2, 5, 7)] #Run SpectralTAD comp <- TADCompare(x, y, resolution = 100000) return(list(comp, chr)) }) # Pull out differential TAD results diff_res <- lapply(sparse_tads, function(x) x$comp) # Pull out chromosomes chr <- lapply(sparse_tads, function(x) x$chr) # Name list by corresponding chr names(diff_res) <- chr

3.8Effect of matrix type on runtime

The type of matrix input into the algorithm can affect runtimes for the algorithm.\(n \times n\)matrices require no conversion and are the fastest. Meanwhile,\(n \times (n+3)\)matrices take slightly longer to run due to the need to remove the first 3 columns. Sparse 3-column matrices have the highest runtimes due to the complexity of converting them to an\(n \times n\)matrix. The times are summarized below, holding all other parameters constant.

库(微基准测试)#阅读第二马trix data("rao_chr22_rep") # Converting to sparse prim_sparse <- HiCcompare::full2sparse(rao_chr22_prim) rep_sparse <- HiCcompare::full2sparse(rao_chr22_rep) # Converting to nxn+3 # Primary prim_n_n_3 <- data.frame(chr = "chr22", start = as.numeric(colnames(rao_chr22_prim)), end = as.numeric(colnames(rao_chr22_prim))+50000, rao_chr22_prim) # Replicate rep_n_n_3 <- data.frame(chr = "chr22", start = as.numeric(colnames(rao_chr22_rep)), end = as.numeric(colnames(rao_chr22_rep))+50000, rao_chr22_rep) # Defining each function # Sparse sparse <- TADCompare(cont_mat1 = prim_sparse, cont_mat2 = rep_sparse, resolution = 50000) # NxN n_by_n <- TADCompare(cont_mat1 = prim_sparse, cont_mat2 = rep_sparse, resolution = 50000) # Nx(N+3) n_by_n_3 <- TADCompare(cont_mat1 = prim_n_n_3, cont_mat2 = rep_n_n_3, resolution = 50000) # Benchmarking different parameters bench <- microbenchmark( # Sparse sparse <- TADCompare(cont_mat1 = prim_sparse, cont_mat2 = rep_sparse, resolution = 50000), # NxN n_by_n <- TADCompare(cont_mat1 = rao_chr22_prim, cont_mat2 = rao_chr22_rep, resolution = 50000), # Nx(N+3) n_by_n_3 <- TADCompare(cont_mat1 = prim_n_n_3, cont_mat2 = rep_n_n_3, resolution = 50000), times = 5, unit = "s" ) summary_bench <- summary(bench) %>% dplyr::select(mean, median) rownames(summary_bench) <- c("sparse", "n_by_n", "n_by_n_3") summary_bench

## mean median ## sparse 0.3675062 0.2253169 ## n_by_n 0.1239759 0.1239285 ## n_by_n_3 0.1470194 0.1423968

The table above shows the mean and median of runtimes for different types of contact matrices measured in seconds. As we see,TADCompareis extremely fast irrespectively of the parameters. However, sparse matrix inputs will slow down the algorithm. This can become more apparent as the size of the contact matrices increase.

4Session Info

sessionInfo()

# # R版本4.2.0 RC (2022-04-19 r82224) # # Platform: x86_64-pc-linux-gnu (64-bit) ## Running under: Ubuntu 20.04.4 LTS ## ## Matrix products: default ## BLAS: /home/biocbuild/bbs-3.15-bioc/R/lib/libRblas.so ## LAPACK: /home/biocbuild/bbs-3.15-bioc/R/lib/libRlapack.so ## ## locale: ## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C ## [3] LC_TIME=en_GB LC_COLLATE=C ## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 ## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C ## [9] LC_ADDRESS=C LC_TELEPHONE=C ## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C ## ## attached base packages: ## [1] stats graphics grDevices utils datasets methods base ## ## other attached packages: ## [1] microbenchmark_1.4.9 TADCompare_1.6.0 SpectralTAD_1.12.0 ## [4] dplyr_1.0.8 BiocStyle_2.24.0 ## ## loaded via a namespace (and not attached): ## [1] TH.data_1.1-1 colorspace_2.0-3 ## [3] ggsignif_0.6.3 ellipsis_0.3.2 ## [5] CGHcall_2.58.0 DNAcopy_1.70.0 ## [7] XVector_0.36.0 GenomicRanges_1.48.0 ## [9] ggpubr_0.4.0 listenv_0.8.0 ## [11] mvtnorm_1.1-3 fansi_1.0.3 ## [13] HiCcompare_1.18.0 codetools_0.2-18 ## [15] splines_4.2.0 R.methodsS3_1.8.1 ## [17] impute_1.70.0 knitr_1.38 ## [19] jsonlite_1.8.0 Rsamtools_2.12.0 ## [21] broom_0.8.0 R.oo_1.24.0 ## [23] pheatmap_1.0.12 BiocManager_1.30.17 ## [25] compiler_4.2.0 backports_1.4.1 ## [27] assertthat_0.2.1 Matrix_1.4-1 ## [29] fastmap_1.1.0 limma_3.52.0 ## [31] cli_3.3.0 htmltools_0.5.2 ## [33] tools_4.2.0 gtable_0.3.0 ## [35] glue_1.6.2 GenomeInfoDbData_1.2.8 ## [37] reshape2_1.4.4 Rcpp_1.0.8.3 ## [39] carData_3.0-5 Biobase_2.56.0 ## [41] jquerylib_0.1.4 vctrs_0.4.1 ## [43] Biostrings_2.64.0 rhdf5filters_1.8.0 ## [45] nlme_3.1-157 QDNAseq_1.32.0 ## [47] xfun_0.30 stringr_1.4.0 ## [49] globals_0.14.0 lifecycle_1.0.1 ## [51] gtools_3.9.2 rstatix_0.7.0 ## [53] InteractionSet_1.24.0 future_1.25.0 ## [55] MASS_7.3-57 zoo_1.8-10 ## [57] zlibbioc_1.42.0 scales_1.2.0 ## [59] MatrixGenerics_1.8.0 sandwich_3.0-1 ## [61] parallel_4.2.0 SummarizedExperiment_1.26.0 ## [63] rhdf5_2.40.0 RColorBrewer_1.1-3 ## [65] yaml_2.3.5 gridExtra_2.3 ## [67] ggplot2_3.3.5 sass_0.4.1 ## [69] CGHbase_1.56.0 stringi_1.7.6 ## [71] highr_0.9 S4Vectors_0.34.0 ## [73] BiocGenerics_0.42.0 BiocParallel_1.30.0 ## [75] GenomeInfoDb_1.32.0 rlang_1.0.2 ## [77] pkgconfig_2.0.3 matrixStats_0.62.0 ## [79] bitops_1.0-7 evaluate_0.15 ## [81] lattice_0.20-45 purrr_0.3.4 ## [83] Rhdf5lib_1.18.0 cowplot_1.1.1 ## [85] tidyselect_1.1.2 parallelly_1.31.1 ## [87] plyr_1.8.7 magrittr_2.0.3 ## [89] bookdown_0.26 R6_2.5.1 ## [91] magick_2.7.3 IRanges_2.30.0 ## [93] generics_0.1.2 multcomp_1.4-19 ## [95] DelayedArray_0.22.0 DBI_1.1.2 ## [97] pillar_1.7.0 mgcv_1.8-40 ## [99] survival_3.3-1 abind_1.4-5 ## [101] RCurl_1.98-1.6 tibble_3.1.6 ## [103] future.apply_1.9.0 PRIMME_3.2-1 ## [105] crayon_1.5.1 car_3.0-12 ## [107] KernSmooth_2.23-20 utf8_1.2.2 ## [109] rmarkdown_2.14 grid_4.2.0 ## [111] data.table_1.14.2 marray_1.74.0 ## [113] digest_0.6.29 tidyr_1.2.0 ## [115] R.utils_2.11.0 stats4_4.2.0 ## [117] munsell_0.5.0 bslib_0.3.1

References

Durand, Neva C., Muhammad S. Shamim, Ido Machol, Suhas S.P. Rao, Miriam H. Huntley, Eric S. Lander, and Erez Lieberman Aiden. 2016. “Juicer Provides a One-Click System for Analyzing Loop-Resolution Hi-c Experiments.”Cell Systems3 (1): 95–98.https://doi.org/10.1016/j.cels.2016.07.002.

Rao, Suhas S.P., Miriam H. Huntley, Neva C. Durand, Elena K. Stamenova, Ivan D. Bochkov, James T. Robinson, Adrian L. Sanborn, et al. 2014. “A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping.”Cell159 (7): 1665–80.https://doi.org/10.1016/j.cell.2014.11.021.

Input data formats

26 April 2022

Contents