R Package MADSEQ

The MADSEQ package is a group of hierarchical Bayesian model for the detection and quantification of potential mosaic aneuploidy in sample using massive parallel sequencing data.

The MADSEQ package takes two pieces of information for the detection and quantification of mosaic aneuploidy:

MADSEQ works on the whole chromosome resolution. It applies all of the five models (normal, monosomy, mitotic trisomy, meiotic trisomy, loss of heterozygosity) to fit the distribution of the AAF of all the heterozygous sites, and fit the distribution of the coverage from that chromosome. After fitting the same data using all models, it does model comparison using BIC (Bayesian Information Criteria) to select the best model. The model selected tells us whether the chromosome is aneuploid or not, and also the type of mosaic aneuploidy. Then, from the posterior distribution of the best model, we could get the estimation of the fraction of aneuploidy cells.

你需要的数据

Abamfile for your sample: sorted, indexed.
Avcf.gzfile for your sample:
- The vcf should be bgzip compressed (bgzip is a part of tabix package, which can be download fromhere).
- The FORMAT section of the vcf file should contain depth (DP) and allelic depth (AD) field.
Abedfile containing targeted regions: if it’s targeted sequencing data, then the bed file contatining targeted regions (tiled regions is recommended) . If it’s whole genome sequencing, bin the whole genome into equal size bins (i.e. 500kb).

Note:Currently our package only supports one bam and one vcf file per sample. If you have more than one sample, please prepare multiple bam and vcf files for each of them.

Example Data

There are two sets of example data come with the package:

A bam file namedaneuploidy.bamand a vcf file namedaneuploidy.vcf.gzfor a sample containing trisomy in chromosome 18.
A bam file namednormal.bamand a vcf file namesnormal.vcf.gzfor a normal sample.
A bed file namedtarget.bedcontaining the targeted regions.

To access the data use

system.file(“extdata","aneuploidy.bam",package="MADSEQ")system.file(“extdata","aneuploidy.vcf.gz",package="MADSEQ")

Note:This is just a set of example data, only contains a very little region of the genome.

Use The Package

We will start with the bam file, vcf file and bed file in the example data to show you each step for the analysis.

Prepare coverage information

Started with bam file and bed file, you can useprepareCoverageGCfunction to get the coverage and GC information for each targeted regions.

## load the packagesuppressMessages(library("MADSEQ"))## get path to the location of example dataaneuploidy_bam =system.file(“extdata","aneuploidy.bam",package="MADSEQ")normal_bam =system.file(“extdata","normal.bam",package="MADSEQ")target =system.file(“extdata","target.bed",package="MADSEQ")## Note: for your own data, just specify the path to the location## of your file using character.## prepare coverage and GC content for each targeted region# aneuploidy sampleaneuploidy_cov =prepareCoverageGC(target_bed=target,bam=aneuploidy_bam,"hg19")#> 309 non-repeats regions from 24 chromosomes in the bed file.#> calculating depth from BAM...#>#> Attaching package: 'BiocGenerics'#> The following objects are masked from 'package:stats':#>#> IQR, mad, sd, var, xtabs#> The following objects are masked from 'package:base':#>#> Filter, Find, Map, Position, Reduce, anyDuplicated, append,#> as.data.frame, basename, cbind, colnames, dirname, do.call,#> duplicated, eval, evalq, get, grep, grepl, intersect, is.unsorted,#> lapply, mapply, match, mget, order, paste, pmax, pmax.int, pmin,#> pmin.int, rank, rbind, rownames, sapply, setdiff, sort, table,#> tapply, union, unique, unsplit, which.max, which.min#>#> Attaching package: 'S4Vectors'#> The following objects are masked from 'package:base':#>#> I, expand.grid, unname#>#> Attaching package: 'Biostrings'#> The following object is masked from 'package:base':#>#> strsplit#> calculating GC content...# normal samplenormal_cov =prepareCoverageGC(target_bed=target,bam=normal_bam,"hg19")#> 309 non-repeats regions from 24 chromosomes in the bed file.#> calculating depth from BAM...#> calculating GC content...## view the first two rows of prepared coverage data (A GRanges Object)aneuploidy_cov[1:2]#> GRanges object with 2 ranges and 2 metadata columns:# > seqnames范围链|深度GC#>    |  #> [1] chr14 21538016-21538117 * | 21 0.647059#> [2] chr14 22377854-22377955 * | 24 0.362745#> -------#> seqinfo: 24 sequences from an unspecified genome; no seqlengthsnormal_cov[1:2]#> GRanges object with 2 ranges and 2 metadata columns:# > seqnames范围链|深度GC#>    |  #> [1] chr14 21538016-21538117 * | 4 0.647059#> [2] chr14 22377854-22377955 * | 32 0.362745#> -------#> seqinfo: 24 sequences from an unspecified genome; no seqlengths

Normalize coverage information

The normalization function takes prepared coverage GRanges object fromprepareCoverageGCfunction, normalize the coverage and calculate the expected coverage for the sample. If there is only one sample, the function will correct the coverage by GC content, and take the average coverage for the whole genome as expected coverage. If there are more than one samples given, the function will first quantile normalize coverage across samples, then correct the coverage by GC for each sample. If control sample is not specified, the expected coverage is the median coverage across all samples, if a normal control is specified, the average coverage for control sample is taken as expected coverage for further analysis.

Note:

If you have more than one samples, please make sure they are targeted by the same regions.
If you have more than one samples, please separate female samples and male samples, which means normalize them separately; because sex chromosomes can bias the normalization.
One sample is good to go, however multiple samples will help you get a better normalization.

If you only have one sample

If you choose to write the output to file (recommended)

## normalize coverage data## set plot=FALSE here because similar plot will show in the following examplenormalizeCoverage(aneuploidy_cov,writeToFile=TRUE,destination=".",plot=FALSE)#> no control provided#> there are 1 samples#> correct GC bias in sample 'aneuploidy_cov' ...#> normalized depth for sample aneuploidy_cov is written to ./aneuploidy_cov_normed_depth.txt

If you don’t want to write output to file

## normalize coverage dataaneuploidy_normed =normalizeCoverage(aneuploidy_cov,writeToFile=FALSE,plot=FALSE)#> no control provided#> there are 1 samples#> correct GC bias in sample 'aneuploidy_cov' ...# # GRangesList对象将由傅nction, look at it bynames(aneuploidy_normed)#> [1] "aneuploidy_cov"aneuploidy_normed[["aneuploidy_cov"]]#> GRanges object with 154 ranges and 4 metadata columns:# > seqnames范围链|深度GCnormed_depth#>    |   #> [1] chr14 21538016-21538117 * | 21 0.647059 34#> [2] chr14 22377854-22377955 * | 24 0.362745 25#> [3] chr14 26201540-26201641 * | 45 0.362745 46#> [4] chr14 26475535-26475636 * | 8 0.264706 26#> [5] chr14 29080232-29080333 * | 6 0.323529 22#> ... ...... ... . ... ... ...#> [150] chr18 58507475-58507576 * | 57 0.441176 54#> [151] chr18 59650666-59650767 * | 37 0.372549 35#> [152] chr18 60388772-60388873 * | 40 0.245098 52#> [153] chr18 63107118-63107219 * | 93 0.450980 90#> [154] chr18 69135718-69135819 * | 14 0.323529 30#> ref_depth#> #> [1] 0#> [2] 0#> [3] 0#> [4] 0#> [5] 0#> ... ...#> [150] 0#> [151] 0#> [152] 0#> [153] 0#> [154] 0#> -------#> seqinfo: 24 sequences from an unspecified genome; no seqlengths

If you have more than one samples, without a normal control

If you choose to write the output to file (recommended)

## normalize coverage datanormalizeCoverage(aneuploidy_cov, normal_cov,writeToFile =TRUE,destination =".",plot=FALSE)#> no control provided#> there are 2 samples#> Quantile normalizing ...#> correct GC bias in sample' aneuploidy_cov '...#> correct GC bias in sample' normal_cov '...#> normalized depth for sample aneuploidy_cov is written to ./aneuploidy_cov_normed_depth.txt#> normalized depth for sample normal_cov is written to ./normal_cov_normed_depth.txt

If you don’t want to write output to file

## normalize coverage datanormed_without_control =normalizeCoverage(aneuploidy_cov, normal_cov,writeToFile=FALSE,plot=TRUE)#> no control provided#> there are 2 samples#> Quantile normalizing ...#> correct GC bias in sample' aneuploidy_cov '...#> correct GC bias in sample' normal_cov '...

# # GRangesList对象将由傅nctionlength(normed_without_control)#> [1] 2names(normed_without_control)#> [1] "aneuploidy_cov" "normal_cov"## subsettingnormed_without_control[["aneuploidy_cov"]]#> GRanges object with 154 ranges and 5 metadata columns:#> seqnames ranges strand | depth quantiled_depth GC#>    |   #> [1] chr14 21538016-21538117 * | 21 16 0.647059#> [2] chr14 22377854-22377955 * | 24 19 0.362745#> [3] chr14 26201540-26201641 * | 45 36 0.362745#> [4] chr14 26475535-26475636 * | 8 6 0.264706#> [5] chr14 29080232-29080333 * | 6 5 0.323529#> ... ...... ... . ... ... ...#> [150] chr18 58507475-58507576 * | 57 46 0.441176#> [151] chr18 59650666-59650767 * | 37 30 0.372549#> [152] chr18 60388772-60388873 * | 40 32 0.245098#> [153] chr18 63107118-63107219 * | 93 76 0.450980#> [154] chr18 69135718-69135819 * | 14 11 0.323529#> normed_depth ref_depth#>  #> [1] 26 35.5733#> [2] 19 35.5733#> [3] 36 35.5733#> [4] 20 35.5733#> [5] 18 35.5733#> ... ......#> [150] 43 31.7051#> [151] 28 31.7051#> [152] 41 31.7051#> [153] 74 31.7051#> [154] 24 31.7051#> -------#> seqinfo: 24 sequences from an unspecified genome; no seqlengthsnormed_without_control[["normal_cov"]]#> GRanges object with 153 ranges and 5 metadata columns:#> seqnames ranges strand | depth quantiled_depth GC#>    |   #> [1] chr14 21538016-21538117 * | 4 5 0.647059#> [2] chr14 22377854-22377955 * | 32 42 0.362745#> [3] chr14 26201540-26201641 * | 24 31 0.362745#> [4] chr14 26475535-26475636 * | 12 16 0.264706#> [5] chr14 29080232-29080333 * | 11 14 0.323529#> ... ...... ... . ... ... ...#> [149] chr18 58507475-58507576 * | 18 24 0.441176#> [150] chr18 59650666-59650767 * | 23 30 0.372549#> [151] chr18 60388772-60388873 * | 22 28 0.245098#> [152] chr18 63107118-63107219 * | 45 61 0.450980#> [153] chr18 69135718-69135819 * | 13 18 0.323529#> normed_depth ref_depth#>  #> [1] 16 35.5733#> [2] 40 35.5733#> [3] 29 35.5733#> [4] 30 35.5733#> [5] 22 35.5733#> ... ......#> [149] 19 31.7051#> [150] 26 31.7051#> [151] 40 31.7051#> [152] 56 31.7051#> [153] 26 31.7051#> -------#> seqinfo: 24 sequences from an unspecified genome; no seqlengths

If you have more than one samples, with a normal control

If you choose to write the output to file (recommended)

## normalize coverage data, normal_cov is the control samplenormalizeCoverage(aneuploidy_cov,control=normal_cov,writeToFile=TRUE,destination =".",plot=FALSE)#> control: normal_cov#> there are 2 samples#> Quantile normalizing ...#> correct GC bias in sample' normal_cov '...#> correct GC bias in sample' aneuploidy_cov '...#> normalized depth for sample normal_cov is written to ./normal_cov_normed_depth.txt#> normalized depth for sample aneuploidy_cov is written to ./aneuploidy_cov_normed_depth.txt

If you don’t want to write output to file

normed_with_control =normalizeCoverage(aneuploidy_cov,control=normal_cov,writeToFile =FALSE,plot=FALSE)#> control: normal_cov#> there are 2 samples#> Quantile normalizing ...#> correct GC bias in sample' normal_cov '...#> correct GC bias in sample' aneuploidy_cov '...# # GRangesList对象将由傅nctionlength(normed_without_control)#> [1] 2names(normed_with_control)#> [1] "normal_cov" "aneuploidy_cov"

Prepare heterozygous sites

Having vcf.gz file and target bed file ready, useprepareHeterofunction to process the heterozygous sites.

## specify the path to vcf.gz fileaneuploidy_vcf =system.file(“extdata","aneuploidy.vcf.gz",package="MADSEQ")## target bed file specified before## If you choose to write the output to file (recommended)prepareHetero(aneuploidy_vcf, target,genome="hg19",writeToFile=TRUE,destination=".",plot =FALSE)#> reading vcf file#> Scanning file to determine attributes.#> File attributes:#> meta lines: 116#> header_line: 117#> variant count: 387#> column count: 10#>Meta line116read in.#> All meta lines processed.#> gt matrix initialized.#> Character matrix gt created.#> Character matrix gt rows: 387#> Character matrix gt cols: 10#> skip: 0#> nrows: 387#> row_num: 0#>Processed variant:387#> All variants processed#> processing vcf file#> filtering vcf file#> Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use#> suppressWarnings() to suppress this warning.)#> Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use#> suppressWarnings() to suppress this warning.)#> filtered heterozygous sites for sample aneuploidy.vcf.gz is written to ./aneuploidy.vcf.gz_filtered_heterozygous.txt## If you don't want to write output to fileaneuploidy_hetero =prepareHetero(aneuploidy_vcf, target,genome="hg19",writeToFile=FALSE,plot =FALSE)#> reading vcf file#> Scanning file to determine attributes.#> File attributes:#> meta lines: 116#> header_line: 117#> variant count: 387#> column count: 10#>Meta line116read in.#> All meta lines processed.#> gt matrix initialized.#> Character matrix gt created.#> Character matrix gt rows: 387#> Character matrix gt cols: 10#> skip: 0#> nrows: 387#> row_num: 0#>Processed variant:387#> All variants processed#> processing vcf file#> filtering vcf file#> Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use#> suppressWarnings() to suppress this warning.)#> Warning in .Seqinfo.mergexy(x, y): The 2 combined objects have no sequence levels in common. (Use#> suppressWarnings() to suppress this warning.)

Run MadSeq model to detect potential mosaic aneuploidy

The functionrunMadSeqwill run the models and select the best model for the input data.

Note:

Among three normalized coverage sets listed above (one sample, two samples without control, two samples with a control), we will use the two samplew with a control case for the following analysis.
Because these models are based on MCMC sampling, the running process can be very long. Running different chromosomes parallel in background or High Performance Computer Cluster is highly recommended.

If the processed data have been written into files

## specify the path to processed filesaneuploidy_hetero ="./aneuploidy.vcf.gz_filtered_heterozygous.txt"aneuploidy_normed_cov ="./aneuploidy_cov_normed_depth.txt"## run the modelaneuploidy_chr18 =runMadSeq(hetero=aneuploidy_hetero,coverage=aneuploidy_normed_cov,target_chr="chr18",nChain=1,nStep=1000,thinSteps=1,adapt=100,burnin=200)#> total number of heterozygous site: 28#> total number of coverage 39#> module mix loaded#> 1. running normal model#> Compiling model graph#> Resolving undeclared variables#> Allocating nodes#> Graph information:#> Observed stochastic nodes: 67#> Unobserved stochastic nodes: 30#> Total graph size: 196#>#> Initializing model#> 2. running monosomy model#> Compiling model graph#> Resolving undeclared variables#> Allocating nodes#> Graph information:#> Observed stochastic nodes: 69#> Unobserved stochastic nodes: 29#> Total graph size: 221#>#> Initializing model#> 3. running mitotic trisomy model#> Compiling model graph#> Resolving undeclared variables#> Allocating nodes#> Graph information:#> Observed stochastic nodes: 69#> Unobserved stochastic nodes: 29#> Total graph size: 221#>#> Initializing model#> 4. running meiotic trisomy model#> Compiling model graph#> Resolving undeclared variables#> Allocating nodes#> Graph information:#> Observed stochastic nodes: 69#> Unobserved stochastic nodes: 30#> Total graph size: 229#>#> Initializing model#> 5. running loss of heterozygosity model#> Compiling model graph#> Resolving undeclared variables#> Allocating nodes#> Graph information:#> Observed stochastic nodes: 67#> Unobserved stochastic nodes: 33#> Total graph size: 770#>#> Initializing model#> models done, comparing models

#> Order and delta BIC of the preference of models #> BIC_normal BIC_mitotic_trisomy BIC_meiotic_trisomy BIC_monosomy #> 0.00000 10.40332 16.06959 20.92761 #> BIC_LOH #> 25.42663 #> model selected: normal ## An MadSeq object will be returned aneuploidy_chr18 #> MadSeq object with the posterior distribution from normal model #> kappa m_cov mu p_cov r_cov #> [1,] 4.251502 28.28205 0.4740872 0.1304636 4.243386 #> [2,] 6.012156 28.28205 0.4740872 0.1507365 5.019805 #> [3,] 3.452512 28.28205 0.4740872 0.1057616 3.344919 #> [4,] 4.372610 28.28205 0.4740872 0.1231693 3.972809 #> [5,] 5.775688 28.28205 0.4740872 0.1456355 4.820974 #> [6,] 5.918540 28.28205 0.4740872 0.1860932 6.466463 #> ------ #> BIC_normal BIC_mitotic_trisomy BIC_meiotic_trisomy BIC_monosomy #> 0.00000 10.40332 16.06959 20.92761 #> BIC_LOH #> 25.42663

Note:In order to save time, we only run 1 chain with a much less steps compared with default settings. For real cases, the default settings are recommended.

If the processed data haveNOTbeen written into files

## subset normalized coverage for aneuploidy sample from the GRangesList## returned by normalizeCoverage functionaneuploidy_normed_cov =normed_with_control[["aneuploidy_cov"]]## run the modelaneuploidy_chr18 =runMadSeq(hetero=aneuploidy_hetero,coverage=aneuploidy_normed_cov,target_chr="chr18")# # MadSeq对象将被归还aneuploidy_chr18

TheMadSeqobject from therunMadSeqfunction contains:

The posterior distribution of the selected model
The delta BIC value between selected model and other models

Note:The value of delta BIC suggests the strength of the confidence of the selected model against other models. In our model, you can set a threshold to get high confidence result, usually it’s20in our testing cases. We summarize it as follows

deltaBIC	Evidence against higher BIC
[0,10]	Probably noisy data
(10,20]	Could be positive
>20	High confidence

Visualize the selected model

There are a group of plot functions to plot the output MadSeq object from therunMadSeq.

## plot the posterior distribution for all the parameters in selected modelplotMadSeq(aneuploidy_chr18)

## plot the histogram for the estimated fraction of aneuploidyplotFraction(aneuploidy_chr18,prob=0.95)#> selected model is normal, f is estimated to be 0%

## plot the distribution of AAF as estimated by the modelplotMixture(aneuploidy_chr18)

parameters	description
f	Fraction of mosaic aneuploidy
m	The midpoint of the alternative allele frequency (AAF) for all heterozygous sites
mu[1]	Mean AAF of mixture 1: the AAFs of this mixture shifted from midpoint to some higher values
mu[2]	Mean AAF of mixture 2: the AAFs of this mixture shifted from midpoint to some lower values
mu[3] (LOH model)	Mean AAF of mixture 3: In LOH model, mu[3] indicates normal sites without loss of heterozygosity
mu[3] (meiotic trisomy model)	Mean AAF of mixture 3: In meiotic model, the AAFs of this mixture shifted from 0 to some higher value
mu[4]	Mean AAF of mixture 4: the AAFs of this mixture shifted from 1 to some lower value (only in meiotic model)
kappa	Indicate variance of the AAF mixtures: larger kappa means smaller variance
p[1]	Weight of mixture 1: indicate the proportion of heterozygous sites in the mixture 1
p[2]	Weight of mixture 2: indicate the proportion of heterozygous sites in the mixture 2
p[3]	Weight of mixture 3: indicate the proportion of heterozygous sites in the mixture 3 (only in LOH and meiotic model)
p[4]	Weight of mixture 4: indicate the proportion of heterozygous sites in the mixture 4 (only in meiotic model)
p[5]	Weight of outlier component: the AAF of 1% sites might not well behaved, so these sites are treated as noise.
m_cov	Mean coverage of all the sites from the chromosome, estimated from a negative binomial distribution
p_cov	Prob of the negative binomial distribution for the coverage
r_cov	Another parameter (r) for the negative binomial disbribution of the coverage, small r means large variance

Yu Kong

2022-04-26

你需要的数据

Example Data

Use The Package

Prepare coverage information

Normalize coverage information

If you only have one sample

If you have more than one samples, without a normal control

If you have more than one samples, with a normal control

Prepare heterozygous sites

Run MadSeq model to detect potential mosaic aneuploidy

If the processed data have been written into files

If the processed data haveNOTbeen written into files

Visualize the selected model

Description of All Parameters