Bioinformatics analysis of circulating cell-free DNA sequencing data.

Bioinformatics analysis of circulating cell-free DNA sequencing data Landon L. Chan, Peiyong Jiang PII: DOI: Reference:

S0009-9120(15)00167-8 doi: 10.1016/j.clinbiochem.2015.04.022 CLB 9019

To appear in:

Clinical Biochemistry

Received date: Revised date: Accepted date:

15 December 2014 30 March 2015 29 April 2015

Please cite this article as: Chan Landon L., Jiang Peiyong, Bioinformatics analysis of circulating cell-free DNA sequencing data, Clinical Biochemistry (2015), doi: 10.1016/j.clinbiochem.2015.04.022

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT Bioinformatics analysis of circulating cell-free DNA sequencing data

PT

Landon L. Chan1,2, Peiyong Jiang1,2*

SC RI

1. Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China

2. Department of Chemical Pathology, The Chinese University of Hong Kong, Prince of Wales

NU

Hospital, Shatin, New Territories, Hong Kong SAR, China

MA

*To whom correspondence should be addressed. E-mail: [email protected]

ED

Abstract

PT

The discovery of cell-free DNA molecules in plasma has opened up numerous

CE

opportunities in noninvasive diagnosis. Cell-free DNA molecules have become increasingly recognized as promising biomarkers for detection and management of

AC

many diseases. The advent of next generation sequencing has provided unprecedented opportunities to scrutinize the characteristics of cell-free DNA molecules in plasma in a genome-wide fashion and at single-base resolution. Consequently, clinical applications of circulating cell-free DNA analysis have not only revolutionized noninvasive prenatal diagnosis but also facilitated cancer detection and monitoring toward an era of blood-based personalized medicine. With the remarkably increasing throughput and lowering cost of next generation sequencing, bioinformatics analysis becomes increasingly demanding to understand the large amount of data generated by these sequencing platforms.

1

ACCEPTED MANUSCRIPT In this Review, we highlight the major bioinformatics algorithms involved in the analysis of cell-free DNA sequencing data. Firstly, we briefly describe the biological

PT

properties of these molecules and provide an overview of the general bioinformatics approach for the analysis of cell-free DNA. Then, we discuss the specific upstream

SC RI

bioinformatics considerations concerning the analysis of sequencing data of circulating cell-free DNA, followed by further detailed elaboration on each key clinical situation in noninvasive prenatal diagnosis and cancer management where

NU

downstream bioinformatics analysis is heavily involved. We also discuss

MA

bioinformatics analysis as well as clinical applications of the newly developed massively parallel bisulfite sequencing of cell-free DNA. Finally, we offer our

AC

CE

PT

ED

perspectives on the future development of bioinformatics in noninvasive diagnosis.

2

ACCEPTED MANUSCRIPT Introduction

PT

The discovery of circulating cell-free DNA (cfDNA) in plasma has opened up a new

SC RI

exciting arena in blood-based diagnosis, obviating the need for tissue biopsy in the settings of prenatal and cancer management [1, 2]. Thus far, the main applications of cfDNA are in prenatal diagnosis [3-6] and cancer monitoring [7-11]. A small

NU

proportion of fetal-derived and tumor-derived cfDNA was found in pregnant women’s circulation and in cancer patients’ circulation respectively [8, 12]. Because

MA

fetal-derived and tumor-derived cfDNA are genetically different from the main background circulating cfDNA, blood samples collected from pregnant women and

ED

cancer patients thus provide a ‘liquid biopsy’ consisting of the fetal genetic profile and tumor’s mutational profile. This is the foundation of the development of various

CE

PT

noninvasive diagnostic techniques.

Before the maturation of next generation sequencing technologies, analyses of

AC

circulating cfDNA were done with PCR assays [12-18]. Conventional PCR assays are useful in qualitative analysis such as fetal sex determination [12, 19] and RhD status determination [13, 20]. However, currently it is still not practical in the screening for aneuploidies, sub-chromosomal changes and copy number aberrations because precise quantitative analyses are required.

Recent advances in next generation sequencing have offered a much more efficient platform for the analysis of cfDNA because millions of DNA molecules could be analyzed in a parallel manner. Characteristics of circulating cfDNA could be unveiled at single-base resolution and in a genome-wide scale. With the exponential growth in

3

ACCEPTED MANUSCRIPT sequencing data output, sophisticated bioinformatics algorithms for noninvasive

PT

prenatal diagnosis and cancer monitoring have become increasingly demanding.

To date, many reviews have discussed the application of the cfDNA analyses in

SC RI

clinical managements [21-25]. However, discussion of the underlying bioinformatics algorithms is lacking. This Review intends to fill this gap by giving an overview of the different bioinformatics algorithms commonly used in the analysis of cfDNA.

NU

This Review will cover four aspects. Firstly, we introduce the key biological

MA

properties of cfDNA underpinning the conceptual basis of many bioinformatics algorithms in noninvasive diagnosis. Secondly, we provide a general overview of the

ED

bioinformatics approach toward the analysis of cfDNA molecules. Thirdly, we discuss upstream bioinformatics considerations specific to the analysis of cfDNA.

PT

Finally, we give a detailed elaboration on the downstream bioinformatics algorithms

CE

that are crucial in noninvasive prenatal diagnosis and cancer assessment.

AC

Biological properties of fetal-specific and tumor-specific cfDNA

The presence of cfDNA in the plasma of healthy individual is thought to be the result of cellular apoptotic event [26]. In pregnant women, the main source of fetal-derived cfDNA in maternal plasma is the placenta [27]. In cancer patients, both apoptotic and necrotic events have been suggested as the sources of circulating tumor cfDNA [21, 22]. DNA resulting from these events are naturally fragmented and released into the circulation. Fetal-derived circulating cfDNA in plasma have been reported to be shorter than maternal-derived cfDNA [28, 29]. Interestingly, tumor-derived

4

ACCEPTED MANUSCRIPT circulating cfDNA in plasma have also been reported to be shorter than the background circulating cfDNA [30-32]. However, some groups have found longer

PT

fragments in the blood circulation of cancer patients with certain types of cancer [33-

SC RI

35].

The concentration of fetal-derived and tumor-derived cfDNA in the circulation has been shown to increase with the size of the fetus and tumor [8, 36]. In one study

NU

consisting of 22,384 healthy singleton pregnancies, it was found that the

MA

concentration of fetal-derived cfDNA increased by 0.1% per week between 10 to 21 weeks gestation, and 1% per week beyond 21 weeks gestation [36]. The concentration

ED

of tumor-derived cfDNA has been reported to demonstrate increasing trends with the staging and size in patients with a variety of cancers [7, 9, 10]. However, the exact

PT

trend is likely to vary among cancer types influenced by cancer-dependent variables such as location, tumor aggressiveness, cancer genotypic aberrations, other risks and

AC

CE

prognostic factors.

Fetal-derived and tumor-derived cfDNA have been demonstrated to share similar clearance kinetics. Following delivery, fetal-derived cfDNA in maternal plasma decreases rapidly [37, 38] in two phases. The initial rapid phase has a mean half-life of one hour. The subsequent slow phase has a mean half-life of 13 hours. Altogether, fetal-derived cfDNA becomes undetectable at about one to two days postpartum. Similarly, tumor-derived cfDNA exhibits rapid clearance with a half-life of two hours [39] following complete resection. Therefore, the presence of tumor-derived cfDNA in postoperative patients may suggest the presence of residual tumor. These biological

5

ACCEPTED MANUSCRIPT properties of fetal-derived cfDNA and tumor-derived cfDNA are summarized in

PT

Table 1.

SC RI

General categorization of bioinformatics approach in the analysis of cfDNA

NU

The biological properties of fetal- and tumor-specific cfDNA, that is, the natural fragmentation, persistency and rapid clearance kinetics of cfDNA, allow for a new

MA

homeostasis to be achieved within a person’s circulation in the presence of a fetus or tumor. In other words, the new homeostasis can be viewed as an imbalance of

ED

circulating genetic materials relative to the baseline (i.e. without the fetus or tumor),

PT

as a result of the maintenance of fetal- and tumor-specific cfDNA. The general bioinformatics approach, therefore, is to detect and quantify this imbalance that

AC

categories:

CE

differentiate the new homeostasis from the old homeostasis. There are three major

A. Imbalances detected via allelic count: The normal allelic ratio within the circulation is disturbed due to the presence of genetic materials derived from the fetus or tumor. A couple of examples with this approach would include: estimation of fractional fetal/tumor DNA concentration and detection of aneuploidy. B. Imbalances detected via regional genomic representation: The relative proportion of the representation of a genomic region can be increased or decreased as a result of the presence of the fetus or tumor. A couple of

6

ACCEPTED MANUSCRIPT examples with this approach would include: detection of aneuploidy and detection of copy number changes in cancer.

PT

C. Imbalances detected via size distribution: The normal size distribution of cfDNA molecules within the circulation is disturbed due to the presence of

SC RI

genetic materials derived from the fetus or tumor. A couple of examples with this approach would include: estimation of fractional fetal DNA concentration

NU

and detection of aneuploidy.

MA

These three types of analyses, together, form the basis of the many algorithms to be discussed in the following sections. Figure 1 summarizes the key categories and

ED

applications of bioinformatics in the analysis of cfDNA.

AC

CE

sequencing

PT

Upstream bioinformatics analysis of circulating cfDNA with

The quality of sequencing reads could affect downstream analyses. In the analysis of cfDNA, the quality of sequencing reads is particularly vulnerable to two technical challenges: adapter contamination [40-42] and sequencing bias contributed by the non-uniform PCR amplification of GC-rich/poor DNA fragments [5, 43-45].

Adapter contamination

Adapter contamination affects circulating cfDNA molecules that are shorter than the lengths of individual reads generated by next generation sequencing machines. For

7

ACCEPTED MANUSCRIPT example, consider a DNA molecule with 80 bp in length is brought to a sequencer targeting for sequencing reads with 100 bp. Without adapter trimming, part of the

PT

adapter ligated to the 3’ end of the molecule during library preparation is being sequenced as well. As a result, there is a 20 bp adapter contamination introduced into

SC RI

this particular molecule. Consequently, this read is either unmappable to the human genome using an end-to-end alignment method or has a low alignment score. In either way, the read is discarded for subsequent analysis. Therefore, with adapter

NU

contamination, a substantial proportion of short cfDNA molecules could be

MA

unanalyzed.

ED

The aforementioned adapter contamination might lead to inaccurate results on downstream analysis. For example, the classification power of noninvasive prenatal

PT

screening tests would be reduced. This is because the fetal-derived cfDNA molecules are generally shorter than the maternal-derived cfDNA molecules [28, 29]. Therefore,

CE

the fetal-derived cfDNA molecules have higher chance to be contaminated by

AC

adapters. Adapter contamination thus can result in losing part of the informative fragments (fetal-derived cfDNA molecules).

A number of bioinformatics packages have been developed to tackle this problem. In general, most of these adapter trimming packages utilize dynamic programming such as Smith-Waterman [46] or Needleman-Wunsch algorithms [47] with minor modifications to identify the adapters. These algorithms are also used in conventional pair-wise alignment. A few examples of adapter trimming packages include Cutadapt [41], TrimGalore [42], FastqMcf [48], HTSeq [49], AdapterRemoval [40] and Trimmomatic [50]. Their functionalities are largely similar but vary among their

8

ACCEPTED MANUSCRIPT abilities to work directly on gzip-format files and paired-end reads. A summary of

PT

these packages is illustrated in Table 2.

SC RI

Sequencing bias

The main sequencing bias affecting the analysis of circulating cfDNA is the GCcontent bias attributed by PCR amplification [51, 52]. In the process of library

NU

preparation, DNA molecules are amplified with PCR. It has been reported that DNA

MA

molecules with poor GC-content and shorter in length are preferentially amplified [51, 52]. Because only a subset of amplified DNA molecules are sequenced on the

ED

flow-cells, regions of poor GC-content or shorter DNA fragments will be overrepresented, whereas GC-rich or longer DNA fragments will be under-represented. As

PT

a result, this phenomenon can hinder the assessment of aneuploidies particularly for

CE

chromosomes with higher GC-content. A number of algorithms have been proposed to correct GC-bias. A detailed discussion of these methods was summarized

AC

previously [51].

Downstream bioinformatics analysis of circulating cfDNA with sequencing

The following sections discuss the downstream bioinformatics algorithms in the analysis of circulating cfDNA. Each section describes a clinical application in which various bioinformatics algorithms are applied: 1) estimation of fractional fetal DNA concentration, 2) noninvasive detection of fetal aneuploidy in singleton pregnancies,

9

ACCEPTED MANUSCRIPT 3) noninvasive detection of zygosity and aneuploidies in twin pregnancies, 4) noninvasive haplotype based detection of monogenic disorders, 5) estimation of

PT

fractional tumor DNA concentration, 6) noninvasive detection of copy number aberrations in cancer patients, 7) noninvasive detection of genomic rearrangements in

SC RI

cancer patients and 8) noninvasive methylomic analysis of circulating cfDNA.

NU

Estimation of fractional fetal DNA concentration

MA

There are two major approaches to estimate the fractional fetal DNA concentration: polymorphism dependent approach and polymorphism independent approach.

ED

Methods under the polymorphism dependent approach are further categorized into either parental-genotype dependent or parental-genotype independent. Briefly, the

PT

parental-genotype dependent method requires both paternal and maternal genotype to

CE

deduce fetal genotype and directly estimates the fractional fetal DNA concentration. In contrast, the parental-genotype independent method does not require both paternal

AC

and maternal genotype. It relies on statistical inference based on the maternal plasma allelic distribution to deduce the fractional fetal DNA concentration. The polymorphism independent approach, as the name suggests, does not rely on the parental genotype at all. It takes advantage of the biological differences between maternal cfDNA molecules and fetal cfDNA molecules to estimate the fractional fetal DNA concentration. A more detailed account of these approaches is discussed in the following sections.

Polymorphism dependent approach (parental-genotype dependent method): Direct estimation

10

ACCEPTED MANUSCRIPT

The parental-genotype method is a polymorphism dependent approach which involves

PT

identification of parental SNP loci at which both mother (AA) and father (BB) are homozygous but for a different allele each [28, 43, 53]. The resulting SNP loci in the

SC RI

fetus are obligately heterozygous (AB). The A-allele is termed shared allele because it is identical between the fetus and the mother. The B-allele is termed fetal-specific allele because it is present in the fetus but absent in the mother. By investigating the

NU

sequence reads originating from these obligately heterozygous SNP loci, the

equation:

MA

fractional fetal DNA concentration in plasma can be directly deduced by using the

2p , where p is total number of reads aligned to the fetal-specific allele p+q

PT

ED

(B), and q is the total number of reads aligned to the shared allele (A) (Figure 2).

Polymorphism dependent approach (parental-genotype independent method):

CE

FetalQuant

AC

The parental-genotype dependent method described above determines the fetal contribution in maternal plasma directly. However, its application is limited by the need for paternal genotype information. In an epidemiological study on paternal discrepancy (PD) around the world, it is suggested that the prevalence of PD can be as high as 30% [54]. To overcome this limitation, Jiang et al. developed FetalQuant, a bioinformatics algorithm that applies maximum likelihood to estimate the fractional fetal DNA concentration without the need of paternal genotype information [55]. FetalQuant hypothetically categorizes each SNP into one of the following groups: AAAA, AAAB, ABAA and ABAB, where the main symbols represent the maternal genotypes and the subscripts represent the fetal genotypes. A and B, respectively, 11

ACCEPTED MANUSCRIPT refer to the most prevalent and second-most prevalent allele at a particular locus. Each of these genotypes has its own distribution for the expected B-allele counts. In

PT

particular, the B-allele occurrences on the AAAB and ABAA genotypes follow binomial distributions, which are directly parameterized by the fractional fetal DNA

SC RI

concentration. For example, at 10% fetal DNA concentration, the chance of observing B-allele on the AAAB would be 5% in plasma. Thus, in turn, the fractional fetal DNA concentration could be deduced by searching for the most appropriate fractional fetal

NU

DNA concentration that enables the likelihood of binomial mixture model to achieve

MA

its maximum.

ED

Polymorphism independent approach: Size-based analysis

PT

Recently, an alternative method based on the size distribution of the cfDNA molecules in maternal plasma was proposed to estimate the fractional fetal DNA

CE

concentration [56]. This method is based on the findings that the fetal-derived cfDNA

AC

molecules are in general shorter than the maternal-derived cfDNA molecules [28, 29]. As the gestational age progresses, the relative contribution of fetal cfDNA in maternal plasma increases. Consequently, the size ratio, that is, the ratio between shorter fragments and longer fragments in maternal plasma, is expected to increase with gestational age. Using a cross-validation model consisting of 73 euploid pregnancies carrying male fetuses, Yu et al. demonstrated that the size ratio has a linear relationship with the fractional fetal DNA concentration, and showed that the size based method is highly concordant with the fractional fetal DNA concentration as determined by the proportion of chromosome Y sequences in maternal plasma with a median absolute difference of 2.3%.

12

ACCEPTED MANUSCRIPT

In the estimation of fractional fetal DNA concentration, all three methods provide

PT

similar accuracy. However, each method has its own specific requirement for implementation. If paternal genotype is available, then the parental-dependent method

SC RI

provides a direct estimation of the fractional fetal DNA concentration from the blood sample, and thus would be the method of choice. If paternal genotype is unavailable, then both FetalQuant and the size-based method can be used. The implementation of

NU

FetalQuant requires a high-sequencing depth through targeted-sequencing because

MA

the maximum likelihood depends on the mean coverage of the SNPs. So, it could be costly if a large number of samples are analyzed. The size-based method, though

ED

requires less sequencing-depth than FetalQuant, needs a set of healthy samples to establish the range of normal size ratio. It is also comparatively more cost-effective

PT

because the normal range can be reused once it is established. However, the size of cfDNA would be affected by pathological and physiological conditions such as

CE

systemic lupus erythematosus and cancer [57, 58], and thus may only be applicable to

AC

healthy subjects.

Noninvasive detection of fetal aneuploidy in singleton pregnancies

One of the most important clinical applications of cfDNA is the noninvasive prenatal diagnosis of fetal aneuploidies: trisomy 21 (Down’s syndrome), trisomy 18 (Edwards syndrome) and trisomy 13 (Patau syndrome). The discovery of fetal DNA in maternal plasma in 1997 [1] has opened up new possibilities for such application. However, circulating cfDNA derived from the fetus only exists at a minor fraction, making it technologically challenging to capture and quantify these molecules. In general, fetal

13

ACCEPTED MANUSCRIPT cfDNA amounts to approximately 10% of maternal plasma cfDNA in pregnant women [59]. Chromosome 21 is about 1.5% of the total human genome length. In

PT

trisomy 21, the genomic representation of chromosome 21 will then be expected to increase slightly from 1.5% to 1.57%. That is, the perturbation of genomic

SC RI

representation of chromosome 21 in the presence of aneuploidy is extremely small. Therefore, a large number of plasma cfDNA molecules is needed to be analyzed in order to generate precise quantitative results. Massively parallel sequencing offers a

NU

practical solution to this problem because millions of cfDNA molecules can be

MA

sequenced simultaneously. Two main approaches to detect fetal aneuploidy in singleton pregnancies noninvasively are discussed here: whole-genome approach and

ED

targeted approach. The targeted approach was developed to increase the sequencing

PT

depth at the targeted site at the expense of reduction in genome-wide coverage.

CE

Whole-genome approach: Tag-counting based analysis

AC

Chromosome representation refers to the proportion of cfDNA represented by each individual chromosome in the plasma sample. In trisomic pregnancy, there would be an increased proportion of fetal-derived DNA molecules in maternal plasma due to the extra chromosome acquired by the fetus. This fact has provided the theoretical foundation for the detection of fetal aneuploidy using massively parallel sequencing. In the tag-counting method [43], each sequenced read is mapped to the human genome. The chromosome representation is estimated from the proportion of reads originating from a chromosome of interest with respective to the total amount of reads. In the detection of trisomy 21, the mean and standard deviation of the chromosome 21 representation are calculated, respectively, using a reference group of

14

ACCEPTED MANUSCRIPT euploid pregnancies. The z-score is then used to compare the proportional

PT

representation of chromosome 21 between the sample and the reference group.

Several groups attempted to directly transfer the tag-counting method to the detection

SC RI

of trisomy 13 and trisomy 18, which are the most clinically important autosomal trisomies apart from trisomy 21 [5]. However, it has been reported that the measurement of genomic representations for chromosome 13 and chromosome 18

NU

tend to be less precise than that of chromosome 21 [43, 60]. This is thought to be the

MA

GC biases related to chromosome 13 and 18 [43]. Consequently, one group modified the tag-counting method by implementing GC correction prior to downstream

ED

analysis. The GC correction method is implemented through modeling the locally weighted scatter plot smoothing regression (LOESS) [5, 51]. First of all, the reference

PT

genome is divided into sub-regions with equal size by using a 50-kb interval, termed as bin. LOESS works by fitting a regression between count and GC-content in a

CE

particular bin. Count in a bin is the number of fragments with 5’ end falling within the

AC

corresponding bin, and GC-content in a bin is the percentage of guanines and cytosines within the corresponding bin. Then, a correction value is calculated by LOESS by measuring the difference between the predicted counts and expected counts in the queried bin. After GC-correction, the group demonstrated that the sensitivity of trisomy 13 and trisomy 18 detection using the tag-counting method improved significantly from 36.0% to 100% and from 73.0% to 91.9%, respectively [5].

Whole-genome approach: Size-based analysis

15

ACCEPTED MANUSCRIPT The size-based method proposed by Yu et al., apart from its ability to estimate the fractional fetal DNA concentration, it is also capable of detecting chromosomal

PT

aneuploidies [56]. Early work based on real-time quantitative PCR had demonstrated that fetal cfDNA is in general shorter than maternal cfDNA [61]. The precise

SC RI

difference in the size distribution of these molecules, however, was only revealed later with the use of massively parallel sequencing technologies. The size of each circulating cfDNA fragment is deduced by the outermost coordinates of its

NU

corresponding pair-end read aligned to the human reference genome. Maternal

MA

cfDNA has a size profile that shows a predominant peak at 166 bp with a series of small peaks occurring at a periodicity of 10 bp [28, 38]. The fact that fetal cfDNA’s

ED

size profile at single-base resolution exhibited an apparent reduction at 166 bp peak and showed a peak at 143 bp with a similar 10 bp periodicity [28] is illustrative that

PT

fetal cfDNA is shorter than maternal cfDNA. This observation has provided a theoretical basis for the size-based method in the detection of chromosomal

CE

aneuploidies. It is expected that the proportion of short molecules (< 150 bp) of a

AC

trisomic chromosome relative to a set of reference chromosomes is higher than that in a euploid pregnancy. The group demonstrated that the size-based method could reliably detect aneuploidy with a sensitivity of 95.2% and specificity of 99%.

Targeted approach: allelic ratio analysis

The allelic ratio analysis is similar to the polymorphism dependent approach in calculating the fractional fetal DNA concentration. It also relies on informative SNPs where the mother is homozygous and the fetus is heterozygous [62]. A ratio is calculated between the total number of reads containing the fetal-specific allele (F)

16

ACCEPTED MANUSCRIPT and the total number of reads containing the shared allele (S) for the targeted chromosome (e.g. chromosome 21) and a reference chromosome (e.g. chromosome 7)

PT

respectively, termed F-S ratio (FSR). In a euploid pregnancy, these two ratios are expected to be very close. In a pregnancy carrying a paternal-derived trisomy 21,

SC RI

because the extra copy of chromosome 21 in fetus is from the father, it is expected that the FSR of the targeted chromosome is two-fold of the FSR for the reference chromosome. In a pregnancy carrying a maternal-derived trisomy 21, because the

NU

extra copy passed to the fetus is from the mother and its genomic makeup is shared

MA

between the fetus and the mother, it is expected that the FSR for the target chromosome is only slightly lower than the FSR for the reference chromosome. Given

ED

sufficient allelic counts are available, both types of trisomy 21 can be detected but the

PT

latter scenario would require much more sequence reads.

CE

Targeted approach: dosage-type analysis

AC

The targeted dosage-type analysis is similar to the whole-genome tag-counting based analysis. It uses z-test to assess an over-representation of the trisomic chromosome through selectively amplifying and sequencing genomic regions of interest from maternal plasma [44]. In the method developed by Spark et al., normalization of sequence counts was done via median polish by systematically removing sample and genomic location biases. With the normalized count, a standard z-test was used to classify aneuploidy. They showed that this method is highly specific (99.2%) and sensitive (100%) as well.

Targeted approach: Bayesian-based maximal likelihood method

17

ACCEPTED MANUSCRIPT

The third method is parental support (PS) [63]. PS is an algorithm that utilized

PT

Bayesian-based maximal likelihood method to determine any presence of aneuploidy. In this method, samples were first processed by highly multiplexed PCR at 11,000

SC RI

SNPs on chromosome 13, 18, 21, X and Y before the application of massively parallel sequencing. In the subsequent bioinformatics analysis, PS algorithm generates billions of possible genotype combinations by varying the number of chromosomal copies, the

NU

fractional fetal DNA concentration and parental genotypes. By comparing the

MA

observed allelic distribution generated from the blood sample and the set of genotype combinations provided by PS, it is then possible to deduce the genotype combination

ED

that best explains the data. For example, a genotype combination provided by PS under the hypothesis of euploidy would not explain well the set of data derived from a

PT

triploidy blood sample. This method showed an accuracy of 99.92% in reporting chromosomal copy number. However, a potential limitation of PS is that it requires

CE

parental genotypes. As discussed above, this requirement might be difficult to fulfill

AC

in certain circumstances.

Noninvasive detection of twin zygosity and twin aneuploidies

Twin pregnancies occur at a small but significant proportion in the population. It was estimated that the average of the twinning rates comprising of 76 countries is 13.1 per 1,000 births [64]. In the United States of America, it had a higher prevalence at 32 per 1,000 births [65]. More importantly, because most of the twin pregnancies are due to advanced age, twin pregnancies are at a higher risk than singleton pregnancies for aneuploidy [66]. Therefore, prenatal screening for twin pregnancies is imperative.

18

ACCEPTED MANUSCRIPT

In light of the success of noninvasive prenatal testing for fetal aneuploidies in

PT

singleton pregnancies, multiple groups have studied the feasibility to detect fetal aneuploidies in twin pregnancies [67-69]. One of the key parameters to interpret fetal

SC RI

aneuploidies in twin pregnancies is zygosity. To assess the twin zygosity, Qu et al. developed the calculation of apparent fractional fetal DNA concentration to determine twin zygosity noninvasively [68]. The apparent fractional fetal DNA concentration is

NU

calculated at informative SNP loci where the mother is homozygous but at least one

MA

of the fetuses is heterozygous with the following equation:

2p , where q is the total p+q

reads mapped to the highest-count allele and p is the total reads mapped to the

ED

second-highest-count allele. In monozygotic twins, because the twins are genetically

PT

identical, this fraction should be constant at different chromosomes. In dizygotic twins, because the twins are not genetically identical, it is expected that the fraction

AC

different.

CE

would fluctuate across chromosomes at places where the genotypes of the twins are

In addition to determine the twin zygosity, the same group pursued the question of detecting aneuploidies in twin pregnancies noninvasively. They used a three-step algorithm [69]. The first step involves classifying the twin pregnancy as either euploid or at least one of the fetuses is aneuploid. It is done by comparing the sample’s chromosomal representation of the targeted chromosome (e.g. chromosome 21 or 18) with the reference mean, which was composed of 11 euploid pregnancies (e.g. chromosome 21: 1.311, SD=0.007; chromosome 18: 2.808, SD=0.004) using the classical z-score approach. For instance, a pregnancy with a z-score > 3 is considered as having at least one aneuploid fetus. If at least one of the fetuses is aneuploid, then 19

ACCEPTED MANUSCRIPT the algorithm proceeds to step two to determine twin zygosity. If the twin is monozygotic, then both twins are aneuploid. If the twin is dizygotic, then the

PT

algorithm proceeds to step three: to determine if either one of the twins is affected or both of them are affected. In this step, the genomic representation of the chromosome

SC RI

of interest (e.g. chromosome 21 or 18) and the fractional fetal DNA concentration of each fetus is calculated. The values are compared to the genomic representation estimated by a reference group of singleton unaffected pregnancies. Here, the key of

NU

this step is that any increment of genomic representation is solely contributed by the

MA

trisomic fetus(es). Therefore, if an increment in the observed genomic representation could be most likely explained by the total fractional fetal DNA concentration, both

ED

twins are aneuploid. On the other hand, if an increment in the observed genomic representation could only be explained by one of the two fetuses’ contributions, only

PT

one twin is aneuploid.

AC

CE

Noninvasive haplotype based detection of monogenic disorders

With the rapid reduction in cost and huge amount of data generated by massively parallel sequencing, it became possible to interrogate the entire fetal genome in the maternal plasma [28, 70, 71]. The proof of the presence of the entire fetal genome in maternal plasma means that it is theoretically possible to diagnose any monogenic diseases noninvasively. The following section reviews the three canonical bioinformatics algorithms that had been implemented to determine the fetal genome in maternal plasma: relative haplotype dosage analysis (RHDO) [28], haplotype counting approach [70], and hidden Markov model (HMM) [71].

20

ACCEPTED MANUSCRIPT RHDO was first introduced by Lo et al. to determine the maternal inheritance of the fetus [28]. ß-thalassemia is a monogenic autosomal recessive blood disorder caused

PT

by mutations at the haemoglobin-beta (HBB) gene. In that study, both the mother and father were heterozygous carriers of ß-thalassemia. Paternal and maternal genotypes

SC RI

were obtained by microarray analysis to categorize SNPs and were grouped into five groups (Figure 3). Category 3 and category 4 were particularly important in the deduction of the paternal and maternal inheritances of the fetus, respectively.

NU

Category 3 SNPs were those for which mother was homozygous and father was

MA

heterozygous. The paternal inheritance of the fetus could be deduced using this group of SNPs by identifying the presence or absence of the non-maternal allele, i.e. fetal-

ED

specific allele. Category 4 SNPs were used to determine the maternal inheritance of the fetus. It was a group of SNPs in maternal plasma for which the mother was

PT

heterozygous and the father was homozygous. Deduction of the maternal inheritance of the fetus is much more challenging because there is no fetal-specific allele that is

CE

absent in the maternal genotypes. Thus, the determination of the maternal inheritance

AC

of the fetus has to be aided by quantitatively comparing the dosage of the two maternal haplotypes.

The central idea of RHDO is to accumulate imbalances of the maternal alleles between the two maternal haplotypes, followed by the statistical deduction of the maternal inheritance of the fetus based on these imbalances. There are two maternal haplotypes, Hap I and Hap II (Figure 4). The category 4 SNPs are further divided into two types in accordance with the maternal haplotypes, namely type α and type β. Type α SNPs are defined as those in which the paternal alleles were the same as those on the maternal Hap I. If the fetus inherited Hap I from mother, then an over-

21

ACCEPTED MANUSCRIPT representation of Hap I relative to Hap II would be observed in maternal plasma. If the fetus inherited Hap II, then no over-representation would be seen. Type β SNPs

PT

are defined as those in which the paternal alleles were the same as those on maternal Hap II. If the fetus inherited Hap I from mother, then an equal representation of Hap I

SC RI

and Hap II would be maintained in maternal plasma. If the fetus inherited Hap II, then an over-representation of Hap II would be observed; in other words, an underrepresentation of Hap I would be seen. The proportional contribution of total reads by

NU

Hap I to Hap II is then subjected to sequential probability ratio test (SPRT) [3] to

MA

determine if either haplotype is over-represented statistically (Figure 4). The important unique feature of RHDO is its ability to explore the consensus calls

ED

between type α and type β SNPs to further enhance the accuracy. With RHDO, it was demonstrated that the genome-wide parental inheritance could be deduced from the

PT

maternal plasma. In that study [28], Lo et al. showed that the fetus actually inherited the mutant of haemoglobin-beta (HBB) gene from father and the wild-type HBB gene

CE

from mother. The fetus was a ß-thalassemia heterozygous carrier which was

AC

concordant with the clinical outcome.

Haplotype counting is an alternative method proposed by Fan et al. to infer the fetal genome from maternal plasma. This method is based on counting the relative representation of haplotype pairs of each parent [70]. In the pregnant women’s plasma, there are three haplotypes: the maternal haplotype that is transmitted to the fetus, the maternal haplotype that is not transmitted to the fetus, and the paternal haplotype that is transmitted to the fetus. If the fractional fetal DNA concentration in maternal plasma and the number of genome equivalent are, respectively, assumed to be f and G, then the relative genome equivalent of un-transmitted maternal haplotype

22

ACCEPTED MANUSCRIPT is G*(1-f), and the relative genome equivalent of the transmitted maternal haplotype is G. Analogously, the relative genome equivalent of the transmitted paternal haplotype

PT

is G*f, and the un-transmitted paternal haplotype is 0. In each pair of parental haplotypes, the transmitted one is over-represented than the non-transmitted one.

SC RI

Therefore, with sufficiently high sequencing depth and long haplotype block, by counting the number of alleles that are aligned to either maternal haplotype accordingly, the maternally transmitted inheritance can be deduced (Figure 5). For the

NU

paternal part, inheritance can be traced by identifying markers that are not present in

MA

the maternal genome. Thus the whole fetal genome can be inferred noninvasively.

ED

Kizman et al. reported another method to infer the fetal genome from maternal plasma [71]. The maternal inheritance of the fetus was inferred by HMM (Figure 6).

PT

Classically, HMM has three parameters: latent state, emission probability and transition probability. In Kitzman et al.’s HMM implementation, the maternal

CE

inheritance of each SNP is determined by two factors: the maternal inheritance of the

AC

previous SNP (latent state) being interrogated and the SNP type (α or β) (emission probability). This model also accounts for natural haplotype switching events such as genetic recombination (transition probability). The Viterbi algorithm, a recursive algorithm that searches for the sequence with the maximum probability, was used to generate the most probable latent state sequence. Altogether, the maternal inheritance of the fetus could be deduced. For the paternal inheritance of the fetus, it was inferred using a similar method to Lo et al.

In summary, RHDO, the haplotype counting method and the HMM method provide similar accuracy in the detection of the maternal inheritance of the fetus. The main

23

ACCEPTED MANUSCRIPT differences among these algorithms are their prior experimental protocols to obtain

PT

the maternal haplotypes and their mathematical complexities.

For RHDO, as a proof-of-concept demonstration, the maternal haplotype was deduced

SC RI

by microarray genotyping of the family (father, mother and fetus) trio’s tissue samples. However, it is theoretically possible to obtain the maternal haplotype noninvasively by collecting genotype information from other family members. If

NU

familial genotype information is not readily available, then this experimental protocol

MA

cannot be used. The experimental protocols for both haplotype counting method and HMM are capable of deducing the maternal haplotype without genotype information

ED

from other familial members. But these methods require significant technological expertise to obtain the maternal haplotypes from the extracted maternal cells. The

PT

experimental protocol for haplotype counting method relies on a microfluidic device to perform direct deterministic phasing as well as the isolation of blood cells at

CE

metaphase. This is both technologically challenging and extremely labor-intensive.

AC

The experimental protocol for HMM obtains the maternal haplotype via clone-pool dilution sequencing. Briefly, maternal genome was sheared, cloned onto fosmids and cultured within Escherichia coli. The clones were subsequently sequenced and maternal haplotypes were reconstructed. This method is also labor-intensive and the whole library preparation process takes about a week [72].

In terms of the mathematical complexities among these three algorithms, RHDO and haplotype counting method are simpler because both methods rely on counting alleles and testing for the accumulated imbalance between maternal haplotypes. Additionally, the isolation of type α and β alleles in RHDO allows for a consistency check to reduce

24

ACCEPTED MANUSCRIPT false positive calls. In contrast, HMM is mathematically more challenging. It involves an in-depth understanding of probability and Markov chains to implement the

PT

algorithm.

SC RI

In summary, amongst these three experimental protocols, RHDO should be easier to implement and more cost-effective if familial genotype information is available. If familial genotype information is unavailable, then the experimental protocols for

MA

of technology and professional expertise.

NU

haplotype counting method and HMM could be applied depending on the availability

ED

Estimation of fractional tumor DNA concentration

PT

The presence of circulating tumor cfDNA (ctDNA) was first detected in cancer

CE

patients in 1996 [2]. Since then, numerous efforts were spent on detecting and quantifying ctDNA in plasma, to investigate the clinical values of these circulating

AC

nucleic acids. Earlier efforts with the use of conventional PCR explored the clinical utility of the detection of multiple oncogenes in ctDNA [16-18]. However, the heterogeneity of tumor cells has posed an extra layer of complexity in the analysis of ctDNA using conventional PCR assays. For instance, some mutations that are only present in subclones of a tumor may be rare and thus less of these mutation-associated tumor DNA molecules are released into plasma. As a result, mutations detected in tumors might not be detected in the circulating nucleic acids. In contrast, some mutations existing in subclones are missed during sampling of tumor tissues but a considerable amount of these mutations can be present in the circulating nucleic acids. Therefore, mutations detected in plasma might not be found in the matched tumor

25

ACCEPTED MANUSCRIPT tissues. A notable concrete example is the KRAS mutation, which was frequently detected in tumor tissues such as the colon, lung and the pancreas. But the detection

PT

of KRAS mutations in blood samples had not been highly consistent [18, 73, 74].

SC RI

Maturation of massively parallel sequencing has shed light on the research in ctDNA. It allows for the detection of mutations present at lower frequencies in tumor and circulation. This enables the precise quantification of important tumor related

NU

information such as the concentration of ctDNA that is well reflective of tumor

MA

burden in cancer patients. The concentration of ctDNA can be estimated in two ways with sequencing data: mutant allele approach and loss of heterogeneity approach

PT

Mutant allele approach

ED

(Figure 7).

CE

Mutant allele approach requires either a predetermined set of mutations (such as those

AC

commonly found in certain cancers) [7, 10, 11] or prior analysis of a resected tumor [8]. In the latter case, monitoring of ctDNA is useful in early detection of tumor recurrence. The concentration of ctDNA in plasma can either be expressed as the number of mutant fragments per volume (e.g. 100 fragments per 5ml) [7, 10] or is calculated by the following equation:

2m , where m is the number of reads m+w

containing the mutant alleles, and w is the number of reads containing the wild-type alleles [8] (Figure 7). The advantage of this method is that the calculation is relatively straightforward. The downside is that it is invasive because of the requirement of tumor sample for the detection of lower frequency alleles. On the other hand, with the use of a predetermined set of mutations, there is no need to analyze the tumor biopsy.

26

ACCEPTED MANUSCRIPT However, the major drawback of using a predetermined set of mutations is that the calculated concentration of ctDNA might be less precise because individual-specific

PT

mutations might be missed in the preset panel.

SC RI

Loss of heterogeneity approach

Chan et al. developed an approach, termed genome-wide aggregated allelic loss

NU

(GAAL), by utilizing the heterozygous SNPs residing in regions exhibiting loss of

MA

heterogeneity (LOH) [8]. For GAAL, LOH associated heterozygous SNPs were first identified with microarray analysis of the tumor biopsy. Alleles that were deleted in

ED

the tumor would exist at a lower frequency in patient’s plasma comparing to nondeleted alleles. So, the proportion of reads carrying the deleted alleles would be more

PT

reflective of the non-tumor fraction of the circulating cfDNA while the proportion of reads harboring the non-deleted alleles would be more enriched for the tumor-derived

CE

DNA. Therefore, the fractional tumor DNA concentration could be deduced by the

AC

following equation:

N nondel - N del , where Nnondel represents the number of sequenced N nondel

reads carrying the non-deleted alleles and Ndel represents the number of sequenced reads carrying the deleted alleles (Figure 7).

The concentration of fractional tumor DNA concentration has been demonstrated to correlate well with the tumor size [7, 8, 10, 11]. As a result, it can be a useful parameter for comparison of tumor load in different types of cancers. Furthermore, multiple studies have shown that tumor fraction exists at an extremely low concentration after surgery [7, 8, 11]. Therefore, tumor fraction can be useful in monitoring cancer patients both in the short term and long term. In immediate 27

ACCEPTED MANUSCRIPT postoperative cancer patients, persistence of small tumor fraction in plasma might suggest residual tumor existence. In the long term, raised tumor fraction might

PT

indicate recurrence. CtDNA thus offers potential for early detection of recurrent

SC RI

tumor and early intervention.

Noninvasive detection of copy number aberrations in cancer patients

NU

Instead of measuring the fractional tumor DNA concentration to assess tumor load in

MA

patients’ plasma, copy number changes is another possible biomarker for the detection of tumor noninvasively. Copy number aberrations have been found in

ED

almost all tumors [75-78]. In some cases, amplification of oncogenes was found to be the culprit for tumorigenesis [79, 80]. Therefore, noninvasive detection of copy

PT

number aberrations is increasingly recognized as a desirable biomarker for cancer

CE

detection and monitoring [8, 9, 81-83] However, comparing with the noninvasive detection of fetal aneuploidy, it generally remains elusive regarding the characteristics

AC

and exact region of human genome affected by copy number aberrations. For example, the sizes, genomic locations, and amplitudes of such aberrations involved are variable among different cancer types. Thus, for a cancer patient, genome-wide scanning of copy number aberrations would be extremely useful.

Chan et al. developed an approached to fit this purpose by dividing the human genome into equally sized bins of 1-Mb in length [8, 9] That is, there are a total of 3,000 bins to be analyzed simultaneously across the whole human genome. The genomic representation per 1-Mb bin is calculated and adjusted according to its GCcontent to improve the detection power. Subsequently, copy number gains and losses

28

ACCEPTED MANUSCRIPT in tumor could be reflected by an increased and decreased genomic representation in the circulation, respectively. Their method was to compare genomic representation of

PT

cancer patients with the genomic representation of healthy subjects using z-score approach for each 1-Mb bin. One technical issue of this method is that it may

SC RI

introduce false positives due to multiple comparisons. But this issue can be minimized by adopting a more stringent threshold for positive calls. For example, a patient is classified as cancer positive by demonstrating a certain proportion of bins showing

MA

NU

aberrations [9] or to use Bonferroni correction of the p-values [83].

Another approach to reduce false positives from the effect of multiple comparisons is

ED

to assess copy number aberrations at a chromosomal arm level. Leary et al. introduced plasma aneuploidy (PA) score to look at chromosomal arm changes in

PT

breast cancer and colon cancer patients [83]. This approach combined information of the top five chromosomal arms that demonstrated the most aberrant changes to

AC

CE

determine both the cancer status and tumor load in patients.

Comparing the two methods, the bin method is more generalized and can be applied to different cancer types, whereas PA would have a limited spectrum of application toward cancer types that affect primarily the entire chromosomal arms. However, the bin method is likely to be less specific than PA because of the effect of multiple comparisons correction. Nonetheless, both methods demonstrate important clinical utilities of the use of copy number changes in noninvasive cancer detection. Furthermore Chan et. al. demonstrated that, with the bin method, copy number changes virtually disappeared in plasma of postoperative cancer patients [8],

29

ACCEPTED MANUSCRIPT suggesting that their method has an additional utility to monitor tumor clearance and

PT

recurrence.

SC RI

Noninvasive detection of genomic rearrangements in cancer patients

Genomic rearrangement is common in many cancers. A notable example is the formation of fusion gene BCR-ABL as a result of chromosomal translocation

[84].

Other

recurrent

genomic

rearrangements

involving

the

MA

leukemia

NU

(Philadelphia chromosome) responsible for the development of chronic myeloid

immunoglobulin genes, T-cell receptor genes and retinoic acid receptor alpha gene

ED

have been reported to associate with haematological malignancies [85-87]. Yet, recurrent genomic rearrangements are not common among solid tumors [88]. Solid

PT

tumors often harbor genomic rearrangements that are individualized. Therefore,

CE

detection of these genomic rearrangements from solid tumors could aid in the understanding of tumor characteristics as well as the development of a more

AC

personalized treatment.

Leary et al. recently developed a noninvasive version of personalized analysis of rearranged ends (PARE) [83, 88] to detect genomic rearrangements in plasma of cancer patients. Originally, PARE was developed to identify patient-specific genomic rearrangements from solid tumors [88]. These specific genomic rearrangements were useful in monitoring cancer progression, remission and personalized care. The method utilized mate-pair library sequencing to identify reads that contained genomic rearrangement breakpoints. These genomic rearrangements were then further evaluated by PCR amplifying across the rearrangement junctions. For noninvasive

30

ACCEPTED MANUSCRIPT PARE, additional bioinformatics filters were applied to remove alignment artifacts and common germline polymorphisms. In their study, they analyzed plasma samples

PT

from 10 cancer patients and 10 normal subjects, revealing 14 rearrangements in 9 out

SC RI

of 10 plasma samples from cancer patients that were not found in normal subjects.

Despite Leary et al.’s method in the detection of genomic rearrangement in cancer patients having high specificity, it is not suitable for screening purposes because only

NU

a handful of rearrangements exist in the cancer genome. Moreover, the sequencing

MA

depth required for the detection of genomic rearrangement is high. With the current cost of sequencing, it is still cost-ineffective to use this method for screening in the

ED

general population. However, this method is suitable for follow-up purposes because the rearrangements have already been identified presumably from the excised tumor

PT

samples. A noninvasive targeted sequencing could be applied and would be

CE

comparably more cost-effective.

AC

Noninvasive methylomic analysis of circulation cfDNA

A recent application of massively parallel genomic sequencing of circulating cfDNA is methylomic analysis. Epigenetic mechanisms play an important role in fetal development, postnatal consequences [89, 90], tumorigenesis and tumor progression [91, 92]. Multiple cfDNA methylation based biomarkers had been discovered with conventional PCR-assays that were fetal- [93-95] and tumor-specific [96-98]. Yet, these markers only provided a limited picture of the fetal and tumor methylome. Whole-genome methylome in circulating plasma was accessible recently with the advent of massively parallel bisulfite sequencing. In the following section we briefly

31

ACCEPTED MANUSCRIPT discuss the bioinformatics algorithms commonly used in methylomic analysis of cfDNA, and give a more in-depth overview of the clinical applications in which

PT

bioinformatics are applied.

SC RI

Integrated bioinformatics packages for methylomic analysis

Methylome could be assessed by first treating the DNA molecule with bisulfite [99,

NU

100], converting the unmethylated cytosine into uracil while leaving the methylated

MA

cytosine unchanged. The uracils were then converted into thymines during PCR amplifications. As a result, a modified DNA sequence could be obtained to reveal the

ED

methylation status.

PT

Multiple bisulfite sequencing alignment tools had been developed [101-106] with extensive reviews covering both the alignment methods [107] and packages elsewhere

CE

[108]. A special note is that, while most of these alignment tools could align bisulfite

AC

treated fragments efficiently and accurately, very few of them could provide the downstream analysis such as the identification of differentiated methylated regions (DMRs). To date, BSmooth [105] and Methy-Pipe [106] are developed to perform an integrative analysis including alignment, methylation level determination, DMR identification as well as DMR annotation in one package. In terms of alignment speed, Methy-pipe outperforms Bismark [104] which has been shown to be faster than most other packages. Both of them could identify DMRs, which are central to many methylomic analyses nowadays. The integrative nature of these two bioinformatics packages renders them well-suited for comprehensive analysis of bisulfite sequencing data.

32

ACCEPTED MANUSCRIPT

PT

Methylomic analysis in plasma of pregnant women

The availability of bisulfite sequencing and corresponding bioinformatics software

SC RI

enabled noninvasive prenatal methylomic analysis. Lun et al. explored this question using whole-genome bisulfite sequencing of the maternal plasma cfDNA [109]. They were able to identify the placental methylome noninvasively. One approach was

NU

based on the analysis of fetal-specific polymorphic alleles. Another approach relied

MA

on knowing the fractional fetal DNA concentration and the methylome of blood cells to reverse deduce the placental methylome. An example of clinical application of

ED

noninvasive prenatal methylomic analysis would be the detection of trisomy 21. These technologies could help gain greater insight into the underlying mechanism and

PT

location of placental- or fetal-specific methylation changes at individual CpG residues and may aid in further identification of potential epigenetic-based biomarker for

AC

CE

prenatal diagnosis.

Methylomic analysis in plasma of cancer patients

Chan et al. accessed the whole-genome methylome in plasma for cancer patients [9]. The detection was made by comparing the methylation density of cancer patients in 1Mb bins with the methylation density of a group of healthy individuals using z-score approach. They showed that in cancer patients, one could reliably detect hypomethylation across the genome with good sensitivity (68%) and high specificity (94%) even at a relatively low sequencing depth (~10 million reads per case).

33

ACCEPTED MANUSCRIPT As illustrated by the examples above, noninvasive methylomic analysis of cfDNA is a promising tool for personalized medicine. In particular, methylation plays an

PT

instrumental role in cancer development. DNA hypomethylation is common in cancer cells and is known to promote tumorigensis by transcriptional activation of proto-

SC RI

oncogenes. Noninvasive methylomic analysis thus offers potential for a more precise stratification for cancer subtypes, thereby enables a more personalized therapeutic

NU

treatment for cancer patients in the future.

MA

Conclusion

ED

In the past decade we witnessed rapid advancements in the technologies and bioinformatics algorithms available for the analysis of circulating cfDNA. With the

PT

availability of massively parallel sequencing and the development of sophisticated

CE

bioinformatics, noninvasive prenatal testing has become increasingly mature and has established itself as a leading field in translational research. Of note, a couple of

AC

large-scale clinical validation studies have been conducted to demonstrate the high sensitivity of noninvasive prenatal screening for fetal aneuploidies [4, 110]. Noninvasive prenatal testing has already become an integral part of clinical practice in obstetric clinics.

Anchoring on the success of noninvasive prenatal diagnosis, noninvasive cancer assessment has gained much popularity in recent years. Because of the added complexity of cancer genetics, bioinformatics algorithms developed for noninvasive prenatal testing can only be conceptually transferred but with significant technical adaptations. Some of these algorithms appear promising in assessing tumor load and

34

ACCEPTED MANUSCRIPT remission. Recently, the development of bisulfite sequencing and specific bioinformatics software has opened up another area for further research in prenatal

PT

diagnosis and cancer management. The feasibility to obtain the genome-wide methylome noninvasively has enabled the identification of differential methylated

SC RI

regions that could be specifically useful in cancer screening, monitoring and the development of personalized therapeutic program.

NU

Even though numerous applications for noninvasive prenatal and cancer diagnosis

MA

have been demonstrated to be feasible in clinical practice, the high sequencing cost currently necessary for noninvasive testing is still a bottleneck for routine clinical

ED

implementations. Most applications mentioned in our review are still very expensive except for prenatal aneuploidy screening tests, which have already been adopted

PT

universally. For example, currently, the throughput of HiSeq is around 3.2 billion reads per flow-cell and one flow-cell would cost around US $40,000. Because for

CE

aneuploidy testing, it is possible to combine hundreds of samples in one flow-cell, so

AC

the cost can be averaged out. However, for other applications such as whole fetal genome assembly, it would require the entire flow cell to handle one sample. This would be too costly for routine use in the clinical settings.

As the sequencing costs continue to drop, we expect to see a proliferation of noninvasive testing offered at the clinic. The role of bioinformatics in noninvasive testing to efficiently analyze these data is thus of increasing importance in future research.

35

AC

CE

PT

ED

MA

NU

SC RI

PT

ACCEPTED MANUSCRIPT

36

ACCEPTED MANUSCRIPT Reference:

AC

CE

PT

ED

MA

NU

SC RI

PT

1. Lo YMD, Corbetta N, Chamberlain PF, Rai V, Sargent IL, Redman CW, et al. Presence of fetal DNA in maternal plasma and serum. Lancet. 1997;350:485-7. 2. Chen XQ, Stroun M, Magnenat JL, Nicod LP, Kurt AM, Lyautey J, et al. Microsatellite alterations in plasma DNA of small cell lung cancer patients. Nature medicine. 1996;2:1033-5. 3. Lo YMD, Lun FM, Chan KCA, Tsui NB, Chong KC, Lau TK, et al. Digital pcr for the molecular detection of fetal chromosomal aneuploidy. Proceedings of the National Academy of Sciences of the United States of America. 2007;104:13116-21. 4. Chiu RWK, Akolekar R, Zheng YW, Leung TY, Sun H, Chan KC, et al. Non-invasive prenatal assessment of trisomy 21 by multiplexed maternal plasma DNA sequencing: Large scale validity study. Bmj. 2011;342:c7401. 5. Chen EZ, Chiu RWK, Sun H, Akolekar R, Chan KC, Leung TY, et al. Noninvasive prenatal diagnosis of fetal trisomy 18 and trisomy 13 by maternal plasma DNA sequencing. PloS one. 2011;6:e21791. 6. Ehrich M, Deciu C, Zwiefelhofer T, Tynan JA, Cagasan L, Tim R, et al. Noninvasive detection of fetal trisomy 21 by sequencing of DNA in maternal blood: A study in a clinical setting. American journal of obstetrics and gynecology. 2011;204:205 e1-11. 7. Newman AM, Bratman SV, To J, Wynne JF, Eclov NC, Modlin LA, et al. An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nature medicine. 2014;20:548-54. 8. Chan KCA, Jiang P, Zheng YW, Liao GJ, Sun H, Wong J, et al. Cancer genome scanning in plasma: Detection of tumor-associated copy number aberrations, single-nucleotide variants, and tumoral heterogeneity by massively parallel sequencing. Clinical chemistry. 2013;59:211-24. 9. Chan KCA, Jiang P, Chan CW, Sun K, Wong J, Hui EP, et al. Noninvasive detection of cancerassociated genome-wide hypomethylation and copy number aberrations by plasma DNA bisulfite sequencing. Proceedings of the National Academy of Sciences of the United States of America. 2013;110:18761-8. 10. Bettegowda C, Sausen M, Leary RJ, Kinde I, Wang Y, Agrawal N, et al. Detection of circulating tumor DNA in early- and late-stage human malignancies. Science translational medicine. 2014;6:224ra24. 11. Forshew T, Murtaza M, Parkinson C, Gale D, Tsui DW, Kaper F, et al. Noninvasive identification and monitoring of cancer mutations by targeted deep sequencing of plasma DNA. Science translational medicine. 2012;4:136ra68. 12. Lo YMD, Tein MS, Lau TK, Haines CJ, Leung TN, Poon PM, et al. Quantitative analysis of fetal DNA in maternal plasma and serum: Implications for noninvasive prenatal diagnosis. American journal of human genetics. 1998;62:768-75. 13. Lo YMD, Hjelm NM, Fidler C, Sargent IL, Murphy MF, Chamberlain PF, et al. Prenatal diagnosis of fetal rhd status by molecular analysis of maternal plasma. The New England journal of medicine. 1998;339:1734-8. 14. Lo YMD, Zhang J, Leung TN, Lau TK, Chang AM, Hjelm NM. Rapid clearance of fetal DNA from maternal plasma. American journal of human genetics. 1999;64:218-24. 15. Lui YY, Chik KW, Chiu RWK, Ho CY, Lam CW, Lo YMD. Predominant hematopoietic origin of cell-free DNA in plasma and serum after sex-mismatched bone marrow transplantation. Clinical chemistry. 2002;48:421-7. 16. Su YH, Wang M, Brenner DE, Norton PA, Block TM. Detection of mutated k-ras DNA in urine, plasma, and serum of patients with colorectal carcinoma or adenomatous polyps. Annals of the New York Academy of Sciences. 2008;1137:197-206. 17. Shinozaki M, O'Day SJ, Kitago M, Amersi F, Kuo C, Kim J, et al. Utility of circulating b-raf DNA mutation in serum for monitoring melanoma patients receiving biochemotherapy. Clinical cancer research : an official journal of the American Association for Cancer Research. 2007;13:2068-74. 18. Wang S, An T, Wang J, Zhao J, Wang Z, Zhuo M, et al. Potential clinical significance of a plasmabased kras mutation analysis in patients with advanced non-small cell lung cancer. Clinical cancer research : an official journal of the American Association for Cancer Research. 2010;16:1324-30. 19. Rijnders RJ, van der Schoot CE, Bossers B, de Vroede MA, Christiaens GC. Fetal sex determination from maternal plasma in pregnancies at risk for congenital adrenal hyperplasia. Obstetrics and gynecology. 2001;98:374-8.

37

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA

NU

SC RI

PT

20. Geifman-Holtzman O, Grotegut CA, Gaughan JP. Diagnostic accuracy of noninvasive fetal rh genotyping from maternal blood--a meta-analysis. American journal of obstetrics and gynecology. 2006;195:1163-73. 21. Jung K, Fleischhacker M, Rabien A. Cell-free DNA in the blood as a solid tumor biomarker--a critical appraisal of the literature. Clinica chimica acta; international journal of clinical chemistry. 2010;411:1611-24. 22. Schwarzenbach H, Hoon DS, Pantel K. Cell-free nucleic acids as biomarkers in cancer patients. Nature reviews Cancer. 2011;11:426-37. 23. Lo YMD, Chiu RWK. Genomic analysis of fetal nucleic acids in maternal blood. Annual review of genomics and human genetics. 2012;13:285-306. 24. De Mattos-Arruda L, Cortes J, Santarpia L, Vivancos A, Tabernero J, Reis-Filho JS, et al. Circulating tumour cells and cell-free DNA as tools for managing breast cancer. Nature reviews Clinical oncology. 2013;10:377-89. 25. Bianchi DW. Circulating fetal DNA: Its origin and diagnostic potential-a review. Placenta. 2004;25 Suppl A:S93-S101. 26. Stroun M, Lyautey J, Lederrey C, Olson-Sand A, Anker P. About the possible origin and mechanism of circulating DNA apoptosis and active DNA release. Clinica chimica acta; international journal of clinical chemistry. 2001;313:139-42. 27. Alberry M, Maddocks D, Jones M, Abdel Hadi M, Abdel-Fattah S, Avent N, et al. Free fetal DNA in maternal plasma in anembryonic pregnancies: Confirmation that the origin is the trophoblast. Prenatal diagnosis. 2007;27:415-8. 28. Lo YMD, Chan KCA, Sun H, Chen EZ, Jiang P, Lun FM, et al. Maternal plasma DNA sequencing reveals the genome-wide genetic and mutational profile of the fetus. Science translational medicine. 2010;2:61ra91. 29. Fan HC, Blumenfeld YJ, Chitkara U, Hudgins L, Quake SR. Analysis of the size distributions of fetal and maternal cell-free DNA by paired-end sequencing. Clinical chemistry. 2010;56:1279-86. 30. Mouliere F, Robert B, Arnau Peyrotte E, Del Rio M, Ychou M, Molina F, et al. High fragmentation characterizes tumour-derived circulating DNA. PloS one. 2011;6:e23418. 31. Mouliere F, El Messaoudi S, Gongora C, Guedj AS, Robert B, Del Rio M, et al. Circulating cellfree DNA from colorectal cancer patients may reveal high kras or braf mutation load. Translational oncology. 2013;6:319-28. 32. Mouliere F, El Messaoudi S, Pang D, Dritschilo A, Thierry AR. Multi-marker analysis of circulating cell-free DNA toward personalized medicine for colorectal cancer. Molecular oncology. 2014;8:927-41. 33. Umetani N, Kim J, Hiramatsu S, Reber HA, Hines OJ, Bilchik AJ, et al. Increased integrity of free circulating DNA in sera of patients with colorectal or periampullary cancer: Direct quantitative pcr for alu repeats. Clinical chemistry. 2006;52:1062-9. 34. Umetani N, Giuliano AE, Hiramatsu SH, Amersi F, Nakagawa T, Martino S, et al. Prediction of breast tumor progression by integrity of free circulating DNA in serum. Journal of clinical oncology : official journal of the American Society of Clinical Oncology. 2006;24:4270-6. 35. Gao YJ, He YJ, Yang ZL, Shao HY, Zuo Y, Bai Y, et al. Increased integrity of circulating cell-free DNA in plasma of patients with acute leukemia. Clinical chemistry and laboratory medicine : CCLM / FESCC. 2010;48:1651-6. 36. Wang E, Batey A, Struble C, Musci T, Song K, Oliphant A. Gestational age and maternal weight effects on fetal cell-free DNA in maternal plasma. Prenatal diagnosis. 2013;33:662-6. 37. Smid M, Galbiati S, Vassallo A, Gambini D, Ferrari A, Viora E, et al. No evidence of fetal DNA persistence in maternal plasma after pregnancy. Human genetics. 2003;112:617-8. 38. Yu SC, Lee SW, Jiang P, Leung TY, Chan KCA, Chiu RWK, et al. High-resolution profiling of fetal DNA clearance from maternal plasma by massively parallel sequencing. Clinical chemistry. 2013;59:1228-37. 39. Diehl F, Schmidt K, Choti MA, Romans K, Goodman S, Li M, et al. Circulating mutant DNA to assess tumor dynamics. Nature medicine. 2008;14:985-90. 40. Lindgreen S. Adapterremoval: Easy cleaning of next-generation sequencing reads. BMC research notes. 2012;5:337. 41. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal. 2011;17. 42. Http://www.Bioinformatics.Babraham.Ac.Uk/projects/trim_galore/. 43. Chiu RWK, Chan KCA, Gao Y, Lau VY, Zheng W, Leung TY, et al. Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in

38

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA

NU

SC RI

PT

maternal plasma. Proceedings of the National Academy of Sciences of the United States of America. 2008;105:20458-63. 44. Sparks AB, Wang ET, Struble CA, Barrett W, Stokowski R, McBride C, et al. Selective analysis of cell-free DNA in maternal blood for evaluation of fetal trisomy. Prenatal diagnosis. 2012;32:3-9. 45. Sehnert AJ, Rhees B, Comstock D, de Feo E, Heilek G, Burke J, et al. Optimal detection of fetal chromosomal abnormalities by massively parallel DNA sequencing of cell-free fetal DNA from maternal blood. Clinical chemistry. 2011;57:1042-9. 46. Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of molecular biology. 1981;147:195-7. 47. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology. 1970;48:443-53. 48. Aronesty E. Ea-utils : Command-line tools for processing biological sequencing data; http://code.Google.Com/p/ea-utils. 2011. 49. Anders S, Pyl PT, Huber W. Htseq—a python framework to work with high-throughput sequencing data. Bioinformatics. 2014. 50. Bolger AM, Lohse M, Usadel B. Trimmomatic: A flexible trimmer for illumina sequence data. Bioinformatics. 2014;30:2114-20. 51. Benjamini Y, Speed TP. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic acids research. 2012;40:e72. 52. Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, et al. Analyzing and minimizing pcr amplification bias in illumina sequencing libraries. Genome biology. 2011;12:R18. 53. Chu T, Bunce K, Hogge WA, Peters DG. A novel approach toward the challenge of accurately quantifying fetal DNA in maternal plasma. Prenatal diagnosis. 2010;30:1226-9. 54. Bellis MA, Hughes K, Hughes S, Ashton JR. Measuring paternal discrepancy and its public health consequences. Journal of epidemiology and community health. 2005;59:749-54. 55. Jiang P, Chan KCA, Liao GJ, Zheng YW, Leung TY, Chiu RWK, et al. Fetalquant: Deducing fractional fetal DNA concentration from massively parallel sequencing of DNA in maternal plasma. Bioinformatics. 2012;28:2883-90. 56. Yu SC, Chan KCA, Zheng YW, Jiang P, Liao GJ, Sun H, et al. Size-based molecular diagnostics using plasma DNA for noninvasive prenatal testing. Proceedings of the National Academy of Sciences of the United States of America. 2014;111:8583-8. 57. Chan RW, Jiang P, Peng X, Tam LS, Liao GJ, Li EK, et al. Plasma DNA aberrations in systemic lupus erythematosus revealed by genomic and methylomic sequencing. Proceedings of the National Academy of Sciences of the United States of America. 2014. 58. Jiang P, Chan CW, Chan KC, Cheng SH, Wong J, Wong VW, et al. Lengthening and shortening of plasma DNA in hepatocellular carcinoma patients. Proceedings of the National Academy of Sciences of the United States of America. 2015;112:E1317-25. 59. Norton ME, Brar H, Weiss J, Karimi A, Laurent LC, Caughey AB, et al. Non-invasive chromosomal evaluation (nice) study: Results of a multicenter prospective cohort study for detection of fetal trisomy 21 and trisomy 18. American journal of obstetrics and gynecology. 2012;207:137 e1-8. 60. Chiu RWK, Sun H, Akolekar R, Clouser C, Lee C, McKernan K, et al. Maternal plasma DNA analysis with massively parallel sequencing by ligation for noninvasive prenatal diagnosis of trisomy 21. Clinical chemistry. 2010;56:459-63. 61. Chan KCA, Zhang J, Hui AB, Wong N, Lau TK, Leung TN, et al. Size distributions of maternal and fetal DNA in maternal plasma. Clinical chemistry. 2004;50:88-92. 62. Liao GJ, Chan KCA, Jiang P, Sun H, Leung TY, Chiu RWK, et al. Noninvasive prenatal diagnosis of fetal trisomy 21 by allelic ratio analysis using targeted massively parallel sequencing of maternal plasma DNA. PloS one. 2012;7:e38154. 63. Zimmermann B, Hill M, Gemelos G, Demko Z, Banjevic M, Baner J, et al. Noninvasive prenatal aneuploidy testing of chromosomes 13, 18, 21, x, and y, using targeted sequencing of polymorphic loci. Prenatal diagnosis. 2012;32:1233-41. 64. Smits J, Monden C. Twinning across the developing world. PloS one. 2011;6:e25239. 65. Chauhan SP, Scardo JA, Hayes E, Abuhamad AZ, Berghella V. Twins: Prevalence, problems, and preterm births. American journal of obstetrics and gynecology. 2010;203:305-15. 66. Audibert F, Gagnon A, Genetics Committee of the Society of O, Gynaecologists of C, Prenatal Diagnosis Committee of the Canadian College of Medical G. Prenatal screening for and diagnosis of aneuploidy in twin pregnancies. Journal of obstetrics and gynaecology Canada : JOGC = Journal d'obstetrique et gynecologie du Canada : JOGC. 2011;33:754-67.

39

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA

NU

SC RI

PT

67. Huang X, Zheng J, Chen M, Zhao Y, Zhang C, Liu L, et al. Noninvasive prenatal testing of trisomies 21 and 18 by massively parallel sequencing of maternal plasma DNA in twin pregnancies. Prenatal diagnosis. 2014;34:335-40. 68. Qu JZ, Leung TY, Jiang P, Liao GJ, Cheng YK, Sun H, et al. Noninvasive prenatal determination of twin zygosity by maternal plasma DNA analysis. Clinical chemistry. 2013;59:427-35. 69. Leung TY, Qu JZ, Liao GJ, Jiang P, Cheng YK, Chan KCA, et al. Noninvasive twin zygosity assessment and aneuploidy detection by maternal plasma DNA sequencing. Prenatal diagnosis. 2013;33:675-81. 70. Fan HC, Gu W, Wang J, Blumenfeld YJ, El-Sayed YY, Quake SR. Non-invasive prenatal measurement of the fetal genome. Nature. 2012;487:320-4. 71. Kitzman JO, Snyder MW, Ventura M, Lewis AP, Qiu R, Simmons LE, et al. Noninvasive wholegenome sequencing of a human fetus. Science translational medicine. 2012;4:137ra76. 72. Kitzman JO, Mackenzie AP, Adey A, Hiatt JB, Patwardhan RP, Sudmant PH, et al. Haplotyperesolved genome sequencing of a gujarati indian individual. Nature biotechnology. 2011;29:59-63. 73. Castells A, Puig P, Mora J, Boadas J, Boix L, Urgell E, et al. K-ras mutations in DNA extracted from the plasma of patients with pancreatic carcinoma: Diagnostic utility and prognostic significance. Journal of clinical oncology : official journal of the American Society of Clinical Oncology. 1999;17:578-84. 74. Ryan BM, Lefort F, McManus R, Daly J, Keeling PW, Weir DG, et al. A prospective study of circulating mutant kras2 in the serum of patients with colorectal neoplasia: Strong prognostic indicator in postoperative follow up. Gut. 2003;52:101-8. 75. Fridlyand J, Snijders AM, Ylstra B, Li H, Olshen A, Segraves R, et al. Breast tumor copy number aberration phenotypes and genomic instability. BMC cancer. 2006;6:96. 76. Albertson DG, Collins C, McCormick F, Gray JW. Chromosome aberrations in solid tumors. Nature genetics. 2003;34:369-76. 77. Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, et al. The landscape of somatic copy-number alteration across human cancers. Nature. 2010;463:899-905. 78. Zack TI, Schumacher SE, Carter SL, Cherniack AD, Saksena G, Tabak B, et al. Pan-cancer patterns of somatic copy number alteration. Nature genetics. 2013;45:1134-40. 79. Alitalo K, Schwab M, Lin CC, Varmus HE, Bishop JM. Homogeneously staining chromosomal regions contain amplified copies of an abundantly expressed cellular oncogene (c-myc) in malignant neuroendocrine cells from a human colon carcinoma. Proceedings of the National Academy of Sciences of the United States of America. 1983;80:1707-11. 80. Hinds PW, Dowdy SF, Eaton EN, Arnold A, Weinberg RA. Function of a human cyclin gene as an oncogene. Proceedings of the National Academy of Sciences of the United States of America. 1994;91:709-13. 81. Heitzer E, Auer M, Hoffmann EM, Pichler M, Gasch C, Ulz P, et al. Establishment of tumorspecific copy number alterations from plasma DNA of patients with cancer. International journal of cancer Journal international du cancer. 2013;133:346-56. 82. Heitzer E, Ulz P, Belic J, Gutschi S, Quehenberger F, Fischereder K, et al. Tumor-associated copy number changes in the circulation of patients with prostate cancer identified through whole-genome sequencing. Genome medicine. 2013;5:30. 83. Leary RJ, Sausen M, Kinde I, Papadopoulos N, Carpten JD, Craig D, et al. Detection of chromosomal alterations in the circulation of cancer patients with whole-genome sequencing. Science translational medicine. 2012;4:162ra54. 84. Sawyers CL. Chronic myeloid leukemia. The New England journal of medicine. 1999;340:133040. 85. Warrell RP, Jr. Retinoid resistance in acute promyelocytic leukemia: New mechanisms, strategies, and implications. Blood. 1993;82:1949-53. 86. Szczepanski T, van der Velden VH, Raff T, Jacobs DC, van Wering ER, Bruggemann M, et al. Comparative analysis of t-cell receptor gene rearrangements at diagnosis and relapse of t-cell acute lymphoblastic leukemia (t-all) shows high stability of clonal markers for monitoring of minimal residual disease and reveals the occurrence of second t-all. Leukemia. 2003;17:2149-56. 87. Korsmeyer SJ, Arnold A, Bakhshi A, Ravetch JV, Siebenlist U, Hieter PA, et al. Immunoglobulin gene rearrangement and cell surface antigen expression in acute lymphocytic leukemias of t cell and b cell precursor origins. The Journal of clinical investigation. 1983;71:301-13. 88. Leary RJ, Kinde I, Diehl F, Schmidt K, Clouser C, Duncan C, et al. Development of personalized tumor biomarkers using massively parallel sequencing. Science translational medicine. 2010;2:20ra14. 89. Tomizawa S, Sasaki H. Genomic imprinting and its relevance to congenital disease, infertility, molar pregnancy and induced pluripotent stem cell. Journal of human genetics. 2012;57:84-91.

40

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA

NU

SC RI

PT

90. Banister CE, Koestler DC, Maccani MA, Padbury JF, Houseman EA, Marsit CJ. Infant growth restriction is associated with distinct patterns of DNA methylation in human placentas. Epigenetics : official journal of the DNA Methylation Society. 2011;6:920-7. 91. Esteller M, Herman JG. Cancer as an epigenetic disease: DNA methylation and chromatin alterations in human tumours. The Journal of pathology. 2002;196:1-7. 92. Egger G, Liang G, Aparicio A, Jones PA. Epigenetics in human disease and prospects for epigenetic therapy. Nature. 2004;429:457-63. 93. Chim SS, Tong YK, Chiu RWK, Lau TK, Leung TN, Chan LY, et al. Detection of the placental epigenetic signature of the maspin gene in maternal plasma. Proceedings of the National Academy of Sciences of the United States of America. 2005;102:14753-8. 94. Tsui DW, Lam YM, Lee WS, Leung TY, Lau TK, Lau ET, et al. Systematic identification of placental epigenetic signatures for the noninvasive prenatal detection of edwards syndrome. PloS one. 2010;5:e15069. 95. Chim SS, Jin S, Lee TY, Lun FM, Lee WS, Chan LY, et al. Systematic search for placental DNAmethylation markers on chromosome 21: Toward a maternal plasma-based epigenetic test for fetal trisomy 21. Clinical chemistry. 2008;54:500-11. 96. Chan KCA, Lai PB, Mok TS, Chan HL, Ding C, Yeung SW, et al. Quantitative analysis of circulating methylated DNA as a biomarker for hepatocellular carcinoma. Clinical chemistry. 2008;54:1528-36. 97. An Q, Liu Y, Gao Y, Huang J, Fong X, Li L, et al. Detection of p16 hypermethylation in circulating plasma DNA of non-small cell lung cancer patients. Cancer letters. 2002;188:109-14. 98. Valenzuela MT, Galisteo R, Zuluaga A, Villalobos M, Nunez MI, Oliver FJ, et al. Assessing the use of p16(ink4a) promoter gene methylation in serum for detection of bladder cancer. European urology. 2002;42:622-8; discussion 8-30. 99. Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, et al. Highly integrated single-base resolution maps of the epigenome in arabidopsis. Cell. 2008;133:523-36. 100. Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, et al. Shotgun bisulphite sequencing of the arabidopsis genome reveals DNA methylation patterning. Nature. 2008;452:215-9. 101. Chen PY, Cokus SJ, Pellegrini M. Bs seeker: Precise mapping for bisulfite sequencing. BMC bioinformatics. 2010;11:203. 102. Lim JQ, Tennakoon C, Li G, Wong E, Ruan Y, Wei CL, et al. Batmeth: Improved mapper for bisulfite sequencing reads on DNA methylation. Genome biology. 2012;13:R82. 103. Xi Y, Li W. Bsmap: Whole genome bisulfite sequence mapping program. BMC bioinformatics. 2009;10:232. 104. Krueger F, Andrews SR. Bismark: A flexible aligner and methylation caller for bisulfite-seq applications. Bioinformatics. 2011;27:1571-2. 105. Hansen KD, Langmead B, Irizarry RA. Bsmooth: From whole genome bisulfite sequencing reads to differentially methylated regions. Genome biology. 2012;13:R83. 106. Jiang P, Sun K, Lun FM, Guo AM, Wang H, Chan KCA, et al. Methy-pipe: An integrated bioinformatics pipeline for whole genome bisulfite sequencing data analysis. PloS one. 2014;9:e100360. 107. Krueger F, Kreck B, Franke A, Andrews SR. DNA methylome analysis using short bisulfite sequencing data. Nature methods. 2012;9:145-51. 108. Bock C. Analysing and interpreting DNA methylation data. Nature reviews Genetics. 2012;13:705-19. 109. Lun FM, Chiu RWK, Sun K, Leung TY, Jiang P, Chan KCA, et al. Noninvasive prenatal methylomic analysis by genomewide bisulfite sequencing of maternal plasma DNA. Clinical chemistry. 2013;59:1583-94. 110. Palomaki GE, Kloza EM, Lambert-Messerlian GM, Haddow JE, Neveux LM, Ehrich M, et al. DNA sequencing of maternal plasma to detect down syndrome: An international clinical validation study. Genetics in medicine : official journal of the American College of Medical Genetics. 2011;13:913-20.

41

ACCEPTED MANUSCRIPT FIGURES: Figure 1. Summary of bioinformatics analysis and key clinical applications of circulating cfDNA.

NU

Figure 3. SNP categories for RHDO analysis.

SC RI

PT

Figure 2. Schematic illustration of the calculation of fractional fetal DNA concentration in maternal plasma with the SNP based method. Informative SNPs are identified at those loci where maternal (AA) and paternal genotypes (BB) are homozygous but for a different allele each. The resulting fetal genotype is an obligate heterozygote (AB). In maternal plasma, majority of the cfDNA molecules mapped to informative loci would be the shared allele (black). Fetal-specific allele (red) exists at a low frequency. The fractional fetal DNA concentration can be directly deduced as shown.

ED

MA

Figure 4. Schematic illustration of RHDO approach to construct maternal inheritance of the fetus noninvasively. Maternal haplotypes (Hap I and Hap II) can be deduced by the use of trio-based genotype information. Allelic counts in plasma of type α (or type β) are accumulated for each maternal haplotype. The accumulated counts are then subjected to sequential probability ratio test (SPRT) to statistically determine where Hap I is over-represented in maternal plasma.

CE

PT

Figure 5. Schematic illustration of haplotype counting approach to construct maternal inheritance of the fetus. Maternal haplotypes are first determined by using direct deterministic phasing approach to analyze maternal blood cells. Heterozygous sites are identified and count of the alleles specific to each haplotype is determined. The relative representation of the two haplotypes in maternal plasma is subjected to Poisson-based z-score test to determine the maternal inheritance of the fetus.

AC

Figure 6. Schematic illustration of hidden Markov model (HMM) approach to construct maternal inheritance of the fetus. Classically, HMM has three parameters: latent states, transition probability [denoted by P(T)] and emission probability [denoted by P(E)]. The three latent states (3 circles) are the first maternal haplotype (Hap I) transmitted to the fetus, the second maternal haplotype (Hap II) transmitted to the fetus, and an unknown state in which maternal haplotype inherited by the fetus is not known. The transition probabilities (solid black arrows) are held at 10-5, close to the human recombination rate. Emission probabilities (dashed black arrows) are modeled based on binomial distribution. Figure 7. Two major methods developed to calculate the fractional tumor DNA concentration: mutant allele approach and loss of heterozygosity approach. (A) Mutant allele approach. In cancer patients, the cells can be classified to normal cells only carrying wild-type alleles and tumor cells harboring additional mutant alleles. The tumor load in patient’s plasma can be reflected by the fraction of mutant alleles which can be translated into the fractional tumor DNA concentration. (B) Loss of heterozygosity approach. Chromosomal arms or sub-chromosomal regions are frequently found to be deleted in cancer cells. These types of deletion preferentially involve only one of the two homologous chromosomes, thus resulting in loss of heterozygosity (LOH). In the patient’s plasma, LOH of tumor cells would lead to a

42

ACCEPTED MANUSCRIPT decrease in the number of deleted alleles compared to the non-deleted alleles. The allelic imbalance between non-deleted and deleted alleles can be translated as the fractional tumor DNA concentration.

PT

TABLES:

SC RI

Table 1. Biological properties of fetal-derived cfDNA and tumor-derived cfDNA.

AC

CE

PT

ED

MA

NU

Table 2. Feature comparison of different adapter trimming tools.

43

ACCEPTED MANUSCRIPT Table 1. Biological properties of fetal-derived cfDNA and tumor-derived cfDNA

Concentration Size

AC

CE

PT

ED

MA

NU

SC RI

Clearance

Tumor-derived cfDNA Positively correlates with tumor size 7, 9, 10 30-32 Shorter or longer 33-35 in cancer patients Half-life: 2 hour 39

PT

Fetal-derived cfDNA Positively correlates with gestational age 36 Shorter than background maternal-derived DNA 28,29 Rapid phase half-life: 1 hour 38

44

ACCEPTED MANUSCRIPT Table 2. Feature comparison of different adapter trimming tools.

Yes

Yes Yes

Yes Yes

No

Yes

Yes

No Yes

Yes Yes

Yes Yes

Yes

0.3.7 0.6.1

Yes Yes

Yes Yes

1.5.4

C++

Yes

1.04.636 0.32

C++ Java

Yes Yes

Trim Galore! 42 HTSeq 49 AdapterRemoval 40

AC

CE

PT

ED

MA

NU

SC RI

No

1.6

FastqMcf Trimmomatic 50

Yes

Python and C Perl Python

Cutadapt 41

48

Language

Able to trim low-quality nucleotides

Directly processing gzip-format file

PT

Version

Able to identify adapter sequences specified by user

Pairedend reads support

45

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA

NU

SC RI

PT

Figure 1

46

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA

NU

SC RI

PT

Figure 2

47

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA

NU

SC RI

PT

Figure 3

48

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA

NU

SC RI

PT

Figure 4

49

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA

NU

SC RI

PT

Figure 5

50

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA

NU

SC RI

PT

Figure 6

51

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

MA

NU

SC RI

PT

Figure 7

52

Bioinformatics for next generation sequencing data.

Review of current methods, applications, and data management for the bioinformatics analysis of whole exome sequencing.

Hyb: a bioinformatics pipeline for the analysis of CLASH (crosslinking, ligation and sequencing of hybrids) data.

Galaxy Workflows for Web-based Bioinformatics Analysis of Aptamer High-throughput Sequencing Data.

Bioinformatics Methods and Biological Interpretation for Next-Generation Sequencing Data.

Methy-Pipe: an integrated bioinformatics pipeline for whole genome bisulfite sequencing data analysis.

Fibronectin and androgen receptor expression data in prostate cancer obtained from a RNA-sequencing bioinformatics analysis.

Dynamic sequencing of circulating tumor DNA: novel noninvasive cancer biomarker.

Novel bioinformatics approaches for analysis of high-throughput biological data.

Development and Evaluation of Quality Metrics for Bioinformatics Analysis of Viral Insertion Site Data Generated Using High Throughput Sequencing.

Next-generation sequencing of elite berry germplasm and data analysis using a bioinformatics pipeline for virus detection and discovery.

Big data bioinformatics.

Bioinformatics and Microarray Data Analysis on the Cloud.

FourCSeq: analysis of 4C sequencing data.

Hands-On Assembly of DNA Sequencing Reads as a Gateway to Bioinformatics.

A bioinformatics approach for identifying transgene insertion sites using whole genome sequencing data.

mtDNA-Server: next-generation sequencing data analysis of human mitochondrial DNA in the cloud.

Genome-wide quantitative analysis of DNA methylation from bisulfite sequencing data.

MEDIPS: genome-wide differential coverage analysis of sequencing data derived from DNA enrichment experiments.

Deciphering the human microbiome using next-generation sequencing data and bioinformatics approaches.

Data mining in translational bioinformatics.

Quantitative analysis of DNA-sequencing electrophoresis.

PRADA: pipeline for RNA sequencing data analysis.

Pathway analysis with next-generation sequencing data.