Bioinformatics analysis of circulating cell-free DNA sequencing data Landon L. Chan, Peiyong Jiang PII: DOI: Reference:
S0009-9120(15)00167-8 doi: 10.1016/j.clinbiochem.2015.04.022 CLB 9019
To appear in:
Clinical Biochemistry
Received date: Revised date: Accepted date:
15 December 2014 30 March 2015 29 April 2015
Please cite this article as: Chan Landon L., Jiang Peiyong, Bioinformatics analysis of circulating cell-free DNA sequencing data, Clinical Biochemistry (2015), doi: 10.1016/j.clinbiochem.2015.04.022
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT Bioinformatics analysis of circulating cell-free DNA sequencing data
PT
Landon L. Chan1,2, Peiyong Jiang1,2*
SC RI
1. Li Ka Shing Institute of Health Sciences, The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong SAR, China
2. Department of Chemical Pathology, The Chinese University of Hong Kong, Prince of Wales
NU
Hospital, Shatin, New Territories, Hong Kong SAR, China
MA
*To whom correspondence should be addressed. E-mail:
[email protected] ED
Abstract
PT
The discovery of cell-free DNA molecules in plasma has opened up numerous
CE
opportunities in noninvasive diagnosis. Cell-free DNA molecules have become increasingly recognized as promising biomarkers for detection and management of
AC
many diseases. The advent of next generation sequencing has provided unprecedented opportunities to scrutinize the characteristics of cell-free DNA molecules in plasma in a genome-wide fashion and at single-base resolution. Consequently, clinical applications of circulating cell-free DNA analysis have not only revolutionized noninvasive prenatal diagnosis but also facilitated cancer detection and monitoring toward an era of blood-based personalized medicine. With the remarkably increasing throughput and lowering cost of next generation sequencing, bioinformatics analysis becomes increasingly demanding to understand the large amount of data generated by these sequencing platforms.
1
ACCEPTED MANUSCRIPT In this Review, we highlight the major bioinformatics algorithms involved in the analysis of cell-free DNA sequencing data. Firstly, we briefly describe the biological
PT
properties of these molecules and provide an overview of the general bioinformatics approach for the analysis of cell-free DNA. Then, we discuss the specific upstream
SC RI
bioinformatics considerations concerning the analysis of sequencing data of circulating cell-free DNA, followed by further detailed elaboration on each key clinical situation in noninvasive prenatal diagnosis and cancer management where
NU
downstream bioinformatics analysis is heavily involved. We also discuss
MA
bioinformatics analysis as well as clinical applications of the newly developed massively parallel bisulfite sequencing of cell-free DNA. Finally, we offer our
AC
CE
PT
ED
perspectives on the future development of bioinformatics in noninvasive diagnosis.
2
ACCEPTED MANUSCRIPT Introduction
PT
The discovery of circulating cell-free DNA (cfDNA) in plasma has opened up a new
SC RI
exciting arena in blood-based diagnosis, obviating the need for tissue biopsy in the settings of prenatal and cancer management [1, 2]. Thus far, the main applications of cfDNA are in prenatal diagnosis [3-6] and cancer monitoring [7-11]. A small
NU
proportion of fetal-derived and tumor-derived cfDNA was found in pregnant women’s circulation and in cancer patients’ circulation respectively [8, 12]. Because
MA
fetal-derived and tumor-derived cfDNA are genetically different from the main background circulating cfDNA, blood samples collected from pregnant women and
ED
cancer patients thus provide a ‘liquid biopsy’ consisting of the fetal genetic profile and tumor’s mutational profile. This is the foundation of the development of various
CE
PT
noninvasive diagnostic techniques.
Before the maturation of next generation sequencing technologies, analyses of
AC
circulating cfDNA were done with PCR assays [12-18]. Conventional PCR assays are useful in qualitative analysis such as fetal sex determination [12, 19] and RhD status determination [13, 20]. However, currently it is still not practical in the screening for aneuploidies, sub-chromosomal changes and copy number aberrations because precise quantitative analyses are required.
Recent advances in next generation sequencing have offered a much more efficient platform for the analysis of cfDNA because millions of DNA molecules could be analyzed in a parallel manner. Characteristics of circulating cfDNA could be unveiled at single-base resolution and in a genome-wide scale. With the exponential growth in
3
ACCEPTED MANUSCRIPT sequencing data output, sophisticated bioinformatics algorithms for noninvasive
PT
prenatal diagnosis and cancer monitoring have become increasingly demanding.
To date, many reviews have discussed the application of the cfDNA analyses in
SC RI
clinical managements [21-25]. However, discussion of the underlying bioinformatics algorithms is lacking. This Review intends to fill this gap by giving an overview of the different bioinformatics algorithms commonly used in the analysis of cfDNA.
NU
This Review will cover four aspects. Firstly, we introduce the key biological
MA
properties of cfDNA underpinning the conceptual basis of many bioinformatics algorithms in noninvasive diagnosis. Secondly, we provide a general overview of the
ED
bioinformatics approach toward the analysis of cfDNA molecules. Thirdly, we discuss upstream bioinformatics considerations specific to the analysis of cfDNA.
PT
Finally, we give a detailed elaboration on the downstream bioinformatics algorithms
CE
that are crucial in noninvasive prenatal diagnosis and cancer assessment.
AC
Biological properties of fetal-specific and tumor-specific cfDNA
The presence of cfDNA in the plasma of healthy individual is thought to be the result of cellular apoptotic event [26]. In pregnant women, the main source of fetal-derived cfDNA in maternal plasma is the placenta [27]. In cancer patients, both apoptotic and necrotic events have been suggested as the sources of circulating tumor cfDNA [21, 22]. DNA resulting from these events are naturally fragmented and released into the circulation. Fetal-derived circulating cfDNA in plasma have been reported to be shorter than maternal-derived cfDNA [28, 29]. Interestingly, tumor-derived
4
ACCEPTED MANUSCRIPT circulating cfDNA in plasma have also been reported to be shorter than the background circulating cfDNA [30-32]. However, some groups have found longer
PT
fragments in the blood circulation of cancer patients with certain types of cancer [33-
SC RI
35].
The concentration of fetal-derived and tumor-derived cfDNA in the circulation has been shown to increase with the size of the fetus and tumor [8, 36]. In one study
NU
consisting of 22,384 healthy singleton pregnancies, it was found that the
MA
concentration of fetal-derived cfDNA increased by 0.1% per week between 10 to 21 weeks gestation, and 1% per week beyond 21 weeks gestation [36]. The concentration
ED
of tumor-derived cfDNA has been reported to demonstrate increasing trends with the staging and size in patients with a variety of cancers [7, 9, 10]. However, the exact
PT
trend is likely to vary among cancer types influenced by cancer-dependent variables such as location, tumor aggressiveness, cancer genotypic aberrations, other risks and
AC
CE
prognostic factors.
Fetal-derived and tumor-derived cfDNA have been demonstrated to share similar clearance kinetics. Following delivery, fetal-derived cfDNA in maternal plasma decreases rapidly [37, 38] in two phases. The initial rapid phase has a mean half-life of one hour. The subsequent slow phase has a mean half-life of 13 hours. Altogether, fetal-derived cfDNA becomes undetectable at about one to two days postpartum. Similarly, tumor-derived cfDNA exhibits rapid clearance with a half-life of two hours [39] following complete resection. Therefore, the presence of tumor-derived cfDNA in postoperative patients may suggest the presence of residual tumor. These biological
5
ACCEPTED MANUSCRIPT properties of fetal-derived cfDNA and tumor-derived cfDNA are summarized in
PT
Table 1.
SC RI
General categorization of bioinformatics approach in the analysis of cfDNA
NU
The biological properties of fetal- and tumor-specific cfDNA, that is, the natural fragmentation, persistency and rapid clearance kinetics of cfDNA, allow for a new
MA
homeostasis to be achieved within a person’s circulation in the presence of a fetus or tumor. In other words, the new homeostasis can be viewed as an imbalance of
ED
circulating genetic materials relative to the baseline (i.e. without the fetus or tumor),
PT
as a result of the maintenance of fetal- and tumor-specific cfDNA. The general bioinformatics approach, therefore, is to detect and quantify this imbalance that
AC
categories:
CE
differentiate the new homeostasis from the old homeostasis. There are three major
A. Imbalances detected via allelic count: The normal allelic ratio within the circulation is disturbed due to the presence of genetic materials derived from the fetus or tumor. A couple of examples with this approach would include: estimation of fractional fetal/tumor DNA concentration and detection of aneuploidy. B. Imbalances detected via regional genomic representation: The relative proportion of the representation of a genomic region can be increased or decreased as a result of the presence of the fetus or tumor. A couple of
6
ACCEPTED MANUSCRIPT examples with this approach would include: detection of aneuploidy and detection of copy number changes in cancer.
PT
C. Imbalances detected via size distribution: The normal size distribution of cfDNA molecules within the circulation is disturbed due to the presence of
SC RI
genetic materials derived from the fetus or tumor. A couple of examples with this approach would include: estimation of fractional fetal DNA concentration
NU
and detection of aneuploidy.
MA
These three types of analyses, together, form the basis of the many algorithms to be discussed in the following sections. Figure 1 summarizes the key categories and
ED
applications of bioinformatics in the analysis of cfDNA.
AC
CE
sequencing
PT
Upstream bioinformatics analysis of circulating cfDNA with
The quality of sequencing reads could affect downstream analyses. In the analysis of cfDNA, the quality of sequencing reads is particularly vulnerable to two technical challenges: adapter contamination [40-42] and sequencing bias contributed by the non-uniform PCR amplification of GC-rich/poor DNA fragments [5, 43-45].
Adapter contamination
Adapter contamination affects circulating cfDNA molecules that are shorter than the lengths of individual reads generated by next generation sequencing machines. For
7
ACCEPTED MANUSCRIPT example, consider a DNA molecule with 80 bp in length is brought to a sequencer targeting for sequencing reads with 100 bp. Without adapter trimming, part of the
PT
adapter ligated to the 3’ end of the molecule during library preparation is being sequenced as well. As a result, there is a 20 bp adapter contamination introduced into
SC RI
this particular molecule. Consequently, this read is either unmappable to the human genome using an end-to-end alignment method or has a low alignment score. In either way, the read is discarded for subsequent analysis. Therefore, with adapter
NU
contamination, a substantial proportion of short cfDNA molecules could be
MA
unanalyzed.
ED
The aforementioned adapter contamination might lead to inaccurate results on downstream analysis. For example, the classification power of noninvasive prenatal
PT
screening tests would be reduced. This is because the fetal-derived cfDNA molecules are generally shorter than the maternal-derived cfDNA molecules [28, 29]. Therefore,
CE
the fetal-derived cfDNA molecules have higher chance to be contaminated by
AC
adapters. Adapter contamination thus can result in losing part of the informative fragments (fetal-derived cfDNA molecules).
A number of bioinformatics packages have been developed to tackle this problem. In general, most of these adapter trimming packages utilize dynamic programming such as Smith-Waterman [46] or Needleman-Wunsch algorithms [47] with minor modifications to identify the adapters. These algorithms are also used in conventional pair-wise alignment. A few examples of adapter trimming packages include Cutadapt [41], TrimGalore [42], FastqMcf [48], HTSeq [49], AdapterRemoval [40] and Trimmomatic [50]. Their functionalities are largely similar but vary among their
8
ACCEPTED MANUSCRIPT abilities to work directly on gzip-format files and paired-end reads. A summary of
PT
these packages is illustrated in Table 2.
SC RI
Sequencing bias
The main sequencing bias affecting the analysis of circulating cfDNA is the GCcontent bias attributed by PCR amplification [51, 52]. In the process of library
NU
preparation, DNA molecules are amplified with PCR. It has been reported that DNA
MA
molecules with poor GC-content and shorter in length are preferentially amplified [51, 52]. Because only a subset of amplified DNA molecules are sequenced on the
ED
flow-cells, regions of poor GC-content or shorter DNA fragments will be overrepresented, whereas GC-rich or longer DNA fragments will be under-represented. As
PT
a result, this phenomenon can hinder the assessment of aneuploidies particularly for
CE
chromosomes with higher GC-content. A number of algorithms have been proposed to correct GC-bias. A detailed discussion of these methods was summarized
AC
previously [51].
Downstream bioinformatics analysis of circulating cfDNA with sequencing
The following sections discuss the downstream bioinformatics algorithms in the analysis of circulating cfDNA. Each section describes a clinical application in which various bioinformatics algorithms are applied: 1) estimation of fractional fetal DNA concentration, 2) noninvasive detection of fetal aneuploidy in singleton pregnancies,
9
ACCEPTED MANUSCRIPT 3) noninvasive detection of zygosity and aneuploidies in twin pregnancies, 4) noninvasive haplotype based detection of monogenic disorders, 5) estimation of
PT
fractional tumor DNA concentration, 6) noninvasive detection of copy number aberrations in cancer patients, 7) noninvasive detection of genomic rearrangements in
SC RI
cancer patients and 8) noninvasive methylomic analysis of circulating cfDNA.
NU
Estimation of fractional fetal DNA concentration
MA
There are two major approaches to estimate the fractional fetal DNA concentration: polymorphism dependent approach and polymorphism independent approach.
ED
Methods under the polymorphism dependent approach are further categorized into either parental-genotype dependent or parental-genotype independent. Briefly, the
PT
parental-genotype dependent method requires both paternal and maternal genotype to
CE
deduce fetal genotype and directly estimates the fractional fetal DNA concentration. In contrast, the parental-genotype independent method does not require both paternal
AC
and maternal genotype. It relies on statistical inference based on the maternal plasma allelic distribution to deduce the fractional fetal DNA concentration. The polymorphism independent approach, as the name suggests, does not rely on the parental genotype at all. It takes advantage of the biological differences between maternal cfDNA molecules and fetal cfDNA molecules to estimate the fractional fetal DNA concentration. A more detailed account of these approaches is discussed in the following sections.
Polymorphism dependent approach (parental-genotype dependent method): Direct estimation
10
ACCEPTED MANUSCRIPT
The parental-genotype method is a polymorphism dependent approach which involves
PT
identification of parental SNP loci at which both mother (AA) and father (BB) are homozygous but for a different allele each [28, 43, 53]. The resulting SNP loci in the
SC RI
fetus are obligately heterozygous (AB). The A-allele is termed shared allele because it is identical between the fetus and the mother. The B-allele is termed fetal-specific allele because it is present in the fetus but absent in the mother. By investigating the
NU
sequence reads originating from these obligately heterozygous SNP loci, the
equation:
MA
fractional fetal DNA concentration in plasma can be directly deduced by using the
2p , where p is total number of reads aligned to the fetal-specific allele p+q
PT
ED
(B), and q is the total number of reads aligned to the shared allele (A) (Figure 2).
Polymorphism dependent approach (parental-genotype independent method):
CE
FetalQuant
AC
The parental-genotype dependent method described above determines the fetal contribution in maternal plasma directly. However, its application is limited by the need for paternal genotype information. In an epidemiological study on paternal discrepancy (PD) around the world, it is suggested that the prevalence of PD can be as high as 30% [54]. To overcome this limitation, Jiang et al. developed FetalQuant, a bioinformatics algorithm that applies maximum likelihood to estimate the fractional fetal DNA concentration without the need of paternal genotype information [55]. FetalQuant hypothetically categorizes each SNP into one of the following groups: AAAA, AAAB, ABAA and ABAB, where the main symbols represent the maternal genotypes and the subscripts represent the fetal genotypes. A and B, respectively, 11
ACCEPTED MANUSCRIPT refer to the most prevalent and second-most prevalent allele at a particular locus. Each of these genotypes has its own distribution for the expected B-allele counts. In
PT
particular, the B-allele occurrences on the AAAB and ABAA genotypes follow binomial distributions, which are directly parameterized by the fractional fetal DNA
SC RI
concentration. For example, at 10% fetal DNA concentration, the chance of observing B-allele on the AAAB would be 5% in plasma. Thus, in turn, the fractional fetal DNA concentration could be deduced by searching for the most appropriate fractional fetal
NU
DNA concentration that enables the likelihood of binomial mixture model to achieve
MA
its maximum.
ED
Polymorphism independent approach: Size-based analysis
PT
Recently, an alternative method based on the size distribution of the cfDNA molecules in maternal plasma was proposed to estimate the fractional fetal DNA
CE
concentration [56]. This method is based on the findings that the fetal-derived cfDNA
AC
molecules are in general shorter than the maternal-derived cfDNA molecules [28, 29]. As the gestational age progresses, the relative contribution of fetal cfDNA in maternal plasma increases. Consequently, the size ratio, that is, the ratio between shorter fragments and longer fragments in maternal plasma, is expected to increase with gestational age. Using a cross-validation model consisting of 73 euploid pregnancies carrying male fetuses, Yu et al. demonstrated that the size ratio has a linear relationship with the fractional fetal DNA concentration, and showed that the size based method is highly concordant with the fractional fetal DNA concentration as determined by the proportion of chromosome Y sequences in maternal plasma with a median absolute difference of 2.3%.
12
ACCEPTED MANUSCRIPT
In the estimation of fractional fetal DNA concentration, all three methods provide
PT
similar accuracy. However, each method has its own specific requirement for implementation. If paternal genotype is available, then the parental-dependent method
SC RI
provides a direct estimation of the fractional fetal DNA concentration from the blood sample, and thus would be the method of choice. If paternal genotype is unavailable, then both FetalQuant and the size-based method can be used. The implementation of
NU
FetalQuant requires a high-sequencing depth through targeted-sequencing because
MA
the maximum likelihood depends on the mean coverage of the SNPs. So, it could be costly if a large number of samples are analyzed. The size-based method, though
ED
requires less sequencing-depth than FetalQuant, needs a set of healthy samples to establish the range of normal size ratio. It is also comparatively more cost-effective
PT
because the normal range can be reused once it is established. However, the size of cfDNA would be affected by pathological and physiological conditions such as
CE
systemic lupus erythematosus and cancer [57, 58], and thus may only be applicable to
AC
healthy subjects.
Noninvasive detection of fetal aneuploidy in singleton pregnancies
One of the most important clinical applications of cfDNA is the noninvasive prenatal diagnosis of fetal aneuploidies: trisomy 21 (Down’s syndrome), trisomy 18 (Edwards syndrome) and trisomy 13 (Patau syndrome). The discovery of fetal DNA in maternal plasma in 1997 [1] has opened up new possibilities for such application. However, circulating cfDNA derived from the fetus only exists at a minor fraction, making it technologically challenging to capture and quantify these molecules. In general, fetal
13
ACCEPTED MANUSCRIPT cfDNA amounts to approximately 10% of maternal plasma cfDNA in pregnant women [59]. Chromosome 21 is about 1.5% of the total human genome length. In
PT
trisomy 21, the genomic representation of chromosome 21 will then be expected to increase slightly from 1.5% to 1.57%. That is, the perturbation of genomic
SC RI
representation of chromosome 21 in the presence of aneuploidy is extremely small. Therefore, a large number of plasma cfDNA molecules is needed to be analyzed in order to generate precise quantitative results. Massively parallel sequencing offers a
NU
practical solution to this problem because millions of cfDNA molecules can be
MA
sequenced simultaneously. Two main approaches to detect fetal aneuploidy in singleton pregnancies noninvasively are discussed here: whole-genome approach and
ED
targeted approach. The targeted approach was developed to increase the sequencing
PT
depth at the targeted site at the expense of reduction in genome-wide coverage.
CE
Whole-genome approach: Tag-counting based analysis
AC
Chromosome representation refers to the proportion of cfDNA represented by each individual chromosome in the plasma sample. In trisomic pregnancy, there would be an increased proportion of fetal-derived DNA molecules in maternal plasma due to the extra chromosome acquired by the fetus. This fact has provided the theoretical foundation for the detection of fetal aneuploidy using massively parallel sequencing. In the tag-counting method [43], each sequenced read is mapped to the human genome. The chromosome representation is estimated from the proportion of reads originating from a chromosome of interest with respective to the total amount of reads. In the detection of trisomy 21, the mean and standard deviation of the chromosome 21 representation are calculated, respectively, using a reference group of
14
ACCEPTED MANUSCRIPT euploid pregnancies. The z-score is then used to compare the proportional
PT
representation of chromosome 21 between the sample and the reference group.
Several groups attempted to directly transfer the tag-counting method to the detection
SC RI
of trisomy 13 and trisomy 18, which are the most clinically important autosomal trisomies apart from trisomy 21 [5]. However, it has been reported that the measurement of genomic representations for chromosome 13 and chromosome 18
NU
tend to be less precise than that of chromosome 21 [43, 60]. This is thought to be the
MA
GC biases related to chromosome 13 and 18 [43]. Consequently, one group modified the tag-counting method by implementing GC correction prior to downstream
ED
analysis. The GC correction method is implemented through modeling the locally weighted scatter plot smoothing regression (LOESS) [5, 51]. First of all, the reference
PT
genome is divided into sub-regions with equal size by using a 50-kb interval, termed as bin. LOESS works by fitting a regression between count and GC-content in a
CE
particular bin. Count in a bin is the number of fragments with 5’ end falling within the
AC
corresponding bin, and GC-content in a bin is the percentage of guanines and cytosines within the corresponding bin. Then, a correction value is calculated by LOESS by measuring the difference between the predicted counts and expected counts in the queried bin. After GC-correction, the group demonstrated that the sensitivity of trisomy 13 and trisomy 18 detection using the tag-counting method improved significantly from 36.0% to 100% and from 73.0% to 91.9%, respectively [5].
Whole-genome approach: Size-based analysis
15
ACCEPTED MANUSCRIPT The size-based method proposed by Yu et al., apart from its ability to estimate the fractional fetal DNA concentration, it is also capable of detecting chromosomal
PT
aneuploidies [56]. Early work based on real-time quantitative PCR had demonstrated that fetal cfDNA is in general shorter than maternal cfDNA [61]. The precise
SC RI
difference in the size distribution of these molecules, however, was only revealed later with the use of massively parallel sequencing technologies. The size of each circulating cfDNA fragment is deduced by the outermost coordinates of its
NU
corresponding pair-end read aligned to the human reference genome. Maternal
MA
cfDNA has a size profile that shows a predominant peak at 166 bp with a series of small peaks occurring at a periodicity of 10 bp [28, 38]. The fact that fetal cfDNA’s
ED
size profile at single-base resolution exhibited an apparent reduction at 166 bp peak and showed a peak at 143 bp with a similar 10 bp periodicity [28] is illustrative that
PT
fetal cfDNA is shorter than maternal cfDNA. This observation has provided a theoretical basis for the size-based method in the detection of chromosomal
CE
aneuploidies. It is expected that the proportion of short molecules (< 150 bp) of a
AC
trisomic chromosome relative to a set of reference chromosomes is higher than that in a euploid pregnancy. The group demonstrated that the size-based method could reliably detect aneuploidy with a sensitivity of 95.2% and specificity of 99%.
Targeted approach: allelic ratio analysis
The allelic ratio analysis is similar to the polymorphism dependent approach in calculating the fractional fetal DNA concentration. It also relies on informative SNPs where the mother is homozygous and the fetus is heterozygous [62]. A ratio is calculated between the total number of reads containing the fetal-specific allele (F)
16
ACCEPTED MANUSCRIPT and the total number of reads containing the shared allele (S) for the targeted chromosome (e.g. chromosome 21) and a reference chromosome (e.g. chromosome 7)
PT
respectively, termed F-S ratio (FSR). In a euploid pregnancy, these two ratios are expected to be very close. In a pregnancy carrying a paternal-derived trisomy 21,
SC RI
because the extra copy of chromosome 21 in fetus is from the father, it is expected that the FSR of the targeted chromosome is two-fold of the FSR for the reference chromosome. In a pregnancy carrying a maternal-derived trisomy 21, because the
NU
extra copy passed to the fetus is from the mother and its genomic makeup is shared
MA
between the fetus and the mother, it is expected that the FSR for the target chromosome is only slightly lower than the FSR for the reference chromosome. Given
ED
sufficient allelic counts are available, both types of trisomy 21 can be detected but the
PT
latter scenario would require much more sequence reads.
CE
Targeted approach: dosage-type analysis
AC
The targeted dosage-type analysis is similar to the whole-genome tag-counting based analysis. It uses z-test to assess an over-representation of the trisomic chromosome through selectively amplifying and sequencing genomic regions of interest from maternal plasma [44]. In the method developed by Spark et al., normalization of sequence counts was done via median polish by systematically removing sample and genomic location biases. With the normalized count, a standard z-test was used to classify aneuploidy. They showed that this method is highly specific (99.2%) and sensitive (100%) as well.
Targeted approach: Bayesian-based maximal likelihood method
17
ACCEPTED MANUSCRIPT
The third method is parental support (PS) [63]. PS is an algorithm that utilized
PT
Bayesian-based maximal likelihood method to determine any presence of aneuploidy. In this method, samples were first processed by highly multiplexed PCR at 11,000
SC RI
SNPs on chromosome 13, 18, 21, X and Y before the application of massively parallel sequencing. In the subsequent bioinformatics analysis, PS algorithm generates billions of possible genotype combinations by varying the number of chromosomal copies, the
NU
fractional fetal DNA concentration and parental genotypes. By comparing the
MA
observed allelic distribution generated from the blood sample and the set of genotype combinations provided by PS, it is then possible to deduce the genotype combination
ED
that best explains the data. For example, a genotype combination provided by PS under the hypothesis of euploidy would not explain well the set of data derived from a
PT
triploidy blood sample. This method showed an accuracy of 99.92% in reporting chromosomal copy number. However, a potential limitation of PS is that it requires
CE
parental genotypes. As discussed above, this requirement might be difficult to fulfill
AC
in certain circumstances.
Noninvasive detection of twin zygosity and twin aneuploidies
Twin pregnancies occur at a small but significant proportion in the population. It was estimated that the average of the twinning rates comprising of 76 countries is 13.1 per 1,000 births [64]. In the United States of America, it had a higher prevalence at 32 per 1,000 births [65]. More importantly, because most of the twin pregnancies are due to advanced age, twin pregnancies are at a higher risk than singleton pregnancies for aneuploidy [66]. Therefore, prenatal screening for twin pregnancies is imperative.
18
ACCEPTED MANUSCRIPT
In light of the success of noninvasive prenatal testing for fetal aneuploidies in
PT
singleton pregnancies, multiple groups have studied the feasibility to detect fetal aneuploidies in twin pregnancies [67-69]. One of the key parameters to interpret fetal
SC RI
aneuploidies in twin pregnancies is zygosity. To assess the twin zygosity, Qu et al. developed the calculation of apparent fractional fetal DNA concentration to determine twin zygosity noninvasively [68]. The apparent fractional fetal DNA concentration is
NU
calculated at informative SNP loci where the mother is homozygous but at least one
MA
of the fetuses is heterozygous with the following equation:
2p , where q is the total p+q
reads mapped to the highest-count allele and p is the total reads mapped to the
ED
second-highest-count allele. In monozygotic twins, because the twins are genetically
PT
identical, this fraction should be constant at different chromosomes. In dizygotic twins, because the twins are not genetically identical, it is expected that the fraction
AC
different.
CE
would fluctuate across chromosomes at places where the genotypes of the twins are
In addition to determine the twin zygosity, the same group pursued the question of detecting aneuploidies in twin pregnancies noninvasively. They used a three-step algorithm [69]. The first step involves classifying the twin pregnancy as either euploid or at least one of the fetuses is aneuploid. It is done by comparing the sample’s chromosomal representation of the targeted chromosome (e.g. chromosome 21 or 18) with the reference mean, which was composed of 11 euploid pregnancies (e.g. chromosome 21: 1.311, SD=0.007; chromosome 18: 2.808, SD=0.004) using the classical z-score approach. For instance, a pregnancy with a z-score > 3 is considered as having at least one aneuploid fetus. If at least one of the fetuses is aneuploid, then 19
ACCEPTED MANUSCRIPT the algorithm proceeds to step two to determine twin zygosity. If the twin is monozygotic, then both twins are aneuploid. If the twin is dizygotic, then the
PT
algorithm proceeds to step three: to determine if either one of the twins is affected or both of them are affected. In this step, the genomic representation of the chromosome
SC RI
of interest (e.g. chromosome 21 or 18) and the fractional fetal DNA concentration of each fetus is calculated. The values are compared to the genomic representation estimated by a reference group of singleton unaffected pregnancies. Here, the key of
NU
this step is that any increment of genomic representation is solely contributed by the
MA
trisomic fetus(es). Therefore, if an increment in the observed genomic representation could be most likely explained by the total fractional fetal DNA concentration, both
ED
twins are aneuploid. On the other hand, if an increment in the observed genomic representation could only be explained by one of the two fetuses’ contributions, only
PT
one twin is aneuploid.
AC
CE
Noninvasive haplotype based detection of monogenic disorders
With the rapid reduction in cost and huge amount of data generated by massively parallel sequencing, it became possible to interrogate the entire fetal genome in the maternal plasma [28, 70, 71]. The proof of the presence of the entire fetal genome in maternal plasma means that it is theoretically possible to diagnose any monogenic diseases noninvasively. The following section reviews the three canonical bioinformatics algorithms that had been implemented to determine the fetal genome in maternal plasma: relative haplotype dosage analysis (RHDO) [28], haplotype counting approach [70], and hidden Markov model (HMM) [71].
20
ACCEPTED MANUSCRIPT RHDO was first introduced by Lo et al. to determine the maternal inheritance of the fetus [28]. ß-thalassemia is a monogenic autosomal recessive blood disorder caused
PT
by mutations at the haemoglobin-beta (HBB) gene. In that study, both the mother and father were heterozygous carriers of ß-thalassemia. Paternal and maternal genotypes
SC RI
were obtained by microarray analysis to categorize SNPs and were grouped into five groups (Figure 3). Category 3 and category 4 were particularly important in the deduction of the paternal and maternal inheritances of the fetus, respectively.
NU
Category 3 SNPs were those for which mother was homozygous and father was
MA
heterozygous. The paternal inheritance of the fetus could be deduced using this group of SNPs by identifying the presence or absence of the non-maternal allele, i.e. fetal-
ED
specific allele. Category 4 SNPs were used to determine the maternal inheritance of the fetus. It was a group of SNPs in maternal plasma for which the mother was
PT
heterozygous and the father was homozygous. Deduction of the maternal inheritance of the fetus is much more challenging because there is no fetal-specific allele that is
CE
absent in the maternal genotypes. Thus, the determination of the maternal inheritance
AC
of the fetus has to be aided by quantitatively comparing the dosage of the two maternal haplotypes.
The central idea of RHDO is to accumulate imbalances of the maternal alleles between the two maternal haplotypes, followed by the statistical deduction of the maternal inheritance of the fetus based on these imbalances. There are two maternal haplotypes, Hap I and Hap II (Figure 4). The category 4 SNPs are further divided into two types in accordance with the maternal haplotypes, namely type α and type β. Type α SNPs are defined as those in which the paternal alleles were the same as those on the maternal Hap I. If the fetus inherited Hap I from mother, then an over-
21
ACCEPTED MANUSCRIPT representation of Hap I relative to Hap II would be observed in maternal plasma. If the fetus inherited Hap II, then no over-representation would be seen. Type β SNPs
PT
are defined as those in which the paternal alleles were the same as those on maternal Hap II. If the fetus inherited Hap I from mother, then an equal representation of Hap I
SC RI
and Hap II would be maintained in maternal plasma. If the fetus inherited Hap II, then an over-representation of Hap II would be observed; in other words, an underrepresentation of Hap I would be seen. The proportional contribution of total reads by
NU
Hap I to Hap II is then subjected to sequential probability ratio test (SPRT) [3] to
MA
determine if either haplotype is over-represented statistically (Figure 4). The important unique feature of RHDO is its ability to explore the consensus calls
ED
between type α and type β SNPs to further enhance the accuracy. With RHDO, it was demonstrated that the genome-wide parental inheritance could be deduced from the
PT
maternal plasma. In that study [28], Lo et al. showed that the fetus actually inherited the mutant of haemoglobin-beta (HBB) gene from father and the wild-type HBB gene
CE
from mother. The fetus was a ß-thalassemia heterozygous carrier which was
AC
concordant with the clinical outcome.
Haplotype counting is an alternative method proposed by Fan et al. to infer the fetal genome from maternal plasma. This method is based on counting the relative representation of haplotype pairs of each parent [70]. In the pregnant women’s plasma, there are three haplotypes: the maternal haplotype that is transmitted to the fetus, the maternal haplotype that is not transmitted to the fetus, and the paternal haplotype that is transmitted to the fetus. If the fractional fetal DNA concentration in maternal plasma and the number of genome equivalent are, respectively, assumed to be f and G, then the relative genome equivalent of un-transmitted maternal haplotype
22
ACCEPTED MANUSCRIPT is G*(1-f), and the relative genome equivalent of the transmitted maternal haplotype is G. Analogously, the relative genome equivalent of the transmitted paternal haplotype
PT
is G*f, and the un-transmitted paternal haplotype is 0. In each pair of parental haplotypes, the transmitted one is over-represented than the non-transmitted one.
SC RI
Therefore, with sufficiently high sequencing depth and long haplotype block, by counting the number of alleles that are aligned to either maternal haplotype accordingly, the maternally transmitted inheritance can be deduced (Figure 5). For the
NU
paternal part, inheritance can be traced by identifying markers that are not present in
MA
the maternal genome. Thus the whole fetal genome can be inferred noninvasively.
ED
Kizman et al. reported another method to infer the fetal genome from maternal plasma [71]. The maternal inheritance of the fetus was inferred by HMM (Figure 6).
PT
Classically, HMM has three parameters: latent state, emission probability and transition probability. In Kitzman et al.’s HMM implementation, the maternal
CE
inheritance of each SNP is determined by two factors: the maternal inheritance of the
AC
previous SNP (latent state) being interrogated and the SNP type (α or β) (emission probability). This model also accounts for natural haplotype switching events such as genetic recombination (transition probability). The Viterbi algorithm, a recursive algorithm that searches for the sequence with the maximum probability, was used to generate the most probable latent state sequence. Altogether, the maternal inheritance of the fetus could be deduced. For the paternal inheritance of the fetus, it was inferred using a similar method to Lo et al.
In summary, RHDO, the haplotype counting method and the HMM method provide similar accuracy in the detection of the maternal inheritance of the fetus. The main
23
ACCEPTED MANUSCRIPT differences among these algorithms are their prior experimental protocols to obtain
PT
the maternal haplotypes and their mathematical complexities.
For RHDO, as a proof-of-concept demonstration, the maternal haplotype was deduced
SC RI
by microarray genotyping of the family (father, mother and fetus) trio’s tissue samples. However, it is theoretically possible to obtain the maternal haplotype noninvasively by collecting genotype information from other family members. If
NU
familial genotype information is not readily available, then this experimental protocol
MA
cannot be used. The experimental protocols for both haplotype counting method and HMM are capable of deducing the maternal haplotype without genotype information
ED
from other familial members. But these methods require significant technological expertise to obtain the maternal haplotypes from the extracted maternal cells. The
PT
experimental protocol for haplotype counting method relies on a microfluidic device to perform direct deterministic phasing as well as the isolation of blood cells at
CE
metaphase. This is both technologically challenging and extremely labor-intensive.
AC
The experimental protocol for HMM obtains the maternal haplotype via clone-pool dilution sequencing. Briefly, maternal genome was sheared, cloned onto fosmids and cultured within Escherichia coli. The clones were subsequently sequenced and maternal haplotypes were reconstructed. This method is also labor-intensive and the whole library preparation process takes about a week [72].
In terms of the mathematical complexities among these three algorithms, RHDO and haplotype counting method are simpler because both methods rely on counting alleles and testing for the accumulated imbalance between maternal haplotypes. Additionally, the isolation of type α and β alleles in RHDO allows for a consistency check to reduce
24
ACCEPTED MANUSCRIPT false positive calls. In contrast, HMM is mathematically more challenging. It involves an in-depth understanding of probability and Markov chains to implement the
PT
algorithm.
SC RI
In summary, amongst these three experimental protocols, RHDO should be easier to implement and more cost-effective if familial genotype information is available. If familial genotype information is unavailable, then the experimental protocols for
MA
of technology and professional expertise.
NU
haplotype counting method and HMM could be applied depending on the availability
ED
Estimation of fractional tumor DNA concentration
PT
The presence of circulating tumor cfDNA (ctDNA) was first detected in cancer
CE
patients in 1996 [2]. Since then, numerous efforts were spent on detecting and quantifying ctDNA in plasma, to investigate the clinical values of these circulating
AC
nucleic acids. Earlier efforts with the use of conventional PCR explored the clinical utility of the detection of multiple oncogenes in ctDNA [16-18]. However, the heterogeneity of tumor cells has posed an extra layer of complexity in the analysis of ctDNA using conventional PCR assays. For instance, some mutations that are only present in subclones of a tumor may be rare and thus less of these mutation-associated tumor DNA molecules are released into plasma. As a result, mutations detected in tumors might not be detected in the circulating nucleic acids. In contrast, some mutations existing in subclones are missed during sampling of tumor tissues but a considerable amount of these mutations can be present in the circulating nucleic acids. Therefore, mutations detected in plasma might not be found in the matched tumor
25
ACCEPTED MANUSCRIPT tissues. A notable concrete example is the KRAS mutation, which was frequently detected in tumor tissues such as the colon, lung and the pancreas. But the detection
PT
of KRAS mutations in blood samples had not been highly consistent [18, 73, 74].
SC RI
Maturation of massively parallel sequencing has shed light on the research in ctDNA. It allows for the detection of mutations present at lower frequencies in tumor and circulation. This enables the precise quantification of important tumor related
NU
information such as the concentration of ctDNA that is well reflective of tumor
MA
burden in cancer patients. The concentration of ctDNA can be estimated in two ways with sequencing data: mutant allele approach and loss of heterogeneity approach
PT
Mutant allele approach
ED
(Figure 7).
CE
Mutant allele approach requires either a predetermined set of mutations (such as those
AC
commonly found in certain cancers) [7, 10, 11] or prior analysis of a resected tumor [8]. In the latter case, monitoring of ctDNA is useful in early detection of tumor recurrence. The concentration of ctDNA in plasma can either be expressed as the number of mutant fragments per volume (e.g. 100 fragments per 5ml) [7, 10] or is calculated by the following equation:
2m , where m is the number of reads m+w
containing the mutant alleles, and w is the number of reads containing the wild-type alleles [8] (Figure 7). The advantage of this method is that the calculation is relatively straightforward. The downside is that it is invasive because of the requirement of tumor sample for the detection of lower frequency alleles. On the other hand, with the use of a predetermined set of mutations, there is no need to analyze the tumor biopsy.
26
ACCEPTED MANUSCRIPT However, the major drawback of using a predetermined set of mutations is that the calculated concentration of ctDNA might be less precise because individual-specific
PT
mutations might be missed in the preset panel.
SC RI
Loss of heterogeneity approach
Chan et al. developed an approach, termed genome-wide aggregated allelic loss
NU
(GAAL), by utilizing the heterozygous SNPs residing in regions exhibiting loss of
MA
heterogeneity (LOH) [8]. For GAAL, LOH associated heterozygous SNPs were first identified with microarray analysis of the tumor biopsy. Alleles that were deleted in
ED
the tumor would exist at a lower frequency in patient’s plasma comparing to nondeleted alleles. So, the proportion of reads carrying the deleted alleles would be more
PT
reflective of the non-tumor fraction of the circulating cfDNA while the proportion of reads harboring the non-deleted alleles would be more enriched for the tumor-derived
CE
DNA. Therefore, the fractional tumor DNA concentration could be deduced by the
AC
following equation:
N nondel - N del , where Nnondel represents the number of sequenced N nondel
reads carrying the non-deleted alleles and Ndel represents the number of sequenced reads carrying the deleted alleles (Figure 7).
The concentration of fractional tumor DNA concentration has been demonstrated to correlate well with the tumor size [7, 8, 10, 11]. As a result, it can be a useful parameter for comparison of tumor load in different types of cancers. Furthermore, multiple studies have shown that tumor fraction exists at an extremely low concentration after surgery [7, 8, 11]. Therefore, tumor fraction can be useful in monitoring cancer patients both in the short term and long term. In immediate 27
ACCEPTED MANUSCRIPT postoperative cancer patients, persistence of small tumor fraction in plasma might suggest residual tumor existence. In the long term, raised tumor fraction might
PT
indicate recurrence. CtDNA thus offers potential for early detection of recurrent
SC RI
tumor and early intervention.
Noninvasive detection of copy number aberrations in cancer patients
NU
Instead of measuring the fractional tumor DNA concentration to assess tumor load in
MA
patients’ plasma, copy number changes is another possible biomarker for the detection of tumor noninvasively. Copy number aberrations have been found in
ED
almost all tumors [75-78]. In some cases, amplification of oncogenes was found to be the culprit for tumorigenesis [79, 80]. Therefore, noninvasive detection of copy
PT
number aberrations is increasingly recognized as a desirable biomarker for cancer
CE
detection and monitoring [8, 9, 81-83] However, comparing with the noninvasive detection of fetal aneuploidy, it generally remains elusive regarding the characteristics
AC
and exact region of human genome affected by copy number aberrations. For example, the sizes, genomic locations, and amplitudes of such aberrations involved are variable among different cancer types. Thus, for a cancer patient, genome-wide scanning of copy number aberrations would be extremely useful.
Chan et al. developed an approached to fit this purpose by dividing the human genome into equally sized bins of 1-Mb in length [8, 9] That is, there are a total of 3,000 bins to be analyzed simultaneously across the whole human genome. The genomic representation per 1-Mb bin is calculated and adjusted according to its GCcontent to improve the detection power. Subsequently, copy number gains and losses
28
ACCEPTED MANUSCRIPT in tumor could be reflected by an increased and decreased genomic representation in the circulation, respectively. Their method was to compare genomic representation of
PT
cancer patients with the genomic representation of healthy subjects using z-score approach for each 1-Mb bin. One technical issue of this method is that it may
SC RI
introduce false positives due to multiple comparisons. But this issue can be minimized by adopting a more stringent threshold for positive calls. For example, a patient is classified as cancer positive by demonstrating a certain proportion of bins showing
MA
NU
aberrations [9] or to use Bonferroni correction of the p-values [83].
Another approach to reduce false positives from the effect of multiple comparisons is
ED
to assess copy number aberrations at a chromosomal arm level. Leary et al. introduced plasma aneuploidy (PA) score to look at chromosomal arm changes in
PT
breast cancer and colon cancer patients [83]. This approach combined information of the top five chromosomal arms that demonstrated the most aberrant changes to
AC
CE
determine both the cancer status and tumor load in patients.
Comparing the two methods, the bin method is more generalized and can be applied to different cancer types, whereas PA would have a limited spectrum of application toward cancer types that affect primarily the entire chromosomal arms. However, the bin method is likely to be less specific than PA because of the effect of multiple comparisons correction. Nonetheless, both methods demonstrate important clinical utilities of the use of copy number changes in noninvasive cancer detection. Furthermore Chan et. al. demonstrated that, with the bin method, copy number changes virtually disappeared in plasma of postoperative cancer patients [8],
29
ACCEPTED MANUSCRIPT suggesting that their method has an additional utility to monitor tumor clearance and
PT
recurrence.
SC RI
Noninvasive detection of genomic rearrangements in cancer patients
Genomic rearrangement is common in many cancers. A notable example is the formation of fusion gene BCR-ABL as a result of chromosomal translocation
[84].
Other
recurrent
genomic
rearrangements
involving
the
MA
leukemia
NU
(Philadelphia chromosome) responsible for the development of chronic myeloid
immunoglobulin genes, T-cell receptor genes and retinoic acid receptor alpha gene
ED
have been reported to associate with haematological malignancies [85-87]. Yet, recurrent genomic rearrangements are not common among solid tumors [88]. Solid
PT
tumors often harbor genomic rearrangements that are individualized. Therefore,
CE
detection of these genomic rearrangements from solid tumors could aid in the understanding of tumor characteristics as well as the development of a more
AC
personalized treatment.
Leary et al. recently developed a noninvasive version of personalized analysis of rearranged ends (PARE) [83, 88] to detect genomic rearrangements in plasma of cancer patients. Originally, PARE was developed to identify patient-specific genomic rearrangements from solid tumors [88]. These specific genomic rearrangements were useful in monitoring cancer progression, remission and personalized care. The method utilized mate-pair library sequencing to identify reads that contained genomic rearrangement breakpoints. These genomic rearrangements were then further evaluated by PCR amplifying across the rearrangement junctions. For noninvasive
30
ACCEPTED MANUSCRIPT PARE, additional bioinformatics filters were applied to remove alignment artifacts and common germline polymorphisms. In their study, they analyzed plasma samples
PT
from 10 cancer patients and 10 normal subjects, revealing 14 rearrangements in 9 out
SC RI
of 10 plasma samples from cancer patients that were not found in normal subjects.
Despite Leary et al.’s method in the detection of genomic rearrangement in cancer patients having high specificity, it is not suitable for screening purposes because only
NU
a handful of rearrangements exist in the cancer genome. Moreover, the sequencing
MA
depth required for the detection of genomic rearrangement is high. With the current cost of sequencing, it is still cost-ineffective to use this method for screening in the
ED
general population. However, this method is suitable for follow-up purposes because the rearrangements have already been identified presumably from the excised tumor
PT
samples. A noninvasive targeted sequencing could be applied and would be
CE
comparably more cost-effective.
AC
Noninvasive methylomic analysis of circulation cfDNA
A recent application of massively parallel genomic sequencing of circulating cfDNA is methylomic analysis. Epigenetic mechanisms play an important role in fetal development, postnatal consequences [89, 90], tumorigenesis and tumor progression [91, 92]. Multiple cfDNA methylation based biomarkers had been discovered with conventional PCR-assays that were fetal- [93-95] and tumor-specific [96-98]. Yet, these markers only provided a limited picture of the fetal and tumor methylome. Whole-genome methylome in circulating plasma was accessible recently with the advent of massively parallel bisulfite sequencing. In the following section we briefly
31
ACCEPTED MANUSCRIPT discuss the bioinformatics algorithms commonly used in methylomic analysis of cfDNA, and give a more in-depth overview of the clinical applications in which
PT
bioinformatics are applied.
SC RI
Integrated bioinformatics packages for methylomic analysis
Methylome could be assessed by first treating the DNA molecule with bisulfite [99,
NU
100], converting the unmethylated cytosine into uracil while leaving the methylated
MA
cytosine unchanged. The uracils were then converted into thymines during PCR amplifications. As a result, a modified DNA sequence could be obtained to reveal the
ED
methylation status.
PT
Multiple bisulfite sequencing alignment tools had been developed [101-106] with extensive reviews covering both the alignment methods [107] and packages elsewhere
CE
[108]. A special note is that, while most of these alignment tools could align bisulfite
AC
treated fragments efficiently and accurately, very few of them could provide the downstream analysis such as the identification of differentiated methylated regions (DMRs). To date, BSmooth [105] and Methy-Pipe [106] are developed to perform an integrative analysis including alignment, methylation level determination, DMR identification as well as DMR annotation in one package. In terms of alignment speed, Methy-pipe outperforms Bismark [104] which has been shown to be faster than most other packages. Both of them could identify DMRs, which are central to many methylomic analyses nowadays. The integrative nature of these two bioinformatics packages renders them well-suited for comprehensive analysis of bisulfite sequencing data.
32
ACCEPTED MANUSCRIPT
PT
Methylomic analysis in plasma of pregnant women
The availability of bisulfite sequencing and corresponding bioinformatics software
SC RI
enabled noninvasive prenatal methylomic analysis. Lun et al. explored this question using whole-genome bisulfite sequencing of the maternal plasma cfDNA [109]. They were able to identify the placental methylome noninvasively. One approach was
NU
based on the analysis of fetal-specific polymorphic alleles. Another approach relied
MA
on knowing the fractional fetal DNA concentration and the methylome of blood cells to reverse deduce the placental methylome. An example of clinical application of
ED
noninvasive prenatal methylomic analysis would be the detection of trisomy 21. These technologies could help gain greater insight into the underlying mechanism and
PT
location of placental- or fetal-specific methylation changes at individual CpG residues and may aid in further identification of potential epigenetic-based biomarker for
AC
CE
prenatal diagnosis.
Methylomic analysis in plasma of cancer patients
Chan et al. accessed the whole-genome methylome in plasma for cancer patients [9]. The detection was made by comparing the methylation density of cancer patients in 1Mb bins with the methylation density of a group of healthy individuals using z-score approach. They showed that in cancer patients, one could reliably detect hypomethylation across the genome with good sensitivity (68%) and high specificity (94%) even at a relatively low sequencing depth (~10 million reads per case).
33
ACCEPTED MANUSCRIPT As illustrated by the examples above, noninvasive methylomic analysis of cfDNA is a promising tool for personalized medicine. In particular, methylation plays an
PT
instrumental role in cancer development. DNA hypomethylation is common in cancer cells and is known to promote tumorigensis by transcriptional activation of proto-
SC RI
oncogenes. Noninvasive methylomic analysis thus offers potential for a more precise stratification for cancer subtypes, thereby enables a more personalized therapeutic
NU
treatment for cancer patients in the future.
MA
Conclusion
ED
In the past decade we witnessed rapid advancements in the technologies and bioinformatics algorithms available for the analysis of circulating cfDNA. With the
PT
availability of massively parallel sequencing and the development of sophisticated
CE
bioinformatics, noninvasive prenatal testing has become increasingly mature and has established itself as a leading field in translational research. Of note, a couple of
AC
large-scale clinical validation studies have been conducted to demonstrate the high sensitivity of noninvasive prenatal screening for fetal aneuploidies [4, 110]. Noninvasive prenatal testing has already become an integral part of clinical practice in obstetric clinics.
Anchoring on the success of noninvasive prenatal diagnosis, noninvasive cancer assessment has gained much popularity in recent years. Because of the added complexity of cancer genetics, bioinformatics algorithms developed for noninvasive prenatal testing can only be conceptually transferred but with significant technical adaptations. Some of these algorithms appear promising in assessing tumor load and
34
ACCEPTED MANUSCRIPT remission. Recently, the development of bisulfite sequencing and specific bioinformatics software has opened up another area for further research in prenatal
PT
diagnosis and cancer management. The feasibility to obtain the genome-wide methylome noninvasively has enabled the identification of differential methylated
SC RI
regions that could be specifically useful in cancer screening, monitoring and the development of personalized therapeutic program.
NU
Even though numerous applications for noninvasive prenatal and cancer diagnosis
MA
have been demonstrated to be feasible in clinical practice, the high sequencing cost currently necessary for noninvasive testing is still a bottleneck for routine clinical
ED
implementations. Most applications mentioned in our review are still very expensive except for prenatal aneuploidy screening tests, which have already been adopted
PT
universally. For example, currently, the throughput of HiSeq is around 3.2 billion reads per flow-cell and one flow-cell would cost around US $40,000. Because for
CE
aneuploidy testing, it is possible to combine hundreds of samples in one flow-cell, so
AC
the cost can be averaged out. However, for other applications such as whole fetal genome assembly, it would require the entire flow cell to handle one sample. This would be too costly for routine use in the clinical settings.
As the sequencing costs continue to drop, we expect to see a proliferation of noninvasive testing offered at the clinic. The role of bioinformatics in noninvasive testing to efficiently analyze these data is thus of increasing importance in future research.
35
AC
CE
PT
ED
MA
NU
SC RI
PT
ACCEPTED MANUSCRIPT
36
ACCEPTED MANUSCRIPT Reference:
AC
CE
PT
ED
MA
NU
SC RI
PT
1. Lo YMD, Corbetta N, Chamberlain PF, Rai V, Sargent IL, Redman CW, et al. Presence of fetal DNA in maternal plasma and serum. Lancet. 1997;350:485-7. 2. Chen XQ, Stroun M, Magnenat JL, Nicod LP, Kurt AM, Lyautey J, et al. Microsatellite alterations in plasma DNA of small cell lung cancer patients. Nature medicine. 1996;2:1033-5. 3. Lo YMD, Lun FM, Chan KCA, Tsui NB, Chong KC, Lau TK, et al. Digital pcr for the molecular detection of fetal chromosomal aneuploidy. Proceedings of the National Academy of Sciences of the United States of America. 2007;104:13116-21. 4. Chiu RWK, Akolekar R, Zheng YW, Leung TY, Sun H, Chan KC, et al. Non-invasive prenatal assessment of trisomy 21 by multiplexed maternal plasma DNA sequencing: Large scale validity study. Bmj. 2011;342:c7401. 5. Chen EZ, Chiu RWK, Sun H, Akolekar R, Chan KC, Leung TY, et al. Noninvasive prenatal diagnosis of fetal trisomy 18 and trisomy 13 by maternal plasma DNA sequencing. PloS one. 2011;6:e21791. 6. Ehrich M, Deciu C, Zwiefelhofer T, Tynan JA, Cagasan L, Tim R, et al. Noninvasive detection of fetal trisomy 21 by sequencing of DNA in maternal blood: A study in a clinical setting. American journal of obstetrics and gynecology. 2011;204:205 e1-11. 7. Newman AM, Bratman SV, To J, Wynne JF, Eclov NC, Modlin LA, et al. An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage. Nature medicine. 2014;20:548-54. 8. Chan KCA, Jiang P, Zheng YW, Liao GJ, Sun H, Wong J, et al. Cancer genome scanning in plasma: Detection of tumor-associated copy number aberrations, single-nucleotide variants, and tumoral heterogeneity by massively parallel sequencing. Clinical chemistry. 2013;59:211-24. 9. Chan KCA, Jiang P, Chan CW, Sun K, Wong J, Hui EP, et al. Noninvasive detection of cancerassociated genome-wide hypomethylation and copy number aberrations by plasma DNA bisulfite sequencing. Proceedings of the National Academy of Sciences of the United States of America. 2013;110:18761-8. 10. Bettegowda C, Sausen M, Leary RJ, Kinde I, Wang Y, Agrawal N, et al. Detection of circulating tumor DNA in early- and late-stage human malignancies. Science translational medicine. 2014;6:224ra24. 11. Forshew T, Murtaza M, Parkinson C, Gale D, Tsui DW, Kaper F, et al. Noninvasive identification and monitoring of cancer mutations by targeted deep sequencing of plasma DNA. Science translational medicine. 2012;4:136ra68. 12. Lo YMD, Tein MS, Lau TK, Haines CJ, Leung TN, Poon PM, et al. Quantitative analysis of fetal DNA in maternal plasma and serum: Implications for noninvasive prenatal diagnosis. American journal of human genetics. 1998;62:768-75. 13. Lo YMD, Hjelm NM, Fidler C, Sargent IL, Murphy MF, Chamberlain PF, et al. Prenatal diagnosis of fetal rhd status by molecular analysis of maternal plasma. The New England journal of medicine. 1998;339:1734-8. 14. Lo YMD, Zhang J, Leung TN, Lau TK, Chang AM, Hjelm NM. Rapid clearance of fetal DNA from maternal plasma. American journal of human genetics. 1999;64:218-24. 15. Lui YY, Chik KW, Chiu RWK, Ho CY, Lam CW, Lo YMD. Predominant hematopoietic origin of cell-free DNA in plasma and serum after sex-mismatched bone marrow transplantation. Clinical chemistry. 2002;48:421-7. 16. Su YH, Wang M, Brenner DE, Norton PA, Block TM. Detection of mutated k-ras DNA in urine, plasma, and serum of patients with colorectal carcinoma or adenomatous polyps. Annals of the New York Academy of Sciences. 2008;1137:197-206. 17. Shinozaki M, O'Day SJ, Kitago M, Amersi F, Kuo C, Kim J, et al. Utility of circulating b-raf DNA mutation in serum for monitoring melanoma patients receiving biochemotherapy. Clinical cancer research : an official journal of the American Association for Cancer Research. 2007;13:2068-74. 18. Wang S, An T, Wang J, Zhao J, Wang Z, Zhuo M, et al. Potential clinical significance of a plasmabased kras mutation analysis in patients with advanced non-small cell lung cancer. Clinical cancer research : an official journal of the American Association for Cancer Research. 2010;16:1324-30. 19. Rijnders RJ, van der Schoot CE, Bossers B, de Vroede MA, Christiaens GC. Fetal sex determination from maternal plasma in pregnancies at risk for congenital adrenal hyperplasia. Obstetrics and gynecology. 2001;98:374-8.
37
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
MA
NU
SC RI
PT
20. Geifman-Holtzman O, Grotegut CA, Gaughan JP. Diagnostic accuracy of noninvasive fetal rh genotyping from maternal blood--a meta-analysis. American journal of obstetrics and gynecology. 2006;195:1163-73. 21. Jung K, Fleischhacker M, Rabien A. Cell-free DNA in the blood as a solid tumor biomarker--a critical appraisal of the literature. Clinica chimica acta; international journal of clinical chemistry. 2010;411:1611-24. 22. Schwarzenbach H, Hoon DS, Pantel K. Cell-free nucleic acids as biomarkers in cancer patients. Nature reviews Cancer. 2011;11:426-37. 23. Lo YMD, Chiu RWK. Genomic analysis of fetal nucleic acids in maternal blood. Annual review of genomics and human genetics. 2012;13:285-306. 24. De Mattos-Arruda L, Cortes J, Santarpia L, Vivancos A, Tabernero J, Reis-Filho JS, et al. Circulating tumour cells and cell-free DNA as tools for managing breast cancer. Nature reviews Clinical oncology. 2013;10:377-89. 25. Bianchi DW. Circulating fetal DNA: Its origin and diagnostic potential-a review. Placenta. 2004;25 Suppl A:S93-S101. 26. Stroun M, Lyautey J, Lederrey C, Olson-Sand A, Anker P. About the possible origin and mechanism of circulating DNA apoptosis and active DNA release. Clinica chimica acta; international journal of clinical chemistry. 2001;313:139-42. 27. Alberry M, Maddocks D, Jones M, Abdel Hadi M, Abdel-Fattah S, Avent N, et al. Free fetal DNA in maternal plasma in anembryonic pregnancies: Confirmation that the origin is the trophoblast. Prenatal diagnosis. 2007;27:415-8. 28. Lo YMD, Chan KCA, Sun H, Chen EZ, Jiang P, Lun FM, et al. Maternal plasma DNA sequencing reveals the genome-wide genetic and mutational profile of the fetus. Science translational medicine. 2010;2:61ra91. 29. Fan HC, Blumenfeld YJ, Chitkara U, Hudgins L, Quake SR. Analysis of the size distributions of fetal and maternal cell-free DNA by paired-end sequencing. Clinical chemistry. 2010;56:1279-86. 30. Mouliere F, Robert B, Arnau Peyrotte E, Del Rio M, Ychou M, Molina F, et al. High fragmentation characterizes tumour-derived circulating DNA. PloS one. 2011;6:e23418. 31. Mouliere F, El Messaoudi S, Gongora C, Guedj AS, Robert B, Del Rio M, et al. Circulating cellfree DNA from colorectal cancer patients may reveal high kras or braf mutation load. Translational oncology. 2013;6:319-28. 32. Mouliere F, El Messaoudi S, Pang D, Dritschilo A, Thierry AR. Multi-marker analysis of circulating cell-free DNA toward personalized medicine for colorectal cancer. Molecular oncology. 2014;8:927-41. 33. Umetani N, Kim J, Hiramatsu S, Reber HA, Hines OJ, Bilchik AJ, et al. Increased integrity of free circulating DNA in sera of patients with colorectal or periampullary cancer: Direct quantitative pcr for alu repeats. Clinical chemistry. 2006;52:1062-9. 34. Umetani N, Giuliano AE, Hiramatsu SH, Amersi F, Nakagawa T, Martino S, et al. Prediction of breast tumor progression by integrity of free circulating DNA in serum. Journal of clinical oncology : official journal of the American Society of Clinical Oncology. 2006;24:4270-6. 35. Gao YJ, He YJ, Yang ZL, Shao HY, Zuo Y, Bai Y, et al. Increased integrity of circulating cell-free DNA in plasma of patients with acute leukemia. Clinical chemistry and laboratory medicine : CCLM / FESCC. 2010;48:1651-6. 36. Wang E, Batey A, Struble C, Musci T, Song K, Oliphant A. Gestational age and maternal weight effects on fetal cell-free DNA in maternal plasma. Prenatal diagnosis. 2013;33:662-6. 37. Smid M, Galbiati S, Vassallo A, Gambini D, Ferrari A, Viora E, et al. No evidence of fetal DNA persistence in maternal plasma after pregnancy. Human genetics. 2003;112:617-8. 38. Yu SC, Lee SW, Jiang P, Leung TY, Chan KCA, Chiu RWK, et al. High-resolution profiling of fetal DNA clearance from maternal plasma by massively parallel sequencing. Clinical chemistry. 2013;59:1228-37. 39. Diehl F, Schmidt K, Choti MA, Romans K, Goodman S, Li M, et al. Circulating mutant DNA to assess tumor dynamics. Nature medicine. 2008;14:985-90. 40. Lindgreen S. Adapterremoval: Easy cleaning of next-generation sequencing reads. BMC research notes. 2012;5:337. 41. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnetjournal. 2011;17. 42. Http://www.Bioinformatics.Babraham.Ac.Uk/projects/trim_galore/. 43. Chiu RWK, Chan KCA, Gao Y, Lau VY, Zheng W, Leung TY, et al. Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in
38
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
MA
NU
SC RI
PT
maternal plasma. Proceedings of the National Academy of Sciences of the United States of America. 2008;105:20458-63. 44. Sparks AB, Wang ET, Struble CA, Barrett W, Stokowski R, McBride C, et al. Selective analysis of cell-free DNA in maternal blood for evaluation of fetal trisomy. Prenatal diagnosis. 2012;32:3-9. 45. Sehnert AJ, Rhees B, Comstock D, de Feo E, Heilek G, Burke J, et al. Optimal detection of fetal chromosomal abnormalities by massively parallel DNA sequencing of cell-free fetal DNA from maternal blood. Clinical chemistry. 2011;57:1042-9. 46. Smith TF, Waterman MS. Identification of common molecular subsequences. Journal of molecular biology. 1981;147:195-7. 47. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology. 1970;48:443-53. 48. Aronesty E. Ea-utils : Command-line tools for processing biological sequencing data; http://code.Google.Com/p/ea-utils. 2011. 49. Anders S, Pyl PT, Huber W. Htseq—a python framework to work with high-throughput sequencing data. Bioinformatics. 2014. 50. Bolger AM, Lohse M, Usadel B. Trimmomatic: A flexible trimmer for illumina sequence data. Bioinformatics. 2014;30:2114-20. 51. Benjamini Y, Speed TP. Summarizing and correcting the gc content bias in high-throughput sequencing. Nucleic acids research. 2012;40:e72. 52. Aird D, Ross MG, Chen WS, Danielsson M, Fennell T, Russ C, et al. Analyzing and minimizing pcr amplification bias in illumina sequencing libraries. Genome biology. 2011;12:R18. 53. Chu T, Bunce K, Hogge WA, Peters DG. A novel approach toward the challenge of accurately quantifying fetal DNA in maternal plasma. Prenatal diagnosis. 2010;30:1226-9. 54. Bellis MA, Hughes K, Hughes S, Ashton JR. Measuring paternal discrepancy and its public health consequences. Journal of epidemiology and community health. 2005;59:749-54. 55. Jiang P, Chan KCA, Liao GJ, Zheng YW, Leung TY, Chiu RWK, et al. Fetalquant: Deducing fractional fetal DNA concentration from massively parallel sequencing of DNA in maternal plasma. Bioinformatics. 2012;28:2883-90. 56. Yu SC, Chan KCA, Zheng YW, Jiang P, Liao GJ, Sun H, et al. Size-based molecular diagnostics using plasma DNA for noninvasive prenatal testing. Proceedings of the National Academy of Sciences of the United States of America. 2014;111:8583-8. 57. Chan RW, Jiang P, Peng X, Tam LS, Liao GJ, Li EK, et al. Plasma DNA aberrations in systemic lupus erythematosus revealed by genomic and methylomic sequencing. Proceedings of the National Academy of Sciences of the United States of America. 2014. 58. Jiang P, Chan CW, Chan KC, Cheng SH, Wong J, Wong VW, et al. Lengthening and shortening of plasma DNA in hepatocellular carcinoma patients. Proceedings of the National Academy of Sciences of the United States of America. 2015;112:E1317-25. 59. Norton ME, Brar H, Weiss J, Karimi A, Laurent LC, Caughey AB, et al. Non-invasive chromosomal evaluation (nice) study: Results of a multicenter prospective cohort study for detection of fetal trisomy 21 and trisomy 18. American journal of obstetrics and gynecology. 2012;207:137 e1-8. 60. Chiu RWK, Sun H, Akolekar R, Clouser C, Lee C, McKernan K, et al. Maternal plasma DNA analysis with massively parallel sequencing by ligation for noninvasive prenatal diagnosis of trisomy 21. Clinical chemistry. 2010;56:459-63. 61. Chan KCA, Zhang J, Hui AB, Wong N, Lau TK, Leung TN, et al. Size distributions of maternal and fetal DNA in maternal plasma. Clinical chemistry. 2004;50:88-92. 62. Liao GJ, Chan KCA, Jiang P, Sun H, Leung TY, Chiu RWK, et al. Noninvasive prenatal diagnosis of fetal trisomy 21 by allelic ratio analysis using targeted massively parallel sequencing of maternal plasma DNA. PloS one. 2012;7:e38154. 63. Zimmermann B, Hill M, Gemelos G, Demko Z, Banjevic M, Baner J, et al. Noninvasive prenatal aneuploidy testing of chromosomes 13, 18, 21, x, and y, using targeted sequencing of polymorphic loci. Prenatal diagnosis. 2012;32:1233-41. 64. Smits J, Monden C. Twinning across the developing world. PloS one. 2011;6:e25239. 65. Chauhan SP, Scardo JA, Hayes E, Abuhamad AZ, Berghella V. Twins: Prevalence, problems, and preterm births. American journal of obstetrics and gynecology. 2010;203:305-15. 66. Audibert F, Gagnon A, Genetics Committee of the Society of O, Gynaecologists of C, Prenatal Diagnosis Committee of the Canadian College of Medical G. Prenatal screening for and diagnosis of aneuploidy in twin pregnancies. Journal of obstetrics and gynaecology Canada : JOGC = Journal d'obstetrique et gynecologie du Canada : JOGC. 2011;33:754-67.
39
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
MA
NU
SC RI
PT
67. Huang X, Zheng J, Chen M, Zhao Y, Zhang C, Liu L, et al. Noninvasive prenatal testing of trisomies 21 and 18 by massively parallel sequencing of maternal plasma DNA in twin pregnancies. Prenatal diagnosis. 2014;34:335-40. 68. Qu JZ, Leung TY, Jiang P, Liao GJ, Cheng YK, Sun H, et al. Noninvasive prenatal determination of twin zygosity by maternal plasma DNA analysis. Clinical chemistry. 2013;59:427-35. 69. Leung TY, Qu JZ, Liao GJ, Jiang P, Cheng YK, Chan KCA, et al. Noninvasive twin zygosity assessment and aneuploidy detection by maternal plasma DNA sequencing. Prenatal diagnosis. 2013;33:675-81. 70. Fan HC, Gu W, Wang J, Blumenfeld YJ, El-Sayed YY, Quake SR. Non-invasive prenatal measurement of the fetal genome. Nature. 2012;487:320-4. 71. Kitzman JO, Snyder MW, Ventura M, Lewis AP, Qiu R, Simmons LE, et al. Noninvasive wholegenome sequencing of a human fetus. Science translational medicine. 2012;4:137ra76. 72. Kitzman JO, Mackenzie AP, Adey A, Hiatt JB, Patwardhan RP, Sudmant PH, et al. Haplotyperesolved genome sequencing of a gujarati indian individual. Nature biotechnology. 2011;29:59-63. 73. Castells A, Puig P, Mora J, Boadas J, Boix L, Urgell E, et al. K-ras mutations in DNA extracted from the plasma of patients with pancreatic carcinoma: Diagnostic utility and prognostic significance. Journal of clinical oncology : official journal of the American Society of Clinical Oncology. 1999;17:578-84. 74. Ryan BM, Lefort F, McManus R, Daly J, Keeling PW, Weir DG, et al. A prospective study of circulating mutant kras2 in the serum of patients with colorectal neoplasia: Strong prognostic indicator in postoperative follow up. Gut. 2003;52:101-8. 75. Fridlyand J, Snijders AM, Ylstra B, Li H, Olshen A, Segraves R, et al. Breast tumor copy number aberration phenotypes and genomic instability. BMC cancer. 2006;6:96. 76. Albertson DG, Collins C, McCormick F, Gray JW. Chromosome aberrations in solid tumors. Nature genetics. 2003;34:369-76. 77. Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, et al. The landscape of somatic copy-number alteration across human cancers. Nature. 2010;463:899-905. 78. Zack TI, Schumacher SE, Carter SL, Cherniack AD, Saksena G, Tabak B, et al. Pan-cancer patterns of somatic copy number alteration. Nature genetics. 2013;45:1134-40. 79. Alitalo K, Schwab M, Lin CC, Varmus HE, Bishop JM. Homogeneously staining chromosomal regions contain amplified copies of an abundantly expressed cellular oncogene (c-myc) in malignant neuroendocrine cells from a human colon carcinoma. Proceedings of the National Academy of Sciences of the United States of America. 1983;80:1707-11. 80. Hinds PW, Dowdy SF, Eaton EN, Arnold A, Weinberg RA. Function of a human cyclin gene as an oncogene. Proceedings of the National Academy of Sciences of the United States of America. 1994;91:709-13. 81. Heitzer E, Auer M, Hoffmann EM, Pichler M, Gasch C, Ulz P, et al. Establishment of tumorspecific copy number alterations from plasma DNA of patients with cancer. International journal of cancer Journal international du cancer. 2013;133:346-56. 82. Heitzer E, Ulz P, Belic J, Gutschi S, Quehenberger F, Fischereder K, et al. Tumor-associated copy number changes in the circulation of patients with prostate cancer identified through whole-genome sequencing. Genome medicine. 2013;5:30. 83. Leary RJ, Sausen M, Kinde I, Papadopoulos N, Carpten JD, Craig D, et al. Detection of chromosomal alterations in the circulation of cancer patients with whole-genome sequencing. Science translational medicine. 2012;4:162ra54. 84. Sawyers CL. Chronic myeloid leukemia. The New England journal of medicine. 1999;340:133040. 85. Warrell RP, Jr. Retinoid resistance in acute promyelocytic leukemia: New mechanisms, strategies, and implications. Blood. 1993;82:1949-53. 86. Szczepanski T, van der Velden VH, Raff T, Jacobs DC, van Wering ER, Bruggemann M, et al. Comparative analysis of t-cell receptor gene rearrangements at diagnosis and relapse of t-cell acute lymphoblastic leukemia (t-all) shows high stability of clonal markers for monitoring of minimal residual disease and reveals the occurrence of second t-all. Leukemia. 2003;17:2149-56. 87. Korsmeyer SJ, Arnold A, Bakhshi A, Ravetch JV, Siebenlist U, Hieter PA, et al. Immunoglobulin gene rearrangement and cell surface antigen expression in acute lymphocytic leukemias of t cell and b cell precursor origins. The Journal of clinical investigation. 1983;71:301-13. 88. Leary RJ, Kinde I, Diehl F, Schmidt K, Clouser C, Duncan C, et al. Development of personalized tumor biomarkers using massively parallel sequencing. Science translational medicine. 2010;2:20ra14. 89. Tomizawa S, Sasaki H. Genomic imprinting and its relevance to congenital disease, infertility, molar pregnancy and induced pluripotent stem cell. Journal of human genetics. 2012;57:84-91.
40
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
MA
NU
SC RI
PT
90. Banister CE, Koestler DC, Maccani MA, Padbury JF, Houseman EA, Marsit CJ. Infant growth restriction is associated with distinct patterns of DNA methylation in human placentas. Epigenetics : official journal of the DNA Methylation Society. 2011;6:920-7. 91. Esteller M, Herman JG. Cancer as an epigenetic disease: DNA methylation and chromatin alterations in human tumours. The Journal of pathology. 2002;196:1-7. 92. Egger G, Liang G, Aparicio A, Jones PA. Epigenetics in human disease and prospects for epigenetic therapy. Nature. 2004;429:457-63. 93. Chim SS, Tong YK, Chiu RWK, Lau TK, Leung TN, Chan LY, et al. Detection of the placental epigenetic signature of the maspin gene in maternal plasma. Proceedings of the National Academy of Sciences of the United States of America. 2005;102:14753-8. 94. Tsui DW, Lam YM, Lee WS, Leung TY, Lau TK, Lau ET, et al. Systematic identification of placental epigenetic signatures for the noninvasive prenatal detection of edwards syndrome. PloS one. 2010;5:e15069. 95. Chim SS, Jin S, Lee TY, Lun FM, Lee WS, Chan LY, et al. Systematic search for placental DNAmethylation markers on chromosome 21: Toward a maternal plasma-based epigenetic test for fetal trisomy 21. Clinical chemistry. 2008;54:500-11. 96. Chan KCA, Lai PB, Mok TS, Chan HL, Ding C, Yeung SW, et al. Quantitative analysis of circulating methylated DNA as a biomarker for hepatocellular carcinoma. Clinical chemistry. 2008;54:1528-36. 97. An Q, Liu Y, Gao Y, Huang J, Fong X, Li L, et al. Detection of p16 hypermethylation in circulating plasma DNA of non-small cell lung cancer patients. Cancer letters. 2002;188:109-14. 98. Valenzuela MT, Galisteo R, Zuluaga A, Villalobos M, Nunez MI, Oliver FJ, et al. Assessing the use of p16(ink4a) promoter gene methylation in serum for detection of bladder cancer. European urology. 2002;42:622-8; discussion 8-30. 99. Lister R, O'Malley RC, Tonti-Filippini J, Gregory BD, Berry CC, Millar AH, et al. Highly integrated single-base resolution maps of the epigenome in arabidopsis. Cell. 2008;133:523-36. 100. Cokus SJ, Feng S, Zhang X, Chen Z, Merriman B, Haudenschild CD, et al. Shotgun bisulphite sequencing of the arabidopsis genome reveals DNA methylation patterning. Nature. 2008;452:215-9. 101. Chen PY, Cokus SJ, Pellegrini M. Bs seeker: Precise mapping for bisulfite sequencing. BMC bioinformatics. 2010;11:203. 102. Lim JQ, Tennakoon C, Li G, Wong E, Ruan Y, Wei CL, et al. Batmeth: Improved mapper for bisulfite sequencing reads on DNA methylation. Genome biology. 2012;13:R82. 103. Xi Y, Li W. Bsmap: Whole genome bisulfite sequence mapping program. BMC bioinformatics. 2009;10:232. 104. Krueger F, Andrews SR. Bismark: A flexible aligner and methylation caller for bisulfite-seq applications. Bioinformatics. 2011;27:1571-2. 105. Hansen KD, Langmead B, Irizarry RA. Bsmooth: From whole genome bisulfite sequencing reads to differentially methylated regions. Genome biology. 2012;13:R83. 106. Jiang P, Sun K, Lun FM, Guo AM, Wang H, Chan KCA, et al. Methy-pipe: An integrated bioinformatics pipeline for whole genome bisulfite sequencing data analysis. PloS one. 2014;9:e100360. 107. Krueger F, Kreck B, Franke A, Andrews SR. DNA methylome analysis using short bisulfite sequencing data. Nature methods. 2012;9:145-51. 108. Bock C. Analysing and interpreting DNA methylation data. Nature reviews Genetics. 2012;13:705-19. 109. Lun FM, Chiu RWK, Sun K, Leung TY, Jiang P, Chan KCA, et al. Noninvasive prenatal methylomic analysis by genomewide bisulfite sequencing of maternal plasma DNA. Clinical chemistry. 2013;59:1583-94. 110. Palomaki GE, Kloza EM, Lambert-Messerlian GM, Haddow JE, Neveux LM, Ehrich M, et al. DNA sequencing of maternal plasma to detect down syndrome: An international clinical validation study. Genetics in medicine : official journal of the American College of Medical Genetics. 2011;13:913-20.
41
ACCEPTED MANUSCRIPT FIGURES: Figure 1. Summary of bioinformatics analysis and key clinical applications of circulating cfDNA.
NU
Figure 3. SNP categories for RHDO analysis.
SC RI
PT
Figure 2. Schematic illustration of the calculation of fractional fetal DNA concentration in maternal plasma with the SNP based method. Informative SNPs are identified at those loci where maternal (AA) and paternal genotypes (BB) are homozygous but for a different allele each. The resulting fetal genotype is an obligate heterozygote (AB). In maternal plasma, majority of the cfDNA molecules mapped to informative loci would be the shared allele (black). Fetal-specific allele (red) exists at a low frequency. The fractional fetal DNA concentration can be directly deduced as shown.
ED
MA
Figure 4. Schematic illustration of RHDO approach to construct maternal inheritance of the fetus noninvasively. Maternal haplotypes (Hap I and Hap II) can be deduced by the use of trio-based genotype information. Allelic counts in plasma of type α (or type β) are accumulated for each maternal haplotype. The accumulated counts are then subjected to sequential probability ratio test (SPRT) to statistically determine where Hap I is over-represented in maternal plasma.
CE
PT
Figure 5. Schematic illustration of haplotype counting approach to construct maternal inheritance of the fetus. Maternal haplotypes are first determined by using direct deterministic phasing approach to analyze maternal blood cells. Heterozygous sites are identified and count of the alleles specific to each haplotype is determined. The relative representation of the two haplotypes in maternal plasma is subjected to Poisson-based z-score test to determine the maternal inheritance of the fetus.
AC
Figure 6. Schematic illustration of hidden Markov model (HMM) approach to construct maternal inheritance of the fetus. Classically, HMM has three parameters: latent states, transition probability [denoted by P(T)] and emission probability [denoted by P(E)]. The three latent states (3 circles) are the first maternal haplotype (Hap I) transmitted to the fetus, the second maternal haplotype (Hap II) transmitted to the fetus, and an unknown state in which maternal haplotype inherited by the fetus is not known. The transition probabilities (solid black arrows) are held at 10-5, close to the human recombination rate. Emission probabilities (dashed black arrows) are modeled based on binomial distribution. Figure 7. Two major methods developed to calculate the fractional tumor DNA concentration: mutant allele approach and loss of heterozygosity approach. (A) Mutant allele approach. In cancer patients, the cells can be classified to normal cells only carrying wild-type alleles and tumor cells harboring additional mutant alleles. The tumor load in patient’s plasma can be reflected by the fraction of mutant alleles which can be translated into the fractional tumor DNA concentration. (B) Loss of heterozygosity approach. Chromosomal arms or sub-chromosomal regions are frequently found to be deleted in cancer cells. These types of deletion preferentially involve only one of the two homologous chromosomes, thus resulting in loss of heterozygosity (LOH). In the patient’s plasma, LOH of tumor cells would lead to a
42
ACCEPTED MANUSCRIPT decrease in the number of deleted alleles compared to the non-deleted alleles. The allelic imbalance between non-deleted and deleted alleles can be translated as the fractional tumor DNA concentration.
PT
TABLES:
SC RI
Table 1. Biological properties of fetal-derived cfDNA and tumor-derived cfDNA.
AC
CE
PT
ED
MA
NU
Table 2. Feature comparison of different adapter trimming tools.
43
ACCEPTED MANUSCRIPT Table 1. Biological properties of fetal-derived cfDNA and tumor-derived cfDNA
Concentration Size
AC
CE
PT
ED
MA
NU
SC RI
Clearance
Tumor-derived cfDNA Positively correlates with tumor size 7, 9, 10 30-32 Shorter or longer 33-35 in cancer patients Half-life: 2 hour 39
PT
Fetal-derived cfDNA Positively correlates with gestational age 36 Shorter than background maternal-derived DNA 28,29 Rapid phase half-life: 1 hour 38
44
ACCEPTED MANUSCRIPT Table 2. Feature comparison of different adapter trimming tools.
Yes
Yes Yes
Yes Yes
No
Yes
Yes
No Yes
Yes Yes
Yes Yes
Yes
0.3.7 0.6.1
Yes Yes
Yes Yes
1.5.4
C++
Yes
1.04.636 0.32
C++ Java
Yes Yes
Trim Galore! 42 HTSeq 49 AdapterRemoval 40
AC
CE
PT
ED
MA
NU
SC RI
No
1.6
FastqMcf Trimmomatic 50
Yes
Python and C Perl Python
Cutadapt 41
48
Language
Able to trim low-quality nucleotides
Directly processing gzip-format file
PT
Version
Able to identify adapter sequences specified by user
Pairedend reads support
45
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
MA
NU
SC RI
PT
Figure 1
46
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
MA
NU
SC RI
PT
Figure 2
47
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
MA
NU
SC RI
PT
Figure 3
48
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
MA
NU
SC RI
PT
Figure 4
49
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
MA
NU
SC RI
PT
Figure 5
50
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
MA
NU
SC RI
PT
Figure 6
51
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
MA
NU
SC RI
PT
Figure 7
52