Journal of Bioinformatics and Computational Biology Vol. 13, No. 2 (2015) 1550001 (16 pages) # .c Imperial College Press DOI: 10.1142/S0219720015500018

An e®ective di®erential expression analysis of deep-sequencing data based on the Poisson log-normal model

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

Jun Wu*,§, Xiaodong Zhao†,¶, Zongli Lin‡,|| and Zhifeng Shao†,** *Department

of Automation, Shanghai Jiao Tong University and Key Laboratory of System Control and Information Processing of Ministry of Education 800 Dongchuan Road, Shanghai 200240, P. R. China †

School of Biomedical Engineering Shanghai Jiao Tong University 800 Dongchuan Road, Shanghai 200240, P. R. China ‡

Charles L. Brown Department of Electrical and Computer Engineering University of Virginia, Charlottesville, VA 22904-4743, USA § [email protected][email protected] ||[email protected] **[email protected] Received 17 February 2014 Revised 14 August 2014 Accepted 22 September 2014 Published 11 November 2014

Tremendous amount of deep-sequencing data has unprecedentedly improved our understanding in biomedical science by digital sequence reads. To mine useful information from such data, a proper distribution for modeling all range of the count data and accurate parameter estimation are required. In this paper, we propose a method, called \DEPln," for di®erential expression analysis based on the Poisson log-normal (PLN) distribution with an accurate parameter estimation strategy, which aims to overcome the inconvenience in the mathematical analysis of the traditional PLN distribution. The performance of our proposed method is validated by both synthetic and real data. Experimental results indicate that our method outperforms the traditional methods in terms of the discrimination ability and results in a good tradeo® between the recall rate and the precision. Thus, our work provides a new approach for gene expression analysis and has strong potential in deep-sequencing based research. Keywords: Deep-sequencing data; di®erential expression analysis; Poisson log-normal.

1. Introduction With the rapid development of next generation sequencing technologies, an enormous amount of deep sequencing data has been generated in the form of short sequence reads under various experimental conditions. These sequencing data have been used in a wide variety of biomedical ¯elds, such as cancer, neuron science, and 1550001-1

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

J. Wu et al.

stem cell biology.1–3 A main objective in these applications is to identify genes that are di®erentially expressed (DE) between di®erent stages (e.g. cancer tissue versus normal tissue). Identi¯cation of these DE genes facilitates our understanding of how alteration of gene expression pattern contributes to the corresponding biological processes under either physiological or pathological conditions. Usually, gene expression is measured by the number of reads mapped to the reference genome. However, these count data may be disturbed by some random factors, such as technical and biological variations. To reliably detect the genes with signi¯cantly di®erential expressions between two biological conditions, a proper statistical model for the count data is essential. Currently, mainly three discrete statistical distributions, are used to analyze such counts: the Poisson distribution,4–6 the Negative Binomial (NB) distribution,7–12 and the Poisson log-normal (PLN) distribution.13–15 The Poisson distribution has a single parameter , which determines the variance and other properties of the distribution. However, as the count data often exhibits an over-dispersion appearance, the restrictive mean–variance relationship in a Poisson distribution could not adequately accommodate the variability that is present in the deep sequencing data.16 To overcome this problem, a Poisson-based NB distribution has been developed. The NB distribution could address over-dispersion by using a prior distribution to the Poisson parameter . Many di®erential expression analysis methods, including baySeq,7 DESeq,8 DEGSeq11 and edgeR,9 are based on the NB distribution. Moreover, another excellent method, the cu®di® method,12 which is based on the mixing NB distribution (beta NB distribution) has also been used to perform the di®erential expression analysis. However, the NB distribution was reported to generate a good ¯t only for low count data.17 Compared to the NB distribution, the PLN distribution provides a better ¯t to both low and high count data. The PLN distribution can also describe the over-dispersion through introducing a log-normal distribution as the prior information of . The PLN distribution has two parameters, its mean  and its variance  2 . The PLN model has been utilized to characterize the deepCAGE count data,13 and to infer network with RNA sequencing data.15 However, as its probability distribution function cannot be expressed as in analytic form, the computational e±ciency and the accuracy of the parameter estimation of the PLN distribution function cannot be ensured. In this study, we develop an e®ective algorithm for di®erential expression analysis based on the PLN distribution, which we refer to as the DEPln method. First, the TMM method18 is employed for the normalization. Then, a trimming stage is applied to ensure the accuracy and robustness of the estimationn of the variance , and a simple mean-log approach is followed to estimate the mean . At last, the discrimination of DE genes is achieved through the z-test. Four other commonly used methods are used for performance comparison. They are the the edgeR method,9 the DESeq method,8 the DEGSeq method11 and the baySeq method.7 Compared to these four methods, our DEPln method discriminates the DE genes more accurately and results in a better tradeo® between the recall and the precision. 1550001-2

An e®ective di®erential expression analysis

2. Method 2.1. The PLN model

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

Assume that the count n of a particular gene can be modeled by a PLN distribution, n  PLN ð;  2 Þ, with two parameters, the mean  and the variance  2 . The PLN process can be divided into two stages. First, the variable n follows a Poisson distribution with parameter , that is, n  PoisðÞ. For presentational clarity, we introduce a pseudo-value x ¼ lnðÞ, with which the probability distribution of n can be written as follow: pðnjxÞ ¼

ðe x Þ n e x e : n!

ð1Þ

Second, the parameter  follows a log-normal distribution, which also means that x follows a normal distribution, x  N ð;  2 Þ,   1 ðx  Þ 2 pðxj; Þ ¼ pffiffiffiffiffiffiffiffiffiffiffi exp  : ð2Þ 2 2 2 2 To avoid confusion, we will call  and  as the Gaussian mean and the Poisson mean, respectively. Integrating pðnjxÞpðxj; Þ over the range of x, we obtain the probability distribution of n, with the Gaussian mean  and the variance  2 , as follows: Z þ1 pðnjxÞpðxj; Þdx pðnj; Þ ¼ 1   Z þ1 1 ðx  Þ 2 ðe x Þ n e x pffiffiffiffiffiffiffiffiffiffiffi exp  e dx: ð3Þ ¼ 2 2 n! 2 2 1

2.2. Parameter estimation We ¯rst estimate the variance of the distribution,  2 . Suppose there are two replications, S1 and S2, for a condition. Assume that the variance for each gene is the same. For the comparability between S1 and S2, the counts in both replications should be preprocessed through a normalization process such as the TMM (trimmed mean of M-values) method.18 Let a particular gene k be sequenced nk times in sample S1 and mk times in sample S2. Then, Eq. (3) can be approximated as13 n o 2 exp  ðlnðnÞÞ 1 2 2ð þ n Þ pðnj; Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð4Þ   : n 2  2 þ n1 Assume that both mk and nk are greater than zero. Then the joint conditional probability pðmk ; nk jÞ :¼ pðmk  nk jÞ is given by, ( ) 1 ðlnðm 0k Þ  lnðn 0k ÞÞ 2 pðmk ; nk jÞ / qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi exp  ; ð5Þ 2ð2 2 þ m1k þ n1k Þ 2 2 þ 1 þ 1 mk

nk

1550001-3

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

J. Wu et al.

where m 0k and n 0k are the normalized counts of mk and nk , respectively, and the variance  2 is to be obtained through the maximum likelihood estimation (MLE) method over all genes for which both mk and nk are greater than zero. We note that this joint conditional probability has been obtained through several steps of approximation, which introduce errors. To reduce such errors, we select a subset in which the observed count leads to the least approximation error. To do this, we consider the approximation in each step. First the Poisson distribution function in Eq. (1) was approximated by a Gaussian distribution function. By Taylor series expansion, the Gaussian distribution function can be approximated as a Gaussianlike function, with which the integration in (3) can be obtained in a closed form. The details are shown as follows:

In above approximations, the ¯rst approximation only requires that the expected value e x is large enough. The second and third approximations both require the condition that lnðnÞ ! x. When n is large enough, both the values of the original form and the approximated form of pðnjxÞ will be small for any x. Thus, for a su±ciently large value of n, the di®erence between them will be su±ciently small and can be safely ignored. The absolute expression level A mentioned in Ref. 18 will be used as the measure for the count in both samples. For gene k, the absolute expression level Ak of samples S1 and S2 is de¯ned as, Ak ¼

lnðm 0k Þ þ lnðn 0k Þ : 2

ð6Þ

A gene k will be trimmed o® if Ak is below a certain given percentile, 30% in our study, among all genes. After trimming the count data, we will estimate the parameter  through Eq. (5) by the MLE method. In the situation of several replications, we can estimate the standard variance for every pair of replications and set the average of the resulting estimates as the ¯nal estimate. The accuracy and robustness of our method will be shown in the results section later in the paper. We next estimate the Gaussian mean of the distribution, . Suppose that, for one condition, there are s replications. Let ni , i ¼ 1; 2; . . . ; s, denote the count of a particular gene in replication i. As the Poisson expected value  follows the 1550001-4

An e®ective di®erential expression analysis

log-normal distribution,   ln N ð; Þ, the parameter  can be estimated by the mean-log method as follows19: ^ 1 ^ ¼ lnðE½Þ ^ 2;  2

ð7Þ

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

where  ^ has been estimated above, and E½ is approximated simply by averaging the observation count. Obviously, this approach has much lower complexity than the commonly used expectation maximization (EM) method20 and MLE method.21 The e®ectiveness of this simple approach will also be seen through comparing with the ME and MLE methods in the results section. 2.3. Di®erential analysis based on the PLN model Suppose that we have s1 replications for Condition A and s2 replications for Condition B. Let the count number under Condition A for gene i in replication j be mi;j . Similarly, let the count number under Condition B for gene i in replication j be ni;k . The aim of di®erential expression analysis is to ¯nd the genes which express significantly di®erentially under the two conditions. As the variance has been estimated and can be considered as known, the two-sized z-test will be more suitable than the commonly used t-test in testing the null hypothesis that i ¼  i , where i is the Gaussian mean of gene i under Condition A and  i is the Gaussian mean of the same ^ , we replace the varigene under Condition B. Considering the estimated error of  ance under the two conditions as follows:  ^1 ; ^ 2i   ^ ¼ ^ 22 þ 22 ; ^ i

S1;i ¼  ^ 21 þ S2;i

ð8Þ

where  ^ 21 and  ^ 22 are the estimated Gaussian variances under Conditions A and B, ^ i , ^ i ,  respectively. Then, with the estimated  ^ 21 , and  ^ 22 , the z-statistic can be de¯ned as ^ i  ^ i  z ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : S1;i S2;i þ s1 s2

ð9Þ

Finally, the p-value of the z-test is computed for every gene. Those genes with p-value less than a preset threshold will be considered to be DE. The analysis results will be shown in the result section with both the synthetic data and real data sets. 3. Results and Discussions 3.1. Synthetic data test 3.1.1. Simulation details The simulation model is similar to what was proposed in Ref. 18. Suppose that N genes and two conditions (Conditions A and B) are considered for the di®erential 1550001-5

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

J. Wu et al.

analysis. For each condition, there are three replications. The synthetic data are randomly generated by the R generic function rpoilog,22,23 which is a random generator for the PLN distribution with parameters  and . Each value of the parameter  is obtained from a given empirical distribution, such as the power-law distribution,24 the exponential distribution and the distribution of a real data set. The parameter  re°ects the unbias noise level. In the di®erential expression analysis test, we suppose that the values of  in two conditions are the same. Besides these two parameters, there are three other parameters for simulating the synthetic data, which are the percentage of DE gene, PD , the percentage of genes which express higher in Condition A than in Condition B, PA , and the change fold which controls the level of di®erential expression. 3.1.2. Performance of parameter estimation We ¯rst discuss the necessity of the trimming stage and the estimation accuracy of the resulting trimmed method. In order to illustrate the necessity of the trimming stage, we design two types of testing count data, one of which owns a smaller proportion of low count data than the other one. In this paper, the power-law distribution is selected as the empirical distribution of the exponent of each gene mean , pðe  Þ ¼ C0 ðe  Þ  , where the constant C0 is a scaling factor and the parameter  > 0 is used to control the distribution shape. By varying the value of the parameter , we can regulate the proportion of low count data. A smaller value of  means a smaller proportion of the low counts. For example, there will be 54% and 96% of the counts less than 10 when the value of  is set to be 1.25 and 2.25, respectively. The estimated accuracy and e®ectiveness of our proposed trimmed method is validated through comparison with the performance of the untrimmed method in Ref. 13. Eight preset values of  are selected as 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, and 0.8. Figure 1 shows the results in the form of boxplots. The top two plots are the results of the untrimmed method and the trimmed method with  ¼ 1:25. The bottom two plots are the results of the untrimmed method and the trimmed method with  ¼ 2:25. Comparing the four plots, we can see that the low count data may reduce the estimated accuracy. It also re°ects the necessity of the trimming stage. Through these comparisons, the accuracy and robustness of our method is demonstrated. We have also used other distributions, such as exponential distribution and the distribution of a real data set, as the empirical distribution, and our method gives similar good results. Next, we verify the feasibility of the proposed simple Gaussian mean  estimation procedure. We preset the real value of  with the uniform distribution in range ½0; 8. For each selected value of , we generate three counts, which represent the observations in three replications with function rpoilog in R project.22,23 The standard deviation of the log-normal distribution is set as 0.3. We illustrate the performance of our mean-log method by comparing with the commonly used EM and MLE method. The three plots in Fig. 2 show the comparison of the true value with the values estimated by the three methods, respectively. We can see that the 1550001-6

0.8 0.6 0.4

Estimated value

0.6





0.0

0.2

0.4

0.1

0.3

0.5

0.7

0.1

0.3

True value

0.7

(b)

0.4



0.6



0.2

0.2





0.4

● ●

Estimated value

0.6

0.8

0.8

(a)

Estimated value

0.5 True value

0.0



0.0

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

0.0

0.2

Estimated value

0.8

An e®ective di®erential expression analysis

0.1

0.3

0.5

0.7

0.1

True value

0.3

0.5

0.7

True value

(c)

(d)

Fig. 1. The boxplots of the estimated standard deviation with the trimmed and untrimmed methods. Plot A: The boxplots for the untrimmed method with  ¼ 1:25, showing a little di®erence between the estimated value and the true; Plot B: The trimmed method with  ¼ 1:25, showing excellent estimations; Plot C: The untrimmed method with  ¼ 2:25, showing a large estimation error; Plot D: The trimmed method with  ¼ 2:25, showing excellent estimations.

simple mean-log method can provide an estimation as good as or even better than the other two methods when the replication number is small. This means that the mean-log method can be used to estimating the parameter  instead of the EM method or the MLE method. In the ¯gure, we also observe that the estimation error with a small count is larger than the estimation error with a large one. This observation should be considered in the signi¯cance testing step. Additionally, the computing time of our mean-log method is drastically reduced relative to the other two methods. For example, for the simulated data with three samples and 10,000 genes, the computing times of the mean-log, EM, MLE methods are 0.49 s, 30.56 s, and 208.75 s, respectively. 1550001-7

6 4 0

2

Estimated value

4 2 0

2

4

6

8

0

2

4

True value

6

8

True value

(b)

0 −5 −10

Estimated value

5

(a)

−15

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

0

Estimated value

6

8

8

J. Wu et al.

0

2

4

6

8

True value

(c) Fig. 2. The smoothing scatter plots of the estimated values versus the true value. Plot A: The value estimated by the simple mean-log method versus the true value; Plot B: The value estimated by the EM method versus the true value; Plot C: The value estimated by the MLE method versus the true value.

3.1.3. Di®erential expression analysis test The aim of di®erential expression analysis is to detect the genes with signi¯cantly di®erential expression between the conditions. The di®erential expression analysis can be seen as a binary classi¯er system which divides the total genes into two classes. Using the synthetic data, a receiver operating characteristic (ROC), or simply an ROC curve, is used as a measure of the overall discriminative performance of a method. Simply put, a larger value of the area under the ROC means a higher discriminatory power. In this paper, four other methods, the DESeq method, the edgeR method, the DEGSeq method, and the baySeq method, are used for comparison. The parameters of these methods are set to their default values. Three 1550001-8

An e®ective di®erential expression analysis

100

100

80

60

40

20

60 20

edgeR DESeq baySeq DEGseq DEPln

0

edgeR DESeq baySeq DEGseq DEPln 100

40

Sensitivity (%)

80 60 40 20

Sensitivity (%)

AUC: 74.7%

80

AUC: 78.4%

0

0

100

80

Specificity (%)

60

40

20

0

Specificity (%)

(b) 100

(a)

60 40 20

Sensitivity (%)

80

AUC: 63.5%

edgeR DESeq baySeq DEGseq DEPln

0

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

in°uence factors, the noise, the percentage of DE genes, and the change fold, are considered in studying the performance of the four methods. To test the impacts of the noise on the performance of the di®erential expression analysis, we design three simulation frameworks by varying the noise component  as  ¼ 0:1,  ¼ 0:3, and  ¼ 0:5. For each simulation the value of  in two conditions are the same. The other detailed parameters are selected as follows: (1) the total gene number N is 10,000; (2) the power-law distribution with  ¼ 1:25 is selected as the empirical distribution of ; (3) the value of PD is set as 20%; (4) the value of PA is set as 80%; and (5) the value of change fold is set as 2. Intuitively, more DE genes will be hidden by the noise as the value of  increases. Our analysis results in Fig. 3 validate

100

80

60

40

20

0

Specificity (%)

(c) Fig. 3. The in°uence of the noise component  on the performance of the di®erential expression analysis. Plot A: The ROC with  ¼ 0:1; Plot B: The ROC with  ¼ 0:3; Plot C: The ROC with  ¼ 0:5. 1550001-9

J. Wu et al.

80

100 60

40

20

20

edgeR DESeq baySeq DEGseq DEPln

0

edgeR DESeq baySeq DEGseq DEPln 100

60

80

AUC: 73.0%

40

Sensitivity (%)

60 40 20

Sensitivity (%)

80

AUC: 72.5%

0

0

100

80

Specificity (%)

60

40

20

0

Specificity (%)

(b) 100

(a)

60 40 20

Sensitivity (%)

80

AUC: 71.9%

edgeR DESeq baySeq DEGseq DEPln

0

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

100

this intuition. The area under the ROC decreases as the value of  increases. From the results we can also see that the proposed method outperforms the other three methods in detecting DE genes. The in°uence of the percentage of DE genes is validated by changing the value of PD . We select three values of PD , 10%, 20%, and 30%, for comparison. The value of  is set as 0.3, and the other parameters are the same as in the previous test. The results are shown in Fig. 4, in which we can see that the change of PD has no signi¯cant in°uence on the discriminative performance of the four methods.

100

80

60

40

20

0

Specificity (%)

(c) Fig. 4. The in°uence of percentage of DE gene PD on the performance of the di®erential expression analysis methods. Plot A: The ROC with a small PD (PD ¼ 10%); Plot B: The ROC with a middle PD (PD ¼ 20%); Plot C: The ROC with a large PD (PD ¼ 30%). 1550001-10

An e®ective di®erential expression analysis

80

100 60

40

20

20

edgeR DESeq baySeq DEGseq DEPln

0

edgeR DESeq baySeq DEGseq DEPln 100

60

80

AUC: 78.7%

40

Sensitivity (%)

60 40 20

Sensitivity (%)

80

AUC: 72.9%

0

0

100

80

Specificity (%)

60

40

20

0

Specificity (%)

(b) 100

(a)

60 40 20

Sensitivity (%)

80

AUC: 82.8%

edgeR DESeq baySeq DEGseq DEPln

0

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

100

The change fold can also be used to validate the performance of di®erential expression analysis methods. We select three values, 2, 3, and 4, for the change fold. The other parameter are ¯xed as: N ¼ 10; 000,  ¼ 0:3, PD ¼ 0:2, and PA ¼ 0:8. The comparison results are shown in Fig. 5, and in which we can see that the performance improves as the change fold level is increased.

100

80

60

40

20

0

Specificity (%)

(c) Fig. 5. The in°uence of change fold on the performance of the di®erential expression analysis methods. Plot A: The ROC with a change fold level of 2; Plot B: The ROC with a change fold level of 3; Plot C: The ROC with a change fold level of 4.

1550001-11

J. Wu et al.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

3.2. Real data test In addition to the synthetic data, we also evaluate our method on a real RNA-seq data set, the \Mice Strains" data set,25 which has been used for bench-marking RNAseq methods. This data set was provided by the ReCount repository26 and consists of 21 samples from two di®erent mouse strains (10 B6 strains and 11 D2 strains). Genes with a read count of zero in all 21 samples were removed, so were genes that contain a zero read count in at least one B6 and one D2 sample. Total normalization was performed by the TMM method the parameters set to their default values. Then, we applied the ¯ve analysis methods to detect the genes that express di®erentially and the parameters of the other four methods are set to their default values. DE genes were marked at a signi¯cance level of 0.01 after the Bonferroni correction. The number of DE genes found by the ¯ve methods and the overlay among the results of the methods are shown in Fig. 6. In the bar plot, we can see that the DEGSeq method detects the most genes that contain a large amount of \unique" DE genes, and the baySeq method found the fewest such genes. This means that the DEGSeq method may cause a large false discovery ratio. Our method found 2035 DE genes, of which 1859 (or 91.35%) were con¯rmed by at least one of the four other methods. Suppose that genes detected by at least two of the ¯ve methods are treated as the true DE genes, then there are 2064 such genes, and we can compute the precision, the recall, (or true positive rate) and the accuracy as the measure of the performance of the ¯ve methods. The results are shown in Table 1. In the table, we can see that our proposed DEPln method represents a better tradeo® between the recall and the precision than the other four methods and possesses a good accuracy.

Fig. 6. Analysis of the RNA-seq data set.25 Plot A: The number of genes found to be signi¯cantly DE between the two conditions; Plot B: Overlay among the sets of DE genes found by the ¯ve methods. 1550001-12

An e®ective di®erential expression analysis Table 1. The precision, recall, and accuracy of the ¯ve methods.

Precision (%) Recall (%) Accuracy (%)

edgeR

DESeq

baySeq

DEGSeq

DEPln

100 64.73 95.50

100 42.83 92.71

100 35.37 91.76

30.81 100 71.37

91.35 90.07 97.65

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

To compare the results of our method to those of the original publication,25 we used their read count data that is based on an older version of the Ensembl gene de¯nitions (Ensembl 59). Of these 2035 DE genes, 1613 were also identi¯ed in the original publication results. 4. Conclusions In this study, we presented a PLN model based algorithm for e®ectively performing the di®erential expression analysis. The PLN distribution, rather than the NB distribution was chosen because it gives a better ¯t to the count data. However, the PLN distribution su®ers the inconvenience in the mathematical analysis. Our DEPln method consists of four steps. First, the TMM method is used to normalize the count data. Second, a trimming step is introduced to ensure an accurate approximation of the PLN distribution by the Gaussian form distribution. The simulation results show that the accuracy of the noise component estimation is greatly improved after the trimming step, especially with a data set with a large proportion of low count data. Third, we use an extremely simple mean-log approach to estimate the log form of the real expression level, which can provide the same precision as the complex EM and MLE methods when the sample number is small. We noticed that the estimation error is a little larger when the count is low, and we considered this situation in the downstream analysis. Finally, the previously estimated parameters were provided to test the signi¯cance of DE genes. Through varying the setting of the synthetic data generation, we found that the noise level and the change fold have signi¯cant in°uence on the performance. Intuitively, the true DE genes may be covered up by the noise when their change fold is not large enough. Furthermore, we compared our method with other four methods on both synthetic and real data. The results indicate that our method performs better than the other four methods in terms of discriminate ability and results in a better tradeo® between the recall rate and the precision. Acknowledgments This work was partially funded by the Longhua Medicial Project of the State Clinical Research Center of TCM at the Longhua Hospital (LYTD-21), the State Key Development Program for Basic Research of China (2010CB529205 and 2013CB967402), and the National Natural Science Foundation of China (91019004 and 91229123). The authors would like to thank the reviewers in advance for their comments. 1550001-13

J. Wu et al.

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

References 1. Wang Z, Gerstein M, Snyder M, Rna-seq: A revolutionary tool for transcriptomics, Nat Rev Genetics 10(1):57–63, 2009. 2. Morozova O, Marra MA, Applications of next-generation sequencing technologies in functional genomics, Genomics 92(5):255–264, 2008. 3. Hammer P, Banck MS, Amberg R, Wang C, Petznick G, Luo S, Khrebtukova I, Schroth GP, Beyerlein P, Beutler AS, mrna-seq with agnostic splice site discovery for nervous system transcriptomics tested in chronic pain, Genome Res 20(6):847–860, 2010. 4. Marioni JC, Mason CE, Mane SM, Stephens M, Gilad Y, Rna-seq: An assessment of technical reproducibility and comparison with gene expression arrays, Genome Res 18(9):1509–1517, 2008. 5. Su H, Zhu Y-S, Serial analysis of gene expression with poisson-model based kernel principle component analysis, 2010 2nd Int Conf Information Engineering and Computer Science (ICIECS), pp. 1–5, 2010. 6. Bullard JH, Purdom E, Hansen KD, Dudoit S, Evaluation of statistical methods for normalization and di®erential expression in mRNA-Seq experiments, BMC Bioinform 11:94, 2010. 7. Hardcastle TJ, Kelly KA, baySeq: Empirical Bayesian methods for identifying di®erential expression in sequence count data, BMC Bioinform 11:422, 2010. 8. Anders S, Huber W, Di®erential expression analysis for sequence count data, Genome Biol 11(10):106, 2010. 9. Robinson MD, McCarthy DJ, Smyth GK, edger: A bioconductor package for di®erential expression analysis of digital gene expression data, Bioinformatics, 26(1):139–140, 2010. 10. Robinson MD, Smyth GK, Small-sample estimation of negative binomial dispersion, with applications to sage data, Biostatistics 9(2):321–332, 2008. 11. Wang L, Feng, Z, Wang X, Wang X, Zhang X, DEGseq: An R package for identifying di®erentially expressed genes from RNA-seq data, Bioinformatics 26(1):136–138, 2010. 12. Trapnell C, Roberts A, Go® L, Pertea G, Kim D, Kelley DR, Pimentel H, Steven L, Rinn JL, Pachter L, Di®erential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cu®links, Nat Protocols 7(3):562–578, 2012. 13. Balwierz PJ, Carninci P, Daub CO, Kawai J, Hayashizaki Y, Van Belle W, Beisel C, van Nimwegen E, Methods for analyzing deep sequencing expression data: Constructing the human and mouse promoterome with deepcage data, Genome Biol 10(7):79, 2009. 14. Suzuki H, Forrest ARR, van Nimwegen E, Daub CO, Balwierz PJ, Irvine KM, Lassmann T, Ravasi T, Hasegawa Y, de Hoon MJL, The transcriptional network that controls growth arrest and di®erentiation in a human myeloid leukemia cell line, Nat Genetics 41(5):553–562, 2009. 15. Gallopin M, Rau A, Ja®rezic, F, A hierarchical poisson log-normal model for network inference from RNA sequencing data, PLOS ONE 8(10):77503, 2013. 16. Lund SP, Nettleton D, McCarthy DJ, Smyth GK, Detecting di®erential expression in RNA-sequence data using quasi-likelihood with shrunken dispersion estimates, Stat Appl Genetics Mole Biol 11(5):8, 2012. 17. Zwinderman HH, Thygesen AH, Modeling SAGE data with a truncated gamma-poisson model, BMC Bioinform 7(1):157, 2006. 18. Robinson MD, Oshlack A, A scaling normalization method for di®erential expression analysis of RNA-Seq data, Genome Biol 11(3):R25, 2010. 19. Johnson NL, Kotz S, Balakrishnan N, Continuous Univariate Distributions, Wiley series in probability and mathematical statistics: Applied probability and statistics, Vol. 1–2, Wiley, New York, 1994.

1550001-14

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

An e®ective di®erential expression analysis

20. Karlis D, EM algorithm for mixed poisson and other discrete distributions, Astin Bull 35(1):3–24, 2005. 21. Bulmer M, On ¯tting the poisson lognormal distribution to species-abundance data, Biometrics 30:101–110, 1974. 22. Engen S, Lande R, Walla T, DeVries PJ, Analyzing spatial structure of communities using the two-dimensional poisson lognormal species abundance model, Amer Naturalist 160(1):60–73, 2002. 23. R Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria, 2003. 24. Clauset A, Shalizi CR, Newman ME, Power-law distributions in empirical data, SIAM Rev 51(4):661–703, 2009. 25. Bottomly D, Walter NA, Hunter JE, Darakjian P, Kawane S, Buck KJ, Searles RP, Mooney M, McWeeney SK, Hitzermann R, Evaluating gene expression in C57BL/6J and DBA/2J mouse striatum using RNA-Seq and microarrays, PLOS ONE 6(3):e17820, 2011. 26. Frazee AC, Langmead B, Leek JT, ReCount: A multi-experiment resource of analysisready RNA-seq gene count datasets, BMC Bioinform 12(1):449, 2011.

Jun Wu was born in Anhui Province, China on February 2, 1987. He received his BS and MS degrees from the School of Electronic Information Engineering at Changchun University of Science and Technology, Changchun, China, in 2008 and 2011, respectively. He is currently working toward achieving PhD in the Department of Automation at Shanghai Jiao Tong University. His current research interests include gene expression pro¯le analysis and transcriptional regulation network construction and analysis.

Xiaodong Zhao is a principal investigator of Shanghai Jiao Tong University. He received his PhD from China Centre of Disease Control and Prevention in 2003 and conducted his post-doctoral research at Genome Institute of Singapore from 2003–2008. After that he worked as a research associate fellow at Cambridge Institute of Medical Research of Cambridge University. In 2010, he joined the faculty of Shanghai Jiao Tong University. He has published 17 papers, which have been cited over 1400 times. Since returning to China, his research has being funded by NSFC, 973 Program and SATCM. During the past few years, he has been making e®orts to develop and apply cutting-edge high throughput genomic technologies to explore genome function. Currently his research is focused on the transcriptional regulation network and epigenetic regulation in stem cell.

1550001-15

J. Bioinform. Comput. Biol. Downloaded from www.worldscientific.com by FLINDERS UNIVERSITY LIBRARY on 01/24/15. For personal use only.

J. Wu et al.

Zongli Lin is a professor of Electrical and Computer Engineering at the University of Virginia. He received his BS degree in mathematics and computer science from Xiamen University, Xiamen, China, in 1983, his Master of Engineering degree in automatic control from Chinese Academy of Space Technology, Beijing, China, in 1989, and PhD in electrical and computer engineering from Washington State University, Pullman, Washington, in 1994. His current research interests include nonlinear control, robust control, and control applications. He was an Associate Editor of the IEEE Transactions on Automatic Control (2001–2003), IEEE/ASME Transactions on Mechatronics (2006–2009), and IEEE Control Systems Magazine (2005–2012). He was an elected member of the Board of Governors of the IEEE Control Systems Society (2008–2010) and has served on the operating committees and program committees of several conferences. He currently chairs the IEEE Control Systems Society Technical Committee on Nonlinear Systems and Control and serves on the editorial boards of several journals and book series, including Automatica, Systems and Control Letters, Science China Information Sciences, and Springer/Birkhauser book series Control Engineering. He is a Fellow of the Institute of Electrical and Electronics Engineers (IEEE), the International Federation of Automatic Control (IFAC), and the American Association for the Advancement of Science (AAAS).

Zhifeng Shao received his undergraduate training at Nanjing University in China and PhD from the University of Chicago in 1988. He then became an assistant professor of molecular physiology and biophysics in 1989 and full professor in 1998 at the University of Virginia, Charlottesville. He was an elected Fellow of AAAS in 2000. In 2009, he was appointed a K. C. Wong Chair Professor of biomedical engineering at Shanghai Jiao Tong University and served as the Dean of Systems Biomedicine from 2009– 2013. He has an authored more than 120 research articles and holds six US patents. His research ¯eld includes structural characterization at the nanoscale, functional genomics, and molecular biophysics.

1550001-16

An effective differential expression analysis of deep-sequencing data based on the Poisson log-normal model.

Tremendous amount of deep-sequencing data has unprecedentedly improved our understanding in biomedical science by digital sequence reads. To mine usef...
2MB Sizes 1 Downloads 7 Views