Imputing missing values for genetic interaction data.

Methods xxx (2014) xxx–xxx

Contents lists available at ScienceDirect

Methods journal homepage: www.elsevier.com/locate/ymeth

Imputing missing values for genetic interaction data Yishu Wang a, Lin Wang a, Dejie Yang b, Minghua Deng a,c,d,⇑ a

Center for Quantitative Biology, Peking University, Beijing 100871, China Institute of Computing Technology, Chinese Academy of Science, Beijing 100190, China c School of Mathematical Sciences, Peking University, Beijing 100871, China d Center for Statistical Sciences, Peking University, Beijing 100871, China b

a r t i c l e

i n f o

Article history: Received 2 October 2013 Accepted 27 March 2014 Available online xxxx Keywords: Soft-SVD Imputation EMAP Genetic interaction

a b s t r a c t Background: Epistatic Miniarray Profiles (EMAP) enable the research of genetic interaction as an important method to construct large-scale genetic interaction networks. However, a high proportion of missing values frequently poses problems in EMAP data analysis since such missing values hinder downstream analysis. While some imputation approaches have been available to EMAP data, we adopted an improved SVD modeling procedure to impute the missing values in EMAP data which has resulted in a higher accuracy rate compared with existing methods. Results: The improved SVD imputation method adopts an effective soft-threshold to the SVD approach which has been shown to be the best model to impute genetic interaction data when compared with a number of advanced imputation methods. Imputation methods also improve the clustering results of EMAP datasets. Thus, after applying our imputation method on the EMAP dataset, more meaningful modules, known pathways and protein complexes could be detected. Conclusion: While the phenomenon of missing data unavoidably complicates EMAP data, our results showed that we could complete the original dataset by the Soft-SVD approach to accurately recover genetic interactions. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction Genetic interactions refer to the phenomenon whereby the mutation phenotype of two genes differs to the superimposition effect of two single mutations [1]. In budding yeast and fission yeast, genetic interactions can be acquired using the high-throughput technology known as the Epistatic Miniarray Profile (EMAP) platform [2]. EMAP can construct double deletion strains systematically by crossing query strains with a library of test strains. Afterwards, the colony size of the double mutant strains is measured to get the S score, which can indicate the genetic interaction, either synthetic sick/lethal or alleviating [3]. In EMAP datasets, each gene in the query, or library, has its genetic interaction spectrum constructed by the genetic interaction S score with other genes in the library, or query. Researchers can exploit biological pathways and reveal protein complexes by clustering the S score matrix and, as a result, find cellular organization and gene functions.

⇑ Corresponding author at: School of Mathematical Sciences, Peking University, Beijing 100871, China. E-mail address: [email protected] (M. Deng).

However, one common characteristic of running EMAP is the significantly high proportion of missing values, even up to 35%, which can reduce the effectiveness of such data analysis techniques as cluster analysis and even prevent the use of some matrix factorization techniques, such as SVD or PCA. This phenomenon of missing entries could be explained by the inability of highthroughput technologies to measure genetic interaction strengths. Also, some genetic interactions could be subsequently filtered as a result of unreliability. The problem of missing values in genetic interaction datasets has been discussed before, but few technologies are used to impute quantitative epistasis values in EMAP datasets [4]. Some previous papers reported improvements in some techniques used in gene expression datasets, subsequently applying them to EMAP data. Four general strategies are considered in EMAP data, including three nearest-neighbor-based: k-Nearest Neighbors imputation (kNN) [5], Local Least Squares imputation (LLS) [6], and Iterated Local Least Square imputation (iLLS) [7]; and one global method known as Bayesian Principal Component Analysis imputation (BPCA) [8]. The term ‘‘local’’ used here represents those algorithms that impute missing values through local information around the missing value. Previously, in order to improve the accuracy of

http://dx.doi.org/10.1016/j.ymeth.2014.03.032 1046-2023/Ó 2014 Elsevier Inc. All rights reserved.

Please cite this article in press as: Y. Wang et al., Methods (2014), http://dx.doi.org/10.1016/j.ymeth.2014.03.032

2

Y. Wang et al. / Methods xxx (2014) xxx–xxx

missing value predictions, these original imputing techniques have incorporated the symmetric character of datasets [4,9]. However, with the recent development of EMAP technology and the need for practical application, most EMAP datasets are asymmetric[10–12]. Therefore, the symmetric characteristic is not appropriate for use in predicting missing entries on an increasing number of new EMAP experimental datasets. We extended the original SVD method by giving it a soft threshold, changing some optimization functions and restricting conditions [13]. This method which can be called Soft-SVD has been used in ‘‘Netflix’’ competition [13], image recovery [13] and eQTL [14], and it has been demonstrated as the most efficient algorithm in these fields. The soft-SVD algorithm is not restricted by symmetry. As such, it can be used in a wide range of EMAP datasets. We introduced this methodology to impute missing values in EMAP datasets, and, hereinafter, we call it Soft-Impute. The Soft-Impute methodology adopts the soft threshold to SVD algorithm and proposes that the nuclear norm results in a convex optimization problem. It takes advantage of the relevance in a given dataset to impute missing entries by finding a low-rank matrix that is close to the original one on the observed data. This algorithm is suitable for datasets in which modules are found with highly correlated entries. In Part 4, we constructed a synthetic dataset with low-rank matrix to test the effect of the Soft-Impute algorithm and compared it with the imputation accuracy of different imputation methods. EMAP datasets store genetic interaction spectra, where genes in the same protein complex or biological pathway tend to have similar genetic interaction spectra. Accordingly, EMAP data matrices contain several highly correlated modules. As a whole, the matrix is low-rank, which satisfies the request of the Soft-Impute algorithm for datasets. In this paper, we systematically described how the Soft-Impute method is applied to EMAP dataset imputation and then applied this method, along with four other imputation methods, to three public EMAP datasets. We then conducted a detailed comparison among these imputation techniques, showing the marked improvement of the Soft-Impute algorithm in the performance of missing entries estimation. Beyond imputation accuracy, we also evaluated these methods in terms of their ability to detect genetic interaction modules in which genes have similar interaction profiles, are involved in the same physical complex or pathway, and are also enriched in GO terms. After imputing missing entries in the EMAP score matrix, we demonstrated the downstream analysis in which hierarchical clustering results were highly improved, and more significant genetic interaction modules, which are enriched in the known discoveries, could be exploited.

2. Materials

2.2. Synthetic data We created a synthetic dataset with low-rank to realize the Soft-Impute algorithm and compared it with other imputation methods. We assumed k modules in the synthetic dataset, in which entries in the same module have higher relevance. In contrast, entries not in the same module have lower relevance. We constructed a dataset of 250 elements representing query genes in 500 dimensions standing for library genes as follows: 1. We first constructed a vector of 500 elements randomly chosen ! from ESP dataset. as Denoted by A1 ¼ fa1 ; a2 ; . . . ; a500 g ! (a) Multiply by Gaussian noise to every element ai of A1 , which ! is randomly chosen from Nð1; jai jÞ. and denoted by A2 . ! ! (b) Repeat (a) for k times which results in k vectors f A 1 ; . . . A k g 2. Generate a matrix with rank k using the above k vectors. (a) Generate k random numbers n1 ; n2 ; . . . ; nk , ranging from 3 to 30, such that their sum is 250. ! (b) Generate a matrix with 250 vectors by repeating vector A1 ! ! for n1 times, vector A2 for n2 times,. . ., vector Ak for nk times. Apparently, such matrix is of rank k. 3. Add a Gaussian noise drawn from Nð0; 0:5Þ to each entry of the above matrix. 4. Now we construct a matrix of 250 vectors with 500 dimensions. 3. Model and algorithm 3.1. Soft-Impute model EMAP data can be represented by a matrix Xmn , where m and n represent the number of query genes and library genes. To represent the existence of missing entries in EMAP dataset matrix, X f1; . . . ; mg f1; . . . ; ng denotes the indices of observed entries. Therefore, X is the original data with observed entries denoted by X and missing values denoted by X? . To impute this matrix, we aim to find a complete matrix Z, which is close to X on the observed entries X and has low rank. Here, the low- rank assumption is based on the consideration that the genetic interaction profile for cofunctional genes is shown to have highly correlated relationships [15,16]. Srebro et al. have studied generalization error bounds for learning the low-rank matrices [18], Their work also showed, theoretically, that the true underlying matrix could be recovered with very high accuracy under certain assumptions based on the entries of the matrix, locations, and proportion of unobserved entries [19–21]. Mazumder et al. formulated the above problem as the following optimization problem [13]:

minimize rankðZÞ X ðX ij Z ij Þ2 6 d subject to

ð1Þ

ði;jÞ2X

2.1. Epistatic Miniarray Profile (EMAP) The genetic interaction datasets used here are from EMAP analysis. Three EMAP datasets are used in our analysis, including the early secretory pathway (ESP) [15], chromosome biology (CHR) [16] for budding yeast, and a genome-wide EMAP profile for fission yeast [17]. The first two datasets are symmetric matrices for which query genes are the same as library genes. There are 424 genes with about 80,000 pairwise measurements in the ESP dataset, containing about 7.5% missing entries, while there are 743 genes with about 200,000 pairwise measurements in the CHR dataset, containing about 34% missing entries. On the contrary, the genome-scale genetic interaction matrix of fission yeast is not symmetric and contains 953 alleles of 876 genes against a mutant library of more than 2000 deletions, resulting in an EMAP profile with 1.6 million genetic interactions and 16% missing entries.

where d P 0 is a regularization parameter to control the error tolerance. However, in the above optimization, the rank constraint makes the problem combinatorially hard for general X [22]. One small modification to 1 is [13]:

minimize kZk X subject to ðX ij Z ij Þ2 6 d

ð2Þ

ði;jÞ2X

P where kZk is the nuclear norm of Z (kZk ¼ ri¼1 ri , where r1 ; . . . ; rr are the singular values of Z and r is the rank of Z). This modification constitutes a problem in convex optimization[23]. We can be reformulated 2 to the Lagrange form [13]:

minimize Z

1 X ðX ij Z ij Þ2 þ kkZk 2 ði;jÞ2X


ð3Þ

3


Here k P 0 is a regularization parameter controlling the nuclear e k of 3. norm of estimated value Z Suppose that we only observe a subset of X, indexed by X, and that the missing entries are indexed by X? . If we define an orthogonal projection operator P, the matrix X can be projected onto the linear space of matrices supported by X [19]:

PX ðXÞði;jÞ ¼

X ij

if ði; jÞ 2 X

0

if ði; jÞ R X

ð4Þ

Now the matrix completion problem in Lagrange form 3 can be written nicely as:

minimize Z

1 kPX ðXÞ PX ðZÞk2F þ kkZk 2

To solve the optimization problem 5, we first present the following lemma (proof can be found in [13]). Lemma If matrix Wmn has rank r, then the optimization problem:

ð6Þ

b ¼ Sk ðWÞ, where has solution Z

Sk ðWÞ ¼ UDk V

Algorithm: Soft-Impute

1. Initialize Zold ¼ 0 2. For ki in the grid of k, compute from kmax (a) Repeat: Ski ðPX ðXÞ þ PX? ðZold ÞÞ. i. Compute Znew ii. Define the energy function: fki ðZÞ ¼ 12 kP X ðXÞ P X ðZÞk2F þ kkZk , jfk ðZnew Þfki ðZold Þj < e exit. if i old fki ðZ

ð5Þ

3.2. Lemma

1 min kW Zk2F þ kkZk Z 2

Now we have the algorithm:

Þ

iii. Zold

Znew . bk (b). Assign Z Znew . i Then kiþ1 ¼ ki 0:9 for kiþ1 . Repeat (a) to (b) until: rankðZÞ P rankmax ðZÞ. bk ; . . . ; Z bk ; Z bk ; . . . ; Z bk . 3. Output the solutions: Z max 1 i iþ1 b . 4. Choose the optimal solution: Z k optimal

4. Results and discussion 4.1. Assessing the accuracy of quantitative imputation

T

with Dk ¼ diag½ðd1 kÞþ ; . . . ; ðdr kÞþ

ð7Þ

UDVT is the Singular Value Decomposition (SVD) of W and here t þ ¼ maxðt; 0Þ. The notation Sk ðWÞ refers to soft-thresholding [24] 3.3. Soft-Impute algorithm Now we begin to introduce the Soft-Impute algorithm. First, we rewrite 5 as follows:

1 kX Zk2F þ kkZk 2 1 ¼ minimize kPX ðXÞ ½Z PX? ðZÞk2F þ kkZk Z 2 1 ¼ minimize k½PX ðXÞ þ P?X ðZÞ Zk2F þ kkZk Z 2

minimize Z

ð8Þ

By Lemma in part 3.2, the optimal solution of optimization problem 8 can be solved by iteratively updating Z using

Z Sk ðPX ðXÞ þ PX? ðZÞÞ

ð9Þ

with an arbitrary initialization. We propose a cross-validation-like strategy to select the optimal parameter tuning. The idea is as follows: X is the index of observed entries of X. First, we randomly introduce 5% more artificial deletions in X, by deleting the original observed ones to get a test dataset. Then we solve 5 on a grid of k values on the test dataset. We start from a large kmax , which equals the second largest singular value of matrix P X ðXÞ. We set the maximum rank of Z, denoted as rankmax ðZÞ, equal to min (m, n). If rankðZÞ < rankmax ðZÞ, we continue to solve 5 and reduce k by a factor g ¼ 0:9 until rankðZÞ P rankmax ðZÞ. Finally, to select the optimal parameter, we evaluate the prediction error between the actual data and the predicted data on a grid of k on the test dataset. Here the Pearson correlation and the normalized root mean squared error (NRMSE) 10 are used as the evaluation criteria. We can choose the parameter k , which minimizes the prediction error (the highest Pearson correlation or lowest NRMSE).

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 mean½ðijanswer ijguess Þ NRMSE ¼ variance½ijanswer

ð10Þ

We applied the Soft-Impute to three genetic interaction datasets and one synthetic dataset. The three genetic interaction datasets investigated here are ESP-EMAP, CHR-EMAP and the S. pombe global genetic interaction map. To assess the effectiveness of imputation techniques for genetic interaction datasets (EMAP), we artificially introduce additional missing values on the basis of an existing incomplete EMAP matrix. We randomly delete the original observed data in EMAP or synthetic dataset to get a new matrix with artificial missing entries. This process is repeated multiple times so that we get a series of test matrices for every original EMAP or synthetic data matrix. After applying different imputation methods on these test matrices, we can evaluate the prediction error between the actual data and the predicted data. Then we get one imputation accuracy distribution for each EMAP dataset for each imputation method. For ESP, CHR and the synthetic datasets, we repeat the artificial deletions twenty times, resulting in twenty test matrices, respectively. The S. pombe set, which contains about 1.6 million pairwise data, is too computationally expensive; therefore, we repeat only 15 times. Fig. 1a and b shows when the 10% rate of artificial missing values is generated 20 times in the synthetic dataset. The Soft-Impute algorithm performs best on all test matrices, irrespective of Pearson correlation or NRMSE evaluated. The BPCA algorithm could not get one stable result. As demonstrated in [8], BPCA does not perform well when genes have dominant local similarity structures. Our test matrices induced from the synthetic dataset have high local correlation with respect to a gene module, but the BPCA algorithm is limited by this characteristic. In order to demonstrate the performance of the Soft-Impute algorithm on datasets with different rates of missing values, an artificial missing rate different from 1% to 37% is used. Fig. 1c and d give the imputation results of this gradient missing rate in 20 synthetic data test matrices. To demonstrate whether such accuracy performance differences, as seen in the synthetic dataset, are statistically significant, we performed the t-test for every two distributions and derived the statistical significance (P-value) (see Fig. 2a and b). The running times, as shown in Table 1, are computed when these imputation methods are applied on 20 synthetic data test matrices with 10% artificial deletion values and gradient artificial missing value rate.



10

15

1.5 1.0

1.5 1.0

Person Correlation

0.5 0.0

5

20

5

0.10

0.15

0.20

0.25

0.30

15

20

0.35

Synthetic datasets with gradient rate of missing values

2.5 0.5

1.0

1.5

2.0

knn ills lls bpca soft

0.0

1.5 1.0 0.5 0.0

Person Correlation

(d) knn ills lls bpca soft

0.05

10

Synthetic datasets with 10% missing values

normalized root mean squared error (NRMSE)

Synthetic datasets with 10% missing values

(c)

knn ills lls bpca soft

0.5

(b) knn ills lls bpca soft

0.0

(a)

normalized root mean squared error (NRMSE)

4

0.05

0.10

0.15

0.20

0.25

0.30

0.35

Synthetic datasets with gradient rate of missing values

Fig. 1. Capability of the imputation methods to reproduce the original measurements in the synthetic datasets. (a and b) The imputation accuracy (Pearson correlation and NRMSE) on synthetic datasets with 10% values deleted. (c and d) The results on the same original synthetic dataset with gradient missing rates. The x axis represents the missing rate.

Imputation accuracies in terms of Pearson correlation and NRMSE for the different imputation methods on 20 ESP-EMAP data test matrices with artificial 10% deletion values are shown in Fig. 3a and b. Within each matrix, the best imputation accuracy is obtained by Soft-Imputation, while the BPCA algorithm is not stable enough to get even one optimal result. The second best performing methods are LLS and ILLS, which are more timeconsuming (see Table 2). To demonstrate the imputation results across different missing rates on ESP-EMAP, we also constructed the gradient artificial missing rate from 1% to 25% by introducing artificial missing entries in the observed subset (see Fig. 3c and d). For each method, the imputation accuracy decreases with the increase of missing value rate; meanwhile, the Soft-Impute method almost achieves a better performance for different missing rates. Because of the high rate of missing values in the original datasets of CHR-EMAP and S. pombe, we did not evaluate the imputation accuracy with gradient missing rates. For the CHR-EMAP and S. pombe datasets, we introduced 5% artificial missing values. The BPCA method depends heavily on the properties of the dataset being imputed, and, therefore, it was difficult to find a stable optimal point. This time, in terms of the S. pombe dataset, BPCA could not converge and, hence, no BPCA result. Running times of the different imputation methods on these three EMAP datasets are presented in Table 2. k Nearest Neighbor imputation is faster than the more advanced imputation methods,

but it has the worst imputation accuracy (Fig. 3). The Soft-Impute algorithm costs a median time, but it achieves the best accuracy among the different methods in all datasets. Its surprising ability to impute the EMAP datasets is supported by the results shown in Fig. 3. We have demonstrated the process of constructing EMAP and synthetic data test matrices with random artificial missing values. Thus, for every EMAP dataset (ESP, CHR, S. pombe) and the synthetic dataset, there are several distributions of imputation accuracy in terms of different imputation methods. To demonstrate that the imputation accuracy differences on EMAP datasets of different imputation methods are statistically significant, we derived statistical significance (P-value) of t-test between the accuracy distribution of Soft-Imputation and that of the other methods for every EMAP dataset. This result is shown in Fig. 2, in which the X axis represents methods, and the Y axis represents the mean of imputation accuracy. For example, in Fig. 2a and b, a set of 20 test matrices with artificial deletions is induced from the original synthetic dataset. After imputing on this set with different imputation methods, two accuracy distributions resulted, including Pearson correlation (Fig. 2a) and NRMSE (Fig. 2b), for each imputation method. In these two figures, the means of accuracy distributions of different imputation methods are plotted in Fig. 2, where the red histogram represents the Soft-Impute method, and blue histograms represent the other


5


Soft.iLLs

Soft.BPCA

0.8 0.6

1.2

0.01

Soft.kNN

Soft.LLS

Soft.iLLs

Soft.BPCA

Soft.kNN

Soft.LLS

Soft.iLLs

Soft.BPCA

0.6 0.4

1.2

Imputing missing yield trial data.

iVAR: a program for imputing missing data in multivariate time series using vector autoregressive models.

Kalman Filtering for Genetic Regulatory Networks with Missing Values.

Improving Disparity Research by Imputing Missing Data in Health Care Records.

Statistical data preparation: management of missing values and outliers.

Missing values in big data research: some basic skills.

Comparison of random forest and parametric imputation models for imputing missing data using MICE: a CALIBER study.

Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values.

An Improved DINEOF Algorithm for Filling Missing Values in Spatio-Temporal Sea Surface Temperature Data.

In-depth method assessments of differentially expressed protein detection for shotgun proteomics data with missing values.

A joint framework for missing values estimation and biclusters detection in gene expression data.

Multivariate t nonlinear mixed-effects models for multi-outcome longitudinal data with missing values.

Application of Multiple Imputation for Missing Values in Three-Way Three-Mode Multi-Environment Trial Data.

mvp - an open-source preprocessor for cleaning duplicate records and missing values in mass spectrometry data.

Ethnicity in Pediatric Electronic Health Records: Reducing Bias with Use of U.S. Census Location and Surname Data.

A nonparametric multiple imputation approach for data with missing covariate values with application to colorectal adenoma data.

Viral Genetic Linkage Analysis in the Presence of Missing Data.

Low-Rank and Sparse Matrix Decomposition for Genetic Interaction Data.

Handling missing values in the MDS-UPDRS.

Drawing inferences from clinical studies with missing values using genetic algorithm.

Influence of missing values substitutes on multivariate analysis of metabolomics data.

Analysis of longitudinal data from animals with missing values using SPSS.

Reporting the reliability of accelerometer data with and without missing values.

Robust smoothing of gridded data in one and higher dimensions with missing values.