Biostatistics (2016), 17, 4, pp. 677–691 doi:10.1093/biostatistics/kxw013 Advance Access publication on April 4, 2016

Hypothesis testing for differentially correlated features ELISA SHENG Department of Biostatistics, University of Washington, Seattle, WA, USA DANIELA WITTEN Department of Biostatistics, University of Washington, Seattle, Washington, USA and Department of Statistics, University of Washington, Seattle, Washington, USA XIAO-HUA ZHOU∗,† Department of Biostatistics, University of Washington, Seattle, Washington, USA and Veterans Affairs Seattle Medical Center, Seattle, Washington, USA [email protected] SUMMARY In a multivariate setting, we consider the task of identifying features whose correlations with the other features differ across conditions. Such correlation shifts may occur independently of mean shifts, or differences in the means of the individual features across conditions. Previous approaches for detecting correlation shifts consider features simultaneously, by computing a correlation-based test statistic for each feature. However, since correlations involve two features, such approaches do not lend themselves to identifying which feature is the culprit. In this article, we instead consider a serial testing approach, by comparing columns of the sample correlation matrix across two conditions, and removing one feature at a time. Our method provides a novel perspective and favorable empirical results compared with competing approaches. Keywords: Correlation matrix; Differential correlation; Feature selection; Hypothesis testing; Wald test.

1. INTRODUCTION In many modern research settings, it is of interest to identify features in a multivariate dataset that differ across conditions. For instance, in functional genetics, researchers look for differences in gene expression profiles between normal and diseased tissues. Using functional MRI data, scientists search for voxels that differ before and after a stimulus. To study differences across conditions, one typically considers the mean of each feature in each condition. Suppose X and Y are multivariate random vectors of a single set of p features under two different conditions. Let μ(X) and μ(Y) denote the means of X and Y, respectively, and let μ j (X) denote the mean ∗ To

whom correspondence should be addressed. (Andrew) Zhou, PhD, is presently a Core Investigator and Biostatistics Unit Director at the Seattle HSR&D Center of Innovation for Veteran-Centered and Value-Driven Care, Department of Veterans Affairs Medical Center, Seattle, WA. The views expressed in this article are those of the authors and do not necessarily represent the views of the Department of Veterans Affairs.

† Xiao-Hua

c The Author 2016. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: [email protected]. 

678

E. SHENG

AND OTHERS

of the jth feature of X. To detect mean differences, one can consider null hypotheses of the form H0∗j : μ j (X) = μ j (Y). However, mean differences may not capture the underlying data characteristics that differ across conditions. For example, in the context of gene expression data, co-expression, or expression of two or more genes together, may change across conditions. To address dependence between two or more genes, methods have been proposed to incorporate correlation structure to test pairs or small sets of genes for mean differences (Lai and others, 2004; Shedden and Taylor, 2004; Xiao and others, 2004; Dettling and others, 2005; Ho and others, 2007). However, differences in correlations may also occur independently of mean differences. One way to compare correlations across conditions is to test for equality of the covariance or correlation matrices. Suppose R(X) and R(Y) are the population correlation matrices of X and Y, respectively. The null hypothesis of interest is then H0 : R(X) = R(Y). In the low-dimensional setting (i.e. p  n, m, where n, m are the sample size of each condition), classical approaches for testing this null hypothesis are based on the likelihood ratio (Cole, 1968) or Wald-type test statistics (Kullback, 1967; Jennrich, 1970; Layard, 1972; Larntz and Perlman, 1985; Modarres and Jernigan, 1992; Satorra and Neudecker, 1997). Testing equality of covariance matrices is also a well-studied problem (Wilks, 1932; Bartlett, 1937; Muirhead, 1982; Seber, 1984). Methods have also been proposed for the high-dimensional setting ( p > n, m) (Ledoit and Wolf, 2002; Schott, 2007; Li and Qin, 2014). In either the low- or high-dimensional setting, the aforementioned references are concerned with testing the global null hypothesis, H0 : R(X) = R(Y). In other words, the goal is to determine whether the covariance or correlation matrix of all p features differs across conditions. In the present article, we are instead interested in identifying individual features whose correlations with other features differ across two conditions. Several recent papers have proposed approaches for identifying features whose correlations differ across conditions (e.g. Gill and others, 2010; Amar and others, 2013; Bockmayr and others, 2013). However, these methods do not employ a formal hypothesis testing framework. In the multivariate normal setting, some authors have considered identifying pairs of features whose partial correlations differ across conditions (Guo and others, 2011; Mohan and others, 2012; Danaher and others, 2014); however, these methods focus on pairwise partial correlation differences across conditions, whereas we are interested in comparing relationships of each feature with all other features across conditions. More related to this article are recent proposals by Hu and others (2010) and Cai and others (2013), who consider tests for equality of the columns of R(X) and R(Y) in order to identify correlation differences in individual features. However, neither proposal takes advantage of the particular correlation structure that arises when a small number of features are disrupted across conditions, leading to a situation in which the correlations of those features with many other features differ across conditions. This would be the case in a comparison of gene expression profiles between normal and disease conditions if only one gene were disrupted in some disease, where that disruption affects that gene’s correlation with many other genes. In this article, we are interested in characterizing the notion of a disrupted gene in terms of correlation matrices, and developing a corresponding hypothesis testing framework to identify features of interest. As a motivating example, suppose X and Y represent gene expression profiles under two distinct conditions. Let R j (X) and R j (Y) denote, respectively, the jth columns of R(X) and R(Y), that is, the correlations between the jth gene and the p − 1 other genes, in the two conditions. Suppose that the first gene is disrupted in one of the two conditions, but that the two conditions are otherwise identical. This would

679

Hypothesis testing for differences in correlation R(Y)

R(X)

R(X) - R(Y)

1



=

-1

Fig. 1. In the motivating example of Section 1, the two correlation matrices differ within a single row/column. Our goal is to identify only the first feature as different between the two conditions.

manifest as R(X) = R(Y), and more specifically, R1 (X) = R1 (Y). Note that the submatrices of R(X) and R(Y) with the first row and column removed are identical—that is, all of the differences between R(X) and R(Y) can be attributed to the first gene. Figure 1 illustrates this motivating example. Our goal is to develop a testing procedure that will determine that in the set-up of Figure 1, all differences between the correlation matrices across the two conditions can be attributed to the first feature. To describe the problem more clearly, we define the terms ‘dysregulated’ and ‘minimally dysregulated’—language borrowed from the study of gene expression data—as follows. DEFINITION 1 The jth feature is dysregulated across conditions X and Y if R j (X) = R j (Y). Let C ⊆ {1, . . . , p} denote a subset of size k and X−C denote the subvector of the random p-vector X corresponding to all but the features in C. DEFINITION 2 Consider a set of k features, M ⊆ {1, . . . , p}, that satisfies the following conditions: (1) R(X−M ) = R(Y−M ); (2) for all strictly smaller subsets C ⊂ M, R(X−C ) = R(Y−C ). Once we remove all of the features in M from X and Y, there is no difference in the correlation matrices among the remaining features across conditions—that is, R(X−M ) = R(Y−M ). Using this terminology, we can now see that our goal is to identify M (e.g. {1} in Figure 1), as opposed to the set of dysregulated features (e.g. {1, . . . , p/2} in Figure 1). In Section 2, we develop Wald-type test statistics for testing null hypotheses of the form H0 j : R j (X) = R j (Y),

(1.1)

for j = 1, . . . , p. We then show that simultaneous tests of (1.1) are not appropriate, as they lead to rejection for more than just the first feature in Figure 1 (i.e. they lead to a poor estimate of M). This motivates the need for a different approach. In Section 3, we develop a method to test the series of null hypotheses of the form  {R(X−C ) = R(Y−C )} (1.2) H0k : C∈C( p,k)

E. SHENG

680

AND OTHERS

for k = 0, . . . , p − 2, where C( p, k) denotes the set of all k-combinations of {1, . . . , p}. This series of hypothesis tests leads naturally to an estimator of the set M, which we show to have excellent empirical performance. In Section 4, we apply our proposal to two gene expression data sets. The discussion is in Section 5. 2. A SIMULTANEOUS APPROACH Let X = (X 1 , . . . , X p ) and Y = (Y1 , . . . , Y p ) denote p-dimensional random vectors corresponding to a single set of p features under two distinct conditions. Let (X) and (Y) denote the population covariance matrices of X and Y, and let R(X) and R(Y) denote their population correlation matrices, respectively. Without loss of generality, assume that the population mean vectors of X and Y equal zero. Let (x1 , . . . , xn ) and (y1 , . . . , ym ) be independent and identically distributed samples of X and Y. Denote the empirical covariance matrices by

where x¯ = (1/n)

n i=1

ˆ (X) =

1  (xi − x¯ )(xi − x¯ ) , n − 1 i=1

ˆ (Y) =

1  (y − y¯ )(yi − y¯ ) , m − 1 i=1 i

xi and y¯ = (1/m)

n

m

m i=1

yi . Denote the empirical correlation matrices by

−1/2 ˆ −1/2 ˆ ˆ ˆ ((X)) diag((X)) , R(X) = diag((X)) −1/2 ˆ −1/2 ˆ ˆ ˆ ((Y)) diag((Y)) . R(Y) = diag((Y))

Let the operator vec(A) denote the “vectorization” of a p × p matrix A, defined by ⎛

a11 ⎜ .. vec(A) = vec ⎝ . a1 p

⎞ . . . a p1 .. ⎟ = (a , . . . , a , . . . , a , . . . , a ) . .. 11 1p p1 pp . . ⎠ . . . a pp

In other words, the vectorization of A stacks the columns of A into a p 2 column vector. Since our goal is to develop test statistics for each of the p features, we are interested in examining columns of the correlation matrices. Since the diagonal of a correlation matrix is always one, we ignore it, and so henceforth let R j (X) and R j (Y) denote the jth column of R(X) and R(Y) with the jth element removed. 2.1 A test of H0 j : R j (X) = R j (Y) In order to develop a test for H0 j : R j (X) = R j (Y), we extend a classical Wald-type approach for testing H0 : R(X) = R(Y), which relies on the following result (Neudecker and Wesselman, 1990). LEMMA 1 Suppose that X has finite fourth moments, that is, E(X i X j X k X h ) < ∞ for all 1  i, j, k, h  p. Then √ ˆ n(vec(R(X)) − vec(R(X))) →d N p2 (0, (X)) as n → ∞,

681

Hypothesis testing for differences in correlation

where  = ∇(vec(R(X)))V(X)∇(vec(R(X))) , ∇ denotes the gradient with respect to vec((X)), and V(X) = E(XX ⊗ XX ) − vec()vec() . Furthermore, if X is multivariate normal, (X) ji,kh ≡ (X)( j−1) p+i,(h−1) p+k = ρ jk ρi h + ρ j h ρik − ρkh (ρ jk ρik + ρ j h ρi h ) − ρ ji (ρ jk ρ j h + ρik ρi h ) 2 + ρi2h ) + 12 ρ ji ρkh (ρ 2jk + ρ 2j h + ρik

(2.1)

for 1  i, j, k, h  p, where ρik denotes the (i, k)th element of R(X). Using Lemma 1, it can be shown that √ and



ˆ j (X) − R j (X)) →d N p−1 (0,  j·,· j (X)), n(R

ˆ j (Y) − R j (Y)) →d N p−1 (0,  j·,· j (Y)), m(R

ˆ j (X) and R ˆ j (Y), where  j·,·, j (X) and  j·,·, j (Y) are the asymptotic covariance matrices corresponding to R respectively. Due to the independence of the samples from each condition, it follows that

nm ˆ ˆ j (X) − R j (X)) − (R ˆ j (Y) − R j (Y))) →d N p−1 (0, I),  j·,· j (X, Y)−1/2 ((R n+m

where ˆ j·,· j (X, Y) = (m/(n + m))ˆ j·,· j (X) + (n/(n + m))ˆ j·,· j (Y), and ˆ j·,· j (X) and ˆ j·,· j (Y) are consistent estimators of  j·,· j (X) and  j·,· j (Y), respectively. This result motivates our proposed test statistic for H0 j : R j (X) = R j (Y), T j (X, Y) =

nm ˆ ˆ j (Y)) (ˆ j·,· j (X, Y)−1 )(R ˆ j (X) − R ˆ j (Y)). (R j (X) − R n+m

(2.2)

PROPOSITION 1 Suppose that X and Y have finite fourth moments, X ⊥ Y, R j (X) = R j (Y) and m/n → λ as n, m → ∞ for some finite constant λ. Then 2 T j (X, Y) →d χ p−1

as n, m → ∞.

The proof of Proposition 1 is given in the supplementary material available at Biostatistics online. Proposition 1 implies that the type I error rate of the following test of H0 j : R j (X) = R j (Y), 2 1 if T j (X, Y) > χ p−1 (1 − α), φ j (α) = 0 otherwise

(2.3)

is controlled at level α asymptotically. In a simulation study in the supplementary materials (available at Biostatistics online), we investigate the finite-sample type I error rate control of φ j (α) in (2.3). Given (2.3), one can simultaneously test H01 , . . . , H0 p , using a multiplicity correction to address the problem of multiple comparisons. However, as we will see in the next section, this approach does not

E. SHENG

682

AND OTHERS

achieve the goal of our paper, as described in Section 1: to identify a minimally dysregulated set of features. 2.2 Motivation for a different approach We now argue that simultaneously testing H0 j : R j (X) = R j (Y) for j = 1, . . . , p can be problematic, motivating the need for a different approach. Consider a simplified version of the motivating example in Figure 1, where the p × p matrices R(X) and R(Y) take the respective forms

1 2 ..

p.  R(X) =

p2 +1 2 .. . p

1 2 ..

p.  R(Y) =

p2 +1 2 .. . p

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

1

2

···

1 b .. .

b 1

···

b

a

..

p

2 b a

2

+1

a

.

···

p

···

a

a 1 1

a .. . a

..

a

.

a

1

2

···

1 c .. .

c 1

···

c

a

a .. . a

p

..

p

p

2 c a

2

a

.

+1

···

p

···

a

1 1 .. a

⎟ ⎟ ⎟ ⎟ ⎟ ⎟, ⎟ ⎟ ⎟ a ⎟ ⎟ ⎟ ⎠

(2.4)

1

a

a



.



⎟ ⎟ ⎟ ⎟ ⎟ ⎟. ⎟ ⎟ ⎟ a ⎟ ⎟ ⎟ ⎠

(2.5)

1

where a = 0.3, b = 0.3, and c = 0.1. Note that R(X) and R(Y) are identical except in the first half of the first row/column. We considered p = 20. For n = m = 200 and n = m = 500, we sampled from N20 (0, R(X)) and N20 (0, R(Y)), and computed T j (X, Y) in (2.2) for j = 1, . . . , 20. This was repeated 10 000 times. The distributions of T1 (X, Y), T2 (X, Y) and T11 (X, Y) are shown in Figure 2. (Test statistics T j (X, Y) for other values of j are not shown, because they are identical in distribution to either T2 (X, Y) or T11 (X, Y).) Other values of a, b, and c yielded similar results. The left-hand panels of Figure 2 indicate that φ1 (α) almost always rejects H01 , as expected. The right2 distribution as suggested by hand panels of Figure 2 indicate that T11 (X, Y) has approximately a χ p−1 Proposition 1, and hence that φ11 (α) has well-controlled Type I error rate. 2 distribution. However, the center panels of Figure 2 indicate that T2 (X, Y) does not have a χ p−1 This becomes particularly pronounced as the sample size increases. Indeed, H02 does not hold, since the (1, 2) elements of R(X) and R(Y) are unequal in (2.4) and (2.5). Consequently, in this simulation study, H02 , H03 , . . . , H0, p/2 will tend to be rejected when the sample size is sufficiently large, even though the

683

Hypothesis testing for differences in correlation

0.10

0.10

0.08

0.08

0.08

0.06 0.04 0.02

Density

0.10

Density

Density

(a)

0.06 0.04 0.02

0.00 20

40

60

80

100

0.06 0.04 0.02

0.00 0

do not reject reject

0.00 0

20

40

60

0

10

20

T2

T1

30

40

50

40

50

T11

0.10

0.10

0.08

0.08

0.08

0.06 0.04

Density

0.10

Density

Density

(b)

0.06 0.04

0.00

0.00 0

50

100 T1

150

0.04 0.02

0.02

0.02

0.06

0.00 0

20

40

60

80

T2

0

10

20

30 T11

Fig. 2. The histograms illustrate the distribution of T j (X, Y) for j = 1, 2, 11 in the simulation study from Section 2.2. 2 The dashed lines represent a χ p−1 distribution, the asymptotic distribution of T j (X, Y) under H0 j : R j (X) = R j (Y) according to Proposition 1. The gray shaded regions indicate the proportion of test statistics that would result in rejecting the null hypothesis H0 j under (2.2). (a) n = m = 200, (b) n = m = 500.

differences between R(X) and R(Y) can be fully attributed to differences in the first feature’s correlations across the two conditions. Our goal was to identify only the first feature as differing across the two conditions. This simulation study reveals that simultaneous tests of H01 , . . . , H0 p are flawed, essentially because each correlation involves a pair of features, and thus any difference in correlation across two conditions necessarily implicates at least two features. To accomplish our goal, we must develop a set of null hypotheses such that, in this example, only a single null hypothesis corresponding to the first feature is violated. This motivates our proposed approach: a serial testing procedure.

3. A SERIAL APPROACH We just saw that tests of H0 j : R j (X) = R j (Y) fail to identify only the first feature as different across conditions in the setting of Figure 1. Therefore, they are not appropriate for the goal laid out in Section 1, which is to identify M, the minimal set of features that account for the differences between R(X) and R(Y). We now consider a new hypothesis testing framework. Instead of simultaneously testing the p null hypotheses, H0 j in (1.1) for j = 1, . . . , p, we propose to test the series of null hypotheses H0k in (1.2) p−2 for k = 0, . . . , p − 2. The null hypotheses H00 , . . . , H0 are closely related to the cardinality of M. If k−1 k 0 H0 , . . . , H0 do not hold and H0 does hold, then |M| = k. In Section 3.1, we propose a test of H0k , which we call φ k (α). Then in Section 3.2, we apply serial tests of φ k (α) in order to estimate M. A comparison to competing approaches is in Section 3.3.

684

E. SHENG

AND OTHERS

3.1 φ k (α): A test of H0k Suppose that H0k in (1.2) holds. Then there exists a set C of size k such that R(X−C ) = R(Y−C ). Hence the 2 . Consider the test test statistics T j (X−C , Y−C ) for j ∈ C are asymptotically χ p−k−1   ⎧ α ⎨1 if min max{T (X , Y )} > χ 2 , j −C −C p−k−1 1 − C∈C( p,k) j∈C p−k φ k (α) = (3.1) ⎩ 0 otherwise. PROPOSITION 2 Suppose the conditions of Proposition 1 and H0k hold. Then E(φ k (α))  α

as n, m → ∞.

The proof of Proposition 2 is given in the supplementary material (available at Biostatistics online). Proposition 2 implies that, under H0k , the test φ k (α) controls type I error at a level  α. A simulation study of the performance of φ k (α) under H0k is presented in the supplementary materials (available at Biostatistics online). 3.2 An estimator of M Now suppose that we perform serial tests of φ k (α) for k = 0, . . . , p − 2. Suppose we reject H00 , . . . , H0k−1 , but do not reject H0k . Then    M = argminC∈C( p,k) max{T j (X−C , Y−C )} j∈C

(3.2)

is a natural estimator of M from Definition 2: our failure to reject H0k supports condition 1 of Definition 2, and our rejection of H01 , . . . , H0k−1 supports condition 2 of Definition 2.   |M| with high probability. We will now show that |M| PROPOSITION 3 Suppose the conditions of Proposition 1 hold. Then  > |M|)  α P(|M|

as n, m → ∞.

The proof of Proposition 3 is given in the supplementary material (available at Biostatistics online). 3.3 Comparison to competing methods  (3.2) as an estimator of M (Definition 2) to the methods proposed in We compared the accuracy of M Cai and others (2013) and Hu and others (2010), both of which use a simultaneous (rather than a serial) approach to identify features whose covariances with other features differ across conditions. Since the methods of Cai and others (2013) and Hu and others (2010) are based on covariances rather than correlations, we employ a covariance version of our proposal (see the supplementary material, available at Biostatistics online, for details). The method of Cai and others (2013) simultaneously tests the null hypotheses H0 j :  j (X) =  j (Y) ˆ j (X) −  ˆ j (Y). for j = 1, . . . , p. The test statistic for H0 j is the squared, standardized maximum value of  We conclude that a given feature is in the estimate of M by Cai and others (2013) if its test statistic exceeds some threshold.

685

Hypothesis testing for differences in correlation

(a)

(b)

(c)

(d)

Fig. 3. Four scenarios for R(Y) considered in the simulation study in Section 3.3. Each subfigure represents a 20 × 20 matrix. Black entries correspond to values of 0.1, for which R(Y) differs from R(X). White off-diagonal entries correspond to values of 0.3, for which elements of R(Y) are equal to R(X). (a) Scenario A, (b) Scenario B, (c) Scenario C, (d) Scenario D.

The method of Hu and others (2010) tests for equality of the joint distribution of “covariance distances” for a given feature across the two conditions (ignoring a fixed number of covariance distances, specified by the “trim number”). We conclude that a given feature is in the estimate of M by Hu and others (2010) if the resampling-based p-value falls below some threshold. We perform the method of Hu and others (2010) with trim numbers 0 and p/2. We simulated data where M = {1, 2, 3, 4, 5}, p = 20, and n = m = 500. Each sample was drawn independently from a N20 (0, R(X)) or a N20 (0, R(Y)) distribution. We fixed R(X) to be ⎛ ⎞ 1 0.3 ⎜ ⎟ .. R(X) = ⎝ ⎠, . 0.3

1

and considered one of four possible scenarios for R(Y), illustrated in Figure 3. Our proposed approach, the method of Cai and others (2013) and that of Hu and others (2010) were applied in order to obtain estimates of M. For each method, the cardinality of the estimate of M was varied (for instance, in our method, this corresponds to the level α; in the method of Cai and others (2013), it corresponds to a threshold for the test statistic). The results, averaged over 1000 replications, are shown  no in Figure 4. The results corroborate Proposition 3, in that our method yields an average value of |M| greater than 5 for all values of α considered. While our proposed approach does at least as well as the competitors in all four scenarios displayed in Figure 4, it does particularly well (relative to competitors) in Scenarios A, C, and D. Recall from Figure 3 that in Scenarios A, C, and D, the features in M have differential correlations with many other features, including features in Mc . Consequently, the serial approach taken by our proposal is key to identifying the correct set of features in M, as described in Section 2.2. In contrast, in Scenario B, the features in M have differential correlations only with other features in M; consequently, in this scenario, there is no need for a serial approach—a simultaneous approach will perform just as well. In fact, we see that in Scenario B, all approaches have comparable performance. 4. APPLICATION TO GENE EXPRESSION DATA In this section, we examine two gene expression datasets. These datasets are high-dimensional, leading to challenges for our proposal: (1) Propositions 2 and 3 require p  n, m; and (2) for large p, tests

E. SHENG

686

(a)

AND OTHERS

(b)

4

3

2 Proposed Approach α = 0 .001 α = 0 .05 α = 0 .2 Hu 2010 (trim = p/2) Hu 2010 (trim = 0) Cai 2013

1

0 0

5

10

15

Number of Correct Features in Estimate of M

Number of Correct Features in Estimate of M

5

5

4

3

2 Proposed Approach α = 0 .001 α = 0 .05 α = 0 .2 Hu 2010 (trim = p/2) Hu 2010 (trim = 0) Cai 2013

1

0

20

0

Number of Features in Estimate of M

(c)

10

15

20

(d) 5

4

3

2 Proposed Approach α = 0 .001 α = 0 .05 α = 0 .2 Hu 2010 (trim = p/2) Hu 2010 (trim = 0) Cai 2013

1

0 0

5

10

15

Number of Features in Estimate of M

20

Number of Correct Features in Estimate of M

5 Number of Correct Features in Estimate of M

5

Number of Features in Estimate of M

4

3

2 Proposed Approach α = 0 .001 α = 0 .05 α = 0 .2 Hu 2010 (trim = p/2) Hu 2010 (trim = 0) Cai 2013

1

0 0

5

10

15

20

Number of Features in Estimate of M

Fig. 4. Results from the simulation study in Section 3.3, averaged over 1000 simulated datasets. The x-axis displays the cardinality of the estimate of M, and the y-axis displays the number of true positives, i.e. the number of features in the estimate of M that are truly in M. For reference, the bounding triangle indicates two methods: the NW boundary is an idealized method that always correctly selects features in M, and the SE boundary indicates a method that selects features at random. (a) Scenario A, (b) Scenario B, (c) Scenario C, (d) Scenario D.

of H0k (1.2, 3.1) become computationally intractable unless k is quite small. Hence a screening procedure to reduce dimensionality is necessary. We consider three screening approaches in Sections 4.1–4.3, respectively. 4.1 Screening based on scientific knowledge We first consider a gene expression dataset that consists of 11 861 gene expression measurements from 220 tissue samples taken from patients with one of four subtypes of glioblastoma multiforme (GBM; Verhaak and others, 2010. We compare the two GBM subtypes with the largest sample sizes, n = 53

Hypothesis testing for differences in correlation

687

(proneural) and m = 56 (mesenchymal). In order to reduce dimensionality, we restrict attention to 34 genes involved in TCR signaling, as was done in Mohan and others (2014).  = {NFKBIA, CHUK, TRA@, UBE2N, RIPK2}. Interestingly, NFKAt level α = 0.05, we estimate M BIA is a tumor suppressor gene. There is known to be enrichment of single-nucleotide polymorphisms and haplotypes of NFKBIA in Hodgkin’s lymphoma, colorectal cancer, melanoma, hepatocellular carcinoma, breast cancer, and multiple myeloma. Furthermore, it has been reported that NFKBIA tends to be deleted in glioblastomas (Bredel and others, 2011).

4.2 Unsupervised screening We now consider a gene expression dataset that consists of 12 600 gene expression measurements from n = 50 normal (X) and m = 52 tumor (Y) prostate gland specimens from patients undergoing radical prostatectomy (Singh and others, 2002). We reduce dimensionality using an unsupervised screening approach: we restrict our analysis to the p = 20, 25, 30, and 35 genes with the highest marginal variance. This screening approach has substantial precedent in the gene expression literature, in which it is often assumed that high-variance genes are more scientifically interesting than low-variance genes. Because the screening does not make use of the class labels (normal versus tumor), the resulting test for H0k results in valid statistical inference (Bourgon and others, 2010).  for α = 10−5 and ˆ ˆ Figure 5 displays R(X) − R(Y) for the 35 highest-variance genes. Estimates of M −10 are reported in Figure 5. We considered very small α in order to avoid estimating M to contain α = 10 all of the features. It is possible that the cancer and normal tissue tend to have substantially different gene expression, but also likely that early microarray studies such as Singh and others (2002) tend to suffer from batch effects.  depends on the set of genes considered. For example, for α = 10−10 , the 13th highestAs expected, M  when p = 20; however, this gene is not in M  when we take p = 35. By design, as α variance gene is in M   consider the result when p = 20 and decreases, M decreases. As an example of the interpretation of M, 0 −10 1 α = 10 . We rejected H0 and H0 at level α, but failed to reject H02 at level α (a conservative estimate  = 2, and M  = {13, 15}. of the p-value corresponding to this hypothesis is 3 × 10−10 ). Consequently, |M| This suggests that there are (very conservatively) at least two minimally dysregulated genes among the top 20 highest-variance genes.  is computationally intensive. Consider the case of p = 35 and α = 10−5 , displayed in Calculating M  = 8. Therefore, in order to obtain the last row of the left-hand table in Figure 5. In this example, |M| k these results, we performed tests of H0 for 0  k  8. The test of H0k requires computing the test statistic T j (X−C , Y−C ) for j = 1, . . . , p − k and C ∈ C( p, k). Hence, in order to test H0k for all k  8, we com puted 8k=0 |C(20, k)| × (20 − k) = 1 853 460 test statistics. Had we chosen a much larger value of α,  might have contained many more genes, which would have substantially increased the necessary then M computations. Fortunately, computation of these test statistics can be easily parallelized.

4.3 Supervised screening We once again consider the prostate cancer gene expression dataset of Section 4.2; this time, however, we take a supervised screening approach. In order to implement this approach, we split the observations into a training set and a test set of equal size. We perform screening on the training set in order to obtain a small set of features, and estimate M on the test set using that small feature set. This training/test set approach is needed in order for our estimate of M to retain the asymptotic properties established in Section 3.

688

E. SHENG

AND OTHERS

 when we restrict analysis to the genes with the highest variance in the ˆ ˆ Fig. 5. R(X) − R(Y), and the set of genes in M, prostate cancer dataset. The rows/columns of the matrix are in order of complete linkage clustering, using Euclidean distance. The gene ranking (highest to lowest variance) is indicated on the right-hand side of the matrix. The two tables  for α = 10−5 (left) and α = 10−10 (right). list the genes in M

ˆ j (X) − R ˆ j (Y)2 for all 1  j  12 600 in the training To perform supervised screening, we compute R set. We then restrict attention to the p = 30 genes for which this quantity is largest. These 30 genes are displayed in Figure 6. Next, we estimate M on the test set, using only these 30 genes. For α = 10−10 , we  = 5, and M  = {2, 14, 15, 25, 29}. rejected H0k for k = 1, . . . , 4, and failed to reject H05 ; therefore, |M| This suggests that there are five minimally dysregulated genes among the 30 genes considered. It is not surprising that, for a given value of α and p, the supervised screening approach results in  than with the unsupervised screening approach: this makes sense because the a higher value of |M| supervised screening approach selects genes that are dysregulated in the training set. Note that the actual  cannot be directly compared across the supervised and unsupervised screening approaches, genes in M as the set of genes used in each approach is largely non-overlapping. 5. DISCUSSION In this paper, we consider the task of identifying features whose correlations differ across conditions, as opposed to identifying features whose means differ across conditions. Correlation differences may occur

Hypothesis testing for differences in correlation

689

ˆ ˆ Fig. 6. R(X) − R(Y) for the 30 genes with the largest sample correlation difference across conditions in the prostate cancer dataset. The rows/columns of the matrix are in order of complete linkage clustering, using Euclidean distance. The gene ranking (highest to lowest 2 -norm of the difference in sample correlation vectors) is indicated on the righthand side of the matrix.

independently of mean differences, and selection methods based on correlation differences have been proposed in the statistical literature (Hu and others, 2010; Cai and others, 2013). But the previously proposed methods test each feature simultaneously, which can be problematic in a scenario where just a small number of features have differential correlations with many other features across the two conditions. Instead, we proposed a serial testing approach, which overcomes the problems associated with simultaneous testing.  of M. Our proposal builds upon classical Wald-type tests In this article, we propose the estimator M of H0 : R(X) = R(Y). We control the asymptotic type I error rate, and demonstrate desirable performance in a variety of simulation settings. Specifically, when non-zero values of R(X) − R(Y) are concentrated in certain rows/columns, our approach outperforms proposals that are based on simultaneous hypothesis tests (Hu and others, 2010; Cai and others, 2013). We restrict attention to the low-dimensional setting, in which the numbers of observations in each group, n and m, exceed the number of features, p, for two reasons: (1) The asymptotic results on type I error control required p  n, m. In order to move into the highdimensional setting, we would need to consider an alternative to T j (2.2) for use in defining φ k (α)  For example, the proposal of Cai and others (2013) could possibly be extended. and M. (2) For large p, tests of H0k in (1.2) become computationally intractable as k increases. We note that if |M| is small, then typically computations are not a concern, as we must only test H0k for k increasing until we fail to reject some null hypothesis. Furthermore, for a given value of k, the computations involved in testing H0k can be easily parallelized. Future work could involve developing an alternative to considering all C( p, k) combinations in order to test H0k . For instance, to test H0k , we could consider the C( p − k + 1, 1) combinations that result from serially removing the feature j that corresponds to the largest T j .

690

E. SHENG

AND OTHERS

In Section 4, we show how to apply our proposal to high-dimensional data by screening to reduce the number of features.  to future work. In addition, We leave the challenging task of studying the sampling variability of M further research could focus on relaxing the conservativeness of φ k (α) that results from the use of the union inequality in the proof of Proposition 2. SUPPLEMENTARY MATERIAL Supplementary material is available at http://biostatistics.oxfordjournals.org. ACKNOWLEDGEMENTS We thank Palma London and Su-In Lee for cleaning, annotating, and sharing with us the GBM gene expression dataset studied in Section 4.1. FUNDING D.W. work was supported in part by NIH DP5OD009145, NSF CAREER DMS-1252624, and a Sloan Research Fellowship. X.-H.Z. work was supported in part by U.S. Department of Veterans Affairs, Veterans Affairs Health Administration Research Career Scientist Award (RCS 05-196). REFERENCES AMAR, D., SAFER, H. AND SHAMIR, R. (2013). Dissection of regulatory networks that are altered in disease via differential co-expression. PLoS Computational Biology 9(3), e1002955. BARTLETT, M. (1937). Properties of sufficiency and statistical tests. Proceedings of the Royal Statistical Society Series A 160, 268–282. BOCKMAYR, M., KLAUSCHEN, F., GYORFFY, B., DENKERT, C. AND BUDEZIES, J. (2013). New network topology approaches reveal differential correlation patterns in breast cancer. BMC Systems Biology 7, 78. BOURGON, R., GENTLEMAN, R. AND HUBER, W. (2010). Independent filtering increases detection power for highthroughput experiments. Proceedings of the National Academy of Sciences 107(21), 9546–9551. BREDEL, M., SCHOLTENS, D. M., YADAV, A. K., ALVAREZ, A. A., RENFROW, J. J., CHANDLER, J. P, YU, I. L. Y., CARRO, M. S., DAI, F., TAGGE, M. J. and others (2011). NFKBIA deletion in glioblastomas. New England Journal of Medicine 364(7), 627–637. CAI, T., LIU, W. AND XIA, Y. (2013). Two-sample covariance matrix testing and support recovery in high-dimensional and sparse settings. Journal of the American Statistical Association 108(501), 265–277. COLE, N. (1968). The likelihood ratio test of the equality of correlation matrices. Technical Report No. 65. DANAHER, P., WANG, P. AND WITTEN, D. (2014). The joint graphical lasso for inverse covariance estimation across multiple classes. Journal of the Royal Statistical Society, Series B 76(2), 373–397. DETTLING, M., GABRIELSON, E. AND PARMIGIANI, G. (2005). Searching for differentially expressed gene combinations. Genome Biology 6, R88. GILL, R., DATTA, S. AND DATTA, S. (2010). A statistical framework for differential network analysis. BMC Bioinformatics 11, 95. GUO, J., LEVINA, E., MICHALIDIS, G. AND ZHU, J. (2011). Joint estimation of multiple graphical models. Biometrika 1–15. HO, Y. Y., COPE, L., DETTLING, M. AND PARMIGIANI, G. (2007). Statistical methods for identifying differentially expressed gene combinations. Methods in Molecular Biology 408, 171–191.

Hypothesis testing for differences in correlation

691

HU, R., QIU, X. AND GLAZKO, G. (2010). A new gene selection procedure based on the covariance distance. Bioinformatics 26(3), 348–354. JENNRICH, R. (1970). An asymptotic χ 2 test for the equality of two correlation matrices. Journal of the American Statistical Association 65, 904–912. KULLBACK, S. (1967). On testing correlation matrices. Applied Statistics 16, 80–85. LAI, Y. L., WU, B., CHEN, L. AND ZHAO, H. Y. (2004). Statistical method for identifying differential gene-gene coexpression patterns. Bioinformatics 20(17), 3146–3155. LARNTZ, K. AND PERLMAN, M. (1985). A simple test for the equality of correlation matrices. Rapport technique, Department of Statistics, University of Washington, 141. LAYARD, M. (1972). Large sample tests for the equality of two covariance matrices. Annals of Mathematical Statistics 43, 149–151. LEDOIT, O. AND WOLF, M. (2002). Some hypothesis tests for tehe covariance matrix when the dimension is large compared to the sample size. The Annals of Statistics 30, 1081–1102. LI, W. AND QIN, Y. (2014). Hypothesis testing for high-dimensional covariance matrices. Journal of Multivariate Analysis 128, 108–119. MODARRES, R. AND JERNIGAN, R. W. (1992). Testing the equality of correlation matrices. Communications in Statistics - Theory and Methods 21(8), 2107–2125. MOHAN, K., CHUNG, M., HAN, S., WITTEN, D., LEE, S. I. AND FAZEL, M. (2012). Structured learning of Gaussian graphical models. Advances in Neural Information Processing Systems, 620–628. MOHAN, K., LONDON, P., FAZEL, M., WITTEN, D. AND LEE, S.-I. (2014). Node-based learning of multiple gaussian graphical models. The Journal of Machine Learning Research 15(1), 445–488. MUIRHEAD, R. (1982) Aspects of Multivariate Statistical Theory. New York, NY: Wiley. NEUDECKER, H. AND WESSELMAN, A. M. (1990). The asymptotic variance matrix of the sample correlation matrix. Linear Algebra and its Applications 127, 589–599. SATORRA, A. AND NEUDECKER, H. (1997). Compact matrix expressions for generalized Wald tests of equality of moment vectors. Journal of Multivariate Analysis 63, 259–276. SCHOTT, J. R. (2007). A test for the equality of covariance matrices when the dimension is large relative to the sample size. Computational Statistics and Data Analysis 51, 6535–6542. SEBER, G. (1984) Multivariate Observations. USA: John Wiley & Sons. SHEDDEN, K. AND TAYLOR, J. (2004). Differential correlation detects complex association between gene expression and clinical outcomes in lung adenocarcinomas. Methods in Microarray Data Analysis 4, 121–132. SINGH, D., FEBBO, P. G., ROSS, K., JACKSON, D. G., MANOLA, J., LADD, C., TAMAYO, P., RENSHAW, A. A., D’AMICO, A. V., RICHIE, J. P. and others (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2), 203–209. VERHAAK, R. G. W., HOADLEY, K. A., PURDOM, E., WANG, V., QI, Y., WILKERSON, M. D., MILLER, C. R., DING, L., GOLUB, T., MESIROV, J. P. and others (2010). Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in pdgfra, idh1, egfr, and nf1. Cancer Cell 17(1), 98–110. WILKS, S. (1932). Certain generalisations in the analysis of variance. Biometrika 24, 471–494. XIAO, Y. H., FRISINA, R., GORDON, A., KLEBANOV, L. AND YAKOVLEV, A. (2004). Multivariate search for differentially expressed gene combinations. BMC Bioinformatics 5(164). [Received November 19, 2014; revised November 28, 2015; accepted for publication January 12, 2016]

Hypothesis testing for differentially correlated features.

In a multivariate setting, we consider the task of identifying features whose correlations with the other features differ across conditions. Such corr...
811KB Sizes 2 Downloads 6 Views