Stat. Appl. Genet. Mol. Biol. 2015; aop

Zhiyi Zhang* and Lukun Zheng

A mutual information estimator with exponentially decaying bias Abstract: A nonparametric estimator of mutual information is proposed and is shown to have asymptotic normality and efficiency, and a bias decaying exponentially in sample size. The asymptotic normality and the rapidly decaying bias together offer a viable inferential tool for assessing mutual information between two random elements on finite alphabets where the maximum likelihood estimator of mutual information greatly inflates the probability of type I error. The proposed estimator is illustrated by three examples in which the association between a pair of genes is assessed based on their expression levels. Several results of simulation study are also provided. Keywords: asymptotic normality; bias; mle; mutual information; nonparametric estimator. AMS 2000 Subject Classifications: Primary 62F10; 62F12; 62G05; 62G20. DOI 10.1515/sagmb-2014-0047

1 Introduction and summary Mutual information is a measure of association between two random elements under a joint probability distribution. It is an important measure because it captures general associations (as opposed to linear correlations) of random elements on general alphabets (as opposed to the real line where random variables reside). With the rapid expansion of data volume and complexity in the digital age, mutual information has become an increasingly useful quantity to be assessed across many fields of science, for examples, word types in text, genotypes in genetics, species in a biological environment, etc. However nonparametric estimation of mutual information and the statistical inferences based on such estimation are technically quite difficult. Many nonparametric estimators have been proposed in the last several decades, but few of them offer carefully developed distributional characteristics. The main purpose of this paper is to put forth a testing procedure based on a nonparametric mutual information estimator with very fast decaying bias and asymptotic normality. Let X  = (xi; i = 1, ···, K1) and Y  = (yj; j = 1, ···, K2) be two finite alphabets with cardinalities K1  0 for all 1  ≤  i  ≤  K1 and 1  ≤  j  ≤  K2. Let K = ∑ i ,j 1 [ p >0] be the number of positive joint i ,j probabilities in {pi,j}. We re-enumerate these K positive probabilities in one sequence and denote it as {pk; k = 1, ···, K}. Consider two partitions {S1, ···, SK } and {T1, ···, TK } of the set {1, 2, ···, K} such that 1. {pk; k∈Ss} is the collection of all positive probabilities in {pi,j; i = s} for each s, s = 1, ···, K1; and 2. {pk; k∈Tt} is the collection of all positive probabilities in {pi,j; j = t} for each t, t = 1, ···, K2. 1

2

*Corresponding author: Zhiyi Zhang, Department of Mathematics and Statistics, University of North Carolina at Charlotte, ­Charlotte, NC 28223, USA, e-mail: [email protected] Lukun Zheng: Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA

Brought to you by | New York University Bobst Library Technical Services Authenticated Download Date | 5/25/15 5:45 PM

2      Z. Zhang and L. Zheng: Mutual information By the construction of the partitions, we have



k∈Si

pk = pi ,• and



k∈Tj

pk = p•,j .

Without loss of generality, let us assume that K∈SK1∩TK2. If not, then K∈Si0∩Tj0 for some i0 and j0, by a ­re-arrangement of the indies (i, j), K∈SK1∩TK2 will be true. Shannon’s entropies for X, Y, and X   × Y, and mutual information between X  and Y  are defined as H ( X ) =− ∑ i pi ,• ln pi ,• ,

H ( Y ) =− ∑ j p•,j ln p•,j ,

(1)

K

H ( X , Y ) =− ∑ i ∑ j pi ,j ln pi ,j =− ∑ k=1 pk ln pk ,



MI ( X , Y ) = H ( X ) + H ( Y ) − H ( X , Y ).



The above indices of Shannon (1948) play an essential role in information theory. An excellent comprehensive discussion of these quantities may be found in Cover and Thomas (2006). For every pair of (i, j), let fi,j be the observed frequency of the random pair (X, Y) taking value (xi, yj), where ˆi ,j = fi ,j / n be the correspondi = 1, ···, K1 and j = 1, ···, K2, in an iid sample of size n from X   × Y under p; and let p ˆ ˆ ˆ ˆ ˆ ˆ•,j } as the sets of observed ing relative frequency. Consequently we write p = { pi ,j }, p x = { pi ,• } and p y = { p joint and marginal relative frequencies. The objective of interest is to estimate the mutual information MI. Perhaps the most intuitive estimator of MI is the plug-in estimator given by

 = MI ( X , Y ) = H ˆ( X )+H ˆ(Y )−H ˆ ( X , Y ) (2) MI

ˆ ( X , Y ) =− ∑ p ˆ ( Y ) =− ∑ p ˆ ( X ) =− ∑ p ˆ ln p ˆi ,j . ˆ ln p ˆi ,• , H ˆ ln p ˆ•,j , and H where H i ,j i ,j j •, j i i ,• Since mutual information is a linear combination of three entropies, its estimation is closely related to the estimation of entropy. The literature on entropy estimation is quite extensive. Readers may wish to , though asymprefer to ­Paninski (2003) for an excellent discussion. For entropy, the plug-in estimator MI totically efficient, is known to have a large bias. For example, for the entropy defined on an alphabet L   = {ℓk; k = 1, ···, K′} with corresponding probability distribution p = {pk; k = 1, ···, K′} and observed relative frequencies ˆ is, according to Harris (1975), ˆ ={ p ˆ k ; k = 1, , K ′ } in an iid sample of size n, the bias of H p



 K′  ˆ − H ) =− K ′ − 1 + 1 1− ∑ 1 +O ( n−3 ). E( H 2 n 12 n 2  k=1 pk 



(3)

In many practical situations, the sample size n is relatively small compared to the cardinality K′ of the alphabet L, therefore the bias could be large and decay slowly. There are many proposed estimators in the ˆ , e.g., the Miller-Madow estimator of Miller (1955) and the jackliterature that aim at reducing in the bias of H knife estimator proposed by Efron and Stein (1981). While these bias-adjusted estimators reduce the bias to some extent, none of them could remove the dominant O(n–2) term, which could be formidably large if pk > 0 is small for some k. Furthermore bias adjustments often bring difficulty in deriving distributional characteristics of an estimator, which in turn make statistical inference more difficult. The bias of such an entropy estimator could be greatly exacerbated when it is used as a part of a MI estimator. For example, by (2) and (3), the bias of the plug-in estimator of MI is  − MI ) = K −( K 1 + K 2 ) + 1 +O ( n−2 ) E ( MI 2n and, when K is in the order of K1 × K2, could lead to a systematic over estimation for MI, and hence an inflated Type I error probability when testing hypothesis H0:MI = 0 versus Ha:MI > 0. However this issue is greatly alleviated by an entropy estimator proposed by Zhang (2012) given below in (4). For the entropy defined on an alphabet L with corresponding probability distribution p and observed ˆ in an iid sample of size n, the estimator by Zhang (2012) is relative frequencies p

Brought to you by | New York University Bobst Library Technical Services Authenticated Download Date | 5/25/15 5:45 PM

Z. Zhang and L. Zheng: Mutual information      3

n−1 K′  v−1  v+ 1  j    ˆ = ∑ 1  n [ n −( v + 1)]! ∑  p ˆ k ∏  1− p ˆk −  . H z n! n    j=0  v=1 v  k =1  



(4) 

ˆ has an bias Zhang (2012) showed that H z ˆ ) = ∑ 1 ∑ p (1− p ) v < (1− p0 ) 0 < E( H − H z k k n v=n v k =1 ∞



K′

n

(5)



where p0 = min{pk > 0; k = 1, …, K′}, which decays at least exponentially fast in n. In addition, Zhang (2013) ˆ when K′ is finite. established asymptotic normality and efficiency for H z Based on the entropy estimator in (4), we introduce the following corresponding estimator of MI.  = MI  ( X, Y ) = H ˆ ( X )+H ˆ (Y )−H ˆ ( X, Y ) MI z z z z z

v+ 1 n−1 1  v−1  k     n [ n −( v + 1)]! K 1  ˆ ˆi ,• −    = ∑ v=1  p 1− p ∑ i=1  i ,• ∏ k =0  v  n! n      v+ 1 n−1 1  v−1  k     n [ n −( v + 1)]! K 2  ˆ ˆ•,j −    p + ∑ v=1  1− p ∑ j=1  •, j ∏ k =0  v  n! n      + v 1  n [ n −( v + 1)]! K 1 K 2  n−1 1  v−1  k    ˆ ˆi ,j −   . − ∑ v=1  p 1− p ∑ i=1 ∑ j=1  i , j ∏ k =0  n! n    v   

 in (6) is By (5), it is easy to see that the bias of MI z

(6)



 − MI ) =− ∞ 1 K 1 p (1− p ) v E ( MI z ∑ v=n v ∑ i=1 i ,• i ,• ∞ 1 K − ∑ v=n ∑ j=21 p•,j (1− p•,j ) v v ∞ 1 K K + ∑ v=n ∑ i=11 ∑ j=21 pi ,j (1− pi ,j ) v v and it is decaying at least exponentially fast in n since each of the three additive terms is decaying at least exponentially fast in n, as argued in (5). In fact, letting p∨ = maxi,j{pi,j > 0}, p∧ = mini,j{pi,j > 0}, p∨,· = maxi{pi,· > 0}, p∧,· = mini{pi,· > 0}, p·,∨ = maxj{p·,j > 0}, p·,∧ = minj{p·,j > 0}, it is easily verified that



∞ 1 ∞ 1 ∞ 1 − ∑ v=n (1− p∧ ,• ) v − ∑ v=n (1− p•,∧ ) v + ∑ v=n (1− p∨ ) v v v v  − MI ) ≤ ≤ E ( MI z ∞ 1 ∞ 1 ∞ 1 − ∑ v=n (1− p∨ ,• ) v − ∑ v=n (1− p•,∨ ) v + ∑ v=n (1− p∧ ) v , v v v 

(7)

and that all the terms of both sides of (7) converge to zero exponentially fast since all of p∨, p∧, p∨,·, p∧,·, p·,∨, and p·,∧ are fixed positive constants in (0, 1). The inequalities of (7) are in fact the main justification for the  of (6). proposed estimator MI z In the next section, we will establish the asymptotic normality and efficiency of the proposed estimator  . MI z In Section 3, we will illustrate the proposed estimator with three examples of gene-gene association based on their expression levels. The paper ends with several remarks in Section 4.

2 Main results  in (2), and then In this section we will first establish the asymptotic normality for the plug-in estimator MI  and  are  establish the asymptotic normality for the proposed estimator MI z in (6) by showing MI MI z ­sufficiently close as n→∞. Toward that end, let us consider the (K–1)-dimensional vectors

Brought to you by | New York University Bobst Library Technical Services Authenticated Download Date | 5/25/15 5:45 PM

4      Z. Zhang and L. Zheng: Mutual information v = ( p1 , , pK −1 ) τ , ˆ 1 , , p ˆ K −1 ) τ . vˆ = ( p We have

L

n ( vˆ − v ) → MVN (0, Σ( v )), where Σ(v) is the (K–1) × (K–1) covariance matrix given by  p1 (1− p1 ) − p1 p2  −p p p2 (1− p2 ) 2 1 Σ( v ) =       − pK −1 p1 − pK −1 p2

− p1 pK −1

 − p2 pK −1   .      pK −1 (1− pK −1 ) 

Let G( v ): = H ( X ) + H ( Y ) − H ( X , Y )

and τ

 ∂  ∂ g ( v ): =∇G( v ) =  G( v ), , G( v )  . ∂pK −1  ∂p1  The following facts can be progressively verified. ∂H ( X ) ln pK 1 ,• − ln pi ,• , if k ∈Si ≠ SK 1 = ∂pk 0, if k ∈SK  1 ∂H ( Y ) ln p•,K 2 − ln p•,j , if k ∈Tj ≠ TK 2 = 0, if k ∈TK ∂pk  2 ∂H ( X ,Y ) = ln pK − ln pk , for 1≤ k ≤ K − 1 ∂pk ln( pK ,• p•,K pk ) − ln( pi ,• p•,j pK ), if k ∈Si ≠ SK and k ∈Tj ≠ TK 1 2 1 2  ∂ G( v ) = ln( p•,K pk ) − ln( p•,j pK ), if k ∈SK and k ∈Tj ≠ TK 2 1 2 ∂pk  if k ∈Si ≠ SK and k ∈TK ln( pK 1 ,• pk ) − ln( pi ,• pK ), 1 2 where pK = 1− ∑ k≠K pk . Noting (a) that every pk > 0 for k = 1, ···, K, and every ∂G(v)/∂pk is continuous at (p1, ···, pK)′, and (b) that Σ(v) is positive definite, and by Theorem 5.5.28 of Casella and Berger (2002), the multivariate delta method . immediately gives the following proposition for the plug-in estimator MI Proposition 1 Provided that gτ(v)Σ(v)g(v) > 0

(

)

1

L  − MI  g τ ( v ) Σ( v ) g ( v )  − 2 → n MI N (0, 1).  



(8)

By the continuous mapping theorem Σ( vˆ ) is a consistent estimator of Σ(v) and g ( vˆ ) is a consistent estimator of g(v). From this and Slutsky’s theorem, Corollary 1 follows. Corollary 1. Provided that gτ(v)Σ(v)g(v) > 0,

(

)

1

L  − MI  g τ ( vˆ ) Σ( vˆ ) g ( vˆ )  − 2 → n MI N (0, 1).  



(9)

Theorem 1. Provided that gτ(v)Σ(v)g(v) > 0,

(

)

1

L  − MI  g τ ( v ) Σ( v ) g ( v )  − 2 → n MI N (0, 1). z  



Brought to you by | New York University Bobst Library Technical Services Authenticated Download Date | 5/25/15 5:45 PM

(10)

Z. Zhang and L. Zheng: Mutual information      5

ˆ of (4) estimating Shannon’s entropy Toward proving Theorem 1, consider first the problem of H z K′ ˆ k } be the H =− ∑ k=1 pk ln pk defined with p = {pk; k = 1, …, K′} on alphabet L   = {ℓk} where K′ is finite. Let { p ˆ =− ∑ K ′ p ˆ k ln p ˆ k be the plug-in observed relative frequencies of letters in an iid sample of size n and H k =1 ­estimator of H.

(

)

ˆ −H ˆ → 0. Lemma 1. Suppose {pk} is a non-uniform distribution on L = {ℓk}. Then n H z To prove Lemma 1, we need Lemma 2 below. Towards stating Lemma 2, let us define for any p∈(0, 1), p

n−1  j   1  nv+1 [ n −( v + 1)]!   v−1  gn ( p ) = ∑    ∏  1− p −   1[ v≤n(1− p )+1]  , n! n   j=0  v=1   v  

hn ( p ) = pg n ( p ), a nd h( p ) =− p ln p. ˆ = X / n where X is a binomial random variable with parameters n and p. Then Lemma 2. Let p 1. n [ hn ( p ) − h( p )] →0 uniformly in p∈(c, 1) for any c, 0  ε p ˆ ∈( c , 1) ) → 0; and by part 2, P ( n [ h ( p +P

By part 1 of Lemma

)

ˆ 1 ) − h( p ˆ 1 )] > ε p ˆ 1 ∈( c , 1) P p ˆ 1 ∈( c , 1) n [ hn ( p

1

1

1

3 of Lemma 2,

ˆ 1 ∈[1/ n, c ]) → 0. Hence ∆n → 0. The result of Lemma 1 is immediate from the following expression in a P( p finite sum,

ˆ −H ˆ ) = ∑ K n( h ( p ˆ k ) − h( p ˆ k )). □ n( H z n k =1

Proof of Theorem 1. In light of Proposition 1 and Slutsky’s theorem, it suffices to show that However this is trivial since

(

) (

)

(

)

(

)

p  − MI  →0. n MI z

(

)

 − MI = n H ˆ ( X )−H ˆ( X) + n H ˆ (Y )−H ˆ(Y ) − n H ˆ ( X, Y )−H ˆ( X, Y ) . n MI z z z z Applying Lemma 1 respectively for the three additive terms above immediately gives the result in (10). □ From Theorem 1 and Slutsky’s theorem, Corollary 2 follows. Corollary 2. Provided that gτ(v)Σ(v)g(v) > 0,

(

)

1

 − MI  g τ ( vˆ ) Σ( vˆ ) g ( vˆ )  − 2 → L n MI N (0, 1). z  



(11)

Corollaries 1 and 2 provide statistical tools for inference about MI based on iid samples of very large size ˆi ,j >0 for all (i, j). When this is the case, estin. In practice however n may not be large enough to insure that p mating gτ(v)Σ(v)g(v) by means of the plug-in g τ ( vˆn ) Σ( vˆn ) g ( vˆn ) may fail computationally since the functional forms of g(v) call for all components of v and pK1, K2 to be positive. To account for such a situation, one may choose to plug an augmented vˆ instead into g(v), namely,

Brought to you by | New York University Bobst Library Technical Services Authenticated Download Date | 5/25/15 5:45 PM

6      Z. Zhang and L. Zheng: Mutual information ∗ ˆ 1,1 ˆ 1,∗ K , p ˆ ∗2,1 , , p ˆ ∗2,K , , p ˆ ∗K ,1 , , p ˆ ∗K ,K −1 ) τ vˆ∗ = ( p , , p 2

2

ˆi∗,j = p ˆi ,j + (1/ n )1[ pˆ where, for any pair of (i, j), p

i ,j

1

1

2

. Clearly such an augmentation does not impact any of the =0]

asymptotic properties established above but does resolve the computational issue mentioned above.  The asymptotically efficiency of the proposed estimator MI z in (6) follows from the fact that the asymp   totic variance of MI is the same as that of MI and that MI is the maximum likelihood estimator (mle) of MI. z

3 Examples To illustrate the proposed methodology, we present three examples of gene-gene association evaluation using the data reported in Dave et al. (2004). These examples include gene expression data on 191 biopsy specimens obtained from 191 patients with untreated follicular lymphoma. RNA was extracted from fresh-frozen tumor-biopsy specimens from patients, who had received a diagnosis between 1974 and 2001. S ­ pecimens were examined for gene expression with the use of Affymetrix U133A and U133B microarrays. The data set is available at and was obtained from http://llmpp.nih.gov/FL. The study included 44,187 probe expression values, and some genes are with multiple (2–7) measurements. We take two pairs of genes and compared them in Examples 1 and 2. In Examples 1 and 2, we let α = 0.05, K1 = K2 = 10 and K = K1K2 = 100, i.e., pi,j > 0 for all 1  ≤  i  ≤  K1 and 1  ≤  j  ≤  K2. In Example 3, we let α = 0.05, K1 = K2 = 10 and K = 94 with 6 of the 100 pi,j’s being zero. More explicitly, we assume that p1,9 = p1,10 = p2,10 = 0 and p9,1 = p9,2 = p10,1 = 0. Example 1. TMEM30A and MTCH2. We compare two different genes, the first corresponding to the probe with ID number 1095962 for gene TMEM30A, and the second corresponding to the probe with ID number 1095967 for gene MTCH2. For each of these two probes, expression values of the 191 patients were obtained. The expression values with respect to TMEM30A are between 10.3593 and 12.2961. We divide the interval [10.3593, 12.2961] into 10 subintervals of equal lengthes and assign them categorical values from 1, 2, …, 10 in an increasing order. Any expression value which lies in the i th subinterval belongs to category i and labeled by i. Similarly, the expression values with respect to MTCH2 ranges from 9.5683 to 11.3247. We divide the interval [9.5683, 11.3247] into 10 subintervals of equal lengthes, and the expression values of MTCH2 are assigned similarly and accordingly. Therefore, ˆi ,j } in part (A) of Table 1, the expression values of both genes are summarized into a 10 × 10 frequency matrix { np with i labeling a level of TMEM30A and j labeling that of MTCH2.  = 0.0552, and  = 0.1459, MI From part (A) of Table 1, we have n = 191, MI z τ ∗ ∗ ∗ g ( vˆ ) Σ( vˆ ) g ( vˆ ) = 0.2198. For the hypothesis of H0:MI = 0 versus Ha:MI > 0, the two approximate Z-test statistics using Corollaries 1 and 2 are respectively

( )  −0  g ( vˆ ) Σ( vˆ ) g ( vˆ )  n ( MI ) 

1

 −0  g τ ( vˆ∗ ) Σ( vˆ∗ ) g ( vˆ∗ )  − 2 = 4.3006, Z 1 = n MI   Z2 =

τ

z









1 2

= 1.6267.

Table 1: Joint frequencies of examples 1 and 2. j

i

1 2 3 4 5 6 7 8 9 10

1 0 0 0 0 0 2 0 0 0 0

2 0 0 0 0 0 0 0 0 0 0

3 0 0 0 0 0 2 4 5 7 4

4 1 1 2 1 3 5 6 10 11 10

5 1 0 1 1 6 6 11 21 9 6

j 6 0 0 0 2 2 5 5 7 6 5

7 0 1 1 0 2 1 1 5 3 1

8 0 1 0 0 0 0 1 1 0 0

9 0 0 0 0 0 0 0 0 0 0

10 0 1 1 0 0 0 1 1 1 0

i

1 2 3 4 5 6 7 8 9 10

1 0 0 0 1 0 1 1 0 0 0

2 0 0 0 0 0 0 0 0 0 0

3 0 0 0 0 1 2 0 1 0 0

4 2 0 2 2 4 3 6 1 1 0

5 0 0 1 3 4 13 8 5 2 0

A

Brought to you by | New York University Bobst Library Technical Services Authenticated Download Date | 5/25/15 5:45 PM

6 0 1 0 2 7 6 16 11 2 0

B

7 0 1 2 1 2 7 16 15 5 2

8 0 0 1 2 0 3 4 6 1 0

9 0 0 0 0 0 0 0 0 0 0

10 0 0 0 0 1 0 3 4 5 1

Z. Zhang and L. Zheng: Mutual information      7

Interestingly the two tests would lead to very different p-values,   0, the two approximate Z-test statistics using Corollaries 1 and 2 are respectively

( )  −0  g ( vˆ ) Σ( vˆ ) g ( vˆ )  n ( MI ) 

1

 −0  g τ ( vˆ∗ ) Σ( vˆ∗ ) g ( vˆ∗ )  − 2 = 4.2115, Z 1 = n MI   Z2 =

z

τ









1 2

= 2.3659,

which give p-values,  

A mutual information estimator with exponentially decaying bias.

A nonparametric estimator of mutual information is proposed and is shown to have asymptotic normality and efficiency, and a bias decaying exponentiall...
3MB Sizes 3 Downloads 18 Views