Pleiotropy can be effectively estimated without counting phenotypes through the rank of a genotype-phenotype map.

INVESTIGATION

Pleiotropy Can Be Effectively Estimated Without Counting Phenotypes Through the Rank of a Genotype–Phenotype Map Xun Gu Department of Genetics, Developmental and Cell Biology, Iowa State University, Ames, Iowa 50011, and State Key Laboratory of Genetic Engineering and MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences, Fudan University, Shanghai 200433, China

ABSTRACT Although pleiotropy, the capability of a gene to affect multiple phenotypes, has been well known as one of the common gene properties, a quantitative estimation remains a great challenge, simply because of the phenotype complexity. Not surprisingly, it is hard for general readers to understand how, without counting phenotypes, gene pleiotropy can be effectively estimated from the genetics data. In this article we extensively discuss the Gu-2007 method that estimated pleiotropy from the protein sequence analysis. We show that this method is actually to estimate the rank (K) of genotype–phenotype mapping that can be concisely written as K = min(r, Pmin), where Pmin is the minimum pleiotropy among all legitimate measures including the fitness components, and r is the rank of mutational effects of an amino acid site. Together, the effective gene pleiotropy (Ke) estimated by the Gu-2007 method has the following meanings: (i) Ke is an estimate of K = min(r, Pmin), the rank of a genotype–phenotype map; (ii) Ke is an estimate for the minimum pleiotropy Pmin only if Pmin , r; (iii) the Gu-2007 method attempted to estimate the pleiotropy of amino acid sites, a conserved proxy to the true gene pleiotropy; (iv) with a sufficiently large phylogeny such that the rank of mutational effects at an amino acid site is r / 19, one can estimate Pmin between 1 and 19; and (v) Ke is a conserved estimate of K because those slightly affected components in fitness have been effectively removed by the estimation procedure. In addition, we conclude that mutational pleiotropy (number of traits affected by a single mutation) cannot be estimated without knowing the phenotypes.

P

LEIOTROPY, or the capability of a gene to affect multiple phenotypes (Fisher 1930; Wright 1968), has played a central role in genetics, development, and evolution (e.g., Lande 1980; Turelli 1985; Wagner 1989; Barton 1990; Keightley 1994; Hartl and Taubes 1996, 1998; Waxman and Peck 1998; Lynch et al. 1999; Bataillon 2000; Orr 2000; Poon and Otto 2000; Wagner 2000; Elena and Lenski 2003; Welch and Waxman 2003; Wingreen et al. 2003; Zhang and Hill 2003; MacLean et al. 2004; Otto 2004; Eyre-Walker et al. 2006; Pigliucci 2008; Wagner et al. 2008). Although functional genomics has brought high-throughput data to bear on the nature and extent of pleiotropy (Dudley et al. 2005; Ohya et al. 2005; Pal et al. 2006; Cooper et al. 2007), this issue Copyright © 2014 by the Genetics Society of America doi: 10.1534/genetics.114.164673 Manuscript received March 28, 2014; accepted for publication May 28, 2014; published Early Online June 3, 2014. Supporting information is available online at http://www.genetics.org/lookup/suppl/ doi:10.1534/genetics.114.164673/-/DC1. Address for correspondence: Center for Bioinformatics and Biological Statistics, 332 Science II Hall, Iowa State University, Ames, IA 50011. E-mail: [email protected].

remains highly controversial, largely because of the phenotypic complexity; see Wagner and Zhang (2011), Hill and Zhang (2012), and Paaby and Rockman (2013) for recent reviews and comments. Meanwhile, a new research line has emerged in the past decade, aiming at the estimation of gene pleiotropy from genetics or sequence data, rather than the measurement of affected phenotypes (Martin and Lenormand 2006; Gu 2007a, b; Chevin et al. 2010; Su et al. 2010; Zeng and Gu 2010; Razeto-Barry et al. 2012; Chen et al. 2013). For instance, Gu (2007a) developed a statistical method to estimate the “effective pleiotropy” (Ke) of a gene from the multiple-sequence alignment (MSA) of protein sequences. Most genes have Ke in the range between 1 and 20 (Su et al. 2010), with the medium Ke = 6.5 of these estimates that is comparable to some empirical pleiotropy measures (Wagner and Zhang 2011; Paaby and Rockman 2013). However, we have to acknowledge that this approach is not very easy to be understood; that is, How can we know the amount of multifunctionality (pleiotropy) of a gene without biologically knowing each functional component?

Genetics, Vol. 197, 1357–1363 August 2014

1357

In this article we try to answer the following questions: What quantity was actually estimated by the method proposed by Gu (2007a)? And under which condition can this estimate be interpreted as effective gene pleiotropy? To answer these questions, we introduce the concept of minimum pleiotropy (Pmin), which has an intrinsic relationship with the rank of the genotype–phenotype map. Therefore, we propose that the method of Gu (2007a) is to estimate the rank of the genotype– phenotype map that can be regarded as effective estimation of Pmin. Simulations and empirical studies are carried out to test several predictions from the new theory.

Materials and Methods Data sets

Random single-copy gene samples were from three sets of model organisms, respectively. MSAs of vertebrate genes were from the data set of Su et al. (2010), while those of five yeast homologous genes (Saccharomyces cerevisiae, Candida glabrata, Kluyveromyces lactis, Debaryomyces hansenii, and Yarrowia lipolytica) were from the Génolevures database (http://cbi.labri.fr/Genolevures), and those of 12 Drosophila genes were from the Consortium of Drosophila genome projects. Estimation of Ke from protein sequences

We calculated the dN/dS (the ratio of nonsynonymous to synonymous rates) of each gene, based on the orthologous sequences from two closely related species: human and mouse for the vertebrate data set, S. cerevisiae and S. paradoxus for the yeast data set, and D. melanogaster and D. sechellia for the Drosophila data set. After the gene phylogeny was reconstructed by the neighbor-joining method, we estimated H, a normalized index for the rate variation among sites (Gu 2007a). It follows that the effective gene pleiotropy (Ke) was estimated by solving the equation [dN/dS]/(1 2 H) = 22Kj[1 +u(Kj)], where u(Ke) = 0.0208Ke(Ke + 2)/(1 + 0.289Ke). The technical procedures for estimating Ke are described by Gu (2007a) and Su et al. (2010) in detail. Estimation of Ke from nucleotide sequences

Vertebrate genes used in this analysis were from the data set of Su et al. (2010). For each gene, the MSA of nucleotides was transformed exactly from the MSA of amino acids. We first calculated the quantity g = (d/dS)/(1 2 H) for each gene. Here d is the evolutionary distance at the first and second nucleotide positions (d12) or at all three positions (d123), and dS is the synonymous distance. All these distances were estimated between human and mouse, by the Kimura two-parameter method. When the phylogeny is given, the normalization index of rate variation among nucleotide sites (H) can be calculated by the algorithm of Gu and Zhang (1997) with some modifications. In practice, when the method of Gu (2007a) is applied for the nucleotide sequences, we may face a technical problem: Due to the saturated synonymous substitutions at many nucleotide

1358

X. Gu

sites, estimation of the quantity g would be subject to a large sampling variance, resulting in a high number of not-applicable (NA) cases. To avoid this problem, we used the following procedure: First, calculate the mean of g over all genes, as well as the confidence interval (25% and 75% quantiles, respectively); and, second, use the mean of g to estimate the mean of Ke , as well as the 25% and 75% quantiles, respectively. Computer simulation

In the simulation procedure, we first generate the matrix A = Sw21Sm as follows: Given any fixed n (n = 6, 10, 14, and 18), we generated n real random variables between 1 and 10 and assigned these variables randomly as diagonal elements of Sw; hence it is of full rank. However, matrix Sm is partitioned into two components: a full, r-rank matrix S*m that was generated similar to procedure (r is less than n) and then a diagonal matrix with n 2 r small but the same positive values of diagonal elements such as Blow = 0.001. Together, matrix A is given by A = CSw21Sm, where the constant C is chosen such that the mean of first r eigenvalues of A equals a specified value B0. The simulation study was carried out in Matlab. Note that Matlab script (KeEstimation.m.txt) for the computer simulation and the software (GenePleio_ win-v15.rar, codes included) for the estimation of Ke are available in supporting information, File S1. The real data sets used in this study are deposited in DryAd (http://datadryad. org/; doi: 10.5061/dryad.s84fm).

Results and Discussion Many-face problem of pleiotropy under a genotype–phenotype map

Like many biological terminologies, the concept of pleiotropy is easy to understand but difficult to measure. Practically, biologists address this issue by experimentally identifying distinct phenotypes, in vitro or in vivo, caused by the same mutation (mutational pleiotropy) or mutations from the same gene (gene pleiotropy). Although these studies largely focused on the nature of phenotypes, the results did provide a “minimum” estimate about the extent of pleiotropy. Thus, one may compare whether a gene is more pleiotropic than others, as long as the same technological platform is used, e.g., the mouse knockout experimentation. However, different pleiotropy measures based on different technological platforms certainly represent biologically significant differences that may even lead to contradictory interpretations (Wagner and Zhang 2011; Paaby and Rockman 2013). To illustrate this problem, we invoke the basic system formulated by Paaby and Rockman (2013), which has three steps: 1. Molecular pleiotropy (f) refers to the number of protein functions, most likely in vitro. For instance, an enzyme is called pleiotropic if it can catalyze many distinct substrates.

2. Developmental pleiotropy (d) refers to the number of developmental pathway autonomies that underlie diverse phenotypes, including diseases. Developmental pleiotropy provides concrete examples to demonstrate how a single gene or mutation may affect distinct phenotypes in vivo. 3. Selectional pleiotropy (n) refers to the number of molecular phenotypes (Gu 2007a), each of which corresponds to a single nontrivial fitness component. In essence, selectional pleiotropy is the dimension of Fisher’s geometric model, which underlies most statistical studies of the genotype–phenotype map (Lande 1980; Turelli 1985; Wagner 1989; Barton 1990). While the difference between f and d illustrates the many-face problem of pleiotropy, the difference between n and f (or d) represents the conceptual gap between empirical pleiotropy and Fisher’s geometric model. Although the relationship between f, d, and n could be astonishingly complex, we assume that, from the genotype (G) to the phenotype, the map direction can be represented as G / f / d / n. Meanwhile, under Fisher’s model, the map direction is as simple as G / n. While one may view this one-step map in Fisher’s model as merely a theoretical abstract of many cryptic steps, we argue that the dimension of Fisher’s model could be misled by this oversimplification; it equals n only when n , f and n , d; otherwise, it is ,n. Actually, it is the minimum pleiotropy Pmin = min(f, d, n) that determines the dimension of Fisher’s model. In other words, Pmin is the rank of phenotype complexity, providing a link between the real genotype–phenotype map and Fisher’s model. The concept of Pmin can be naturally extended to any general genotype–phenotype map with N steps. Without loss of generality, we suppose that the lowest level is the molecular functions ( f) related to protein activities and the highest level is the molecular phenotypes (n) related to fitness components. At each ith step node, the pleiotropy level can be symbolically denoted by Pi, i = 1, . . . , N, where P1 = f and PN = n. Thus, the minimum pleiotropy can be generally defined as follows: Pmin ¼ minðP1 ; . . . ; PN Þ ¼ minð f ; P2 ; . . . ; PN21 ; nÞ:

(1)

While minimum pleiotropy may provide one solution to the many-face problem, the question that remains is under what condition we are able to estimate Pmin. To this end, we should address the rank of the genotype–phenotype map first. Rank of the genotype–phenotype map

Without loss of generality, the genotype–phenotype map can be schematically represented as G / f / (P2, . . . , PN21) / n, where (P2, . . . , PN21) represents the black box to indicate the phenotype complexity. We then define the rank of the genotype–phenotype map (K) as the canonical number of mapping paths from the genotype (G) to the fitness components (n) through the entire phenotypic space. In other

words, K is the degrees of freedom for a particular genotype–phenotype mapping, which highly depends on the nature of the genotype. In the case of single mutations treated as genotypes, the mutational pleiotropy (a single mutation can affect multiple phenotypes) means that those corresponding mapping paths start from the same mutational event. Indeed, these affected phenotypes are expected to be highly correlated, without many degrees of freedom, simply because the same biochemical–structural property of the mutation underlies most phenotypic consequences. By contrast, a pleiotropic gene may contain many mutations, each of which affects only one specific phenotype. As a result, a gene may contain a bunch of related but distinct mapping paths that have high degrees of freedom. This issue can be well addressed by the rank of mutational effects (r), defined by the canonical number of mutations within a genotype unit that cause distinct phenotypes. Intuitively, r is the number of starting points of map paths; obviously, r = 1 for single mutations. Putting this together, the rank (K) of the genotype– phenotype map is the smaller one of the rank of phenotypic complexity (minimum pleiotropy) and the rank of mutational effects; that is, K ¼ minðr; Pmin Þ:

(2)

As is discussed later, this geometric structure of the genotype–phenotype map can be rigorously derived under the linear transformation model of mapping. Obviously, to what extent K can be interpreted as the minimum pleiotropy (Pmin) depends on the unknown rank (r) of mutational effects; K = Pmin if Pmin , r. In particular, when r = 1, K always = 1, regardless of how large Pmin is. Nevertheless, the up limit of r for given genotype data is controlled by the number of generic states (M). For instance, M = 2 for random nucleotides, M = 4 for nucleotide sites, and M = 20 for amino acid sites. In fact, the maximum rank (rmax) for a given genotype is the number of generic states minus one (the wild type); i.e., rmax = M 2 1. Moreover, the principle of “microergodicity” claims that the mutational process can reach over all generic states for a sufficiently long time; in this case we have r / rmax. In practice, the “real” rank of mutational effects (r) may be ,,rmax if the underlying sampling process has no sufficient time for mutations to reach all states. Rank of the genotype–phenotype map estimated by the Gu (2007a) method

In this section, we show that the statistical method developed by Gu (2007a) (Gu-2007 method) actually was to estimate the rank (K) of the genotype–phenotype map from the protein sequences. Since the genotype–phenotype map can convey the geometric information, the rank K, between the stabilizing selection and the protein sequence evolution, we are able to, in retrospect, estimate K based on the analysis of protein sequence data without the help of observed

Pleiotropy Can Be Effectively Estimated Without Counting Phenotypes

1359

phenotypes. Since K = min(r, Pmin), we thus conclude that the minimum pleiotropy (Pmin) can be estimated without counting the number of phenotypes, if it is in the range of [1, r]. Derivation of K = min(r, Pmin) under a linear mapping model: Consider the case where the genotype–phenotype map is schematically represented as a linear mapping model G = r / P1 / , . . . , / PN = n, which can be characterized by three matrices (Wagner 1989): an r 3 r variance–covariance matrix (V) for the correlated mutational effects of genotypes (G), an r 3 n transformation matrix T that links between r number of mutational effects and n number of fitness components (molecular phenotypes), and an n 3 n stabilizing selection matrix Sw that is full rank of n. The linear mapping model claims that the mutation matrix V and transformation matrix T together determine how mutations ultimately affect n molecular phenotypes, as characterized by an (n 3 n) matrix defined by Sm = T9VT. However, the rank of Sm is not n except for some special cases, because T is a multiplication of those consecutive step-transformation matrices; that is, T = T1 T2, . . . , TN, where T1 is the matrix with r rows and P1 columns, T2 is the one with P1 rows and P2 columns, and so forth. Thus, the linear algebra theory states that the rank of Sm is the minimum of (r, P 1, . . . , P N21, PN ). Noting that the minimum pleiotropy P min = min(P 1, . . . , P N21 , PN ), we have the rank of Sm given by min(r, Pmin ). Under this transformation model, the rank (K) of the genotype–phenotype map turns out to be the rank of matrix A = Sw21Sm. As Sw is a full rank of n, the rank of A is the same as that of Sm. Thus, we have shown K = min (r, Pmin) # min (r, n). The equals sign holds when n = PN is the smallest among P1, . . . , PN or the one-step (N = 1) genotype–phenotype map (Chen et al. 2013). Estimation of K = min(r, Pmin) by the Gu-2007 method: Gu (2007a) formulated the statistical model of protein sequence under n-dimensional stabilizing (purifying) selection. The results showed that, in addition to mutation rate v, the first (mean) and the second moments of the evolutionary rate are dominated by the n eigenvalues (ai, i = 1, . . . , n) of matrix A=Sw21Sm, which are nonnegative, i.e., either positive or zero. Without loss of generality, one may set a1 $ a2 . . . $ an $ 0. For instance, the mean evolutionary rate can be written as " #" # n n Y X cBi l¼ n ð1 þ 2Bi Þ 1 þ ; (3) 1 þ 2Bi i¼1 i¼1 where c = 0.5772, Bi = 4Neai is the ith component of selection intensity, and Ne is the effective population size. A notable property of Equation 3 is that only those nonzero Bi’s affect the evolutionary rate. Since some eigenvalues (ai’s) of matrix A may be zero, the number of eigenvalues (n) in Equation 3 should be replaced by that of positive eigenval-

1360

X. Gu

ues. We formulate this argument as follows. Since the rank of matrix A is K = min(r, Pmin) # min(r, n) # n, matrix A has the first K nonzero positive eigenvalues; that is, ai . 0 for any i # K, and ai = 0 for all i . K. It immediately follows that B1 $ B2 $, . . . , $ BK . 0, whereas BK+1 =, . . . , = Bn = 0, suggesting that Equation 3 is equivalent to " #" # K K Y X cBi l¼n ð1 þ 2Bi Þ 1 þ : (4) 1 þ 2Bi i¼1 i¼1 As similar results can be derived for any kth moment of evolutionary rate (not shown), we conclude that the Gu-2007 method is to estimate the rank of the genotype–phenotype map K = min(r, Pmin), rather than the gene pleiotropy directly. From the rank of the genotype–phenotype map to effective gene pleiotropy

Putting this together, we are able to explain why gene pleiotropy can be effectively estimated without counting the phenotypes. Precisely, let Pmin be the minimum pleiotropy over all legitimate measures. Under the nearly neutral model (Kimura 1983), the Gu-2007 method attempts to estimate K = min(r, Pmin), the rank of the genotype–phenotype map, from a given MSA of protein sequences. Moreover, the effective estimate of K (denoted by Ke) can be interpreted as the effective gene pleiotropy (K = Pmin) when Pmin , r, or K = r as a minimum estimate of Pmin when Pmin . r. We also suggest, to be practically meaningful, that (i) amino acid sequence alignment (rmax = 19) is preferred, rather than nucleotide sequence alignment (rmax = 3), and that (ii) sequence sampling should be from a large phylogeny so that the real rank of mutational effects (r) can be reasonably close to rmax. For amino acid sequence alignment, the principle of microergodicity ensures that, when the phylogeny is sufficiently large, the rank (r) of mutational effects is close to the maximum value rmax = 19, leading to K / min(19, Pmin). In the literature, the term “mutation size” (s) was used as the overall mutational effect on fitness. At the molecular level, it is virtually the same as the coefficient of selection. One criticism (Razeto-Barry et al. 2011, 2012; Razeto-Barry and Maldonado 2011) was that the Gu-2007 method assumed that gene pleiotropy and the overall mutation size (s) were correlated. We try to clarify this issue in the following two arguments. First, the overall mutation size can be written as s = s1 + . . . + sn, where si is the mutation size for the ith fitness component. If si remains a rough constant, we have s s0n, implying a strong correlation between s and n. By contrast, if there is only a very small number (n*) of large si’s, but the others are very small, we expect that s correlates only with n* rather than with n. Indeed, the Gu-2007 method estimates the number of fitness components with strong mutation size. Second, we have shown that the rank of the genotype–phenotype map K = min(r, Pmin), rather than the pleiotropy Pmin, was involved in the statistical model

that underlies the Gu-2007 method. Hence, the question turns out to be whether K and s are correlated. As there seems no correlation between the rank (r) of mutational effects and the mutation size (s), we claim that K is not correlated with s when r , Pmin. When r . Pmin, the estimated K is for the minimum pleiotropy Pmin. In practice, whether the overall mutation size (s) is correlated with Pmin may affect the interpretation of the Gu-2007 method, but not the procedure of estimation. Computer simulation and data analysis

Simulation study for Ke estimation: We carried out computer simulations to illustrate our main conclusion. To be concise, the simulation procedure was designed under the simple one-step mapping model so that K = min(r, n). For each of the simulation data, the Gu-2007 method was implemented to estimate the effective gene pleiotropy (Ke). Table 1 shows the results. In case A, we set r = n/2 and so K = min(r, n) = n/2. It appears that the estimate (Ke) is, on average, less than but close to r = n/2, because the estimation of the Gu-2007 method is conserved. In case B, we set r = 15 but allow K to vary from 2 to 20. Although Ke is on average ,n in all cases, it increases almost linearly with the increasing of n until it reaches the point of n = r. As expected, when n . r, the estimated Ke, on average, becomes quickly saturated toward the line of r = 15, regardless of the increasing of n. Range of Ke estimated from protein sequences: One impressive prediction from our theory is that, based on the protein sequence alignment, the Gu-2007 method can estimate the degree of gene pleiotropy effectively only between 1 and 19, whereas this estimate would converge to 19 when the real gene pleiotropy becomes higher. We have examined three data sets: 321 vertebrate genes (Su et al. 2010), 580 single-copy genes from five yeast genomes, and 437 single-copy genes from 12 Drosophila genomes. Although each data set shows a great range of gene pleiotropy, almost all estimates (99%) are ,19 (Table 2), consistent with the theoretical prediction. When the total evolutionary time of the phylogeny is not large enough to meet the criterion of microergodicity, the real rank (r) of mutational effects could be ,,rmax, resulting in a lower rank (K) of the genotype–phenotype map. We tested this prediction by the vertebrate gene data set (Su et al. 2010). For each gene, we estimated Ke from four species (mammals), five species (mammals and chicken), six species (mammals, chicken, and frog), seven species (mammals,

Table 1 Simulation results showing that estimation of K is affected by the rank (r) of mutational effects n

r

Ke (B0 = 2)

Ke (B0 = 4)

6 10 14 18

3 5 7 9

Case A 2.14 6 0.01 3.39 6 0.03 5.58 6 0.10 7.07 6 0.24

2.47 4.81 6.58 8.37

6 6 6 6

0.01 0.04 0.12 0.30

6 10 14 18

15 15 15 15

Case B 4.38 6 0.05 8.71 6 0.37 12.48 6 0.40 13.08 6 0.44

5.47 9.12 13.29 14.35

6 6 6 6

0.07 0.39 0.41 0.46

B0 is the baseline selection intensity.

chicken, frog, and fugu), and eight species (mammals, chicken, frog, fugu, and zebrafish), respectively. We observed the average of Ke = 3.1, 4.2, 6.0, 6.3, and 6.5 for the data sets of four species to eight species, respectively. Indeed, for the same gene set, the estimation of Ke tends to be small when the protein phylogeny is small. One may wonder the ultimate value that Ke may approach when the phylogeny becomes very large. Tentatively, we used a simple extrapolation model after calculating the total branch length of the phylogeny in each case and estimated and found that, roughly, the average of Ke approaches 7.17. Ke estimated from nucleotide sequences reflecting the rank (r) of mutational effects: While the Gu-2007 method is applied to the nucleotide sequence alignment, we have K = min(r, Pmin) , 3 because of the maximum rank rmax = 3 at nucleotide sites. We used the vertebrate gene set (Su et al. 2010) to test this prediction. Two types of nucleotide sets were analyzed: The first data set includes the first and second codon positions (data set 1st2nd), and the second one includes all sites (data set all). Since many saturated mutations may have occurred at synonymous sites, the estimation procedure in the Gu-2007 method would be statistically unreliable. We propose a modified procedure to avoid this problem; see Materials and Methods. We analyzed 321 vertebrate single-copy genes and found that the average of K12, the effective gene pleiotropy based on the first and second codon positions, is 2.91, with the 25–75% sampling quantile (1.75–5.59). This finding can be reasonably explained as the result of rmax = 3 such that K12 = min(3, Pmin) 3 despite that the real gene pleiotropy Pmin may be much higher. Moreover, when all nucleotide positions are used, the average of effective gene pleiotropy (K123) is as low as 0.48, with

Table 2 Extent of effective gene pleiotropy estimated from MSA of amino acid sequences Data sets

No. species

No. genes

Median of Ke

Ke cutoff (95% quantile)

Ke cutoff (99% quantile)

Vertebrates Drosophila Yeasts

8 5 12

321 437 580

6.5 6.9 6.8

2.1–12.4 2.8–13.1 2.4–12.6

1.3–18.8 1.7–20.2 1.5–19.4


1361

Table 3 The means and ranges of effective gene pleiotropy at the first and second nucleotide positions, all three nucleotide positions, and amino acid sites

Mean of d/dS Mean of H Mean of g 25% of g 75% of g Mean of K 75% of K 25% of K

1+2 position

1+2+3 position

Amino acid sites

0.115 0.724 0.416 0.186 0.590 2.914 5.587 1.753

0.269 0.689 0.866 0.690 0.963 0.477 1.233 0.125

0.113 0.221 0.145 0.089 0.245 6.451 8.042 4.677

the 25–75% sampling quantile (0.00–1.23). Indeed, synonymous substitutions mainly at the third codon position have little functional effect (Table 3). Distribution of coefficient of selection: Although the distribution of s (the coefficient of selection or, broadly, mutation size), f(s), can be approximately modeled as a (negative) gamma distribution, a long-standing controversy was about how to determine the shape parameter b. One class of models determined that b is a universal constant between 0.5 and 1.0, based on the argument that gene pleiotropy and mutation size are not correlated (Orr 2000; Razeto-Barry et al. 2011, 2012; Razeto-Barry and Maldonado 2011). By contrast, another class of models suggested that mutation size is the sum of effects from all (n) fitness components, leading to the relationship b n/2 (Poon and Otto 2000; Welch and Waxman 2003; Gu 2007a,b). We compromise between these two extremes as follows. Rephrasing the derivation of Gu (2007b) with the interpretation K = min(r, Pmin), we have b K/2 = min(r, Pmin)/2, which includes two special cases: (i) b r/2 if r , Pmin, predicting that b could be any value between 0.5 and 2.0, and (ii) b Pmin/2 if r . Pmin. Our model explains most empirical distributions of s with b = 0.5–1 simply because they were from nucleotide data (Keightley 1994). In the following we briefly discuss two examples. Piganeau and Eyre-Walker (2003) integrated the distribution f(s) into a population genetics model. They analyzed polymorphisms of 13 genes encoded in the mouse mitochondrial genome and estimated, on average, b = 0.86. It means that the rank of the genotype–phenotype map is K = 2b = 1.72, consistent with our prediction of K = min(3, Pmin) , 3 for nucleotide sites. Since the shortterm coalescent time of the mouse population allows virtually no multiple (two or more) mutations to appear at the same nucleotide site, the rank (r) of mutational effects is likely to be ,2. In the same manner, the low-pleiotropy estimation by Martin and Lenormand (2006) was not very surprising. Since their study was based on six mutation accumulation (MA) experiments, the rank of mutational effects is virtually r = 1 so that K = 1, regardless of the real mutation pleiotropy. Our explanation is consistent with the follow-up analyses from the same research group (Martin et al. 2007).

1362

X. Gu

Concluding remarks

The exact meaning of the term effective designed for Ke in the original article (Gu 2007a) was somewhat ambiguous, which is clarified as follows. First, Ke estimated by the Gu2007 method is an estimate for the minimum pleiotropy Pmin. Second, Ke is an estimate of K = min(r, Pmin), the rank of the genotype–phenotype map. Third, what the Gu-2007 method attempted to estimate is the pleiotropy of amino acid sites, a conserved proxy to the true gene pleiotropy. Finally, Ke is a conserved estimate of K because those fitness components that are slightly affected by gene mutations have been effectively removed by the estimation procedure. One may wonder whether the Gu-2007 method can be expanded beyond single amino acid sites so that more generic states are allowable. For instance, we can develop an evolutionary model that allows coevolution of amino acid pairs or hyplotypes in the population genetics data. While it is certainly worthwhile to investigate in a future study, we suspect that the problem of microergodicity may become severe. Consequently, a huge number of sequences (or any genotype samples) or a large phylogeny is required. We may acknowledge that our main result K = min(r, Pmin) should be a principal limitation of any method to estimate pleiotropy from genetic data instead of phenotype counting. One impressive claim is that we cannot estimate the mutational pleiotropy directly from any genetic data because of r = 1 virtually. Nevertheless, some important implications of our study may be interesting for the general community of evolutionary geneticists. For instance, under the framework of the genotype–phenotype map, minimum pleiotropy is a measure of a gene-specific canalization, the reduced sensitivity of a phenotype to genetic changes, recognized by observing that most genetic changes leave the phenotypes invariant. Moreover, the rank of the genotype– phenotype map K may be the key to determining under which condition the cost of complexity may occur. Indeed, our work is only a first attempt to understand the evolution of the genotype–phenotype map.

Acknowledgments The author is grateful to David Waxman, Zhixi Su, Yanwu Zeng, and Wenhai Chen for constructive discussions and criticisms in the early version of the manuscript and also grateful to the reviewing editor and two anonymous reviewers for constructive comments that have helped to address the pleiotropy problem clearly.

Literature Cited Barton, N. H., 1990 Pleiotropic models of quantitative variation. Genetics 124: 773–782. Bataillon, T., 2000 Estimation of spontaneous genome-wide mutation rate parameters: Whither beneficial mutations? Heredity 84: 497–501.

Chen, W. H., Z. X. Su, and X. Gu, 2013 A note on gene pleiotropy estimation from phylogenetic analysis of protein sequences. J. Syst. Evol. 51: 365–369. Chevin, L. M., G. Martin, and T. Lenormand, 2010 Fisher’s model and the genomics of adaptation: restricted pleiotropy, heterogenous mutation, and parallel evolution. Evolution 64: 3213– 3231. Cooper, T. F., E. A. Ostrowski, and M. Travisano, 2007 A negative relationship between mutation pleiotropy and fitness effect in yeast. Evolution 61: 1495–1499. Dudley, A. M., D. M. Janse, A. Tanay, R. Shamir, and G. M. Church, 2005 A global view of pleiotropy and phenotypically derived gene function in yeast. Mol. Syst. Biol. 1: 2005.0001. Elena, S. F., and R. E. Lenski, 2003 Evolution experiments with microorganisms: the dynamics and genetic bases of adaptation. Nat. Rev. Genet. 4: 457–469. Eyre-Walker, A., M. Woolfit, and T. Phelps, 2006 The distribution of fitness effects of new deleterious amino acid mutations in humans. Genetics 173: 891–900. Fisher, R. A., 1930 The Genetical Theory of Natural Selection. Oxford University Press, Oxford. Gu, X., 2007a Evolutionary framework for protein sequence evolution and gene pleiotropy. Genetics 175: 1813–1822. Gu, X., 2007b Stabilizing selection of protein function and distribution of selection coefficient among sites. Genetica 130: 93–97. Gu, X., and J. Zhang, 1997 A simple method for estimating the parameter of substitution rate variation among sites. Mol. Biol. Evol. 14: 1106–1113. Hartl, D. L., and C. H. Taubes, 1996 Compensatory nearly neutral mutations: selection without adaptation. J. Theor. Biol. 182: 303–309. Hartl, D. L., and C. H. Taubes, 1998 Towards a theory of evolutionary adaptation. Genetica 102–103: 525–533. Hill, W. G., and X. S. Zhang, 2012 On the pleiotropic structure of the genotype-phenotype map and the evolvability of complex organisms. Genetics 190: 1131–1137. Keightley, P. D., 1994 The distribution of mutation effects on viability in Drosophila melanogaster. Genetics 138: 1315–1322. Kimura, M., 1983 The Neutral Theory of Molecular Evolution. Cambridge University Press, Cambridge, UK. Lande, R., 1980 The genetic covariance between characters maintained by pleiotropic mutations. Genetics 94: 203–215. Lynch, M., J. Blanchard, D. Houle, T. Kibota, S. Schultz et al., 1999 Perspective: spontaneous deleterious mutation. Evolution 53: 645–663. MacLean, R. C., G. Bell, and P. B. Rainey, 2004 The evolution of a pleiotropic fitness tradeoff in Pseudomonas fluorescens. Proc. Natl. Acad. Sci. USA 101: 8072–8077. Martin, G., and T. Lenormand, 2006 A general multivariate extension of Fisher’s geometrical model and the distribution of mutation fitness effects across species. Evolution 60: 893–907. Martin, G., S. F. Elena, and T. Lenormand, 2007 Distributions of epistasis in microbes fit predictions from a fitness landscape model. Nat. Genet. 39: 555–560. Ohya, Y., J. Sese, M. Yukawa, F. Sano, Y. Nakatani et al., 2005 High-dimensional and large-scale phenotyping of yeast mutants. Proc. Natl. Acad. Sci. USA 102: 19015–19020.

Orr, H. A., 2000 Adaptation and the cost of complexity. Evolution 54: 13–20. Otto, S. P., 2004 Two steps forward, one step back: the pleiotropic effects of favoured alleles. Proc. Biol. Sci. 271: 705–714. Paaby, A. B., and M. V. Rockman, 2013 The many faces of pleiotropy. Trends Genet. 29: 66–73. Pal, C., B. Papp, and M. J. Lercher, 2006 An integrated view of protein evolution. Nat. Rev. Genet. 7: 337–348. Piganeau, G., and A. Eyre-Walker, 2003 Estimating the distribution of fitness effects from DNA sequence data: implications for the molecular clock. Proc. Natl. Acad. Sci. USA 100: 10335– 10340. Pigliucci, M., 2008 Is evolvability evolvable? Nat. Rev. Genet. 9: 75–82. Poon, A., and S. P. Otto, 2000 Compensating for our load of mutations: freezing the meltdown of small populations. Evolution 54: 1467–1479. Razeto-Barry, P., and K. Maldonado, 2011 Adaptive cis-regulatory changes may involve few mutations. Evolution 64: 3332–3335. Razeto-Barry, P., J. Diaz, D. Cotoras, and R. A. Vasquez, 2011 Molecular evolution, mutation size and gene pleiotropy: a geometric reexamination. Genetics 187: 877–885. Razeto-Barry, P., J. Diaz, and R. A. Vasquez, 2012 The nearly neutral and selection theories of molecular evolution under the Fisher geometrical framework: substitution rate, population size, and complexity. Genetics 191: 523–534. Su, Z., Y. Zeng, and X. Gu, 2010 A preliminary analysis of gene pleiotropy estimated from protein sequences. J. Exp. Zool. B Mol. Dev. Evol. 314: 115–122. Turelli, M., 1985 Effects of pleiotropy on predictions concerning mutation-selection balance for polygenic traits. Genetics 111: 165–195. Wagner, A., 2000 The role of population size, pleiotropy and fitness effects of mutations in the evolution of overlapping gene functions. Genetics 154: 1389–1401. Wagner, G. P., 1989 Multivariate mutation-selection balance with constrained pleiotropic effects. Genetics 122: 223–234. Wagner, G. P., and J. Zhang, 2011 The pleiotropic structure of the genotype-phenotype map: the evolvability of complex organisms. Nat. Rev. Genet. 12: 204–213. Wagner, G. P., J. P. Kenney-Hunt, M. Pavlicev, J. R. Peck, D. Waxman et al., 2008 Pleiotropic scaling of gene effects and the ‘cost of complexity’. Nature 452: 470–472. Waxman, D., and J. R. Peck, 1998 Pleiotropy and the preservation of perfection. Science 279: 1210–1213. Welch, J. J., and D. Waxman, 2003 Modularity and the cost of complexity. Evolution 57: 1723–1734. Wingreen, N. S., J. Miller, and E. C. Cox, 2003 Scaling of mutational effects in models for pleiotropy. Genetics 164: 1221–1228. Wright, S., 1968 Evolution and the Genetics of Populations. University of Chicago Press, Chicago. Zeng, Y., and X. Gu, 2010 Genome factor and gene pleiotropy hypotheses in protein evolution. Biol. Direct 5: 37. Zhang, X. S., and W. G. Hill, 2003 Multivariate stabilizing selection and pleiotropy in the maintenance of quantitative genetic variation. Evolution 57: 1761–1775. Communicating editor: I. Hoeschele


1363

GENETICS Supporting Information http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.114.164673/-/DC1

Pleiotropy Can Be Effectively Estimated Without Counting Phenotypes Through the Rank of a Genotype–Phenotype Map Xun Gu

Copyright © 2014 by the Genetics Society of America DOI: 10.1534/genetics.114.164673

File S1 Matlab script (KeEstimation.m.txt) for the computer simulation, and the software (GenePleio_win‐v15.rar, codes included) for the estimation of Ke Available for download as a .zip file at http://www.genetics.org/lookup/suppl/doi:10.1534/genetics.114.164673/‐/DC1

2 SI

X. Gu

How effectively can HIV phylogenies be used to measure heritability?

Sporadic spinal hemangioblastomas can be effectively treated by microsurgery alone.

Low-anxiety rat phenotypes can be further reduced through genetic intervention.

Cytokines. Can there be life without LIF?

Can pharmacological agents be used effectively in the alleviation of jet-lag?

Can a four-session biofeedback regimen be used effectively for treating children with dysfunctional voiding?

Can geriatric hip fractures be managed effectively within a level 1 trauma center?

Transpulmonary pressure and lung elastance can be estimated by a PEEP-step manoeuvre.

Can secular trends in child growth be estimated from a single cross sectional survey?

Evidence Showing That Tumors Can Grow Without Angiogenesis and Can Switch Between Angiogenic and Nonangiogenic Phenotypes.

Can SNOMED CT be squeezed without losing its shape?

Atrial fibrillation: Can electrograms be interpreted without repolarization information?

Post activation potentiation can be induced without impairing tendon stiffness.

Can reading disabilities be diagnosed without using intelligence tests?

Methicillin-resistant Staphylococcal periprosthetic joint infections can be effectively controlled by systemic and local daptomycin.

Can Observation Skills of Citizen Scientists Be Estimated Using Species Accumulation Curves?

Effective synaptic current can be estimated from measurements of neuronal discharge.

Streptococcal pharyngitis in adults: can it be efficiently and effectively managed by remote control?

How should "ambidexterity" be estimated?

Brain fragility can be estimated by its putrefactive signs on postmortem computed tomography.

Can "normal" protein expression ranges be estimated with high-throughput proteomics?

Malignant cancer and invasive placentation: A case for positive pleiotropy between endometrial and malignancy phenotypes.

DISCOVERING PATIENT PHENOTYPES USING GENERALIZED LOW RANK MODELS.

Concept selection for phenotypes and diseases using learn to rank.