Genetic Epidemiology 9:87-107 (1992)

Using Multidimensional Scaling on Data From Pairs of Relatives to Explore the Dimensionality of Categorical Multifactorial Traits J.M. Meyer, A.C. Heath, and L.J. Eaves Department of Human Genetics (J.M.M., L.J. €.I, Medical College of Virginia, Richmond, Virginia and Departments of Psychiatry, Psychology, and Genetics (A.C.H.), Washington University School of Medicine, St. Louis, Missouri An accurate specification of the dimensionality and ordering of categorical multifactorial phenotypes (e.g., smoking status, including heavy, moderate, light, and nonsmokers) is an important prerequisite for the genetic analysis of these traits. Typically, phenotypic dimensionality and ordering are determined by comparing the relative fits of alternative parametric threshold models. Here, a method of analysis is described which addresses the same issue of trait dimensionality but does not require parametric assumptions. Specifically, we detail how nonmetric multidimensional scaling (MDS), applied to contingency tables which cross-classify the phenotypes or responses of one relative with another, may be used to explore trait dimensionality. Scaling results from deterministic simulation studies indicate that the latent structure of categorical phenotypes can be recovered with nonmetric MDS. Results from stochastic simulations, however, indicate that the accuracy of recovery, as well as the rejection of models of incorrect dimensionality, are strongly dependent upon sample size and the latent liability correlation between relatives. As an application of the method, the dimensionality of a measure of smoking status in 1,656 pairs of monozygotic twins ascertained through the American Association of Retired Persons is considered. The MDS results indicate that the onset of the smoking habit and the quantity smoked in this aging population represent a unidimensional process. The implication this finding has for subsequent genetic 0 1992 Wiley-Liss, Inc. analysis is discussed. Key words: threshold models, contingency tables, twins, the smoking habit

Received for publication February 18, 1991;revision accepted March 17, 1992. Address reprint requests to Joanne M. Meyer, Department of Human Genetics, Medical College of Virginia, Box 3, MCV Station, Richmond, VA 23298.

01992 Wiley-Liss, Inc.

88

Meyer et al.

INTRODUCTION

In psychiatric, medical, and behavioral genetics, the analysis of qualitative multifactorial traits is typically predicated on the assumption of a normally distributed liability scale underlying the categorical phenotypes [Falconer, 1965; Reich et al., 1972; Heath et al., 19851. Latent thresholds, superimposed on the liability distribution, are hypothesized to divide response categories or affection states; thus the term ‘‘threshold model” is often used to describe this application. Thresholds, along with tetrachoric or polychoric correlations between the latent liability variables of relatives, may be estimated by maximum likelihood approaches [Tallis, 1962; Olsson, 19791 and the correlations between relatives may be further parameterized in terms of genetic and environmental effects in order to test particular models of cultural or biological inheritance [Eaves et al., 1978; Eaves and Eysenck, 1980; Heath et al., 19851. When fitting threshold models to multiple (>2) response categories, it is necessary to specify both the ordering and dimensionality of the phenotypes. If the variable of interest is merely a polychotomized continuous trait (e.g., height classified as tall, average, or short), these factors are usually not at issue. However, when analyzing variables such as psychiatric diagnostic categories (e.g., major depression, bipolar I, and schizoaffective illness), severity levels, as well as dimensionality, may be disputed [Price et al., 19851. Similarly, when studying the onset and outcome of addictive behaviors (e.g., smoking, drinking, drug taking) or the continuity of psychiatric symptoms and psychiatric disorders, dimensionality may be at issue. In these cases, it is of interest to determine whether the initiation of the habit or the presence of psychiatric symptoms is influenced by factors independent of, or identical to, those influencing the maintenance of the habit or the onset of a diagnosed disorder. The selection of appropriate threshold models (specifying both category ordering and dimensionality) is currently guided by existing theories on particular traits or the goodness-of-fit of competing threshold models. In the latter case, a poor fit of a threshold model may reflect an erroneous specification of category ordering or dimensionality, or, alternatively, a violation of the assumption of an underlying normal liability distribution. To discount the possibility that nonnormality contributes to model rejection, it is useful to replicate the latent structure of categories using a nonparametric approach. To this end, we describe how nonmetric multidimensional scaling may be used to explore the ordering and dimensionality of categorical phenotypes. We provide simulation studies of the approach and illustrate its application by considering the dimensionality of the smoking habit in a volunteer sample of twins ascertained through the American Association of Retired Persons (AARP). METHODS Multidimensional Scaling-An

Overview

Kruskal and Wish [1978], Young [1987], and Schiffman et al. 119871 provide extensive reviews of nonmetric multidimensional scaling (MDS), while Heath et al. [1991a] have illustrated the application of nonmetric MDS to twin data on alcohol consumption. An overview of the method and a more detailed account of its application to data from related pairs of individuals are given here. MDS is typically applied to judged or measured indices of similarity (or dissimi-

Multidimensional Scaling

89

larity) between n objects and results in an ordering of the objects in r independent dimensions which have been selected a priori by the data analyst. The distances between the objects in the r dimensional space are a function of their similarity, while the dimensions themselves reflect salient features of the objects which the judge wittingly or unwittingly used in determining the objects’ similarity. A simple example of a MDS analysis is the scaling of various types of fruit: given that a subject has been asked to compare each of 12 pieces of fruit to the remaining 11 types of fruit by ranking them in ascending order of similarity, an MDS analysis of the data might reveal 3 independent dimensions reflecting size, shape, and sweetness of fruit. The first step in an MDS analysis is to arrange indices of object similarity (which have been measured or judged with or without error) in a similarity matrix (A). Each element (S,) of the matrix reflects the confuseability of the ith andjth object (when i =j, 8, = 0). With nonmetric MDS, the similarity measures are required only to be ordinal and monotonically related to the final scaled distances between objects. These scaled distances, denoted d,, are computed from the coordinates for the n objects in the r dimensions. Specifically, if x . o denotes the coordinate for the ith orjth object in the ath dimension, then, under the Euclidean distance model (which is the most commonly used distance model in nonmetric MDS applications):

The coordinates for the scaled objects are normalized such that, for the ath dimension n

x

Xia =

0.0

i=l

and over all dimensions n

r

2 C.

x : ~= nr

i=l a=l

(3)

To derive the scaling solution of coordinates for n objects in r dimensions, leastsquares monotonic regression is employed to find a transformation of the judged or measured similarities [t(8,)] which minimizes a function ( F ) of the sum of the squared differences between these transformed similarities and the estimated distances between objects (d,). F [Eq. (4)] also involves a scaling factor, which may differ with the MDS program used for analysis (such as KSYT, Kruskal et al. [1973]; POLYCON, Young [ 19731; or ALSCAL, Young and Lewyckyj [ 19871). In n

C 2 [t(Sq) - dqI2 i#j scale factor

(4)

90

Meyer et al.

Assessing goodness-of-fit. When the scaling factor Edi: is included in Eq. (4), the resulting expression is termed Stress. Stress is often used as a goodness-of-fit index and usually provided by MDS software even though it specifically may not have been minimized. An additional index of fit often computed by MDS software is R 2 , or the proportion of the total variance of the optimally scaled data which is accounted for by the model [Eq. (5)].

Since a perfect fit of n points may always be achieved in n - 1 dimensions, criteria must be available to choose a more parsimonious scaling solution, that is, one with fewer than the maximum number of dimensions. Kruskal [1964], Kruskal and Wish [1978], and Young and Lewyckyj [1987] suggest that Stress, R2, and the interpretation of the configuration be used in selecting an appropriate dimensionality. Typically, R2 values greater than 0.90 indicate an acceptable scaling solution. The interpretation of Stress values, in contrast, differs according to the symmetry of the similarity matrix being analyzed, the amount of error in the data, and the dimensionality of the configuration in the relation to the number of objects being scaled [Kruskal and Wish, 19781. For example, Kruskal [1964] considers the simplest case of symmetric, error-free data with n > 4r and provides the following guideline for Stress values: > 0.20, a poor fit; 0.10-0.20, a fair fit; 0.05-0.10, a good fit; 0.025-0.05 a very good fit; and < 0.025, an excellent fit. For asymmetric similarity matrices or similarity indices measured with error, acceptable Stress values (i.e., Stress values derived under the true dimensionality model for the data) are generally higher than those proposed by Kruskal. Therefore, when analyzing these types of data, it is helpful to refer to simulation studies which have explored ranges of Stress values under similar conditions [Kruskal and Wish, 19781. This approach will be used in the present application of nonmetric MDS. Applying nonmetric MDS to data from pairs of relatives. When applying nonmetric MDS to behavioral or psychiatric data, the purpose is not to scale objects, but rather mutually exclusive response categories. If relatives correlate for phenotypes of interest, then a similarity matrix may be computed from two-way contingency tables which cross-classify the responses (or phenotypes) of one relative with another. To convert the contingency table into a similarity matrix, each raw cell frequency I&) is divided by the product of the row and column totals (fi. andf.j). The resulting indices are monotonically related to the increase (or decrease) in the probability of an observation occurring in the ijth cell of the contingency table over that expected by chance. It should be noted that since nonmetric MDS only requires ordinal measures of similarity, any monotonic function ofJj/(f,. Xf.j) would suffice as a similarity index. Limitations of the present application. As described here, nonmetric MDS may be applied only to data from a single class of relatives. Thus, the scaling configurations for monozygotic (MZ) or dizygotic (DZ) twins, for example, could be considered separately but not jointly. If genetic and environmental influences on categorical

Multidimensional Scaling

91

phenotypes differ in their dimensionality, then scaling solutions for MZ and DZ pairs are also expected to differ. This is the case since the off-diagonal elements of a similarity matrix are a function of environmental differences within MZ pairs, but a function of both genetic and environmental differences within DZ pairs. Further work is required to develop analytical methods which not only test the equivalency of scaling solutions across groups, but also allow the derivation of separate genetic and environmental scaling solutions. A second limitation of the present application of nonmetric MDS is that it is uninformative when a perfect correlation exists between relatives. Under this condition, all of the off-diagonal elements of the contingency table (and resulting similarity matrix) are null and thus provide no distance information. As a result, a scaling solution cannot be determined. At the other extreme, when there is no correlation between relatives, scaling solutions will merely reflect stochastic error in the similarity metrics. Scaling configurations based on these data will be meaningless. We give further consideration to the relationship between scaling solutions and the correlation between relatives in the simulation studies. Finally, it should be emphasized that this application of nonmetric MDS is essentially an unweighted analysis. The similarity metrics are not weighted by the amount of information on which they are based. Again, additional work is required before weights are introduced in the analysis. Simulations-An

Overview

To investigate the usefulness of nonmetric MDS in exploring the dimensionality of categorical, multifactorial traits, simulation experiments were conducted by applying MDS to data from relative pairs simulated under multifactorial threshold models and summarized in two-way contingency tables. Both deterministic and stochastic (Monte Carlo) simulations were carried out. Because the initial motivation for using MDS was to determine whether the onset of the smoking habit and the quantity smoked (given that an individual had become a smoker) were part of a single liability continuum or two independent liability continua, both types of simulations were conducted under unidimensional and bidimensional threshold models. In these models, there were 5 mutually exclusive categories for each individual of a relative pair: a “nonsmoking” category and 4 “quantity smoked” categories (Fig. 1A and B). In the case of the unidimensional model, the approach of Eaves et al. [1978] was followed and it was assumed that there was a single normal liability distribution underlying the smoking continuum. Alternatively, for the bidimensional model [Heath et al., 1991b1, it was assumed that the quantity smoked was independent of the liability to smoke, given that an individual was a smoker (i.e., the individual’s liability on the initiation dimension exceeded an initiation threshold). Again, normal liability distributions were underlying the onset of the smoking habit and the quantity smoked (conditional on the habit being adopted). Under each dimensionality model, the joint liability distribution or distributions in relative pairs were bivariate normal. Additionally, under each model, latent thresholds were superimposed on the liability distributions to divide response categories. For the unidimensional model, there were a total of 6 thresholds and t5 = and for the bidimensional model, a total of 8 (tO,tl, ...,t5) with to = thresholds, with to = t3 = --co and t2 = t7 = a.The distances between thresholds reflected the expected proportion of individuals in each category. --cc)

92

Meyes et al.

For both the unidimensional and bidimensional simulations, a single threshold condition was used under which 50% of the population was comprised of nonsmokers and the remaining 50% divided equally into 4 quantity categories. To investigate whether the correlation between relatives’ latent liabilities affected the recovery of the coordinate space of the categories, 5 liability correlations (r = 0.2,0.4,0.6,0.8, and 0.95) were used for the deterministic simulations and 6 liability correlations (Y = 0.0, 0.2, 0.4, 0.6, 0.8, and 0.95) were used for the stochastic simulations. (Nonmetric MDS cannot be applied to data simulated under a deterministic model with Y = 0.0 due to the abundance of tied similarity indices and indeterminacy of the scaling solution.) When simulating data under the bidimensional model, the same liability correlations were used for the onset and quantity dimensions. Deterministic simulations. For the deterministic simulations under the unidimensional model, two-way contingency tables were generated by integrating the bivariate liability distribution between thresholds to obtain the probability of an observation falling in the ijth cell of a two-way 5 X 5 contingency table cross-classifying relatives’ 5 smoking categories. This 5 X 5 matrix of probabilities was denoted A, and the probability of an observation falling in the ijth cell (aij)of the matrix was equal to

A

Never smoked

1-5 6-10 11-20 21 + Quantity (cigarettes per day)

B

Never smoked

Smokes

1-5 6-10 11-20 21+ Quantity (cigarettes per day)

Fig. 1. Unidimensional (A) and bidimensional (B) threshold models for the onset of the smoking habit and the quantity smoked.

MultidimensionalScaling

93

where +(xl ,xz;r)was the bivariate normal probability density function of latent liability values x 1 and x2 with correlation r. For deterministic simulations under the bidimensional model, the 5 X 5 probability matrix cross-classifying relatives’ smoking categories (A) was computed from a 2 x 2 probability matrix B (containing the unconditional probabilities of relative pairs in each smokinghonsmoking cross-classification) and a 4 X 4 probability matrix C (containing the probabilities of relative pairs in each of the quantity smoked cross-classifications, conditional on the pairs being smokers). Specifically, if both relatives were nonsmokers, then a l l= bl 1 . If relative 1 was a smoker and relative 2 a nonsmoker, then 4

a b = b12

X

C c,, m= 1

(7)

where j = 2 , 3 , . ..,5 and n = j - 1. Conversely, when relative 1 was a nonsmoker and relative 2 a smoker then

where i = 2 , 3 , . ..5 and m = i - 1, and when relatives were smokers,

Elements of B and C were obtained by integrating the “onset” liability dimensions and the “quantity” liability dimensions between the thresholds of the bivariate normal distributions [as in Eq. (6)J. A FORTRAN program was written for both the unidimensional and bidimensional deterministic simulations. It employed subroutines from the IMSL software library [IMSL, Inc., 19871 for the purposes of integration. Stochastic simulations. For the stochastic simulations under the unidimensional model, a FORTRAN program (with IMSL [ 19871 subroutines) was written to generate N (100, 300, or 500) random pairs of deviates from a standardized bivariate normal distribution with correlation r . Thresholds were then superimposed on the random deviates in order to assign each pair to a cell of a 5 X 5 contingency table. For the bidimensional model, the thresholds of the “onset” liability distribution were first superimposed on N pairs of random bivariate normal deviates (with correlation r) to generate a 2 X 2 contingency table, B ’ , cross-classifying the onset status of relative 1 with that of relative 2. Smokers (b’22pairs of relatives, and br12and b’21 individuals) were reclassified into ‘‘quantity smoked” categories by generating the corresponding number of random bivariate or univariate normal deviates and placing them in quantity categories by the superimposition of thresholds.

94

Meyer et al.

Multidimensional scaling and goodness-of-fit.The program ALSCAL [Young et al., 1978; Young and Lewyckyj, 19871, available as an SPSS”[SPSS Inc., 19831 routine, was used for the MDS analyses of the similarity matrices derived from the simulated contingency tables. The component of ALSCAL which utilizes classical MDS techniques [CMDS; Schiffman et al., 19871 was used for all analyses. CMDS methods assume a EUCLIDEAN distance model and are applied to single square (twoway) matrices of similarities. Data from the deterministic simulations were specified as being ORDINAL, SIMILAR, and SYMMETRIC (due to the equality of 6, and SjJ, while those from the stochastic simulations were ORDINAL, SIMILAR, and ASYMMETRIC. When employing CMDS techniques, ALSCAL handles asymmetries by using 8, and Sjias repeated measures of the same similarity. To summarize the deterministic simulation results, values of Stress and R2 (both provided by ALSCAL) were recorded under the true and false (bidimensional or unidimensional) models. For the stochastic simulations, mean Stress and R2 values under the true and false models were calculated from 50 sets of data simulated under each correlation/sample size/dimensionality condition. Additionally, for each stochastic simulation, the Pearson product-moment correlation of the recovered categorical distances and the true distances was computed. The true distance was obtained from the corresponding deterministic simulation scaling solution. The average value of the Pearson product-moment correlation for the 50 trials under each correlationhample size/dimensionality condition was calculated by first transforming each correlation to Fisher’s z, averaging the z-transforms, and then transforming this average to a correlation coefficient. Using the values of Stress and R 2 , as well as the correlation between true and recovered distances, we identified conditions under which nonmetric MDS rendered informative scaling solutions and led to the discrimination of competing dimensionality models. Application

Nonmetric MDS was applied to data on the smoking habits of monozygotic (MZ) twins who were ascertained through advertisements in publications of the American Association of Retired Persons (AARP). The smoking data were collected as part of a Health and Lifestyle questionnaire completed by 3,740 twin pairs between 1985 and 1986. To minimize the effect of cohort differences in these analyses, we limited our sample to the 3,499 complete twin pairs born before 1955. These twins had a mean age of 63.2 +- 10.9 years. Zygosity was diagnosed through questionnaire items regarding physical similarity and confusion in recognition by others. When compared to blood typing, this method of zygosity determination has been found to be 95% accurate in other twin populations [Nichols and Bilbro, 1966; Kasriel and Eaves, 19761. Zygosity diagnoses indicated that there were 1,238 pairs of female and 418 pairs of male MZ twins in the AARP sample. On the Health and Lifestyle questionnaire, there were two items regarding the twins’ smoking history. In the first item, the twins were asked if they had (1) never smoked, (2) used to smoke but had given it up, (3) smoked on and off, or (4) smoked most of their life. Those who endorsed (1) were classified as “nonsmokers.” In the second item, twins who had ever smoked or currently smoked were asked to note the best estimate of their daily cigarette consumption (or equivalent in pipefuls or cigars).

Multidimensional Scaling

95

The responses for this item included (1) 1-5 per day, (2) 6-10 per day, (3) 11-20 per day, (4) 21-40 per day, and ( 5 ) more than 40 per day. Due to low frequencies in the final quantity category (3% in males and 0.8% in females), it was combined with category (4) to yield a “smokes 21 or more cigarettes per day” category. Nonmetric MDS was applied to the similarity matrices of the 5 smoking categories for MZ male and female twins separately. The ALSCAL procedure in SPSS” was used for the analyses. Stress and R2 values, the interpretation of the scaling configuration, and results from the simulation studies were used to select an appropriate scaling solution. Multidimensional scaling results were also compared to multifactorial threshold model-fitting results. Parametric unidimensional and bidimensional models were fit by maximum likelihood methods outlined by Eaves and Eysenck [ 19801 and Heath et al. [ 1991b]. They obtain estimates of liability thresholds and correlations by maximizing the log-likelihood of observing a contingency table under the unidimensional or bidimensional threshold model. This log-likelihood is equal to

where c is a constant; hj,the observed frequency of twin pairs in the ijth cell of the contingency table, and aij, the ijth cell probability. Elements of A were calculated from Eq. (6) for the unidimensional model and Eqs. (7-9) for the bidimensional model. The standard errors of liability correlations and thresholds were obtained from the sampling covariance matrix of the parameter estimates. This matrix is derived as the inverse of the Fisher Information Matrix ( I ) , with its m,nth element given as [Tallis, 1962; Olsson, 19791

where 8, denotes the mth or nth element of the parameter vector of liability correlation(s) and thresholds. The goodness-of-fit of the unidimensional or bidimensional model was evaluated with a chi-squared test statistic, C, where

and eij is the expected i,jth cell frequency. The statistic C has (n2-1-k)degrees of freedom, with n being the number of categories in the contingency table and k the number of parameters estimated. For the unidimensional model, 4 thresholds and I correlation were estimated; for the bidimensional model, 4 thresholds and 2 correlations were estimated. Although the unidimensional model has one fewer parameter than the bidimensional model, the former is not a submodel of the latter and thus the two cannot be compared

96

Meyer et al.

TABLE I. Stress and R2Values Obtained by Applying Nonmetric MDS to Data Simulated Under Unidimensional and BidimensionalModels: The DeterministicCase* Model Fit True Model Unidimensional

Bidimensional

Liability correlation

0.95 0.80 0.60 0.40 0.20 0.95 0.80 0.60 0.40 0.20

Unidimensional Stress R2

Bidimensional Stress R2

0.009 0.013 0.001 0.013 0.080 0.182 0.372 0.406 0.402 0.400

0.003 0.006 0.001 0.000 0.005 0.004 0.003 0.001 0.001 0.003

1.000 0.999 1.ooo 0.999 0.981 0.840 0.605 0.484 0.487 0.489

1.000

1.000 1.000 1 .OoO 1.000 1.000 1.000 1.000 1.000 1.000

*Results under the true dimensionality models are shown in bold type.

directly through a likelihood ratio chi-square test. Instead, we compare the fit of the two models using Akaike’s Information Criterion (AIC) [Akaike, 19701 and the average absolute deviation between observed and fitted cell counts, as a proportion of the expected cell count. AIC is calculated as chi-square minus twice the model’s degrees of freedom. The model with the lowest AIC is accepted as the best model. RESULTS

Deterministic Simulations Stress and R2 values for (true and false) models fit to the deterministic data are shown in Table I. Stress values are all low ( S 0.080) and R2 values all high (20.981) when the correct model is fit to the simulated data. Further, when thefalse bidimensional model is fit to data generated under a unidimensional process, the values of Stress and R2 improve only slightly, indicating that the bidimensional model does not provide a much better fit to the data than the unidimensional model. In contrast, when thefalse unidimensional model is fit to data generated under a bidimensional process, Stress values increase, and R2 values decrease markedly. These changes in Stress and R2 are notably less when r = 0.95. This finding reflects the decrease in distance information when the correlation between relatives is high and there are several null cells in the similarity matrix. Still, if an R2 value of 0.90 is required for an acceptable scaling solution, then none of the unidimensional models fit to data simulated under a bidimensional process is acceptable. To illustrate the recovery of categorical distances from the simulated data, we show the scaling solutions for data generated under unidimensional (Fig. 2A) and bidimensional models (Fig. 2B). The latent liability correlations were 0.6 in both cases. In the bidimensional solution, the nonsmokers define a smoking vs. nonsmoking dimension, but lie in the middle of the quantity smoked dimension. This result highlights the independence of the adoption of the habit, and the quantity smoked once the habit has been adopted, under the bidimensional model.

-

Multidimensional Scaling

97

A

E

-2.0

D

-1.0

C B

A

1 .o

0.0

2.0

B 2.0

A

Fig. 2. Scaling results obtained by applying nonrnetric MDS to deterministic data simulated under multifactorial threshold models. Results are shown for a unidimensional model (A) and a bidimensional model (B). A, nonsmokers; B, 1-5 cigarettedday; C , 6-10 cigarettesiday; D, 11-20 cigarettesiday; E, 21 cigarettedday .

+

Stochastic Simulations The bidimensional model. The mean values of R2 obtained when true (bidimensional) and false (unidimensional) models are fit to data simulatedunder the bidimensional model are shown in Figure 3a. R2 values are given for a range of liability correlation values and sample sizes. As the figure reveals, when the true model is fit to the data, the proportion of variance of the similarity indices accounted for by the scaled distances increases as the stochastic error decreases (i.e., as N becomes large) and as the correlation between the relatives' liability values increases up to 0.80. The increase in R2 becomes progressively smaller with increasing values of Y, and, as Y becomes quite large (0.95), R2 decreases. This decrease reflects the loss of information in the offdiagonal elements of the similarity matrix as a correlation of unity is approached. If R2 values of 0.90 or greater are required for accepting a scaling solution, then acceptable scaling solutions under the bidimensional model will result if liability correlations are greater than or equal to 0.4 and sample sizes are as large or larger than 300 pairs of relatives. For sample sizes of 100relative pairs, liability correlations must

Meyer et al.

98

1.oo

0.90 0.80 N

CT

0.70

0.60

0.50

1-00

r=O2 r=O4 r=O6

r.08

r=O95

WOO

r.02

100

r-06

1-04

7-00

r=O95

r-00 r-02 r-04

r.0 6 r-08

fa95

500

300

Number of relative pairs A

I-

0.10

l l i i l l i i r=OO

r=O2

r=O4

r=O6

100

B

r=O 8

kO95

(-00

I r=O2

k04

r=O6 r=O8

r=O95

I r-00

1 r-02

300

1 r-04

1 1-06

1

r-08

1

r=O95

500

Number of relative pairs

Fig. 3. Mean values of R2 (A) and Stress (B) for 50 sets of similarity matrices simulated (by Monte Car10 methods) under bidirnensional threshold models. RZ and Stress values are given for a range of sample sizes and liability correlations (r). In both (A) and (B),shaded bars indicate mean values under a (false) unidimensional model, while solid white bars indicate mean values under the (true) bidimensional model. and Stress (C0.20;for symmetric similarity matrices Dashed lines indicate acceptable values of R2(30.90) measured without error). Standard errors are indicated on each bar.

MultidimensionalScaling

99

be high ( 3 0.60 and < 0.95) for R2 to exceed 0.90. When the false unidimensional model is fit to the simulated data, none of the scaling solutions is acceptable using the R2 criterion. In Figure 3b, mean Stress values for all sample sizeicorrelation conditions are shown under the true bidimensional models and false unidimensional models. Under the true model, Stress decreases with increasing sample size and increasing liability correlations up to and including 0.80. However, when r = 0.95, Stress values increase and exceed those obtained under the r = 0.80 condition. This finding parallels the decrease in R2 as a liability correlation of unity is approached. Using Kruskal’s [ 19641 goodness-of-fit criteria, the mean Stress values for the majority of the bidimensional solutions fall in the fair to poor range, with the exception of the values for data generated under the r = 0.0 and 0.20 conditions, which fall in the unacceptable range. The finding that none of the scaling solutions provides a good or excellent fit to the data according to Kruskal’s criteria for symmetric, error free data (Stress < 0.05), indicates that his criteria are too stringent for this application. Instead, we suggest that Stress values ranging from 0.10 to 0.15 indicate an acceptable scaling solution for these asymmetric similarity matices with measurement error. This range of Stress values corresponds well to R2 values greater than 0.90. Using this guideline, all of the unidimensional solutions are clearly rejected. Stress values ranging from 0.20-0.25 should be viewed with extreme caution, since these can result when bidimensional models are fit to data with no distance information (r = 0.0). An additional way of evaluating the scaling solution obtained from simulated similarity matrices is to compute the correlation between the true category distances and 0.9

Correlation between true and scaled distances

0.8

N=500 N=300

0.7

N=lOO

0.6 0.5

0.4 0.3 0.2 0.1

0.0 0.0

0.2

0.4

0.6

0.8

1.o

Liability correlation between relatives Fig. 4. Average correlations (from 50 trials under each liability condition) between true and scaled categorical distances. Scaled distances are derived from applying nonmetric MDS to similarity matrices simulated under bidimensional threshold models by Monte Car10 methods. True distances are derived from applying nonmetric MDS to deterministic similarity matrices.

100

Meyer et al.

the recovered distances. The average correlations under the true bidimensional model (for each sample size/liability condition) are shown in Figure 4. The figure illustrates that the correlation between the true and scaled distances is at a maximum (> 0.90) when the liability correlation ranges from 0.6 to 0.8 and the sample size is at least 300 pairs of relatives. These conditions correspond well to those with “acceptable” R2 and Stress values. For small samples (N = loo), the correlation between true category distances and scaled distances never exceeds 0.80. For this reason, we would consider the solutions based on such samples to be unacceptable. When the sample size is increased by 200 pairs ( N = 300), the corresponding increase in correlation values is marked; however, a further increase in sample size (N = 500) improves only slightly the distance recovery. The unidimensional model. In Figure 5A and 5B, bar graphs of mean R2 and Stress values are shown for data generated under a unidimensional model. Although the general trends in the graphs are similar to those in Figure 3 , it can be seen that, under the true model, the values of R2 are smaller, and the values of Stress are greater than the corresponding values for the bidimensional case. As a result, if the selection of a scaling solution were based on Stress values less than 0.15 and R2 values greater than 0.90, then only the unidimensional solutions for large sample sizes ( N = 500 relative pairs) and moderate to large correlations (0.60 to 0.80), or a moderate sample size (N = 300) and a high correlation ( r = 0.80) would be acceptable. The Stress values under the (false) bidimensional model (also shown in Fig. 5B) are all less than 0.20 (with the exception of the r = 0.0 condition) and most similar to the (true) unidimensional values when 0.60 d r 2 0.80 and N 3 300. The difference between the Stress values for the two models becomes increasingly greater as r and N decrease or as r approaches unity. Consequently, a bidimensional model is more likely to be erroneously favored over a unidimensional model when the sample sizes or the correlations are small, or the correlation is quite large (0.95). In Figure 6, the mean correlations between the true and recovered category distances (under the unidimensional model) are shown. The graph is comparable to that for the bidimensional case (shown in Fig. 4): the correlations are at a maximum (-0.95) when the liability correlation used for the simulation ranges from 0.6 to 0.8 and N 3 300 relative pairs. The similarity between Figures 4 and 6 highlight the finding that, under the true model, the accuracy of distance recovery is not as dimensionality dependent as are Stress and R2 values. Application to the Smoking Habit

The contingency tables of smoking responses for monozygotic twins in the AARP Twin Registry are provided in the Appendix. A bidimensional nonmetric MDS solution for the male and female data yielded Stress values of 0.032 and 0.023, respectively, while a one-dimensional scaling solution yielded corresponding Stress values of 0.033 and 0.100. Thus all values fell within or below the acceptable Stress range defined by our simulation studies. R2 values were similarly indicative of good fit: for the bidimensional solution, male and female R2 values were 0.998 and 0.999, while for the unidimensional solution, they were 0.997 and 0.976. The unidimensional and bidimensional scaling solutions for male and female twins are shown in Figure 7. In both cases, the ordering of the coordinates for the unidimensional solution is as expected under the single liability dimension model. In

MultidimensionalScaling

101

1.oo

0.90 0.80 cu

LI:

0.70 0.60 0.50

r-00

r-02

r=04

r-06

r.08

r-095

r-00

r-02

100

r-04

r-06

r-08

rag5

r=OO

r-02

r.04

r.06

r-08

r-095

500

300

Number of relative pairs A

0.40

0.30

0.20

0.10

r-00 t Q 2

r-04

1-06

100

B

1-08

r=O95

1-00 f-02

re04 ‘-06 r-08

r-095

r-00

‘-02

300

r-04 r-06

,108 r-095



500

Number of relative pairs

Fig. 5 . Mean values of R2 (A) and Stress (B) for 50 sets of similarity matrices simulated (by Monte Car10 methods) under unidimensional threshold models. R2 and Stress values are given for a range of sample sizes and liability correlations ( r ) . In both (A) and (B), shaded bars indicate mean values under a (true) unidimensional model, while solid white bars indicate mean values under the (false) bidimensional model. Dashed lines indicate acceptable values of R2 (30.90) and Stress (S0.20;for symmetric similarity matrices measured without error). Standard errors are indicated on each bar.

102

Meyer et al.

I,:

N=500 N=300 N=lOO

0.8

0.7 Correlation between true and scaled distances

0.6 0.5 0.4 -

0.0 0.0

0.2

0.4

0.6

0.8

1 .o

Liability correlation between relatives

Fig. 6. Average correlations (from 50 trials under each liability condition) between true and scaled categorical distances. Scaled distances are derived from applying nonmetric MDS to similarity matrices simulated under unidimensional threshold models by Monte Carlo methods. True distances are derived from applying non-metric MDS to deterministic similarity matrices.

contrast, neither bidimensional solution is interpretable in terms of the model outlined previously, that is, there are not specific quantity and onset dimensions. For both male and female MZ twin pairs, the differences in Stress and R2 values between the unidimensional and bidimensional solutions are very small in comparison to the differences suggested by the simulation studies under a bidimensional model. Instead, the differences are consonant with simulation results for large sample sizes under the unidimensional model. This, along with the finding that the bidimensional solutions are difficult to interpret in terms of the hypothesized model, leads to the inference that the onset of smoking and the quantity smoked in the AARP twin sample is a unidimensional process. It is of interest to compare the nonmetric MDS results to those obtained from parametric model-fitting. In Table 11, goodness-of-fitstatistics and estimates of polychoric and tetrachoric correlations under unidimensional and bidimensional threshold models are shown. (In the Appendix, expected contingency table cell frequencies are given for both models.) Chi-square goodness-of-fit indices indicate that all models fail to account for the observed data. The bidimensional models do, however, offer improvement over the unidimensional models, as indexed by Akaike’s Information Criteria and the average absolute deviation between observed and fitted cell counts. From the parametric model-fitting results alone, one would either reject both dimensionality models on the grounds of their poor fit, or favor the bidimensional model due to its improvement over the unidimensional model. If the latter case were true, then a genetic analysis (with the inclusion of dizygotic twin pairs) would subsequently be performed on two smoking dimensions. The magnitude of the MZ correlations (and

Multidimensional Scaling

103

UNlDlMENSlONALSCALING SOLUTIONS MZ Females E

MZ Males

C B

D

A

E DC B

~

-2.0

-1.0

1.o

0.0

2.0

A

4 -1.0 0.0 1.o 2.0

-2.0

BlDlMENSlONAL SCALING SOLUTIONS

MZ Males

MZ Females 2.0

1.o

B I -2.0

I 1.o

-1.0 D

E

-- -1 .o

I 2.0 A

t -2.0

I

I

-1.0

1 .o

E

20

-- -1 .o

-

--2.0

-2.0

Fig. 7. Multidimensional scaling solutions for the AARP smoking data. A, nonsmokers; B, 1-5 cigaretteslday; C, 6-10 cigaretteslday; D, 11-20 cigarettedday; E, 21 cigarettedday.

+

their standard errors) leads us to suggest that the results of a bidimensional genetic analysis would be different than those of a unidimensional analysis. Specifically, in a bidimensional genetic analysis, the familial aggregation (due to genetic and shared environmental effects) for the onset of the smoking habit would likely appear to be high (since rmmales- 0.83, rmzfemales = 0.76) and significantly larger than that for the quantity - 0.40, rmzfemales = 0.59). In the unidimensional analysis, smoked (since rmzmalesthis distinction would not be made and the familial aggregation for the unitary dimension - 0.69, rmzfemales = 0.70). would be moderately high (rmzmales

DISCUSSION The present study indicates that, under certain conditions, nonmetric multidimensional scaling applied to similarity data from relative pairs is a useful method of exploring

104

Meyer et al.

TABLE 11. Parametric Threshold Model-FittingResults for the AARP Twin Smoking Data* Number of dimensions

1 Monozygotic males (N = 418) X2 df P AIC AAD" rl r2 Monozygotic females ( N = 1,238) X2 df

P AIC AAD" TI

r2

52.54 19

Using multidimensional scaling on data from pairs of relatives to explore the dimensionality of categorical multifactorial traits.

An accurate specification of the dimensionality and ordering of categorical multifactorial phenotypes (e.g., smoking status, including heavy, moderate...
1MB Sizes 0 Downloads 0 Views