This article was downloaded by: [Central Michigan University] On: 31 December 2014, At: 07:40 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Multivariate Behavioral Research Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/hmbr20

The Behavior Of Number-OfFactors Rules With Simulated Data A. Ralph Hakstian , W. Todd Rogers & Raymond B. Cattell Published online: 10 Jun 2010.

To cite this article: A. Ralph Hakstian , W. Todd Rogers & Raymond B. Cattell (1982) The Behavior Of Number-Of-Factors Rules With Simulated Data, Multivariate Behavioral Research, 17:2, 193-219, DOI: 10.1207/s15327906mbr1702_3 To link to this article: http://dx.doi.org/10.1207/s15327906mbr1702_3

PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content.

Downloaded by [Central Michigan University] at 07:40 31 December 2014

This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Multiva~iatsBehavioral Research, 1982,17, 193-219

THE BEHAVIOR O F NUMBER-OF-FACTORS RULES WITH SIMULATED DATA A. RALPH HAKSTIAN and W. TODD ROGERS University of British Columbia RAYMOND B. CATTELL University of Hawaii

Downloaded by [Central Michigan University] at 07:40 31 December 2014

ABSTRACT Itssues related to the decision of the n~xnzber of factors to retain in factor analysis are identified, and three widely-used decision rules-the Kaiser-Guttman, scree, and likelihood ratio tests-are isolated for empirical study. Using two differing structural models and incorporating a number of relevant independent variables (such as number of variables, ratio of' number of factors to number of variables, variable cammunality levels, and factorial compliexity), the authors simulated 144 population data sets and, then, from these., 288 sample data sets, each with a precisely known (or incorporated) number of factors. The Kaiser-Guttman and scree rules were applied to the population data in Part I of the study, and all three rules were applied to the sample data sets in Part 11. Overall trends and intenactive results, in terms of the independent variables examined, are discussed in detail, and methods are presented for assessing the quality of the number-of-factors indicated by a particuIar mIe.

In an earlier study, Hakstian and Muller (1973) presented the underlying views, models, and bases sf inference in factor analysis. Four major rationales on which to base Wle aurnber-offactors decision were delineated : (1) algebraic, (2) psychometric, (3) statistical, and (4) importance or interpretability. Tlnese rationales and the number-of-factors rules emanating therefrom were disciussed a t length and, therefore, are not discussed further here. In addition, a large number of earlier articles concerned with the number-of-factors problem-from Hoel's (1937) stakistical procedure for component analysis and Ledeman's (1937) early .work on algebraic bounds, to Kaiser's (1970) more recent insights-were note!d and in some cases discussed. The reader is referred, therefore, to this earlier paper for more perspective on the present study. In a more recent paper (Crawford, 1975)' the importance or ii~terpretabilityrationale has been extended, and, finally, Cattell and Vogelmann (1977) investigated two methods in cominon use for deciding on the number of factors. Several of the number-of-factors rules included in the cornparative study of Hakstian and Muller (1973) are not widely used, and one-connected with Kaiser's (1970) "Second Generation Little Jiffy9'-has since been rejected by its developer (E;aiser & The research reported in this paper was supported by a grant (No. A9088) from the Natural Sciences and Engineering Research Council of Canada. APRIL, 1982

193

Downloaded by [Central Michigan University] at 07:40 31 December 2014

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell

Rice, Note I). It appears true that three rules for deciding on the number of factors to retain are used almost universally today: (1) the Kaiser-Guttman-number of eigenvalues of R greater than unity-rule (hereafter referred to as the K-G rule), (2) Cattell's (1966) scree test, and (3) the statistical likelihood ratio criterion employed with maximum likelihood factor analysis. Of these three rules, the K-G is used in the vast majority of empirical factor analytic studies reported in the literature. The scree test has also been used in a large number of studies, but it has proved to be the most equivocal of the three methods, being applied somewhat differently by different investigators. Some researchers interpret the rule to mean simply finding the first large break in the plot of eigenvalues of R, proceeding from the largest downward, whereas others (most notably Cattell) proceed by seeking areas in the plot in which the eigenvalues describe a steady, relatively flat descent, generally beginning from the low end of the plot and working upwards. This latter procedure-which Cattell (1966) originally proposed-eonstitutes the operationalization of the scree test in the present paper. The third rule noted above-the likelihood ratio test associated with maximum likelihood (or generalized least squares) factor analysis-has been used in a small number of reported substantive studies. These studies can be characterized as demonstrating a generally higher level of precision and care throughout their execution than most factorial studies. Although grounded in precise statistical principles, this latter criterion has been found to possess certain defects for the empirical researcher, most notably a dependence upon the sample size employed (see Hakstian & Muller, 1973). The purpose of the present study was to investigate the three rules listed above in terms of their performance with simulated data having a known number of factors. Part I of the investigation concerned the performance of these rules when applied to population R matrices and thus excluded the likelihood ratio test because of the inferential basis of the latter. In Part I1 results based on samples randomly selected from the same simulated population matrices were examined for all three decision rules.

Consider a data matrix, Z, of order p variables by n subjects. In the present study, two different structural models for Z were 194

MULTIVARIATE BEHAVIORAL RESEARCH

A. Ralph Hakvtian, W. Todd Rogers and Raymond 81. Cattell

employed : (1) the traditional common-factor model, as depicted by McDonald and Burr (1967), and (2) a model initially conceptualized by Tucker, Koopman, and Linn (1969). These models, termed, respectively, the Formal and Niddle Models by Tucker et al., will be described separately.

Downloaded by [Central Michigan University] at 07:40 31 December 2014

The Formal Model

The n column vectors, zi, i = 1, . . . , n, of Z correspond to n independent observations distributed as the random vector z. Each element of z has origin and scale such that R = E (zz') is a correlation matrix. In structural terms, a = Ax,

I31

+ Ue ,

where A, p x q and of rank q < p, is an orthogonal pattern matrix; U, p xp, is a diagonal matrix of unique factor standard deviations; x,, y x l , is a random vector of orthogonal factor scores; and e, pxl., is a random vector. Given the orthogonality conditions, namely,

then

R = E(zz') = AA'

f31

+ U2 ,

the well-known Thurstonian representation of R.

..

The Middle Model Here, in place of [1], the structural form of z is :

C~I APRIL, 1982

z = Ax,

+ Bx, + Ue ,

3

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell

Downloaded by [Central Michigan University] at 07:40 31 December 2014

where U and e are defined as before; A, p x q, is a matrix of m a j o ~ factor pattern coefficients; x,, q x 1, is a random vector of a major orthogonal factor scores ; Byp x m (m > p), is a matrix of minor factor pattern coefficients; and XI, t n x l , is a random vector of minor orthogonal factor scores. In place of [21, we have

Consequently, for the Middle Model,

f 51

R

= E(zz') = AA'

+ BB' + U2 .

Some additional explanation regarding the Middle Model may be helpful. Use of this model (originating in the work of Tucker et al., 1969) would seem consistent with the view that in reality, the idealized traditional common-factor model may be a poor representation for many sets of variables. In addition to the major simple-structure factors structuring a domain of interest (and which the investigator wishes to discover), the existence would be postulated of a vast number of real influences, which although contributing somewhat to the covariation among a set of p variof little consequence and of a random, unables-as in [5]-are structured form. The view that such minor factors may exist in most real data is not a t issue in the present study. Rather, ,alven ' that these are factors an investigator wishes not to include within the set of major factors (which would distort interpretations of the major factors), the question addressed here is the applicability of the number-of-factors rules for determining the number of major factors in the presence of minor factors. Clearly, the number-of-factors problem with such a model involves discovering the q major factors for the variables, without proceeding beyond these to the minor factors, a task obviously more difficult than discovering the q common factors in data represented by the simpler Formal Model in [I] through [3]. For convenience, the various features of these two structural models are outlined in Table 1. 196

MULTIVARIATE BEHAVIORAL RESEARCH

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell Table 1 Structural Factorial Models Used in the Study

Downloaded by [Central Michigan University] at 07:40 31 December 2014

Formal (Traditional) Model (i)

Structural Equation

(ii)

Constituent Structures

z=Ax

Middle Model

--

z = Axo

+Ue

+ Bxl + Ue

A, p x q, of rank q 2 p, orthogoncl factor pattern matrix

A, p x q, of rank q 2 p, ortho-

xo, q x 1, random vector of orthogonal factor scores

x

gonal matrix

factor pattern

,.q x 1, random vector of ma0 ] or orthogonal factor scores

B, p x m (m > p),

factor pattern matrix; magnitude of column sums of squares geometrically decreasing

m x I, random vector of miX1*nor orthogonal factor racores U, p x p, diagonal matrix of unique factor standard deviations

U, p x p, diagonal matrix of unique factor standard deviations

e, p x 1, a random vector

e, p x 1, a random vector

(iii) Orthogonality Conditions

(iv)

Representation of g

R

=

AA'

+ U2

Implementation of the Models

The construction of the population R matrices according to the above models proceeded in a manner similar to that of Tucker e t al. (1969). The one difference between the present procedures and those of Tucker et al. was that the conceptual major factor loading matrices were constructed in a systematic rather than random manner, in order that several characteristics of factorial data could be more precisely examined. The steps to construct each data set are outlined as follows : 1. First a set of matrices, A*, p x q, of conceptual major factor loadings was constructed to reflect what appeared to be reasonable simple structure representations of data. For each combination of number of variables: ( p ) and factors (9). (a) three factorially simple matrices were construci;ed, in vvhich the nonsaIient enMes ranged from zero in the simplest form to values in the neighborhood of k.10 to 2.15 in the less clear fonns, and (b) three factorially complex matrices were constructed, ranging from mild complexity to extreme factorial complexity in approximstely onehalf of the variabIes, i.e., non-salient entries ranged from k.15 to k . 5 0 for up t o one-half of the variables. Salient entries were arbitrarily given the

A. Ralph Hakstian, W. Todd Rogers and Raymond B, Cattell

value one in all matrices. Representative examples of the conceptual major factor loading matrices used are provided in Table 2.

Table 2 Four Representative Conceptual Major Factor Loadin Matrices from the 12-Variable, 3-Factor Data Sets8 Factorially Simple Most Simple

Factorially Complex

Least Simple

Least Complex

Most Complex

Downloaded by [Central Michigan University] at 07:40 31 December 2014

Variable 1

2 3

4 5 6 7 8 9 10 11 12 a These conceptual major factor loading matrices are those that existed prior to row

normalization and then row rascaling by the square roots of the variable communalities. They are referred to as t h e e matrices in the text. Note:

Three conceptual major factor loading matrices were created for each combination of 2, 9,and level of complexity, and were numbered, within each level of complexity, from most to least simple and least to most complex. For the Part I1 Sample Results, the set numbers corresponding to the conceptual matrices used are 2 (1 and 3 are shown above left), 8, 14, 20, 26, and 32 for factorially simple data, and 5 ( 4 and 6 are shown above right), 11, 17, 23, 29, and 35, for factorially complex.

-

2. Each A* matrix was next row-normalized, and then premultiplied by a diagonal p x p matrix containing the square roots of the predetermined variable communalities to produce the matrix A (see [3]). 3. For the Formal Model population R matrices, 131 was employed:

...

, p ) values in the diagonal matrices U2 were where the u2 ( i = 1, simply the complements of the communalities established for the A matrices. 4. For the Middle Model population R matrices, each matrix B, p xm, in 143 and 151 was generated as follows: (a) A p x m matrix of random normal (0, 1 ) deviates, Q, was generated; (b) An m x m diagonal m&ric, C, was constructed in which cjj = ( 1 - E)(j-l), j = 1, , rn and = .02 (as in the study by Tucker et al. (1969)) ; (c) Next the product Q* =QC was obtained, so that the magnitudes of the minor factors formed a decreasing geometric series; (d) Q* was then row-normalized; (e) To reduce the negativity of the factor loadings in the row-normalized Q* matrix, the skewing function developed by Tucker et al. was employed (see this latter paper f or the details of this function) ;

...

198

,

MULTIVARIATE BEHAVIORAL RESEARCH

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell (f) The matrix resulting from Step (e) was row-normalized, yielding B*; (g) The complements of the communalities established for the A matrix-[(1 - hi2), i = 1, , pl-were isolated, and each row, i, of B* was multiplied by [(l - hi2)/2J%,yielding B (for which the sum of squares for each row i, i = 1, . , p, therefore, was (equal to (1 hiz)/2) ; (h) Finally, each value of U2 was set to (1 - h,2)/2, and the R matrix was constructed by [5 J :

.. .

Downloaded by [Central Michigan University] at 07:40 31 December 2014

. .

-

It should be noted that in the above procedures, the magsitude of the column sums of squares of B decreased geometrica~lly,so that by about m = 100 columns, the effect of the minor factors on the off-diagonal elements of R became negligible. Thus, for all analyses, m was arbitrarily set to 100.

Independent Variables Studied For population R matrices constructed according to both the Formal and Middle Models, the following independent vaxiables were incorporated. Number of variables, p. Three values were used for p : 12, 30, and 50. The 12-variable sets were constructed to be representative of small factor studies, whereas the 50-variable sets were seen as representative of large-scale studies. Number of major factors, q, and q/p ratio. For each value of p, two values of q were used, one representing a small q/'p ratio and the other, a large ratio. For the 12-variable data sets, q was set to 3 and 6, yielding ratios of .25 and 50. For the 30-variable sets, values for q of 6 and 12 were used, yielding ratios of -20 and .40. For the 50-variable sets, q was set to 8 and 20, producing q/p ratios of .16 and .40. Factorial complexity. As noted earlier in the discu~~sion of construction of the conceptual major factor loading matrices and as slhown in Table 2, two levels of factorial complexity were employed-low (factorial simplicity) and high. It was mainly to be able to vary this characteristic, as well as to construcl; major factor loading matrices that approximated those found with realworld data, that the present method of constructing the conceptual major factor loading matrices was used in preference to that used by Tucker et al. (1969). Communality. Two ranges of communality of the variableswith respect to the major common factors-were employed :a, high range, in which each variable was arbitrarily assigned a comAPRIL, 1982

199

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell

munality of -6, .7, or .8 with respect to the major factors, and a low range, in which each variable was assigned a communality of .2, .3, or .4. Clearly, with the Middle Model, the minor common factors were relatively more influential under the lower range of communality than under the high range.

Downloaded by [Central Michigan University] at 07:40 31 December 2014

Summary of the Study

Design

In total, 36 conceptual major factor loading matrices were constructed-three for each combination of p, q, and factorial complexity. For the factorially simple matrices, the three A* matrices ranged from those having hyperplanar entries of exactly zero, to those with hyperplanar values in the t.10 to k.15 range. In the factorially complex A* matrices, the degree of complexity was increased, as well as the cleanness of the hyperplane values. This variation in conceptual factor loading matrices can be seen from Table 2, in which the two extremes of the three A* matrices are presented for two combinations of p, q, and level of factorial eomplexity. Each of the three A* matrices for a given combination of p, q, and level of factorial complexity was incorporated with the two levels of communality discussed above. And finally, for each combination of p, q, factorial complexity, and communality, two population R matrices were constructed--one according to the Formal Model, and one according to the Middle Model. Thus, a total of 144 population R matrices were constructed for the study. Stated in analysis of variance terms, a 3 (2) x 2 x 2 x 2 hierarchical factorial design was employed, with q nested within p and three replications per cell. It should be noted that although three simulated R matrices were developed for each cell of the factorial design, these replications were not independent from cell to cell. The R matrices for all cells having the same values for p, q, and level of factorial complexity were developed from the same three A* matrices. The communality and model variables were incorporated with the common set of 36 conceptual major factor loading matrices, A*. These constraints were adopted to permit a more precise examination of the effects of communality and model, i.e., to prevent the effects of these factors being confounded with the particular A* matrices used.

Identification of the Number of Factors

Once the 144 population R matrices had been constructed,

zoo

MULTIVARIATE BEHAVIORAL RESEARCH

A. Ralph Hakstian, W. Todd Rogers and Rayrnond B. Cattell

Downloaded by [Central Michigan University] at 07:40 31 December 2014

they were canonically decomposed and their eigenvalues were isolated :€or the present study. The number of eigenvalues greater than one was recorded as the indication of the correct number of factors by the K-G rule, and a computer-generated plot of the root spectrum was obtained. A scree test on each of these plots was performed by the third author.

The second population R matrix of each cell, corresponding to the intermediate conceptual factor matrix a t each level of factorial complexity, was selected from each of the 48 cells described above. Representative examples of the conceptual factor loadings used to produce the population R matrices were provided earlier in Table 2. Each of the 48 R matrices was then canonically decomporsed as R = QMZQ', and a factor matrix, F = QM, was established for each population. Next matrices, X, of order p variables by n subjects, were formed. Independent uniformly distributed ra~ndom numbers on the interval (0, 1) were generated and then transformed to normally distributed (0, I) randorn numbers by Marsaglia's Rectangular Wedge-Tail method (Knuth, 1968). Strings of length pn of such numbers were produced and then partitioned into p row vectors. With the string partitioned this way, each , p) variate vector represented a independent l xnf (j = 1, sample of observations randomly selected from a unit normal population. It can easily be demonstrated that the joint distribution thus arising ~ S ' M V N(0, I). (See, for example, And~erson, 1958, pp. 19-27.) The product Y = FX was then obtained, constituting the sample raw data matrices, The sample correlatioln ma-

...

trices, &, produced from these raw data matrices were, therefore, within random sampling error of the input population R matrices, R,, since, given the independence of the entries in X, DaE(YY')Da = EE(QMX) (&1MX)'] = E(QMXXIMOt) = QME (XX') MQ' = QM2Q' = Rp , and D, = diag[E (YY')] APRIL. 1982

-5 =

I,

.

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell

Downloaded by [Central Michigan University] at 07:40 31 December 2014

For each of the multivariate normal population sets, two sample sizes were employed: n = 150 (small) and n = 400 (large). For each sample size and population, three random samples were obtained, a process virtually identical to repeated sampling without replacement from an infinitely large population. Thus a total of 288 sample fi matrices were constructed. Summarized in analysis of variance terms, a 3 (2) x 2 x 2 x 2 x 2 (variables (factors nested within variables) by complexity by communality by model by sample size) factorial design was employed, with three randomly different replications per cell.

Identification of Number of Factors Once the sample 6 matrices had been constructed, they were canonically decomposed and their eigenvalues isolated. The number of eigenvalues greater than one was recorded as the indication of the correct number of factors by the K-G rule, and a computergenerated plot of the root spectrum (in descending order) was subjected to scree testing by the third author. The likelihood ratio test, with a set to .05, was performed by means of JSreskog and van Thillo's (1971) iterative maximum likelihood procedure. All computer analyses were performed on the University of British Columbia IBM 330/116 computer, using the Alberta General Factor Analysis Program (Hakstian & Bay, 1973). RESULTSAND DISCUSSION Clearly, space precludes presenting the entire vector of eigenvalues for each of the 144 population and 288 sample data sets. Therefore, the results are presented in a greatly condensed form. For the population matrices, the data are summarized in Table 3 in terms of the absolute cell mean deviations from the correct number of factors for each of the 48 cells and two rules-K-G and Scree-considered. Because of the greater variability in performance across the independent variables with the sample data, the number of factors identified using the K-G, scree, and L-R rules are presented in Tables 4, 5, and 6 for each of the 288 sample data sets examined. These results are then summarized-in terms of the absolute cell mean deviations from the correct number of factors for the 96 sample cells-in Table 7. The "A" attached to certain absolute cell mean deviations reported in Tables 3 and 7 signifies that the number of factors was 202

MULTIVARIATE BEHAVIORAL RESEARCH

A. Ralph Hahstian, W. Todd Rogers and Raymond B. Cattell

Downloaded by [Central Michigan University] at 07:40 31 December 2014

overestimated for at least one of the three replicates in those cells. Correspondingly, a "B" indicates the number of factors was underestimaked; a "C" indicates that there was both overestimation and underestimation. For example, in Table 7 the entry 4.67 B for the K-G rule under the formal model and with p = 50, y. = 20, high

APRIL, 1982

203

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell

Downloaded by [Central Michigan University] at 07:40 31 December 2014

communality, factorially complex, and n = 150 signifies that across the three replicates in this cell, the mean number of factors indentified by the K-G rule was 4.67 less than the correct number of factors (see also Table 6, Data set 35). A synthesis of the results contained in these tables, organized by model and number-of-factor rules, is presented separately for the population and sample data in what follows.

The Fomal Model The Kaiser-Guttrnan rule. As shown in Table 3, for the Formal Model, the K-G rule identified the correct number of factors in 16 of the 24 cells of the design. In each of the remaining cells, the number of factors determined was less than the correct number. An interactive effect among the independent variables explains these observed differences most clearly: the rule can be seen to perform well for all data sets except those possessing all three of (1) a large number of variables, (2) a large q / p ratio, and (3) substantial factorial complexity. Of these three characteristics, the q/p ratio appears to be the most important. The scree test. As is evident from Table 3, this rule indicated the correct number of common factors in every case. I t is important to remember, however, in connection with this fact and with what follows, that the R matrices factored were population matrices, and i t is to be expected that precisely q eigenvalues should appear well-separated from the remaining ( p - q) when R is of the simple and precisely error- and minor factor-free form in [3]. Comparison between K-G rule and scree test. I t is clear from the preceding results that the scree test outperformed the K-G rule for population R matrices constructed under the Formal Model. For each combination of the independent variables, the scree test correctly identified the actual number of factors. In contrast, the K-G rule performed well for all data sets except those in which there was a large number of variables, a large ratio of factors to variables, and considerable factorial complexity. For these sets, there was a tendency to underestimate the actual number of factors in the population data. Since the larger q/p ratios used in the present study (.50 for the 12-variable data; .40 for the 30- and 50variable data) are larger than one would expect in good factor 204

MULTIVARIATE BEHAVIORAL RESEARCH

A. Ralph Hakstian, W. Todd Rogers and Raynnond B. Cattell

analytic data, the present findings for the K-G rule may be somewhat overly pessimistic.

Downloaded by [Central Michigan University] at 07:40 31 December 2014

The Middle Model The Kaiser-Guttman rule. Under the Middle Model, the K-G rule ciorrectly identified the number of major factors in half of the cells of the design (12 of 24). Unlike the case for the Formal Model, there was a tendency to indicate too many factors in some of the! remaining cells. Again, a clearer picture is gained from examining the interaction among the independent variables; pervasive trends are noted where appropriate. The tendency to overidentify the number of major factors occurred in those cells with large p (especially p 50), small1 q / p ratio, and low communality. In contrast, there was some underestimation of the number of major facto1.s in cells with a. high q / p ratio. These effects were slightly more pi.onounced with factorially complex data than with simple data. As with the Formal IModeI, the K-G rule indicated the correct number of factors in all cases in which the variable communalities, in terms of the major factors, were high and the q / p ratio low. Z'he scree test. As is evident from Table 3, the scree test perfoirmed substantially more poorly with the Middle Model than with the Formal Model. The existence of minor eornmon factors appears to have prevented, in many eases, the detection of a clear break in the plot of eigenvalues at the point corresponding to the correct number of major factom. Overall, the scree test tended to indica~tetoo many major factors under the Middle Model; in 43 of the 72 data sets examined, the number of major factors was overestimated. For data sets involving large p, smdl q / p ratio, and low communality, the number indicated was far in excess of the correct number of major factors (e.g., see data sets with p = 30, q = 6, and p = 50, q = 8, with low communality). Clomparison between the K-G rule and scree test. I t is evident that the K-G rule was eonside~ablyless influenced by the presence of minor factors than was the scree test. Of the 24 comparisons with the Middle Model and involving the same combination of independent variables, the absolute differences between the number of identified and of actual major factors were generally less with the X.-G rule than with the scree test. Over all the Middle Model data sets, the mean absolute difference using the K-G rul!e was 2.25; with the scree test, the corresponding mean difference was 7.08.

-

APRIL. 1982

205

Downloaded by [Central Michigan University] at 07:40 31 December 2014

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell

It might be argued that expecting a technique, such as the scree test, to perform accurately when applied to data arising under a different process and structural model than that for which the technique was designed is unreasonable. On the other hand, the hypothesis that real-world data are more likely to be constructed according to the Middle Model than the Formal does seem attractive. It does seem naive to assume than any set of p variables could be exactly described by q ( < < p ) common factors. Whether the slippage between structural theory and data is best explained, however, by means of the minor common factors postulated in the Middle Model, and [4] and [5], is debatable, and it is likely true that we should not overemphasize the Middle Model results. A widespread belief is that the K-G rule tends to indicate too few factors with a small number of variables and too many with a large number. The results of the present study do not generally support this view. With the possible exception of the Middle Model data with low communalities in terms of the major factors and low factor to variable ratios, the general direction of error with the K-G rule seems to be toward indicating too few factors (particularly Formal Model results). The direction of error with the scree test, however, is uniformly toward indicating too many factors (Middle Model results only). It may be of interest to note that, over all 72 Middle Model data sets constructed, the K-G rule indicated an average q / p ratio of .313. The scree test, overall, yielded an average q / p ratio of 401. The correct overall averageresulting from the p and q values used-was .318. Summary

It must be noted that the generalizability of the present results rests on (1) the closeness of the conceptual major factor loading matrices to those which actually underlie real-world data and (2) the validity of the Middle Model as an accurate factorial representation of such data. It should be added, in connection with the latter point, that Tucker et al. (1969) introduced a third modelwhich they termed the "Simulation Model"-in which the U and U2 terms in [4] and [5], respectively, vanished. The present authors rejected this model a s unlikely to represent realistically actual factorial data. The Middle Model does, however, aqpear reasonable, in that behavioral variables may well tend to be related not only to clearly-understood major higher-order constructs, but also to a myriad of minor influences a s well. 206

MULTIVARIATE BEHAVIORAL RESEARCH

Downloaded by [Central Michigan University] at 07:40 31 December 2014

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell

With the Formal Model, the scree test indicated the correct numb~erof factors for each combination of the independenl; variables; the Kaiser-Guttman rule performed equally well except for data sets involving a large number of variables, high factor to variable ratio, and low communality. For these data sets, there was a slight tendency to underestimate the actual number of factors. As seen from the Middle Model results, both methods applear to be influenced somewhat by the presence of minor factors, but in generally opposite directions. Again the Kaiser-Guttman rule usually tended to underestimate the correct number of factors; the scree test, on the other hand, tended to indicate, to a greater extent, too many factors. l?inally, it must be noted that the real-world phenomenon of sampling error has not been considerd in Part I, and tlhus no firm recommendations can be made on the basis of the results reported so far.

The results obtained on sample data randomly selected from the simulated matrices included in Part I are presented next for all three decision rules. Unlike the popula,tion data just considered, these data are influenced by the real-world presence of sampling error. As such, the results are, as shown by a comparison b~etween Tables 3 and 7, somewhat more variable. Consequently, the data and corresponding discussion are presented in somewhat greater detail.

The Formal Model The Kaiser-Guttman rule. For the data sets constructed according to the Formal Model, the K-G indication correspoiided to the correct number of factors in 19 of the 48 cells of the design. Of the remaining cells, the number of factors was underestimated in 18, overestimated in 10, and both over- and under-estimated in one. This behavior is most clearly explained by the interaction of the independent variables examined; consistent trends are noted where appropriate. The K-G rule indicated the correct number of factors in all cases in which the q/p ratio was low and the variable communalities high. The tendency to indicate too few factors was found in those cells containing factorially complex data with a high q/p APRIL, 1982

207

A. Ralph Hakstian. W. Todd Rogers and Raymond B. Cattell Table 4 Correct Number of Factors and Number Indicated by t h e Kaiser-Guttman (K-G),

Scree, and Likelihood b t i o (L-R) Rules

f o r t h e 12-Variable Data S e t s , Derived from Both t h e Formal and Kiddle Yidels Sample Results 1. -

Formal Model Commui~all t y

Downloaded by [Central Michigan University] at 07:40 31 December 2014

Major Factor Set 2 2 2 2

2 2 5 5 5 5 5 5 8 8 8 8 8 8 11 11 I1 11 11 11

Hlgh Sample Size

Factorial Complexity

Correct Bo. of Factors

150 150 150 400 400 400 150 150 150 400 400 400 150 150 150 400 400 400 150 150 150 400 400 400

Simple Simple Simple Simple Simple Simple Complex Complex Complex Complex Complex Complex Simple Simple Simple Simple Simple Simple Complex Complex Compiex Compiex Complex Complex

3 3 3 3 3 3 3 3 3 3 3 3 6 6 6 6 6 6 6 6 6

Rule:

6

6 6 2. -

Low

K-G

Scree

L-R

K-G

Scree

L-R

3 3 3 3 3 3 3 3 3 3 3 3 6 6 6 6 6 6 3 5 4 4 5 5

3 3 3 3 3 3 3 3 3 3 3 3 6 6 6 6 6 6 6 6 6

3 3 3 3 3 3 3 3 3 3 3 3 6 6 6 6 6 6 6 6 6

4 3 3 3 3 3 3 3 3 3 3 3 5 5 6 6 6 6 4 4 4

7 4 5 3 5 4 6 7 6 3 4 3 7 5 6 8 6 6 4 7 7 0 2 0

3 3 3 4 3 3 3 3 2 3 3 3 4 2 4 5 5 6 1 3 1 4 4 3

6 6

6 6 6

5 5 6

1 1 4

1

Middle Model Comlmaliiy

Major Factor Set 2 2 2 2 2 2 5 5 5 5 5 5 8 8 8 8 8 8 11

11 11 11 11 11

208

Righ Sample Size

Factorial Complexity

Correct No. ofFactors

150 150 150 400 400 400 150 150 150 400 400 400 150 150 150 400 400 400 150 150 150 400 400 400

Simple Simple Simple Simple Simple Simple Complex Complex Complex Complex Complex Complex Simple Simple Simple Simple Simple Simple Complex Complex Complex Complex Complex Complex

3

3 3 3 3 3 3 3 3 3 3 3

6 6 6 6 6 6 6

6 6 6 6 6

Rule:

Lov

K-6

Scree

L-R

3 3 3 3 3 3 3 3 3 3 3 3 6 6 6 6 6 6 5 3 4 4 4 4

4 6 3 3 3 3 3 3 7 4 4 5 8 6 7 6 6 6 7 6

4 4 3 6 7 5 3 4 3 4 4 5 6 6 6 7 7 7 7 6

7

6 7 7

6

7 7 7

K-G

3 4 4 4 4 3 4

5

Scree

L-X

6 8 5 7 5 5

3 3 5 7 7 5 4 3 3 5 5 4 5 4 4 6 6 7 4 5 4 5 7 6

1

3 4 3 4 3 6 6 5 6 6 5 4 5 4 4 5

1

1 4 6 6 8 5 6 5 5 6 7 6 6 5 6 7 8

1

MULTIVARIATE BEHAVIORAL RESEARCH

A. Ralph Hakvtian, W. Todd Rogers and Rayrnond B. Cattell Table 5 Correct Number of Factors and Number Indicated by the Kaiser-Guttman (K-G), Scree, and Likelihood Ratio (L-R) Rules for the 30-Variable Sets, DerivedfromBoth the Formal and Middle Models Sample Results 1. -

Formal Model

Downloaded by [Central Michigan University] at 07:40 31 December 2014

Cemmunality Malor Factor Set

High Sample Slze

Factorial Complexity

Correct No. oE Factors

14 14 14 14 14 14 17 17 17 17 17 17 20 20 20 20 20 20 23 23 23 23 23 23

150 150 150 400 400 400 150 150 150 800 400 400 150 150 150 400 400 400 150 150 150 400 400 400

Simple Simple Simple Simple Simple Simple Corcplex Complex Complex Complex Complex Complex Simple Simple Simple Simple Simple Simple Complex Complex Complex Complex Complex Complex

6 6 6 6 6 6 6 6 6 6 6 6 12 12 12 12 12 12 12 12 12 12 12 12 2. -

Rule; K-G 6 6 6 6 6 6 6 6

6 6 6 6 11 10 12 11 12 12 11 10 11 11 11 10

Lev

Scree

L-R

K-G

Scri?e

L-R

6 6 6 6 6 6 6 6 6 6 6 6 12 12 12 12 12 12 12 12 12 12 12 12

6 6 6

9 10 9

6 11 8 4 9 5 6 7 7 6 1 2 28 25 26 13 18 18 16 18 15 16 11 13

5 5 5 6 6 6 5 6 3 6 6 6 5 6 5 9 9 10 7 8 5 9 9 8

6 6 7

8 8 7 6 6 6 6

6 6

2 1 1 9 9 1

0 8

7 7 12 12 12 12 12 12 12 11 12 12 i2 12

1 1 11 11 12 13 13 12 12 12 12 11 12 11

Middle Model Comunality

Major Factor Set

Sample Size

Factorial Complexity

14 14 14 14 14 14 17 17 17 17 17 17 20 20 20 20 20 20 23 23 23 23 23 23

150 150 150 400 400 400 150 150 150 400 400 400 150 150 150 400 400 400 150 150 150 400 400 400

Sim~le Simple Simple Sim~le Simple Simple Complex Complex Complex Complex Complex Complex Simple Simple Simple Simple Simple Simple Complex Complex Complex Complex Complex Complex

Bigh

APRIL, 1982

Correct No. of'r'actors Rule: K-6 6 6 6 6 6 6 6 6 6 6 6 6 12 12 12 12 12 12 12 12 12 12 12 12

6 6 6 6 6 6 6 6 6 6 6 6 11 11 11 12 12 12 9 10 10 11 11 11

Low

-

Scree

L-R

6 6 6 6 6 7 9 7 8 6 6 6 12 12 12 12 12 12 12 13 14 13 14 13

11 12 12 14 14 14 10 10 11 14 15 15 13 12 14 16 17 17 14 14 16 17 18 20

K-G Scree L-R 10 11 10 8 9 9 10 10 9 9 9 9 12 12 11 12 12 12 10 11 10 12 11 11

12 13 13 13 If) 13 14 17 1'1 14 1:) 10 19 17 111 1:' 1% 111 17 16 15 14 13 15

9 9 9 12 12 14 8 10 10 14 12 13 11 10 10 17 18 16 11 11 9 15 13 16

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell Table 6 Correct Number of Factors and Number Indicated by the Kaiser-Guttman (K-G),

Scree, and Likelihood Ratio (L-R) Rules

for the 50-Variable Sets, Derived from Both the Formal and Middle Models Sample Results I. -

Formal Pfodel Communality

Downloaded by [Central Michigan University] at 07:40 31 December 2014

Major Factor Set

High Sample Size

Factorial Complexity

Correct No. ofFactors

Rule:

K-G

Scree

Low L-R

X-G

Scree

L-R

Simple Simple Simple Simple Simple Simple Complex Complex Complex Complex Complex Complex Simple Sim~le Simple Simple Simple Simple Complex Complex Conplex Complex Complex Complex 2. -

Middle Model Comunality

Major Factor Set

32 32 32 32 35 35 35

35 35 35

High Sample Size

Factorial Com?lexity

Correct No. ofFactors

150 400 400 400 150 150 I50 400 400 400

Simple Simple Simple Simple Simple Simple Complex Complex Complex Complex Complex Complex Simple Simple simple Simple Simple simple Complex Complex Complex Complex Complex Complex

20 20 20 20 20 20 20 20 20 20

Rule: K-G

Scree

Low L-R

K-G

Scree

L-R

@

solution since input matrix was singular--attributable to random chance. No discernible break in the plot of eigenvalues was found.

210

MULTIVARIATE BEHAVIORAL RESEARCH

Downloaded by [Central Michigan University] at 07:40 31 December 2014

Table

z I$

8

p

q = 3

=

12

q = 6

Comunality

Rule

n

Simp.

Comp.

Simp.

Comp.

Simp.

K-G

150 400 150 400 150 400

0 0 0 0 0 0

0 0 0 0 0

0 0 0 0 0 0

2.008 1.338 0 0 0 0

0 0 0 0 0 .33A

150 400 150 400 150 400

.33A 0 2.33A 1.00A 0 .33A

0 0 3.33A .33A .33A 0

.67B 0 .67C .67A 2.678 .67B

2.008 1.338 1.33C 4.67A 4.338 2.33I:

3.33A 1.67A 2.33A 13.338 1.008 0

L-R

K-G

Low

Scree L-R

0

2. -

q = 20

Row Yean Absolute Difference

Simp.

Comp.

Simp.

Comp.

Simp.

Comp.

0 0 0 0 0 0

1.008 .33B 0 0 0 0

1.338 1.33B 0 0 .33B 0

0 0 0 0 NS= 0

0 0 0 0 0

2.67B 1.008 0 0 1.008 0

4.678 4.338 .338 1.33A 1.33B .33A

.03 .ll .27 .06

3.33A 1.33A .67A 3.67A 1.338 0

.67B .h7A 14.33A 4.33A 6.67B 2.678

0 .678 4.33A 2.00A 5.338 3.338

8.00A 5.00A 6.00A 3.33A ,338 1.OOC

9.0OA 5.338 8.67A 3.67A 1.338 0

.67B .67C 11.67% 10.00A 12.678 5.338

1.00B 2.008 4.33A 2.33C 13.338 8.008

2.42 1.56 5.00 4.11 4.11 1.97

Nsa

.97

.69

Middle Model p = 50

Comp.

Simp.

Comp.

Simp.

Comp.

Simp.

Comp.

Simp.

Cnmp.

Simp.

Comp.

Row Mean Absolute Difference

K-G

150 400 150 400 150 400

0 0 1.33A 0 .67A 3.00A

0 0 1.33A 1.33A .33A 1.33A

0 0 1.00A 0 0 1.00A

2.008 2.008 .67A .67A .33A 1.00h

0 0 0 .33A 5.67A 8.00A

0 0 2.00A 0 4.33A 8.67A

1.008 0 0 0 1.00A 4.67A

2.338 1.008 1.00A 1.33A 2.67A 6.33A

0 0 2.00A 0

0 0 7.00A 5.33A

Nsa

Nsa

2.338 1.008 3.00A 1.00A 3.00A 9.00A

4.333 4.008 4.33A 6.678 3.67A 9.33.4

1.00 .67 1.97 1.39 2.17 5.23

1.338 1.008 4.33A 5.338 3.00B 7.33A

1.678 1.678 9.678 5.33C 4.678 4.678

2.70 2.25 7.11 5.78 3.11 6.06

K-G

L-R

150 400 150 400 150 400

.67A .67A 3.33A 2.67A .67A 3.33A

.67A .33A 4.00A 3.33A .33A 1.67A

.33B .338 .67B .33A 1.678 .33A

1.338 1.678 2.00C 1.OOA 1.678 .67C

q = 6

4.33A 2.67A 6.678 7.678 3.00A 6.67A

3.67A 3.00A 9.33A 7.00A 3.33A 7.00A

q = 12

.67R 0 4.33C 2.OOA 1.678 5.00A

1.6733 .67B 4.00~ 2.OOA 1.678 2.678

q = 8

N S ~

8.OOA 7.73A 20.67A 16.33A 6.67A 16.338

q = 20

N S ~

8.00A 7.67A 16.33A 16.33A 9.00A 17.00A

r-o i i o w l n g a n e n r r y s i g n i f i e s r h a r r h e n u m b e r u f f a c t o r s was orei. e s t i x t e d f o r a t l e a s t o n e o f t h e t h r e e c e l l r e p l i c a t e s ) a r ' B " s i g n i f i e s t h a t t h e numberof faetorswasunderestimatedforat l e a s t o n e o f t h e t h r e e r e p l i c a t e s ; and a " C " s i g n i f i e s t 1 1 a t t h e number of f a c t o r s w a s b o t h overe s t i m a t e d f o r a t l e a s t one r e p l i c a t e a n d u n d e r e s t i m a t e d f o r a t l e a s t one o t h e r r e p l i c a t e . :NO s o l u t i o n s i n c e i n p u t m a t r i x was s i n g u l a r - - a t t r i b u t a b l e t o random chance. Based on two r e p l i c a t e s , s i n c e f o r one of t h e t h r e e r e p l i c a t e s , no d i s c e r n i b l e b r e a k i n t h e p l o t of e i g e n v a l u e s Was found. Note:

N

q = 8

Simp.

Scree

Id

12

n

L-R

::*':

p = 50 q =

Rule

Scree

Low

30

p = 30

q = 6

q - 3

High

=

Comp.

p = 12 Comunality

p

q = 6

High

Scree

CI

7

Mean Absolute D i f f e r e n c e Between I n d i c a t e d and C o r r e c t Number of F a c t o r s f o r Each C e l l i n t h e Design (Means a r e over Three R e p l i c a t e s per C e l l ) Sample R e s u l t s 1. Formal Yodel -

Downloaded by [Central Michigan University] at 07:40 31 December 2014

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell

ratio. Further, this effect was slightly more pronounced for data with high communality than with low communality. In contrast, too many factors were indicated in those cells containing factorially simple or complex data with large p, low q / p ratio, and low communality. Generally, the degree of overestimation exceeded the degree of underestimation noted above. Finally, the deviations from the correct number of factors were less pronounced for the larger samples than for the smaller samples, especially in the case of overestimation. The scree test. Clearly, the scree test performed much more accurately with the high communality data sets than with the low communality sets. As shown in Table 7,in all but two of the 24 cells with high communality data, the number of factors was correctly identified. In contrast, in 21 of the 24 cells with low communality data, the identified number of factors exceeded, in some cases substantially, the correct number of factors. The existence of variables with low communality appears to have prevented the detection of a clear break a t the correct point in the plot of eigenvalues. A summary of the behavior of the scree test in terms of the remaining independent variables for the low comrnunality data sets follows. 1. Number of variables, p. The overestimation of the number of factors was most pronounced with a large number of variables (p = 30 and 50). 2. Ratio of q t o p. There was no discernible effect on the performance of the scree test from the q / p ratio. The test appeared to perform the same with a high q / p ratio as with a low ratio. 3. Factorial complexity. The scree test tended to indicate too many factors to a greater extent with factorially simple data than with complex. 4. Sample size, n. The overestimation of the number of factors appears to a lesser extent with large samples than with small samples. This phenomenon, though, is somewhat inconsistent, as is evident from the 30-variable data sets. With low communality data structured according to the Formal Model, the number of variables and factorial complexity appear to be the most important influences upon performance of the scree test. For data sets involving low communality and a large number of factorially simple variables, the number of factors indicated was far in excess of the correct number (e.g., note the results in 212

MULTIVARIATE BEHAVIORAL RESEARCH

Downloaded by [Central Michigan University] at 07:40 31 December 2014

A. Ralph Hakstian, W. Todd Rogers and Rayrnond B. Cattell

Tables 4-7 for the factorially simple data, sets with low communality,p = 3 0 a n d p = 50). The likelihood ratio test. For the Formal Model, the L-R rule indicaited the correct number of factors in 22 cells of the clesign, with a tendency toward underestimation in the remaining- cells. Again, a clearer picture is provided by an examination of the interaactions among the independent variables. The L-R rule performed more accurately with data pos~!essing high, rather than low, communality. The tendency toward underidentification occurred, for the most part, in those cells with low communality, increased with larger p, and was much more pronounced with the larger q/p ratios than with the smaller. Fa,ctorial complexity appeared to have little effect. Generally, observed differences between the identified and correct number of factors were, as would be expected, markedly less with larger samples tha.n with smaller samples. Comparisons among the three tests. It is evident from the preceding results that all three rules performed more accurately With with data possessing high, rather than low, com.r~~unality. such data, the overall performance of the scree and L-R tests was quite good. In contrast, the K-G rule performed quite well except for those data sets manifesting large p, large q/p ratio, aind considerable factorial complexity. For these sets, there was a tendency towards underidentification, Sample size appeared ito have little effect, with the high communality data, upon the results of all tlnree tests. Over all the high communality data sets, the mean absolute difference between the identified and correct nurnber of factors was .83 for the K-G rule, .07 for the scree test, and .17 for the L-R test. Each of the rules performed less well with variables possessing low communalities, but in different ways. The K-G rule tended to indicate too few factors with high q//p ratios, and, to a much greater extent, too many with low ql'p ratios and, particularly, large p. The scree test generally overidentified tbe nurnber of factors, and the effect was maist pronounced in those cells with a large number of factorially simple variables. In contrast, the L-R rule tended to indicate too few factors. Further, this tendency was most pronounced with high q/p ratios. As expected, the L-R test performed better with larger samples than with smaller ones. Similarly, the performance of the K-G rule improved with increasing sample size, but to a lesser extent, and, although there was an APRIL, 1982

213

Downloaded by [Central Michigan University] at 07:40 31 December 2014

A. Ralph Hakstian, W. Todd Rogers and Raymond 6. Cattell1

overall slight improvement in the performance of the scree test with increasing n, this improvement was not as consistent as for the K-G and L-R rules. Across all the low communality data sets, the mean absolute differences between the identified and correct number of factors were, for the K-G, scree, and L-R tests, respectively, 1.99,4.56, and 3.04. It is again appropriate to note that the high range q/p ratios used in the present study (.50 for p = 12; .40 for p = 30 and 50) are larger than one would expect to find in good factor analytic data. Indeed, the major use of factor analysis is to find a limited number of factors which will contain the maximum amount of information. Thus, the findings for the K-G rule with high communality data and for the L-R test with low communality data may be somewhat overly pessimistic; correspondingly, the findings for the K-G rule with low communality data may be somewhat optimistic.

The Middle Model The Kaiser-Guttman rule. The performance of the K-G rule under the Middle Model was similar to its performance under the Formal Model. As seen from Table 7, the performance of the K-G rule in the cells with low communality data was slightly less accurate under the Middle Model than under the Formal Model, whereas the results for the data sets with high communality were nearly identical. Thus, no further discussion of the performance of the K-G rule under the Middle Model is made here; the reader may wish to review the material presented for the Formal Model. The scree test. As is evident in Tables 4 through 7, the scree test performed more poorly under the Middle Model than under the Formal Model. Again, there was a tendency to overidentify the number of factors, but to a greater extent. Further, the number of high communality data sets for which the correct number of major factors was indicated was sharply reduced (from 22 to seven; see Table 7). As was true for the population data, the existence of minor common factors appears to have prevented detection of a dear break in the plot of eigenvalues a t the point corresponding to the number of major factors. Again, the break in the plot of the eigenvalues was further obscured by the presence of variables with low communality; the identification of the correct number of major factors was less certain with variables of low communality (with respect to the 214

MULTIVARIATE BEHAVIORAL RESEARCH

R~~gers and Raymond B. Cattell major factors) than with variables of high communaoIity.As with the Formal Model, the identified number of factors exceeded the correct number for the majority of cases with low communality. In the cells with low q/p ratios, this overidentification was rather substantial (e.g., see data sets with low communality and p = 12, q = 3 ; p = 3 0 , q = 6 ; a n d p E 50,q = 8). For both high and low communality, the scree test tended to indicate too many major factors to a sli.ghtly greater extent with facto~riallycomplex data than with simp'le data. Over all data sets, the extent of overestimation was less with large n than with small n. The likelihood ratio test. The L-R test performed mulch less accurately under the Middle Model than under the Formal Model. As might be expected from the presence of the minor common factors, the L-R rule indicated, with few exceptions, too many major factors. Overall, the degree of overestimation exceeded the degree of underestimation found under the Formal Model. Further, unlike with the Formal Model, the overestimation was much more pronounced in the large samples than in the small ones, and in the data having low q/p ratios than having large q/p ratios. As with the Formal Model, the L-R rule performed 'less well with variables having low communality; there appeared to be no consistent effect from factorial complexity. Comparisons among the three tests. It is clear that the K-G rule was considerably less influenced by the presence of the minor factors than either the scree or the L-R test. As noted, the performance of the K-G test under the Middle and Fornial Models was roughly comparable. With the scree and L-R tests, not only were the absolute mean differences between the identified and correct number of factors generally larger under the Middle Moclel than under the Formal Model, but they also typically exceeded the corresponding differences associated with the K-G rule under the Middle Model. Over all the Middle Model data sets with high cornmunality, the mean absolute difference using the K-G rule .was .84, whereas for the scree and L-R tests, the differences were, respectively, 1.68 and 3.70. Over the low communality sets, the corresponding values, in the same order, were 2.48, 6.45, and 4.59. With the high communality data,, the K-G rule performed quite well, except for those data sets with a large number of variablqs, high q/p ratio, and factorial complexity-in which case undleridentification of the number of major factors occurred. In

Downloaded by [Central Michigan University] at 07:40 31 December 2014

A. Ralph HaksRian, W. Todd

*

APRIL, 1982

215

Downloaded by [Central Michigan University] at 07:40 31 December 2014

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell

contrast, the scree and L-R tests more frequently overidentified the number of factors. This tendency was generally more pronounced for the L-R than for the scree test, particularly for large sample size. As noted above, all of the rules were less accurate, but to varying degrees, when applied to data sets consisting of low communality variables. The K-G rule tended to indicate too few factors with a high q / p ratio, and, to a somewhat greater extent, too many with a low q / p ratio. The scree test generally indicated too many factors, particularly in cells with low q / p ratio. The performance of the L-R test appeared to be more dependent upon sample size. For small n, its performance was somewhat comparable to that observed with the K-G rule; for large n, the pattern was more similar to that observed with the scree test. Summary

As with the population results presented in Part I, the generalizability of the present sample results depends on (1)the closeness of the conceptual pattern matrices to those underlying realworld data and (2) the validity of the Formal and Middle commonfactor models as accurate representations of such data. The discussion of these issues presented earlier is relevant here also. Given these conditions, then, we have found that with the Formal Model, the K-G, scree, and L-R tests all performed quite well with variables having high communality. The correct number of factors was indicated for nearly all the combinations of the independent variables. Each of the rules performed less well with variables having low commanality. For these data sets, the K-G rule tended to underidentify slightly the number of factors with a high q / p ratio, and overidentify with a low ratio. The scree test generally, to a greater extent, indicated too many factors; overidentification was particularly pronounced with a large number of factors and for a high q / p ratio. Predictably, the performance of all three rules improved somewhat with increasing sample size. With the Middle Model, the K-G rule appeared to be less influenced by the presence of the minor factors than either the scree or L-R tests. As noted earlier, the performance of the K-G rule under the Niddle Model was very similar to its performance under the Formal Model. On the other hand, the scree test tended to indicate, to a greater extent, too many major factors. Except for data sets involving a small sample size and low communality, the 216

MULTIVARIATE BEHAVIORAL RESEARCH

A. Ralph Hakstian, W. Todd Rogers and Raymond 6. Cattell

Downloaded by [Central Michigan University] at 07:40 31 December 2014

L-R test also overidentified, sometimes excessively, the number of major factors. With small samples and low communality, the pattern of differences for the L-R test across the remaining independent variables was similar to the pattern for the K-G rule.

rlls previously noted, it could be argued that the K-G, scree, and IL-R rules, when applied to data arising under a different process and structural model than that for which they were designed, can be expected to lead to ambiguous results. The present results with the Middle Model data tend to support; this to some extent, particularly with the scree and L-R tests. When applied with real-world data, the K-G, scree, and L-R rules often do not indierate the same number of factors, suggesting that th% Formal, traditional common-factor model may often be an inaccurate description of the underlying structures and processes of data. Whether the slippage encountered, however, is best explained by means of the minor common factors postulated in the Middle! Model is, at present, debatable, and it is likely true that .we shoi~ldnot stress the Middle Model results. Comparison of the findings of Parts I and I1 of this study for the PC-G and scree rules reveals that, as expected, the realworld phenomenon of sampling error did influence the performance of these two rules. The patterns of performance are quite comp~arable;the absolute deviations are predictably larger for the sample data, but to different degrees. Of the two rules, the scree appears to be more influenced by the presence of samplinj, error than the K-G rule. It appears true that the influence in the population data of the minor factors of the Niddle Model on the resulting covariance structures is similar to that of sampling error on data fitting the Formal Model. (Of the several independent variables examined in this study, three are known to an investigator: n, p, and (after convergence for a common-factor solution) a fairly good estimate of the variable comrnunalities and the overall mean communatity (MC). It is possible, from the foregoing results, to identify an optimal range for accurate assessment of q, in terms of two of these directlyn and MC-and a function of the third, p, that is in line with earliler findings. It is clear from the results presented in Table 7 that given, for a set of data, (1)a large n (say 250-300 or larger), APRIL, 1982

217

Downloaded by [Central Michigan University] at 07:40 31 December 2014

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell

and (2) a mean calculated communality (MC) of, say .60 or larger, either the K-G or scree rules (and possibly the L-R test as well) will yield an indication close to correct. Such an identification will be just that much more credible if, with either rule, the indicated q / p value is less than about .30. Clearly, the investigator can have greatest confidence if, say, for a set of data in which the n is large and the MC high, the indications by both the K-G and scree rules provide a q / p ratio less than .30 and are the same. With lower communality-say a MC of .30-or a q / p ratio above .30, however, one should expect the K-G rule to be less accurate than before and the scree test to be systematically and, in some cases, substantially, biased towards overidentification. These findings suggest parallel analysis, with convergence of the number of factors identified indicative of good factor analytic data. In Hakstian and Muller's (1973) study, among the 17 factor analytic data sets found in the literature and reanalyzed in that study, the range of MCs ran from a low of .39 to a high of .75, with a mean MG, over all 17, of .54. Two of the 17 data sets had a MC falling in what has been operationalized in the present study as the low range (.20-.40). Four of the 17 fell in the present high range (.60-.80), and the remaining 11 had MCs that fell between .40 and .60. Not surprisingly, the studies with the higher MCs displayed lesser discrepancies among the number-of-factors rules than did those with lower MCs. Thus, in conclusion, the investigator may now have some more finely-calibrated methods for determining the quality of the data for answering the number-of-factors question, as well as greater insight into the direction and magnitude of error if error exists in the decision.

1. Kaiser, H. F., & Rice, J. Little jiffy, Mark IV. Unpublished manuscript, University of California, Berkeley, California, 1973.

Anderson, T. W. An introduction to multivariate statistical analysis. New York: Wiley, 1958. Cattell, R. B. The scree test for the number of factors. Multivariate Behavioral Research, 1966,1, 245-276. Cattell, R. B., & Vogelmann, S. A comprehensive trial of the scree and KG criteria for determining the number of factors. Multivariate Behavioral Research, 1977,12,289-325. 218

MULTIVARIATE BEHAVIORAL RESEARCH

Downloaded by [Central Michigan University] at 07:40 31 December 2014

A. Ralph Hakstian, W. Todd Rogers and Raymond B. Cattell Crawford, C. B. Determining the number of interpretable factors. Psycholog& cal Bulletin, 1975, 82, 226-237. Hakstian, A. R., & Bay, K. S. User's manual to accompany the Alberta General Factor Analysis Program. Edmonton, Alberta: Division of Educational Research Services, University of Alberta, 1973. Hakstian, A. R., & Muller, V. J. Some notes on the number of factors problem. Multivariate Behavioral Research, 1973, 8, 461-475. Hoel, P. G. A significance test for component analysis. Annals of Mathematical Statistics, 1987, 8, 149-158. Jiireskog, K. G., & van ThiIlo, M. New rapid algorithms for factor analysis by unweighted least squares, generalized least squares and maximum likelihood. Research Memorandum 71-5. Princeton, N. J.: Educational Testing Service, 1971. Kaiser, H. F. A second generation Little Jiffy. Psychmetrika, 1970, 35, 401-415. Knuth, D. E. The a r t of computer programming (Vol. 2): Semi-numerical algorithms. Reading, Mass.: Addison-Wesley, 1968. Lederrnan, W. On the rank of the reduced correlation matrix in multiplefa.ctor analysis. Psychometrika, 1937, 2,85-93. McDonald, R. P., & Burr, E. J. A comparison of four methods of constructing factor scores. Psychometrika, 1967,92,381-401. Tucker, L. R., Koopman, R. F., & Linn, R. L. Evaluation of factor analytic research procedures by means of simulated correlation matrices. Psuchometrika, 1969, 34, 421-459.

APRIL, 1982

The Behavior Of Number-Of-Factors Rules With Simulated Data.

issues related to the decision of the number of factors to retain in factor analysis are identified, and three widely-used decision rules -- the Kaise...
566B Sizes 1 Downloads 8 Views