Proc. Nati. Acad. Sci. USA Vol. 89, pp. 7669-7673, August 1992 Population Biology

Origins of the Indo-Europeans: Genetic evidence (gene frequencies/Europe)

ROBERT R. SOKAL*t, NEAL L. ODENt, AND BARBARA A. THOMSON* *Department of Ecology and Evolution, State University of New York, Stony Brook, NY 11794-5245; and tDepartment of Preventive Medicine, Division of Epidemiology, Health Sciences Center, State University of New York, Stony Brook, NY 11794-8036

Contributed by Robert R. Sokal, May 22, 1992

ABSTRACT Two theories of the origins of the IndoEuropeans currently compete. M. Gimbutas believes that early Indo-Europeans entered southeastern Europe from the Pontic Steppes starting ca. 4500 B.C. and spread from there. C. Renfrew equates early Indo-Europeans with early farmers who entered southeastern Europe from Asia Minor ca. 7000 BC and spread through the continent. We tested genetic distance matrices for each of 25 systems in numerous Indo-Europeanspeaking samples from Europe. To match each of these matrices, we created other distance matrices representing geography, language, time since origin of agriculture, Gimbutas' model, and Renfrew's model. The correlation between genetics and language is signi t. Geography, when held constant, produces a markedly lower, yet still highly significant partial correlation between genetics and language, showing that more remains to be explained. However, none of the remaining three distances-time since origin of agriculture, Gimbutas' model, or Renfrew's model-reduces the partial correlation further. Thus, neither of the two theories appears able to explain the orign of the Indo-Europeans as gauged by the geneticslanguage correlation.

MATERIALS AND METHODS We studied 25 genetic systems (erythrocyte antigens, plasma proteins, enzymes, histocompatibility alleles, immunoglobulins; Table 1) from 2111 IE-speaking samples in Europe. Details are specified elsewhere (10-14). We computed Prevosti's genetic distances (15, 16) (GEN) separately for the 479 to 27 localities (mean = 84) of each genetic system. Linguistic distances (LAN) were subjective estimates furnished by M. Ruhlen, based on his current classification of IE languages (17). A dendrogram (Fig. 1) resulting from UPGMA clustering (18) of the linguistic distance matrix shows the relations between the IE languages in that matrix. We computed great-circle geographic distances (GEO) between pairs of localities. The origin-of-agriculture distances (OOA) between any pair of points were described earlier (8). They sum distances from their respective starting times of agriculture back to their putative common agricultural origins. The Renfrew hypothesis distance (REN) matrix was based on ref. 4 and discussions with Professor Renfrew. In his view, most of the introduction and subsequent diversification of the IE language families in Europe was concurrent with the spread of agriculture in the continent. Nevertheless, Renfrew explains the final branching into the major language families by a series of 10 so-called transitions illustrated in ref. 4 (figure 7.7). These transitions are associated with specific archaeological assemblages whose starting dates were entered on a map we smoothed by interpolation. Superimposed on this map (Fig. 2A) is a directed graph summarizing the directions and branching patterns of Renfrew's transitions. The REN between any pair of localities is their distance in time along the directed graph. If two localities are located in regions connected by different branches of the graph, the REN is computed by summing the time-distances along each branch to the point of their common origin. Suggestions by Professor Renfrew (personal communication) that some of these transitions might be wholly or partly acculturation rather than demic diffusion were tested by a sensitivity test aiming to maximize average GEN,LAN correlations. No genetic evidence for acculturation was found and the REN values were retained as described above. The Gimbutas distances (GIM) are based on a map (Fig. 2B) redrawn from one provided by Professor Gimbutas. It shows the regions reached by Kurganization waves 1 and/or 2 and 3. Distances between any pair of localities ij are

Almost all Europeans speak Indo-European (IE) languages, the only exceptions being Finns, Estonians, Hungarians, Turks, Basques, and Maltese. Where did IEs come from and how did they spread to most areas of Europe? Two theories of IE origins, derived from archaeological and linguistic evidence, currently predominate. The majority view is that of Marija Gimbutas (1-3) of the University of California, Los Angeles. She believes that early IEs entered southeastern Europe in three Kurgan culture waves from the Pontic Steppes starting ca. 4500 B.C. and spread from there. This view was challenged in 1987 by Colin Renfrew of Cambridge University (4). He equates early IEs with early farmers who entered southeastern Europe from Asia Minor ca. 7000 B.C. and spread through the continent by demic diffusion as proposed by Ammerman and Cavalli-Sforza (5). Genetic evidence from modern populations supports this model (68), justifying a subsequent test of Renfrew's theory. However, because Renfrew links his hypothesis with the origin of agriculture by demic diffusion, it becomes difficult to test the two hypotheses separately. Here we examine whether the genetic evidence available from modern European populations favors one of the two hypotheses on IE origins. Our approach is to examine correlations between genetic and linguistic distances in Europe and to estimate the effects ofvarious factors (geography, origin of agriculture) and hypothesized movements (Gimbutas' and Renfrew's models) on the magnitude of these correlations.

computed as

Dij qki + qkj =

-

q(ki+kj).

In this formula q is the proportion of nonreplacement of resident genes by Kurgan genes, and ki and kj are the number of Kurganization waves received by localities i andj, respectively. Two localities are assigned DV = 1 (ki = kj = 0) in the un-Kurganized region (N in Fig. 2B), and ki = kj = 100 in the

The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact.

Abbreviation: IE, Indo-European. tTo whom reprint requests should be addressed. 7669

7670

Population Biology: Sokal et al.

Proc. Natl. Acad Sci. USA 89 (1992)

Table 1. Matrix correlations between genetic and linguistic distances, and partial matrix correlations involving these distances and geographic, origin-of-agriculture, Renfrew and Gimbutas distances System N GEN,LAN .GEO .GEO,OOA .GEO,OOA,REN .GEO,OOA,GIM 1-2 ABO 133 0.144*** 0.001 0.067 0.067 0.027 2-5 MN 179 0.024 0.039 0.040 0.057* 0.106* 2-7 MN 51 -0.002 -0.142 -0.065 -0.050 0.016 3-1 P 79 -0.042 -0.077 -0.045 -0.046 -0.021 4-1 RHESUS 479 0.054*** 0.006 -0.001 0.004 0.003 4-13 RHESUS 74 0.114* 0.057 0.086* 0.087 0.065 4-19 RHESUS 69 0.179*** 0.164** 0.173** 0.178*** 0.192*** 5-1 LUTH 27 -0.029 -0.015 0.010 0.012 0.019 6-1 KELL 103 0.093 0.076 0.048 0.032 0.075 6-3 KELL 30 0.027 -0.054 0.025 0.027 0.061 7-1 ABHSE 49 0.017 -0.190 -0.174 -0.180 -0.217 8-1 DUFFY 81 0.109* 0.073 0.084 0.080 0.110** 36-1 HP 147 0.133*** 0.055* 0.065* 0.061* 0.038 37-1 TF 33 0.117* 0.072 0.063 0.060 0.077 38-1 GC 85 0.003 0.007 -0.005 -0.009 0.057* 50-1-1 AP 61 0.317*** 0.249*** 0.207* 0.195* 0.177* 52 PGD 0.122 39 0.043 0.045 0.049 0.005 53 PGM1 63 0.236*** 0.077 0.000 -0.013 0.048 56 AK 58 0.005 -0.119 -0.062 -0.073 0.025 63 ADA 41 0.203*** 0.059 0.058 0.031 -0.010 65 TASTER 52 0.359*** 0.273*** 0.291*** 0.294*** 0.219*** 100 HLA-A 60 0.408*** 0.238** 0.181*** 0.177* 0.184*** 0.455*** 101/2 HLA-B 60 0.280*** 0.216*** 0.211*** 0.231*** 200 GM 30 0.231*** 0.080 0.052 0.075 -0.011 201 KM 28 0.246* 0.213* 0.215* 0.184* 0.19* 0.141 0.059 Average 0.063 0.060 0.067 Numbers preceding the system symbols, up to 65, are those assigned by Mourant et al. (9); those from 100 and above were assigned in our laboratory. N, numbers of localities samples; GEN, LAN, GEO, OOA, REN, and GIM stand for genetic, linguistic, geographic, origin-ofagriculture, Renfrew, and Gimbutas distances, respectively; the pairwise correlation is indicated as GEN,LAN; in the interest of brevity, partial correlations, which all are of GEN against LAN with various other distances held constant, are indicated by a period followed by the constant variables. Thus, .GEO,OOA stands for rGENLAN.GEO,OOA. The correlations are followed by significance symbols based on 249 permutations of rows and columns of one of the two distance or residual matrices. Significances are indicated as follows: *, 0.01 < P c 0.05; **, 0.004 < P ' 0.01; ***, P = 0.004. The last probability is conservative, since it is the lowest we can demonstrate with 249 permutations. Had we carried out more permutations, we probably could have shown that many of the correlations marked by three asterisks are significant at P a .. f@M@v|m :XXXCXXX L OOCCOOCC 2 D

4500 4950 5600 5900 6500 7250 7750 8500 9000 yr BP

.B

-

that the high correlations subsumed in the low averages remain high as partial correlations also. Our results imply that, while neither of the currently contending hypotheses of IE origins can be supported by the genetic evidence, there is significant residual correlation for some genetic systems that requires explanation. We shall not propose an alternative hypothesis of IE origins. However, the observed correlations invite exploratory data analysis of the population samples supporting the relation between language and genetics. We examined the residuals from the regressions ofGEN on GEO and LAN on GEO. We examined the highest 2% of the products of these residuals and mapped the pairs of localities represented by them for each of the five systems (50.1.1 AP, 65 T, 101 HLA-A, 101/102 HLA-B, and 201 KM) showing the

FIG. 2. Maps used for computing distances corresponding to the two theories of TE origins. (A) Map for computing Renfrew (REN) distances. The contours represent time intervals [years before present (yr BP)], as identified in the key. They were obtained from a map in which the starting dates of archaeological assemblages, which characterize Renfrew's 10 transitions (4), were interpolated to smooth the surface. The area has been subdivided into 10 numbered regions corresponding to the identically numbered transitions. A directed graph following the outline furnished by Renfrew has been superimposed on the map. (B) Map for computing Gimbutas (GIM) distances. The map shows outlines of regions in Europe that received none, one, or more of the Kurgan waves described by Gimbutas in refs. 1-3. Each region is labeled by the wave number it received. Regions that received more than one wave are marked by more than one numeral. Thus, area 123 received waves 1, 2, and 3. The region labeled 0 is the original home of the Kurgan people (the SredniStog and Yamna cultures), regions labeled N received none of the Kurgan waves. The map is based on hand-drawn originals by M. Gimbutas.

highest correlation between genetics and language. Paired positive (negative) deviations indicate areas more (less) distant genetically and linguistically than their geographic distances would predict. In maps for these five systems, paired positive deviations frequently involve Sardinia, which is quite distant genetically, and also linguistically, from nearby Mediterranean populations. Paired negative deviations heavily involve Iceland. Icelanders, being "displaced Scandinavians," are far less distant genetically and linguistically from Scandinavians and other Germanic speakers than their geographic distances indicate. While the map patterns do not, unfortunately, suggest an alternative hypothesis of IE origins, since the relations they do indicate are far more recent (such as the settlement of Iceland), they suggest that the overall pattern of partial

Population Biology: Sokal et al.

Proc. Natl. Acad. Sci. USA 89 (1992)

7673

.060"** .GEOOOAREN\v 11ne

.141*** GENLAN

3***

I-

15ns

.063*** .059*** .065*** 15nG .GEOOOARENGIM .GEO .GEO,OOA

-,_

\5ns \4

9n

.067**

I

.GEO,OOA,G1M * Renfrew correct E Gimbutas correct * Neither correct FIG. 3. Summary of results. The large arrows indicate successive steps in computing zero- to fourth-order partial correlations between genetic (GEN) and linguistic (LAN) distances. Other distances successively held constant are geography (GEO), origin of agriculture (OOA), Gimbutas (GIM), and Renfrew (REN). The numerical values at both ends of the large arrows are the average correlations from the bottom line of Table 1. They are all highly significant (P 0.05; ***, P < 0.005]. The three small arrows beneath each large arrow furnish predictions made by each theory concerning the behavior of the partial correlations. From the top down the arrows represent Renfrew's theory, Gimbutas' theory, and the assumption that neither theory is correct. A horizontal small arrow predicts no effect, a downward sloping small arrow predicts a reduction in the magnitude of the partial correlations, and a downward vertical small arrow predicts a reduction of the partial correlation to nonsignificance. The small arrows illustrate that the predictions of the Renfrew and Gimbutas theories are not borne out and that the outcomes are compatible with the prediction that neither theory is correct.

correlations might help us decide among competing hypotheses. If the TEs originated in situ by local differentiation only, there should be no significant partial correlation, since geography should fully explain the observed genetic and linguistic distances. This was not the case. If the geneticslanguage correlation were entirely due to the spread of populations accompanying the origin of agriculture, then the origin-of-agriculture model should suffice, or at least there should be some effect due to origin ofagriculture. But we saw that origin-of-agriculture distances (OOA) cannot reduce the partial correlations remaining after geography has been held constant. If the IEs originated by a branching process outside or inside of Europe and the populations ancestral to the modem IE language families branched off at different times, and moved into different regions in Europe where they differentiated subsequently, they would yield a pattern such as was found by us. A phylogenetic tree structure would add additional similarities and distances to the data, above and beyond those engendered by local differentiation. These conclusions agree with earlier findings in our laboratory (13, 14, 27) that intrusion of populations differentiated elsewhere has contributed an important element to the association between genetics and language in Europe. We thank Prof. Marija Gimbutas, Lord Renfrew, and Dr. Merritt Ruhlen for their collegial cooperation in this work. We are indebted to D. DiGiovanni, M.-J. Fortin, and C. Wilson for technical assistance. Part of the computation was carried out on the Cornell National Supercomputer Facility. This research was supported by National Science Foundation Grant BNS8918751 and National Institutes of Health Grant GM28262. 1. Gimbutas, M. (1973) J. Indo-Eur. Studies 1, 1-20. 2. Gimbutas, M. (1979) Arch. Suisses Anthropol. GMn. 43, 113-137. 3. Gimbutas, M. (1986) in Ethnogenese Europdischer V6lker, eds. Bernhard, W. & Kandler-PNsson, A. (Fischer, Stuttgart, F.R.G.), pp. 5-20.

4. Renfrew, C. (1987) Archaeology and Language: The Puzzle of IndoEuropean Origins (Jonathan Cape, London). 5. Ammerman, A. J. & Cavalli-Sforza, L. L. (1984) The Neolithic Transition and the Genetics of Populations in Europe (Princeton Univ. Press, Princeton, NJ). 6. Menozzi, P., Piazza, A. & Cavalli-Sforza, L. L. (1978) Science 201, 786-792. 7. Sokal, R. R. & Menozzi, P. (1982) Am. Nat. 119, 1-17. 8. Sokal, R. R., Oden, N. L. & Wilson, C. (1991) Nature (London) 351, 143-145. 9. Mourant, A. E., Koped, A. C. & Domaniewska-Sobczak, K. (1976) The Distribution of the Human Blood Groups (Oxford Univ. Press, London). 10. Derish, P. A. & Sokal, R. R. (1988) Hum. Biol. 60, 801-824. 11. Sokal, R. R. (1988) Proc. Natl. Acad. Sci. USA 85, 1722-1726. 12. Sokal, R. R., Oden, N. L. & Thomson, B. A. (1988) Am. J. Phys. Anthropol. 76, 337-361. 13. Sokal, R. R., Oden, N. L., Legendre, P., Fortin, M.-J., Kim, J. & Vaudor, A. (1989) Am. J. Phys. Anthropol. 79, 489-502. 14. Sokal, R. R., Harding, R. M. & Oden, N. L. (1989) Am. J. Phys. Anthropol. 80, 267-294. 15. Prevosti, A., Ocana, J. & Alonso, G. (1975) Theor. Appl. Genet. 45, 231-241. 16. Wright, S. (1978) Evolution and the Genetics of Populations, Vol 4: Variability Within and Among Populations (Univ. of Chicago Press,

Chicago).

17. Ruhlen, M. (1991) A Guide to the World's Languages, Vol 1: Classification; With a Postscript on Recent Developments (Stanford Univ. Press,

Stanford, CA). 18. Sneath, P. H. A. & Sokal, R. R. (1973) Numerical Taxonomy (Freeman,

19. 20. 21. 22. 23. 24.

25. 26.

27.

San Francisco). Mantel, N. (1%7) Cancer Res. 27, 209-220. Sokal, R. R. (1979) Syst. Zool. 28, 227-231. Smouse, P. E., Long, J. C. & Sokal, R. R. (1986) Syst. Zool. 35,627-632. Sokal, R. R. & Rohlf, F. J. (1981) Biometry (Freeman, San Francisco), 2nd Ed. Ammerman, A. J. & Cavalli-Sforza, L. L. (1973) in The Explanation of Culture Change, ed. Renfrew, C. (Duckworth, London), pp. 343-357. Ammerman, A. J. & Cavalli-Sforza, L. L. (1979) in Transformations: Mathematical Approaches to Culture Change, eds. Renfrew, C. & Cooke, K. L. (Academic, New York), pp. 275-294. Sokal, R. R., Oden, N. L., Legendre, P., Fortin, M.-J., Kim, J., Thomson, B. A., Vaudor, A., Harding, R. M. & Barbujani, G. (1990) Am. Nat. 135, 157-175. Roychoudhury, A. K. & Nei, M. (1988) Human Polymorphic Genes: World Distribution. (Oxford Univ. Press, New York). Sokal, R. R. (1991) Annu. Rev. Anthropol. 20, 119-140.

Origins of the Indo-Europeans: genetic evidence.

Two theories of the origins of the Indo-Europeans currently compete. M. Gimbutas believes that early Indo-Europeans entered southeastern Europe from t...
1MB Sizes 0 Downloads 0 Views