bammds: a tool for assessing the ancestry of low-depth whole-genome data using multidimensional scaling (MDS).

BIOINFORMATICS APPLICATIONS NOTE Genome analysis

Vol. 30 no. 20 2014, pages 2962–2964 doi:10.1093/bioinformatics/btu410

Advance Access publication June 28, 2014

bammds: a tool for assessing the ancestry of low-depth whole-genome data using multidimensional scaling (MDS) Anna-Sapfo Malaspinas1,*, Ole Tange1,*, Jose Vıctor Moreno-Mayar1, Morten Rasmussen1,2, Michael DeGiorgio3, Yong Wang4,5, Cristina E. Valdiosera1,6, Gustavo Politis7,8, Eske Willerslev1 and Rasmus Nielsen1,4 1

Associate Editor: Inanc Birol

ABSTRACT Summary: We present bammds, a practical tool that allows visualization of samples sequenced by second-generation sequencing when compared with a reference panel of individuals (usually genotypes) using a multidimensional scaling algorithm. Our tool is aimed at determining the ancestry of unknown samples—typical of ancient DNA data—particularly when only low amounts of data are available for those samples. Availability and implementation: The software package is available under GNU General Public License v3 and is freely available together with test datasets https://savannah.nongnu.org/projects/bammds/. It is using R (http://www.r-project.org/), parallel (http://www.gnu.org/software/parallel/), samtools (https://github.com/samtools/samtools). Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online. Received on April 10, 2014; revised on June 16, 2014; accepted on June 23, 2014

1

INTRODUCTION

Population structure plays an important role in determining the evolutionary history of a group. A great deal has been learned from single nucleotide polymorphism (SNP) array technology providing unmatched information of the population structure of several species [for humans, see (Novembre and Ramachandran, 2011)]. The advent of new sequencing platforms, which can deliver millions to billions of sequencing reads within days, has shifted the focus from SNP array data to whole-genome shotgun (WGS) data. While the cost has steadily decreased (Sboner et al., 2011), obtaining many high-depth genomes remains prohibitive for many laboratories, in particular when working with ancient DNA (aDNA) samples where it is *To whom correspondence should be addressed.

often desirable to screen many samples of potential interest while keeping the cost at a minimum. Methods based on non-parametric multidimensional statistics (more specifically principal components analysis, PCA) were first applied to genetic data more than 30 years ago (Menozzi et al., 1978). PCA has since become a standard tool in population genetics (Patterson et al., 2006; Wang et al., 2014) owing in particular to (i) the low computational demand of such analyses, (ii) the appealing graphical result and (iii) its ease of use. Here, we describe a tool that allows to assign an ancestry to low-depth mapped WGS data when compared with an existing reference panel of genotype data using multidimensional scaling (MDS) based on genetic distances, a related method that provides results similar to those of PCA (Cox and Cox, 2000).

2

METHODS

In what follows, we assume that WGS data have been mapped to a reference genome and that files in BAM format are available (Li et al., 2009). Calling genotypes for low-depth data is a challenging task (Nielsen et al., 2011), particularly for aDNA, as ancient damage (Briggs et al., 2007) and contamination are not incorporated into sequence data error models. To avoid calling genotypes, we sample a read at every position for the WGS data, similar in spirit to previous aDNA approaches (Green et al., 2010). Specifically, for the reference set of individuals, we randomly sample one of the alleles from each individual, and for the WGS data, we choose an allele from a randomly selected read covering that site. If no read covers that site or if the sampled allele is not the minor or the major allele in the reference panel, we then assume that the data for this site are missing for that sample. In other words, the data in both the reference panel and the WGS samples become either one allele (A, C, G or T) or missing data. For site k, let dkij = 1 if individuals i and j have a different randomly chosen allele and 0 if that allele is the same or if one of the individuals has missing data. Assume that the number of sites in the reference panel is K. Denote K~ ij as the number of sites where neither of individual i and j have missing data. Then, the allele-sharing distance between individuals i and j

ß The Author 2014. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/3.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected]

Downloaded from http://bioinformatics.oxfordjournals.org/ at East Tennessee State University on May 27, 2015

Centre for GeoGenetics, Natural History Museum of Denmark, University of Copenhagen, 1350 Copenhagen K, Denmark, 2Department of Genetics, Stanford University School of Medicine, Stanford, CA 94305, 3Department of Biology, Pennsylvania State University, Wartik Laboratory, University Park, PA 16802, 4Centre for Theoretical Evolutionary Genomics, Departments of Integrative Biology and Statistics, University of California, Berkeley, CA 94720-3140, 5 Ancestry.com DNA LLC, San Francisco, CA 94107, 6Department of Archaeology, Environment and Community Planning Faculty of Humanities and Social Sciences, La Trobe University, Melbourne, VIC 3086, Australia, 7INCUAPACONICET, Universidad del Centro de la Provincia de Buenos Aires, 7600 Olavarrıa, Argentina and 8Facultad de Ciencias Naturales y Museo de La Plata, 1900 La Plata, Argentina

bammds

dij =

K X

1 dkij K~ ij k=1

A matrix D=ðdij Þ of allele-sharing distances between all pairs of individuals is computed. We then apply classical MDS to this matrix [e.g. (Cox and Cox, 2000)]. Our implementation has three major features: it is user friendly and is intended to be used by biologists with limited familiarity with a UNIX system, it is flexible in terms of formats of the reference panel and in terms of the visual output, it runs in 20 min on a machine with four 2.2 GHz cores with a reference panel including 4600 000 SNPs and 950 individuals, making it practical to screen samples of an ongoing experiment progressively as additional data are produced. We first tested bammds through simulations using publicly available modern and ancient human data. For the WGS data, we used 10 modern human genomes from HGDP cell lines, published in Meyer et al. (2012), an Australian aboriginal genome (Rasmussen et al., 2011) and the Anzick-1 genome (Rasmussen et al., 2014). We mapped and processed the data identically for all genomes (see Supplementary data). We used a

M M M M M M M B M B BB B BB B B B BB B B SSS B B B B Y NY Y Y B Y Y Y Y B B B M Y Y Y M M M M M M M M B B M M M BB M M M B MB

A

B BB

M F P S H Y K A N D O A R

B F I S T M B D P B B B H

Mbuti French Papuan Sardinian Han Yoruba Karitiana San Mandenka Dai Orcadian Adygei Russian

Basque French Italian Sardinian Tuscan Mozabite Bedouin Druze Palestinian Balochi Brahui Burusho Hazara

K M P S U M P C K S M P B

Kalash Makrani Pathan Sindhi Uygur Melanesian Papuan Colombian Karitiana Surui Maya Pima BantuSAfrica

S

B M Y B M S H H D D H L M

BantuKenya Mandenka Yoruba BiakaPygmy MbutiPygmy San Han Han−NChina Dai Daur Hezhen Lahu Miao

P P P P P P P

M

M M M P M M M M

P M B M M MM S M MM B M M B M M M M S M M M M MM M B M P M M M BB B B BM S S B B B BBB H B B M SPP BBB B B BB M PP BBB B SPSB M M B SSSS S S BMB B BS S S B M S P M P S P MB BPP PP PB BB B P P M BB P PB B P MM P B BB BPP PK S BB P K M B B B S B K MM K K B K B P BP B B K B B M K K PP PB P M BBPK B B BB B P B MM B B M K BBBB P P B BP P P P B B P P P B P P B P B P BB B P B P P B B P D P B BP AR P B B B P AR R R A D D RR RR R R D D DD AAAAA AAA R D RR D D D R R D D R D D R A D D D DD DDD D A AR R A F O S SDTTITFFITFFTFFFFOO O F FF O O FF O IB F F S IF B IB IT IB FO F F B SSB B S S S S FB B B B S SS B SS S S S BO B S B

O S T T X Y M N C J Y

Oroqen She Tujia Tu Xibo Yi Mongola Naxi Cambodian Japanese Yakut

H S JH H S T S H T JT S SS M M M S H Y D H JJM T D JT H S M JH L N Y T H H T D D Y M H D H H JH S D JH L H JH N Y JM T N D Y L N JJH C Y N JJH LLLM H H H H N H H JJY L H L H N O Y C M D H H H Y X H M Y D D H D X TX C TT D C D O X D M M H M YO O D O O C TTT XM D M C D C T CC C T XM Y Y Y Y Y Y Y M Y Y Y YYM K C S PC S K K S K CK C K M S YYY M M C K S PP Y P P KK C PSK X YM MP PP P M PM M M P M M P M M M M YM M M

UY Y M M M UH U H H H HH H H U H UHUH H H H UH U U HU HH

Fig. 1. First two dimensions of an MDS plot including the ten 0.1X modern human genomes and the HGDP SNP data

2963


public reference panel that we make available in the Supplementary data, i.e. HGDP (Li et al., 2008), which includes 4600 000 SNPs and 950 individuals subdivided into 53 populations and 7 geographic regions (Africa, Eastern-, Western-, Central- and South Asia, Europe, Oceania and Native America). For each genome, we sampled 3 104 , 3 105 , 3 106 , 1:5 107 , 3 107 and 1:5 108 reads (which corresponds to a depth of coverage around 0.001, 0.01, 0.1, 0.5, 1 and 5, assuming 100 bp sequence reads). For each sub-sampled genome, we ran bammds with the HGDP reference panel. We summarized the simulation results using dimension 1 and 2 only of the MDS output, as we expect this to be the common usage. For each population in HGDP, we defined its centroid (or center of gravity) based on the coordinates of its members for those two dimensions. We then evaluated the results using two criteria: (i) by assessing which population was the closest when comparing the position of the WGS sample with the population centroid, and (ii) by determining if the position of the genome is within a two-dimensional 99% confidence region. We built the confidence region by assuming that the points follow a bivariate normal distribution centred around the centroid of the population to which it belongs (‘population ellipse’). We present a practical example on how to use the tool to determine whether a library is heavily contaminated by processing a newly sequenced 10 000 year BP old phalange (‘Gus’) from Argentina that clusters with the Europeans (Supplementary data).

is as follows:

A.-S.Malaspinas et al.

Table 1. Summary of the simulation results for the ten modern genomes. For more details, see Supplementary data Min. approx. depth of coverage to . . .

. . . recover geographic region as closest centroid

3

0.001 0.001 0.001 0.1 0.001 0.001 0.01 0.001 0.001 0.001

0.001 0.01 0.001 0.01 0.1 0.001 0.01 0.001 0.1 0.5

0.1 0.1 0.5 0.5 0.01 0.1 0.1 1 0.1 0.5

RESULTS

The graphical result with all 10 modern individuals at a depth of 0.1 can be seen on Figure 1. We find in the simulations that for all but two cases, we recover the geographic region as the first hit for as few as 30 000 reads (0.001, Table 1). In the remaining two cases, the Sardinian and the Karitiana individual, a depth of 0.1 and 0.01, respectively, is enough. The true nearest population was also identified in most cases within the three closest centroids for a depth above 0.01 (7/10 cases). For the second criteria, we find that in 9/10 of the cases, the WGS sample was within the population ellipse at 0.5 and above. Only in one case (San individual) was a depth of 1 necessary to be placed within the population ellipse. For the ancient data, we get similar results for the Aborigine, which is assigned to the correct geographic region (Oceania) as a first hit with a depth of 0.001 and above. At a depth higher than 0.01, we also recover the expected population as the closest population. For the Anzick-1 individual, presumably because of increased damage, a depth of 1 is needed to recover the geographic region as the first hit. On the other hand, a Native American population is among the three closest populations from a depth of 0.1 and above. The results for Gus are given in Supplementary data.

4

CONCLUSION

The tool we present in this article is based on classical MDS, a technique that originated in the 1930s and is commonly used in other fields [see, e.g. (Borg and Groenen, 1997) and citations therein]. We present a tool that was designed to be practical to

2964

ACKNOWLEDGMENTS The authors thank Marıa C. Avila-Arcos, Amhed Missael Vargas Velazquez, Morten E. Allentoft, Hannes Schroeder, Kerttu Majander, Maanasa Raghavan and Johannes Krause for helpful discussions and testing, and the National high-throughput DNA Sequencing Center for assistance with the sequencing. Funding: A.-S.M. was supported by a Swiss NSF, J.V.M.-M. by the ‘Consejo Nacional de Ciencia y Tecnologıa’ (Mexico) and M.D. by the US NSF (DBI-1103639). GeoGenetics members were supported by the Lundbeck Foundation and the Danish National Research Foundation (DNRF94). Conflict of Interest: none declared.

REFERENCES Borg,I. and Groenen,P.J.F. (1997) Modern Multidimensional Scaling: Theory and Applications. Springer, New York. Briggs,A.W. et al. (2007) Patterns of damage in genomic DNA sequences from a Neandertal. Proc. Natl Acad. Sci. USA, 104, 14616–14621. Cox,T.F. and Cox,M.A.A. (2000) Multidimensional Scaling. 2nd edn. Chapman and Hall/CRC, Florida. Green,R.E. et al. (2010) A draft sequence of the Neandertal genome. Science, 328, 710–722. Li,H. et al. (2009) The Sequence Alignment/Map format and SAMtools. Bioinformatics, 25, 2078–2079. Li,J.Z. et al. (2008) Worldwide human relationships inferred from genome-wide patterns of variation. Science, 319, 1100–1104. Menozzi,P. et al. (1978) Synthetic maps of human gene frequencies in Europeans. Science, 201, 786–792. Meyer,M. et al. (2012) A high-coverage genome sequence from an archaic Denisovan Individual. Science, 338, 222–226. Nielsen,R. et al. (2011) Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet., 12, 443–451. Novembre,J. and Ramachandran,S. (2011) Perspectives on human population structure at the cusp of the sequencing era. Annu. Rev. Genomics Hum. Genet., 12, 245–274. Patterson,N. et al. (2006) Population structure and eigenanalysis. PLoS Genet., 2, e190. Rasmussen,M. et al. (2011) An Aboriginal Australian genome reveals separate human dispersals into Asia. Science, 334, 94–98. Rasmussen,M. et al. (2014) The genome of a Late Pleistocene human from a Clovis burial site in western Montana. Nature, 506, 225–229. Sboner,A. et al. (2011) The real cost of sequencing: higher than you think! Genome Biol., 12, 125. Wang,C. et al. (2014) Ancestry estimation and control of population stratification for sequence-based association studies. Nat. Genet., 46, 409–415.


Mbuti (Africa) French (Europe) Papuan (Oceania) Sardinian (Europe) Han (Eastern Asia) Yoruba (Africa) Karitiana (America) San (Africa) Mandenka (Africa) Dai (Eastern Asia)

. . . recover true . . . be placed population within population within three ellipse closest centroids

assess the ancestry of mapped WGS data for samples sequenced at low depth, assuming that a relevant reference panel in terms of ancestry is provided. We show through simulations that useful ancestry information can be recovered for as few as 30 000 reads—corresponding to a fraction (1/60 in early 2014) of a HiSeq 2000 lane (www.illumina.com) for a sample with 1% endogenous content (or 1/4800 of a lane for a typical modern sample).

Graphical Representation of Proximity Measures for Multidimensional Data: Classical and Metric Multidimensional Scaling.

MM-MDS: a multidimensional scaling database with similarity ratings for 240 object categories from the Massive Memory picture database.

Multidimensional scaling of pictorial informativeness.

Analysis of world economic variables using multidimensional scaling.

Multidimensional scaling analysis of the dynamics of a country economy.

Using multidimensional scaling on data from pairs of relatives to explore the dimensionality of categorical multifactorial traits.

A multidimensional dynamic quantification tool for the mitral valve.

Multidimensional perceptual scaling of musical timbres.

Ycasd - a tool for capturing and scaling data from graphical representations.

MDS data should be used for research.

Multidimensional Unfolding by Nonmetric Multidimensional Scaling of Spearman Distances in the Extended Permutation Polytope.

Empirical modeling of an alcohol expectancy memory network using multidimensional scaling.

A new tool called DISSECT for analysing large genomic data sets using a Big Data approach.

Noncombination of pitch and loudness in multidimensional scaling.

Predicting and interpreting identification errors in military vehicle training using multidimensional scaling.

Using multidimensional scaling to quantify similarity in visual search and beyond.

The Predictions Table: a tool for assessing students' knowledge.

Using population data for assessing next-generation sequencing performance.

Confirmatory Factor Analysis and Profile Analysis via Multidimensional Scaling.

A structure-based distance metric for high-dimensional space exploration with multidimensional scaling.

On the development of a multidimensional Dutch pain assessment tool for children.

Assessing microbial competition in a hydrogen-based membrane biofilm reactor (MBfR) using multidimensional modeling.

Assessing Lymphatic Filariasis Data Quality in Endemic Communities in Ghana, Using the Neglected Tropical Diseases Data Quality Assessment Tool for Preventive Chemotherapy.

SearchSmallRNA: a graphical interface tool for the assemblage of viral genomes using small RNA libraries data.