HHS Public Access Author manuscript Author Manuscript

Forensic Sci Int Genet Suppl Ser. Author manuscript; available in PMC 2017 August 07. Published in final edited form as:

Forensic Sci Int Genet Suppl Ser. 2015 December ; 5: e267–e268. doi:10.1016/j.fsigss.2015.09.106.

SNPs and SNVs in forensic science Bruce S. Weir* and Xiuwen Zheng Department of Biostatistics, University of Washington, Box 359461, Seattle, WA 98195-9461, United States

Abstract Author Manuscript

The utility of short tandem repeat genetic (STR) markers for forensic science is beyond question and there are over 50 million STR profiles in current national databases. The magnitude and value of those data, however, are likely to be dwarfed by what is emerging from large-scale SNP and DNA sequence assays. Phenotypic characterization may well accompany future statements about identity. In this very brief review we focus on the use of rare variants to describe relatedness and population structure.

Keywords Population genetics; Sequence variants; Population structure; Relatedness

Author Manuscript

1. Introduction

Author Manuscript

The amount of genetic data collected by forensic scientists worldwide is now substantial, with short tandem repeat (STR) frequencies having been reported in this and other journals from close to one million people. National offender databases now have a total of about 50 million profiles, although the possibility of conducting numerical experiments on those data [1] in order to address issues such as the dependencies of matching probabilities among loci [2] is presently limited to very few countries. Forensic science is turning attention to largescale genetic data, such as those provided for single nucleotide polymorphisms (SNPs) from chip-array technology [3] or single nucleotide variants (SNVs) from next-generation sequencing (NGS) [4]. The 1000 Genomes project, www.1000genomes.org, has already published whole-genome sequence data that includes over 78 million SNPs. The data will accumulate rapidly from several projects: in the US the National Human Genome Research Institute plans to sequence 200,000 people and the National Heart Lung and Blood Institute anticipates sequencing another 100,000 people. Similar projects are underway in other countries. The implications for forensic science are likely to be substantial, and will include the characterization of biogeographical ancestry (BGA) and externally visible characters (EVC) as well as enhanced deconvolution of mixtures.

*

Corresponding author. [email protected] (B.S. Weir). Conflict of interest statement The authors declare no conflict of interest.

Weir and Zheng

Page 2

Author Manuscript

In this brief discussion we will explore the use of SNPs or SNVs to characterize relatedness on the evolutionary time scale and the immediate family time scale.

2. Methods To set the stage for a discussion of emerging SNP and SNV data, we refer to a forthcoming review of published STR allele frequencies [5]. Data for 446 populations were gathered from 250 publications and analyses conducted on 24 STR loci. An aim of the study was to provide estimates for the population structure parameter θ used in the Balding–Nichols [6] match probability equations. The probability that two alleles in population i are the same type, i.e. they match, is where θi is the probability those two alleles are identical by descent and pu are the probabilities of alleles of type u. If the allele frequencies are estimated with sample values p̃u from a database representing a set of populations rather

Author Manuscript

where βi than just population i then the estimated matching proportion is = (θi − θB)/(1 − θB) and θB is the average over all pairs of populations in the set of betweenpopulation-pair analogs of θi. If θij is the probability an allele from population i is identical by descent to an allele from population j, then θB is the average of the θij’s. The estimated value of βi is (M̃i − M̃B)/(1 − M̃B) where M̃i is the observed allelic matching proportion in population i and M̃B is the observed allelic matching proportion between all pairs of populations in some sampled set of populations.

Author Manuscript

The key observation is that β, i.e. FST, depends on the set of sampled populations. If nothing is known about the population for which matching probabilities are to be estimated, then the set may be from a world-wide set such as the 446 populations described in [5]. If the continental ancestry of the target population is known, then the set of populations may be those with similar ancestry. We find that the world-wide reference set generally leads to higher estimates of β, and therefore larger estimated matching proportions, than an ancestryspecific ancestry set. The opposite can hold for populations of African ancestry in which matching can be less than it is between pairs of populations elsewhere in the world. Lower within- than between-population matching will be more common for SNP and SNV data in which rare variants are often found, with an extreme situation being when some variants are private, meaning they exit in only one population. Suppose a variant is found to have small sample frequency x in one population but is not seen in r − 1 other populations. The estimated value of β for the population with the variant is (1 + xr) − r and this can be very large and negative: at the nucleotide site under study the matching proportion within the population with the variant is less than it is between any other two populations.

Author Manuscript

Population structure parameters describe relationships brought about by shared ancestry. The increased amount of information in SNPs as compared to STRs also allows family-based relatedness to be estimated. Instead of trying to distinguish among major relationship categories such as parent–child, full-siblings or unrelated in mass disaster victim identifications it becomes feasible to estimate the actual relationship. The usual estimate [7] for the coancestry (kinship) coefficient between individuals i and j from L-SNP profiles is where xil and xjl are the number (0, 1, 2) of

Forensic Sci Int Genet Suppl Ser. Author manuscript; available in PMC 2017 August 07.

Weir and Zheng

Page 3

Author Manuscript

reference alleles at SNP l for i and j respectively and pl is the reference allele frequency for SNP l. The problem with this estimate is that is evaluated with sample allele frequencies p̃l from a set of sampled individuals to which i, j belong. The estimates are affected by the relationships between all pairs of individuals in the sample, and they do not allow for different allele frequencies being relevant for each individual. Estimates that do not depend on allele frequencies are βiĵ = [Σl (M̃ijl − M̃Bl)/[Σl (1 − M̃l)] where M̃ijl = [1 + (1 − xil) (1 − xjl)]/2 and M̃Bl is the average of M̃ijl for all pairs of individuals in the sample. These estimates follow from setting population sample sizes to one in the population structure work, and the use of the same notation is deliberate. The expected value of β̂ij is (θij − θB)/(1 − θB) where θB is the average of all the θij values for pairs of individuals in the sample. Relatedness can be measured only with respect to some reference set of individuals. Better relationship estimates, that distinguish between distant degrees of cousins for example, use inferred haplotypes rather than single SNPs [8].

Author Manuscript

3. Discussion

Author Manuscript

Given the many potential advantages of being able to determine large-scale SNP or sequence data, it might be appropriate to ask why STRs continue to be the genetic markers in use by forensic scientists. The obvious issue of cost may lessen as sequencing costs, especially on a per-sample basis, decrease. The amount of DNA needed for the new analyses is an issue that may be difficult to overcome. The need to preserve the value of legacy STR-databases seems to addressed by recent work (e.g. [9]) in recovering STR profiles from sequence data. The promise of this approach is suggested by the data posted on the 1000 Genomes web site (www.1000genomes.org) where 79 Y-STR genotypes were recovered from sequence data, although the methods used resulted in a median number of STRs per individual of less than 30. Mixture deconvolution seems to be possible with dense SNP data [10].

4. Conclusion DNA sequencing is emerging as a disruptive technology for forensic science. The huge amount of information contained in a whole-genome sequence has substantial potential forensic benefit, although this is likely to require big-data types of analyses in which the aggregate data, rather individual variants, are of importance. Progress is likely to be rapid and substantial.

Acknowledgments This note contains ideas developed with John Buckleton, ESR New Zealand and Jérôme Goudet, UNIL Switzerland. The work was supported in part by NIJ 2014-DN-BX-K028.

Author Manuscript

References 1. Weir BS. Matching and partially-matching DNA profiles. J Forensic Sci. 2004; 49:1009–1014. [PubMed: 15461102] 2. Laurie C, Weir BS. Dependency effects in multi-locus match probabilities. Theor Popul Biol. 2003; 63:207–219. [PubMed: 12689792]

Forensic Sci Int Genet Suppl Ser. Author manuscript; available in PMC 2017 August 07.

Weir and Zheng

Page 4

Author Manuscript Author Manuscript

3. Danturk KM, Emrre R, Kimoglu K, Baspinar B, Sahin F, Ozen M. Current status of the use of single-nucleotide polymorphisms in forensic practices. Genetic Test Mol Biomark. 2014; 18:455– 460. 4. Børsting C, Morling N. Next generation sequencing and its applications in forensic genetics. Forensic Sci Int Genet. 2015; 18:78–89. [PubMed: 25704953] 5. Buckleton JS, Curran J, Goudet J, Thiery A, Weir BS. Population-specific FST values: a worldwide survey. 2015 (submitted for publication). 6. Balding DJ, Nichols RA. DNA profile match probability calculation: how to allow for population stratification, relatedness, database selection and single bands. Forensic Sci Int. 1994; 64:125–140. [PubMed: 8175083] 7. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, et al. Common SNPs explain a large proportion of the heritability for human height. Nat Genet. 2010; 42:565–569. [PubMed: 20562875] 8. Browning SR, Browning BL. Identity by descent between distant relatives: detection and applications. Annu Rev Genet. 2012; 46:617–633. [PubMed: 22994355] 9. Warshauer DH, King JL, Budowle B. STRait Razor v2. 0: the improved STR allele identification tool – Razor. Forensic Sci Int Genet. 2015; 14:182–186. [PubMed: 25450790] 10. Voskoboinik L, Ayers SB, LeFebvre AK, Darvasi A. SNP-microarrays can accurately identify the presence of an individual in complex forensic DNA mixtures. Forensic Sci Int Genet. 2015; 16:208–215. [PubMed: 25682311]

Author Manuscript Author Manuscript Forensic Sci Int Genet Suppl Ser. Author manuscript; available in PMC 2017 August 07.

SNPs and SNVs in forensic science.

The utility of short tandem repeat genetic (STR) markers for forensic science is beyond question and there are over 50 million STR profiles in current...
NAN Sizes 1 Downloads 7 Views