Forensic Science International: Genetics 22 (2016) 128–138

Contents lists available at ScienceDirect

Forensic Science International: Genetics journal homepage: www.elsevier.com/locate/fsig

Research paper

Familial searching on DNA mixtures with dropout K. Slooten a,b,* a b

Netherlands Forensic Institute, P.O. Box 24044, 2490 AA The Hague, The Netherlands VU University Amsterdam, De Boelelaan 1081, 1081 HV Amsterdam, The Netherlands

A R T I C L E I N F O

A B S T R A C T

Article history: Received 16 September 2015 Received in revised form 11 December 2015 Accepted 1 February 2016 Available online 22 February 2016

Familial searching, the act of searching a database for a relative of an unknown individual whose DNA profile has been obtained, is usually restricted to cases where the DNA profile of that person has been unambiguously determined. Therefore, it is normally applied only with a good quality single source profile as starting point. In this article we investigate the performance of the method if applied to mixtures with and without allelic dropout, when likelihood ratios are computed with a semi-continuous (binary) model. We show that mixtures with dropout do not necessarily perform worse than mixtures without, especially if some separation between the donors is possible due to their different dropout probabilities. The familial searching true and false positive rates of mixed profiles on 15 loci are in some cases better than those of single source profiles on 10 loci. Thus, the information loss due to the fact that the person of interest’s DNA has been mixed with that of other, and is affected by dropout, can be less than the loss of information corresponding to having 5 fewer loci available for a single source trace. Profiles typed on 10 autosomal loci are often involved in familial searching casework since many databases, including the Dutch one, in part consist of such profiles. Therefore, from this point of view, there seems to be no objection to extend familial searching to mixed or degraded profiles. ß 2016 Elsevier Ireland Ltd. All rights reserved.

Keywords: Familial Searching DNA mixtures Dropout DNA databases Likelihood ratio distributions

1. Introduction Familial searching is the technique of retrieving a relative of an unknown trace donor out of a database of known individuals, with the aim of identifying the unknown trace donor. This technique is usually applied when a single source trace, attributed to an unknown offender, has not given any direct matches in the criminal justice database, as a (genetic) last resort to obtain the offender’s identity. Familial searching is obviously much more difficult than direct searching: in most cases it is impossible to extract relatives with certainty if they are present in the database, since there are too many profiles in the database to further inspect all of them and the genetic information stored in the database is insufficient for such a purpose. Several strategies to find as many relatives as possible at minimal cost have been discussed in the literature; in [1] it was shown that a likelihood ratio (LR) threshold is the most efficient strategy from a false rates point of view. In this paper, we investigate to which extent mixtures are amenable to familial searching. In some crime cases, a single

* Correspondence to: VU University Amsterdam, Netherlands Forensic Institute (NFI), The Netherlands. E-mail address: k.slooten@nfi.minvenj.nl http://dx.doi.org/10.1016/j.fsigen.2016.02.002 1872-4973/ß 2016 Elsevier Ireland Ltd. All rights reserved.

source trace of the offender may not be at hand, while a mixed profile containing DNA of several people including one or more persons of interest may have been obtained. Extending familial searching to such traces may therefore lead to the resolution of cases which are currently not solvable with this technique. A first expansion of the technique is to consider mixed traces showing the profiles of a known person (say, a victim) and the person of interest (say, the offender), abbreviated PoI. In that case, if the profile is complete (i.e., there is no allelic dropout or drop-in) it was shown in [2,3] that familial searches for the unknown offender are feasible. They showed this by studying the rank that the unknown offender’s relative obtains, when hidden in a database of a certain size. In practice, there need not be a known individual and the profile may be of a lesser quality leading to allelic dropout. In a previous publication [4], we studied how well one can discriminate between donors of these mixtures and their relatives. One of the findings was, that it is advantegeous if the persons other than the person of interest, have a non-zero probability of dropout for their alleles. This is very logical, since we may indeed expect that we are less hampered by the presence of other people’s DNA in the mixture if there is less of it. It also means that we can expect that mixtures with dropout are no less tractable for a familial search than the mixtures without dropout studied in [3].

K. Slooten / Forensic Science International: Genetics 22 (2016) 128–138

In this article we investigate to which extent this is true. Since one of our aims is to study the feasibility of such searches in the Dutch DNA database, we focus on sets of loci that are contained in that database: the SGMPlus and the NGM loci. One may wonder when a familial search can be considered feasible; one benchmark would be results obtained with single source traces on the SGMPlus loci, since these searches are carried out in practice (and we are unaware of familial searches based on fewer loci carried out routinely). Therefore, we will compare the results obtained with mixtures, to those obtained with single source traces on the SGMPlus loci, and on the NGM loci as well since these form the (growing) majority of traces in the Dutch database. Obviously the fact that we no longer exactly know the profile of the PoI reduces the effectiveness of a familial search, but a mixed profile compared on the NGM loci may still be better than an unmixed profile compared on the SGMPlus loci. It is of course impossible to make a comprehensive study of all types of mixtures. We have analyzed a few examples, and considered searches for relatives of the major as well as for the minor donor, both with and without the knowledge of the other donor(s). The computer script MixKin with which these simulations have been carried out is available from the author, so that the interested reader may perform his or her own simulations involving other mixtures or types of relatives than those considered here. It is written in the programming language Mathematica and needs a licence of that software to run (see www.wolfram.com); some experience with this language is most helpful. 2. Method Throughout we abbreviate likelihood ratio as LR. To calculate LR’s, we use a model sometimes referred to as a semi-continuous model. This model evaluates as evidence the set of recorded alleles, but allows for different dropout rates per contributor. Hence, by letting these dropout rates depend on the peak heights of the profile, this model is able to take peak height information into account to some extent. The model itself is fairly simple and is described in [5, Online Supplement]. We review it here for convenience and to establish some notation that we will need later on. 2.1. Likelihood computations Suppose that a mixture has n contributors. Let their dropout rates be d1, . . ., dn for a heterozygous allele and D1, . . ., Dn for a homozygous allele. That is, di is the probability that an allele of a heterozygous pair is not detected for donor i, if donor i would have been the only contributor with this allele and dropin is impossible. Analogously for Di, which is the dropout probability for homozygous alleles. Obviously, one has Di d2i but in all our calculations we will simply assume Di ¼ d2i . While this is partly an opportunistic assumption (it facilitates the computations), it is not without support. Typically the difference between Di and d2i is mostly relevant for high dropout probabilities, but in that case we are dealing with low template profiles for which the assumption is not unreasonable. We refer to [6] for a discussion on this assumption for models such as the one in this paper. Furthermore [7,8], describe a model for dropout and its fit to experimental data. Finally, we need a parameter c that allows for alleles that are not present in any of the contributors to be detected in the mixture profile. For simplicity, we work on only one locus here, since we suppose that all considered loci are independent, hence likelihood ratios are obtained as a product over loci, given the genotypes of the donors, the di and c. Since different replicates potentially lead to different mixture profiles, we regard the observed mixture as a random variable M. Still supposing the genotypes of the donors are

129

known and writing ~ g ¼ ðg 1 ; . . .; g n Þ where gi is the genotype of donor i on the considered locus, then the probability that allele a is detected in the mixture is then given by n Y n ða 2 Mj~ g Þ ¼ 1ð1cpa Þ di i;a ; P~ d;c

(2.1)

i¼1

where ni,a 2 {0, 1, 2} is the number of alleles a present in gi, the genotype of contributor i (by definition, 00 = 1). Note that, when c = 0, we see from this formula that an allele is recorded unless it drops out for all the contributors that have that allele. To compute the probability that the set of alleles observed from mixture M is equal to M, one simply uses (2.1) to obtain Y Y P~ ðM ¼ Mj~ gÞ ¼ P~ ðx 2 Mj~ gÞ Pd;c = Mj~ gÞ (2.2) ~ ðx 2 d;c d;c x2M

x2 =M

We also note that contaminant alleles may coincide with alleles of donors, and thus do not necessarily lead to an allele in the mixture that is not present in any of the contributor’s genotypes. In particular they may undo the effect of allelic dropout. The probability to observe M ¼ M when some of the donors have unknown genotypes is obtained by summing (2.2) over the set of possible genotypes for these donors, weighted by their prior probability to be the donor’s genotypes. In standard mixture calculations one would assume that the unknown donors are unrelated to the known ones, and to each other, in which case one could assign the relevant population frequency as probability for an unknown donor to have a certain genotype. However, it is of course also possible to assume relatedness, and this is what is done for familial searching. In that case, we are going to investigate whether an unknown donor of the mixture is a certain relative of a known person. If the mixture has n donors of which m are known, we will always number the contributors such that the known contributors are the first ones, and the person of interest is the first of the unknown contributors. We need to calculate the probability to observe a mixture, given that it is a mixture of n donors with dropout probabilities d1, . . ., dn, drop-in parameter c, that the first m donors are the known individuals K1, . . ., Km and that donor m + 1 (the first of the remaining, unknown, donors) is related to some person R with known genotype, according to IBD-coefficients k = (k0, k1, k2) whereas the remaining unknown donors are not related to any known individual or each other. Usually the parameters n; ~ d ¼ ðd1 ; . . .; dn Þ; c and the known individuals will be clear from the context. Omitting these from the notation, we denote the just described likelihood by Lk;R ðMÞ:

(2.3)

Then the likelihood ratios that we use for familial searching are of the form LRk;R ðMÞ ¼

Lk;R ðMÞ LðMÞ

(2.4)

where L(M) = L(1,0,0),R(M) is the likelihood of the mixture if R is unrelated to donor m + 1. Indeed in our model, where we consider genotypes of unrelated persons as independent, the likelihood L(1,0,0),R(M) does not depend on R anymore. For example, if k = (0.25, 0.5, 0.25) then LRk,R(M) is the likelihood ratio for hypotheses that informally can be summarized as  H1: person of known genotype R is a sibling of the first unknown contributor of the mixture M.  H2: person of known genotype R is unrelated to all of the unknown contributors, where all other relevant parameters are the same for both hypotheses. Setting n = 1, d1 = c = m = 0 brings us back to the classical familial search based on a single source trace and a complete profile,

130

K. Slooten / Forensic Science International: Genetics 22 (2016) 128–138

whereas the case n = 2, d1 = d2 = c = 0 and m = 1 corresponds to a complete two-person mixture, of which one contributor is known. In that case the algorithm here coincides with the one described in [3] and we therefore expect similar results. Since we do not use exactly the same allele frequencies, some differences can be expected. We present a comparison with the results of [3] below. Note that not all possible mixtures are covered by these hypotheses. For example, in case the mixture contains DNA of several individuals that are unknown but related to each other (for example, a known victim and two unknowns who are brothers), then these likelihoods do not perfectly apply. On the other hand, it can probably be expected that a relative is easier to identify in a database if there are multiple donors in the mixture that he or she is related to, than if there is only one. 2.2. Likelihood ratio distributions If c = 0 the likelihood of a mixture is a function that depends only on the allele frequencies of the alleles that have been observed in the mixture. Informally, this is because if a hypothesized contributor has an allele that has not been observed in the mixture, then it does not matter which allele that was for the mixture’s likelihood. More formally, we note that for c = 0 we get from (2.1) P~ ða 2 = Mj~ gÞ ¼ d;c

n Y n di i;a ;

(2.5)

i¼1

so in (2.2) the factor corresponding to the unobserved alleles is equal to Y x2 =M

P~ ðx 2 = Mj~ gÞ ¼ d;c

n YY x2 = M i¼1

n

di i;x ¼

n Y

ni;a

di

0

;

i¼1

P where a0 is the virtual allele with frequency 1  x2Mpx, obtained from identifying all the x 2 = M with each other. The equality Di ¼ d2i is crucial here, since then the probability for both alleles of the same contributor to drop out is the same whether or not they are equal. Therefore, the calculation can, at least for c = 0 assume that the alleles on the ladder are the observed alleles, and one auxilliary allele a0 whose frequency is one minus the sum of the frequencies of the observed alleles in the mixture. If c > 0 then reducing the ladder makes a difference. Indeed, let a1, . . ., ak be the alleles that have not been observed in a mixture and let their frequencies be p1, . . ., pk. Then the mixture likelihood will contain a term, corresponding to the fact that Q these alleles have not been observed, equal to ki¼1 ð1cpi Þ when these alleles are all taken into account individually instead of P 1c ki¼1 pi if an auxilliary allele with frequency p1 +    + pk is used. For small c these terms are approximately equal. We therefore always carry out calculations that redefine the allelic ladder by only selecting the alleles visible in the mixture, and then adding one auxilliary allele as the replacement of all the other, unobserved ones. When we do so, there remain (N + 1)(N + 2)/2 different relevant genotypes for R on a locus where N alleles have been recorded. For each such genotype, we can compute the likelihood ratio (2.4) assuming R has this genotype. If the second hypothesis is true (which states that R is unrelated to all unknown contributors) then we know in addition the probability distribution on the genotypes of R: it is the population frequency of this genotype, conditional on the genotypes of the known contributors. We suppose genotypes of unrelated individuals to be independent (i.e., we do not use a u-correction) so this amounts to the population frequency itself for our purposes. In any case, we obtain the likelihood ratio distribution for LR’s obtained when the second hypothesis is true. This immediately also gives us the likelihood ratio distribution

when the first hypothesis is true, since for the likelihood ratios (2.4) we have (cf. [9]) PðLR ¼ xjH1 Þ ¼ xPðLR ¼ xjH2 Þ:

(2.6)

We use these distribution of the likelihood ratio for several applications: 2.2.1. Database searching If we want to simulate a familial search against a database of N persons all of whom are unrelated to the mixture donors, what we really need are not these individuals, but the likelihood ratios obtained by them. By calculating the distribution that these LR’s follow, we can directly generate random sample of N likelihood ratios from the distribution under H2. This is computationally more efficient than sampling DNA profiles and calculating likelihood ratios, since LR’s need only be calculated once. If a familial search against a database of N individuals is to be simulated, then a sample of N likelihood ratios from the distribution of the LR under H2 is generated. Likewise, if one wishes to instead consider likelihood ratios that would be obtained by actual relatives, then the distribution of the LR under H1 (i.e., assuming a relative to be a trace donor) is taken by first calculating the LR distribution under H2, then using (2.6) to obtain the distribution of LR under H1, and then taking a random sample from this distribution. We can describe this procedure informally as follows: the mixture evaluation model (i.e., the chosen number of contributors, their probabilities of dropout and the drop-in parameter) induces a probabilistic deconvolution of the mixture (as also described in [10]), hence it also gives us probability distributions on the genotypes of relatives the donors. Thus, we can predict the likelihood ratios that a familial search on the mixture will yield, both for unrelated individuals and for truly related individuals, from the mixture data and the chosen mixture evaluation model. 2.2.2. Exceedance probabilities By taking a random sample from H1, we can estimate how many likelihood ratios are at least equal to x, for some x of our choice. This gives us the probabilities P(LR  x j H1) and P(LR < x j H1) that a LR generated by H1 does, or does not, reach threshold x. Similarly, we can estimate the probability P(LR  x j H2) for threshold x. For small x it may be sufficient to directly sample from H2, but for larger x the probability P(LR  x j H2) becomes so small that very large samples are needed. In that case, one can use dedicated algorithms for efficient computation of these probability (cf. [11,12]) or one may use the technique of importance sampling (cf. [10,12]) and estimate P(LR  x j H2) by k 1X 1 1r  x ; k i¼1 i r i

where r1, . . ., rk is a sample of likelihood ratios obtained from the probability distribution of H1, and 1ri  k is the indicator function taking the value one if ri  x and zero otherwise. We can use this technique to estimate the number of false leads to be expected in a familial database search of N unrelated individuals: if everyone in the database whose LR exceeds x is selected, the number of false leads expected in a database of size N without relatives is N  P(LR  x j H2). Note that this means that both for estimating the probability that a relative is selected, P(LR  x j H1), and the probability that an unrelated individial is selected, P(LR  x j H2), the same sample of LR’s obtained from H1 can be used. 2.3. Terminology In the sections that follow, we will compare results obtained from familial searches with mixtures to those obtained from single

K. Slooten / Forensic Science International: Genetics 22 (2016) 128–138

source traces. The efficiency of a familial search can be summarized in various ways, e.g., by considering the probability that a relative is among the top-k in a database of a certain size, or by considering the probabilities P(LR  x j Hi) associated with various likelihood ratio thresholds x. Since using a fixed LR-threshold is the most efficient strategy for familial searching (cf. [1,13]) we will mostly use this approach, resorting to top-k probabilities when this is needed for comparison with other results in the literature. Suppose that we use a LR-threshold x, then this means that we extract everyone from the database whose LR is at least equal to x. Then P(LR  x j H2) is the probability that an individual who is in reality unrelated, is selected from the database. In a large database composed of unrelated individuals only, this is the proportion of the database that we expect to extract. Therefore, we call P(LR  x j H2) the false positive rate FPR(x) for threshold x. Similarly, P(LR  x j H1) is the probability that an individual who is in reality related, is selected from the database. This is the proportion of the related individuals that we expect to extract. Therefore we call P(LR  x j H2) the true positive rate TPR(x) for threshold x. We could have also used the phrase probability of detection instead of TPR (as is done, for example, in [9]) but the TPR and FPR terminology is widely adopted elsewhere (e.g. [12–14]). We note however, that E[LR j H2] = 1, so that, with neutral evidence, LR = 1, as threshold, the corresponding P(LR > 1 j H2) will be inevitably a positive probability (barring pathological cases, such as an empty mixture profile). It would be better to speak of probabilities rather than rates, and also at least an equally good choice to talk about the probability of misleading evidence of strength x, instead of the false positive probability FPR(x), but we will follow the conventional terminology. Similarly the false negatives are not false as in reflecting an imperfection in the classification method using the LR-threshold, but they correspond to profiles that by genetic chance do not look sufficiently strongly like what they really are – related. 3. Benchmark results We have applied our likelihood ratio calculations to two- and three person mixtures, with varying dropout probabilities for the contributors and with or without known contributors. In the main body of this article, we restrict familial searches with mixtures to comparisons on the NGM loci, and consider familial searches on single source traces on the NGM as well as on the SGMPlus loci. In the Appendix we have placed the results for mixtures evaluated on the SGMPlus loci. 3.1. Complete single source profile We first briefly review the classical situation in which familial [(Fig._31)TD$IG] searching is carried out with a full DNA profile at our disposal.

131

Clearly, a familial search based on a mixture cannot have better performance than we have in this situation, so these results can be used to see how much efficiency we lose when the DNA of the person of interest is mixed with that of other persons (known, or unknown) and possibly susceptible to allelic dropout. Both for familial searches for siblings and for parents and children, we have taken a sample of 100,000 likelihood ratios (for a single source trace, these are simply sibling indices SI, resp. paternity indices PI) to obtain the true positive rates corresponding to LR-thresholds, as well as the false positive rates by viewing these LR’s as an importance sample. Since the DNA database of the NFI contains SGMPlus profiles and NGM profiles, we have carried out the simulations for both. We plot the results in the form of ROC curves in Fig. 3.1. In such a curve we plot the graph {(Log10(FPR(x)), TPR(x))}, for a range of LR-thresholds x. For x such that 2x is an integer number, we have plotted the points on the ROC curve using x as a label of the point. For example, we see in Fig. 3.1a that, with a Log10(SI)-threshold equal to 3, we have a false positive rate of approximately 104 on the SGMPlus loci, meaning that the probability that an unrelated pair of individuals has a SI of at least 103 on these loci is estimated to be 104 from our sample data. Similarly, for the same SI-threshold the true positive rate is about 50%, meaning that the probability that a pair of siblings has a SI of at least 103 on these loci is about 0.5. When we consider the same threshold on the NGM loci, we see that the FPR is about the same (equal to 104.2) but the TPR is much higher, namely 0.76. We note that, for LR-thresholds that are small enough (Log10(PI) < 3 for NGM or 1.5 for SGMPlus), the corresponding points in the ROC graph for PI almost coincide with each other. We have omitted the label of these points, since they overlap if we attempt to plot them. The fact that these points are so close to each other is due to the fact that (barring mutations, which we have not modelled) the PI is virtually always at least of this magnitude if it is non-zero. In other words, small non-zero PI almost do not occur, and applying a threshold anywhere in this range amounts to the same performance. 3.2. Two-person mixtures without dropout We now turn to two person mixtures without dropout or dropin. The results for these may be used to compare with the results for mixtures with dropout, and also to compare our results to those of [3]. 3.2.1. Ranking In [3], the probability of finding a relative in the top-k in a database of 30,000 Identifiler profiles was obtained by simulation, provided that the profile of one of the contributors is known

Fig. 3.1. Average ROC curves for familial searches on complete single source traces on the SGMPlus (dashed) and the NGM loci (full). Dots are labelled by Log10 of their corresponding SI- or PI-threshold.

K. Slooten / Forensic Science International: Genetics 22 (2016) 128–138

132

Table 1 Probabilities that a familial search identifies a relative in the top-k in a database of 30,000 Identifiler profiles. k

Identification probability in top-kprofiles SI

1 5 10 20 50 100

PI

V+U

V + U [3]

U+U

V+U

V + U [3]

U+U

0.476 0.649 0.718 0.772 0.837 0.873

0.429 0.646 0.748 0.814 0.890 0.930

0.154 0.298 0.367 0.446 0.573 0.675

0.568 0.811 0.881 0.941 0.988 0.995

0.496 0.816 0.909 0.957 0.982 0.999

0.208 0.382 0.466 0.588 0.757 0.842

Table 2 Probabilities that a familial search identifies a relative in the top-k in a database of 30,000 NGM profiles. k

Identification probability in top-k profiles SI

1 5 10 20 50 100

PI

V+U

U+U

V+U

U+U

0.566 0.722 0.771 0.823 0.883 0.910

0.263 0.424 0.514 0.583 0.665 0.743

0.735 0.921 0.966 0.986 0.998 1.000

0.278 0.515 0.610 0.728 0.873 0.941

(assumed to come from a victim, say) and that we are interested in finding a relative of the remaining unknown donor of the mixture. We repeat and extend the experiment of [3] here, and report the results in Table 1. We distinguish between the just described situation V + U, where the donors are (known) victim V and unknown U; and U + U meaning that both donors are unknown. Although the allele frequencies here and in [3] are different, we see from Table 1 that the results correspond fairly well. From Table 2 we see that if we do the same searches on the NGM loci, we obtain increased probabilities that actual relatives are ranked highly, compared to searches with the Identifiler loci. 3.2.2. False rates In Fig. 3.2 we plot the ROC curves for two person mixtures on the NGM loci where no donors are known, or one donor is known, analogous to Fig. 3.1. We see from Fig. 3.2 that both for SI and PI applied to NGM profiles, the amenability of two-person mixtures where one donor is known lies between that for single source complete profiles on the [(Fig._32)TD$IG] ten SGMPlus loci and the fifteen NGM loci. That is, the loss of

information that we have due to the unknown person of interest being mixed with a known individual, is less than the loss of information that we have if only the SGMPlus instead of the NGM loci can be compared. The latter is done in practice, so there seems to be – from this point of view – no objection to carry out familial searches which have as starting point a two-person mixture where one contributor is known. If both of the donors are unknown, then we see from these Figures that both the TPR and the FPR are worse than they are for SGMPlus complete profiles. However, even such a search may still be a good option. For example, if there are no known donors then (cf. Fig. 3.2b) a familial search for parents and children carried out with a PI-threshold equal to 1000 on a mixture without known contributors corresponds to a probability of about 50% to detect real parents and children, at the cost of a false positive rate of about one in ten thousand. For siblings, (cf. Fig. 3.2a) a SI-threshold of 100 corresponds to a probability of about 60% to detect real siblings, at the cost of a false positive rate of about one in a thousand. Whether or not such a search is worthwhile can be decided based on the information from Fig. 3.2, the database size, the a priori probabilities that the database contains relatives, the possibilities for additional genetic testing on potential relatives, etc. 4. Two-person mixtures with dropout Now we turn to mixtures with dropout. We restrict ourselves to two choices for the dropout probabilities: first, the case where both donors have the same probability of dropout which we set at 0.3, and second, a situation where they have unequal probabilities of dropout (0.1, 0.5). We are going to assume for all cases that we have three replicate analyses of the mixture, and we study familial searches for siblings and parents/children of each donor, where we may or may not know the other donor. For all cases, we have simulated 100 mixtures with the specified dropout probabilities, where we fixed the drop-in parameter c = 0.05. For each mixture, we have simulated 10,000 likelihood ratios obtained by siblings and 10,000 obtained by parents/children from the distribution of likelihood ratios obtained by actual relatives, as described in Section 2.2.1. This gives us a collection of a million likelihood ratios, which we use to estimate true positive rates from, and also false negative rates (by importance sampling, as explained in Section 2.2.2). 4.1. Two-person mixtures with dropout (0.3, 0.3) The obtained TPR and FPR for siblings are presented in Fig. 4.1a and those for parents and children in Fig. 4.1a. Notice that these are very similar to those for mixtures without dropout, as one can see

Fig. 3.2. Average ROC curves for familial searches on two-person mixtures without dropout, on the NGM loci. Dots are labelled by Log10 of their corresponding SI- or PIthreshold. Dotted lines are for reference and correspond to the single source traces in Fig. 3.1 (open circles: NGM; open squares: SGMPlus). The full lines correspond to mixtures where one donor is known (left graph, marked with filled circles) and to mixtures where both donors are unknown (right graph, marked with filled squares).

[(Fig._41)TD$IG]

K. Slooten / Forensic Science International: Genetics 22 (2016) 128–138

133

Fig. 4.1. Average ROC curves for familial searches on two-person mixtures with dropout (0.3, 0.3), on the NGM loci. Dots are labelled by Log10 of their corresponding SI- or PIthreshold. Dotted lines correspond to the single source traces in Fig. 3.1 (open circles: NGM; open squares: SGMPlus). The full lines correspond to mixtures where one donor is known (left graph, marked with filled circles) and to mixtures where both donors are unknown (right graph, marked with filled squares)

from a comparison between Figs. 3.2a and 4.1a and between Figs. 3.2b and 4.1b. Indeed, since we have three replicates, most of the alleles of the donors will have been observed at least once, given their dropout probability of 0.3. Since most alleles will tend to have been recorded, the probability of drop-in is small (c = 0.05) and, as in the case without dropout, the reproducibility of the alleles is identical for both donors and cannot assist in the deconvolution of the profile, this is a logical finding. 4.2. Two-person mixtures with dropout (0.1, 0.5) Now we turn to mixtures where the donors have, on average, the same dropout probability as in the previous example but they are now different from each other: we have a major donor with a probability of dropout equal to d1 = 0.1 and a minor donor for whom this probability is d2 = 0.5. We still consider three replicates and maintain the drop-in parameter at the same value c = 0.05 as for the previous mixtures. Since the donors are no longer equally represented, it is easier to deconvolute the mixture. Certainly familial searches for relatives of the major donor can be expected to be easier than for the previously considered mixtures, and we will see to which extent relatives of the minor donor can be found as well. 4.2.1. Relatives of the major donor First we consider that the major donor is our person of interest. We plot the obtained true and false positive rates for SI and PI in Fig. 4.2 below. We notice that these mixtures perform much better

[(Fig._42)TD$IG]

than the previously considered mixtures. Indeed, since the two donors in this case have quite different dropout probabilities, and we are using this information in our LR calculations, the different reproducibility of the alleles in the mixtures can be used, as was also described in [4]. Even when none of the donors are known, the performance of a familial search is better than it is for single source SGMPlus profiles, which was not the case for mixtures without dropout (cf. Fig. 3.2), and also not for mixtures where both donors had the same dropout probability equal to 0.3 (cf. Fig. 4.1). If the minor donor happens to be known, then the performance of a familial search for the major donor comes close to what we have for a single source trace. For example, from Fig. 4.2b we see that setting the PI-threshold at 104 leads to a FPR of around 105 in all cases, but corresponds to a TPR of 0.32 for single source SGMPlus traces, 0.56 for mixtures where both donors are unknown, 0.84 if the minor donor is known, and 0.93 for single source NGM traces. This observation supports the ones made in [4] and is a very logical one: if the profile of our person of interest is mixed up with that of other persons, then it is in our advantage if the other persons are less represented then if they are fully represented, and it is also in our advantage if the dropout probabilities are different since then producing replicate analyses will show alleles with different reproducibility, allowing for a better probabilistic deconvolution of the mixture than if all donors are exchangeable. 4.2.2. Relatives of the minor donor We now turn to a familial search for the minor donor. We plot the obtained true and false positive rates for SI and PI in Fig. 4.3

Fig. 4.2. Average ROC curves for familial searches on two-person mixtures with dropout (0.1, 0.5), on the NGM loci, with the PoI being the major donor. Dots are labelled by Log10 of their corresponding SI- or PI-threshold. Dotted lines correspond to the single source traces in Fig. 3.1 (open circles: NGM; open squares: SGMPlus). The full lines correspond to mixtures where the minor with dropout 0.5 is known (left graph, marked with filled circles) and to mixtures where both donors are unknown (right graph, marked with filled squares).

[(Fig._43)TD$IG]

134

K. Slooten / Forensic Science International: Genetics 22 (2016) 128–138

Fig. 4.3. Average ROC curves for familial searches on two-person mixture with dropout (0.1, 0.5), on the NGM loci, with the PoI being the minor donor. Dots are labelled by Log10 of their corresponding SI- or PI-threshold. Dotted lines correspond to the single source traces in Fig. 3.1(open circles: NGM; open squares: SGMPlus). The full lines correspond to mixtures where the major with dropout 0.1 is known (left graph, marked with filled circles) and to mixtures where both donors are unknown (right graph, marked with filled squares).

below. Of course, such a search is harder than the one for the major donor. However, a search is not hopeless at all: the ROC curve when the major donor is known, coincides almost exactly with that of single source profiles compared on the SGMPlus loci for all SI-thresholds and also for larger PI-thresholds, at least equal to about 104. The difference for smaller thresholds is much greater than for SI. Indeed, if dropout and drop-in are possible, it is impossible to obtain PI = 0, contrary to for single source traces. If none of the donors are known, then the TPR obviously decreases for the same threshold, but the ROC curve does not differ much from the one for mixtures where the dropout probabilities are 0.3 for both donors (cf. Fig. 4.1). We can think of this as follows. The effect of having a higher probability of dropout for the required donor, and a smaller for the other donor, is on the one hand that we see fewer alleles of our PoI, but on the other hand that we have a better deconvolution of the mixture than if the dropout probabilities are equal. In that case, there is no way to associate observed alleles to one or the other donor, whereas in the case of different dropout probabilities the alleles that are less consistently seen across the replicates are more likely to belong to the minor donor. Thus, we will have fewer alleles to work with, but a better separation between the donors. 5. Discussion and conclusion Familial searches are, as far as we know, typically restricted to traces with a single unknown contributor, from which a complete profile has been determined. There can be various reasons to restrict searches to this situation, including the required belief that the trace’s DNA is from the offender that we are looking for. However, from the casework encountered since the introduction of familial searching in the Netherlands it has become clear that there are also situations in which the unknown offender’s profile is only available as a partial profile or as one of the profiles in a mixed profile. The other donors may or may not be known, and the traces may or may not be affected by allelic dropout. This has been a main motivation for this study, in which we have examined the feasibility of familial searches if a mixture is used as starting point. We have seen that the performance of familial searches based on a mixture on the NGM loci is, in many cases, better than the performance of familial searches of single source profiles on the SGMPlus loci. Therefore, if familial searches with mixtures are performed on the NGM part of the database, we can expect similar or better true positive and false positive rates for a given likelihood ratio threshold as one has for comparisons with single source

traces on the SGMPlus part of the database. Thus, from this point of view, there is no objection to carrying out familial searches based on a mixture. We note also that for this study we have performed all calculations using the alleles detected in the simulated mixture profiles. We have not considered peak heights. In actual casework, it may be possible – on some loci – to determine the profile of the PoI. This can only improve the performance of a familial search. One could use this information by calculating the LR using different parameters on the loci where this is possible. For example, one could treat the trace as a single source unambiguous profile on the loci where this is possible and as a mixture on the other loci. Throughout this manuscript we have considered mixtures with dropout, where the dropout probability has been set constant over all loci. Sometimes degraded profiles are encountered, meaning that the dropout probability increases strongly with increasing fragment size. We have not treated such mixtures explicitly, however, it is possible to do so. Indeed, conditionally on the model parameters (i.e., n; ~ d; c) the mixture likelihood factorizes over the loci. Thus, on every locus these parameters can be chosen separately; this can be used to incorporate degradation by taking locus-specific probabilities of dropout. If candidate relatives are found, then it is desirable to carry out additional genetic testing to further investigate the possibility of relatedness. For a single source trace of male origin, an adequate method is to do so by Y-chromosomal testing. In mixtures, the efficiency of this method will of course depend on the number of male contributors. If there are several, then recent developments such as Y-chromosomal mixture deconvolution based on haplotype frequencies may be helpful, cf. [15], also to assign a weight of evidence if a match is obtained. In any case, in view of the discriminatory power of single Y-STR haplotypes, it seems reasonable to expect that a mixed Y-STR profile is still very useful for exclusion purposes. In case the PoI and the tentative relative are not both males, Y-STR’s can obviously not assist and then one can carry out further autosomal testing. For example, if one would determine the profile on 15 additional autosomal STR’s then the amount of information, originally based on 15 loci, roughly doubles. For a scenario where the original mixture’s performance (based on 15 loci) was similar to that of a single source trace on 10 loci, this means that we can expect the same characteristics after additional testing as when we would have a single source profile with twenty autosomal loci. The latter is, given the size of databases, in practice typically sufficient to investigate possible leads. Therefore, if enough genetic material is available to be able

K. Slooten / Forensic Science International: Genetics 22 (2016) 128–138

to carry out additional DNA testing, the exclusion of obtained false positives from the database search should in many cases be feasible, and we conclude that extension of the familial searching method to mixtures, with or without allelic dropout, can certainly be considered, if there is enough certainty concerning the person of interest’s connection to the case under investigation. In casework, it is of course not possible to determine the dropout probabilities precisely, even disregarding the fact that the binary model does not perfectly reflect reality. On the other hand, familial searching primarily being an investigative technique, this need not be of great concern. If one has a specific mixture at hand, one may evaluate the potential for familial searching using one or more choices of plausible parameters such as the number of contributors and their dropout probabilities. Each will give estimated true and false positive rates. The false positive rates predict the number of unrelated individuals whose likelihood ratio exceeds a certain threshold. This rate is determined by the chosen probability model, and since the genotype distribution of unrelated individuals is not governed by the mixture, this is also the rate that will be realized in practice if the chosen model is used for a familial search in the database. The predicted true positive rate associated to a certain LR-threshold, will depend on the chosen model since the model induces a deconvolution of the mixture and hence a probability distribution on relatives of the contributors as well. A less accurate choice of parameters will lead to a less accurate deconvolution, hence to a less accurate estimate of the true positive rate. Similarly to how an estimate of the LR in favour of contribution to a mixture may be obtained from a sensitivity analysis on the model parameters, an estimate of the true positive rate may be obtained in the same manner. Finally we remark that although we have mentioned only mixtures involving more than one donor, the same considerations also apply to single source traces affected by allelic dropout. By choosing an appropriate probability of dropout and drop-in, it is then possible to involve all the obtained replicate analyses into a familial search without having to resort to potentially less informative consensus profiles obtained from these replicates.

Appendix A. Results on the SGMPlus loci In this Appendix section, we include the ROC curves analogous to the ones in the main body of the paper, but now with familial searches for mixtures based on the SGMPlus loci instead of on the NGM loci.

135

A.1. Two-person mixtures without dropout For mixtures without dropout, the ROC curves are plotted in Fig. A.1, analogous to Fig. 3.2. A.2. Two-person mixtures with dropout (0.3, 0.3) For mixtures with dropout (0.3, 0.3), the ROC curves are plotted in Fig. A.2, analogous to Fig. 4.1. We still consider that we have three replicate analyses available. A.3. Two-person mixtures with dropout (0.1, 0.5) For mixtures with dropout (0.1, 0.5), the ROC curves are plotted in Fig. A.3 (familial search for the major donor) analogous to Figs. 4.2 and A.4 (familial search for the minor donor) analogous to Fig. 4.3. We still consider that we have three replicate analyses available. Notice that, in a search for the major donor where the minor donor is known, the ROC curves for PI and SI are close to those for single source SGMPlus traces, similar to what was observed for the NGM loci.

Appendix B. Three-person mixtures Finally, we briefly consider mixtures of three individuals. We distinguish between the case where there is no dropout, and a case where the three donors have markedly different dropout probabilities. B.1. No dropout We plot the ROC curves for three-person mixtures without dropout, looking for a sibling or a parent/child of one of the donors, in Fig. B.1. For both searches, if the mixture is compared to reference profiles on the NGM loci, then the false rates are comparable to those obtained with a single source SGMPlus profile if the two other donors are known; of course if one or none of the donors are known, the performance drops below that benchmark. If the comparison is carried out on the SGMPlus loci only, then a familial search becomes much less feasible, as can be seen from Fig. B.2.

[(Fig._A1)TD$IG]

Fig. A.1. Average ROC curves for familial searches on two-person mixtures without dropout, considered on the SGMPlus loci. Dots are labelled by Log10 of their corresponding SI- or PI-threshold. Dotted lines are for reference and correspond to the single source traces in Fig. 3.1 (open circles: NGM; open squares: SGMPlus). The full lines correspond to mixtures where one donor is known (left graph, marked with filled circles) and to mixtures where both donors are unknown (right graph, marked with filled squares).

[(Fig._A2)TD$IG]

136

K. Slooten / Forensic Science International: Genetics 22 (2016) 128–138

Fig. A.2. Average ROC curves for familial searches on two-person mixtures with dropout (0.3, 0.3), considered on the SGMPlus loci. Dots are labelled by Log10 of their corresponding SI- or PI-threshold. Dotted lines are for reference and correspond to the single source traces in Fig. 3.1 (open circles: NGM; open squares: SGMPlus). The full lines correspond to mixtures where one donor is known (left graph, marked with filled circles) and to mixtures where both donors are unknown (right graph, marked with filled squares).

[(Fig._A3)TD$IG]

Fig. A.3. Average ROC curves for familial searches on two-person mixtures with dropout (0.1, 0.5), on the SGMPlus loci, with the PoI being the major donor. Dots are labelled by Log10 of their corresponding SI- or PI-threshold. Dotted lines correspond to the single source traces in Fig. 3.1 (open circles: NGM; open squares: SGMPlus). The full lines correspond to mixtures where the minor with dropout 0.5 is known (left graph, marked with filled circles) and to mixtures where both donors are unknown (right graph, marked with filled squares).

[(Fig._A4)TD$IG]

Fig. A.4. Average ROC curves for familial searches on two-person mixtures with dropout (0.1, 0.5), on the SGMPlus loci, with the PoI being the minor donor. Dots are labelled by Log10 of their corresponding SI- or PI-threshold. Dotted lines correspond to the single source traces in Fig. 3.1(open circles: NGM; open squares: SGMPlus). The full lines correspond to mixtures where the major with dropout 0.1 is known (left graph, marked with filled circles) and to mixtures where both donors are unknown (right graph, marked with filled squares).

B.2. With dropout As a final example, we investigate mixtures where the three donors have dropout probabilities equal to (d1, d2, d3) = (0.1, 0.4, 0.7). As before, we suppose that we have three replicate analyses of

the mixture and work with c = 0.05 throughout. We only consider searches for the first donor (with d1 = 0.1) and distinguish between the cases where none of the donors are known, where only the second donor (with d2 = 0.4) is known, and where both the second and third donor (with d2 = 0.4, d3 = 0.7) are known. The resulting

[(Fig._B1)TD$IG]

K. Slooten / Forensic Science International: Genetics 22 (2016) 128–138

137

Fig. B.1. Average ROC curves for familial searches on three-person mixtures without dropout, considered on the NGM loci. Dots are labelled by Log10 of their corresponding SIor PI-threshold. Dotted lines are printed for reference and correspond to the single source traces in Fig. 3.1. The full lines correspond to mixtures where two donors are known (leftmost graph, marked with downward pointing triangles), resp. mixtures where one donor is known (middle graph, marked with upward pointing triangles) resp. to mixtures where no donors are known (rightmost graph, marked with diamonds).

[(Fig._B2)TD$IG]

Fig. B.2. Average ROC curves for familial searches on three-person mixtures without dropout, considered on the SGMPlus loci. Dots are labelled by Log10 of their corresponding SI- or PI-threshold. Dotted lines are printed for reference and correspond to the single source traces in Fig. 3.1. The full lines correspond to mixtures where two donors are known (leftmost graph, marked with downward pointing triangles), resp. mixtures where one donor is known (middle graph, marked with upward pointing triangles) resp. to mixtures where no donors are known (rightmost graph, marked with diamonds).

[(Fig._B3)TD$IG]

Fig. B.3. Average ROC curves for familial searches on three-person mixtures without dropout, considered on the NGM loci. Dots are labelled by Log10 of their corresponding SIor PI-threshold. Dotted lines are printed for reference and correspond to the single source traces in Fig. 3.1. The full lines correspond to mixtures where the donors with d4 = 0.4, d3 = 0.7 are known (leftmost graph, marked with downward pointing triangles), resp. mixtures where the donor with d2 = 0.4 is known (middle graph, marked with upward pointing triangles) resp. to mixtures where no donors are known (rightmost graph, marked with diamonds).

ROC curves are displayed in Fig. B.3 for comparisons on the NGM loci and Fig. B.4 for comparisons on the SGMPlus loci. The performance of familial searches on these mixtures is notably better than for three person mixtures without dropout, as was the

case for two person mixtures. Indeed, evaluated on the NGM loci, for siblings the TPR, as a function of the FPR, lies between those for single source NGM and SGMPlus comparisons, where obviously a better TPR is achieved if more donors are known. For parents and

[(Fig._B4)TD$IG]

138

K. Slooten / Forensic Science International: Genetics 22 (2016) 128–138

Fig. B.4. Average ROC curves for familial searches on three-person mixtures without dropout, considered on the SGMPlus loci. Dots are labelled by Log10 of their corresponding SI- or PI-threshold. Dotted lines are printed for reference and correspond to the single source traces in Fig. 3.1. The full lines correspond to mixtures where the donors with d4 = 0.4, d3 = 0.7 are known (leftmost graph, marked with downward pointing triangles), resp. mixtures where the donor with d2 = 0.4 is known (middle graph, marked with upward pointing triangles) resp. to mixtures where no donors are known (rightmost graph, marked with diamonds).

children, the behaviour is also qualitatively comparable to what we saw for two-person mixtures. References [1] M. Kruijver, R. Meester, K. Slooten, Optimal strategies for familial searching, Forensic Sci. Int.: Genet. 13 (2014) 90–103. [2] Y.-K. Chung, Y.-Q. Hu, W. Fung, Evaluation of DNA mixtures from Database Search, Biometrics 66 (2010) 233–238. [3] Y.-K. Chung, Y.-Q. Hu, W. Fung, Familial database search on two-person mixture, Comput. Stat. Data Anal. 54 (2010) 2046–2051. [4] K. Slooten, Distinguishing between donors and their relatives in complex DNA mixtures with binary models, Forensic Sc. Int.: Genet. 21 (2016) 95–109. [5] H. Haned, K. Slooten, P. Gill, Exploratory data analysis for the interpretation of low template DNA mixtures, Forensic Sci. Int.: Genet. 6 (6) (2012) 762–774. [6] D. Balding, J. Buckleton, Interpreting low template DNA profiles, Forensic Sci. Int.: Genet. (4) (2009) 1–10. [7] T. Tvedebrink, P.S. Eriksen, H.S. Mogensen, N. Morling, Statistical model for degraded DNA samples and adjusted probabilities for allelic drop-out, Forensic Sci. Int.: Genet. 6 (2012) 97–101.

[8] T. Tvedebrink, P.S. Eriksen, M. Asplund, H.S. Mogensen, N. Morling, Allelic dropout probabilities estimated by logistic regression-further considerations and practical implementation, Forensic Sci. Int.: Genet. 6 (2012) 263–267. [9] K. Slooten, R. Meester, Probabilistic strategies for familial DNA searching, J. R. Stat. Soc.: Ser. C (Appl. Stat.) 63 (3) (2014) 361–384. [10] K. Slooten, T. Egeland, Exclusion probabilities and likelihood ratios with applications to mixtures, Int. J. Legal Med. 130 (1) (2015) 39–57. [11] G. Dørum, O. Bleka, P. Gill, H. Haned, T. Egeland, Exact computation of the distribution of likelihood ratios with forensic applications, Forensic Sci. Int.: Genet. 9 (2014) 93–101. [12] M. Kruijver, Efficient computations with the likelihood ratio distribution, Forensic Sci. Int.: Genet. 14 (2015) 116–124. [13] D.J. Balding, M. Krawczak, J.S. Buckleton, J.M. Curran, Decision-making in familial database searching: KI alone or not alone? Forensic Sci. Int.: Genet. 7 (1) (2013) 52–54, http://dx.doi.org/10.1016/j.fsigen.2012.06.001. [14] J. Ge, R. Chakraborty, A. Eisenberg, et al., Comparisons of familial DNA database searching strategies, J. Forensic Sci. 56 (2011) 1448–1456. [15] M.M. Andersen, P.S. Eriksen, H.S. Mogensen, N. Morling, Identifying the most likely contributors to a Y-STR mixture using the discrete laplace method, Forensic Sci. Int.: Genet. 15 (2015) 76–83, http://dx.doi.org/10.1016/j.fsigen. 2014.09.011.

Familial searching on DNA mixtures with dropout.

Familial searching, the act of searching a database for a relative of an unknown individual whose DNA profile has been obtained, is usually restricted...
566B Sizes 0 Downloads 5 Views