Accepted Manuscript Title: Bayesian classification criterion for forensic multivariate data Author: S. Bozza J. Bros´eus P. Esseiva F. Taroni PII: DOI: Reference:

S0379-0738(14)00399-5 http://dx.doi.org/doi:10.1016/j.forsciint.2014.09.017 FSI 7749

To appear in:

FSI

Received date: Revised date: Accepted date:

26-3-2014 11-8-2014 16-9-2014

Please cite this article as: S. Bozza, J. Bros´eus, P. Esseiva, F. Taroni, Bayesian classification criterion for forensic multivariate data, Forensic Science International (2014), http://dx.doi.org/10.1016/j.forsciint.2014.09.017 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Title Page (with authors and addresses)

Bayesian classification criterion for forensic multivariate data S. Bozza1,, J. Bros´eus1 , P. Esseiva1 , F. Taroni1 a Ca’

Foscari University of Venice, Department of Economics, Venice, Italy of Lausanne, School of Criminal Justice, Lausanne, Switzerland

ip t

b University

Abstract

an

us

cr

This study presents a classification criteria for two-class Cannabis seedlings. As the cultivation of drug type Cannabis is forbidden in Switzerland, law enforcement authorities regularly ask laboratories to determine cannabis plant’s chemotype from seized material in order to ascertain that the plantation is legal or not. In this study, the classification analysis is based on data obtained from the relative proportion of three major leaf compounds measured by gaschromatography interfaced with mass spectrometry (GC-MS). The aim is to discriminate between drug type (illegal) and fibre type (legal) Cannabis at an early stage of the growth. A Bayesian procedure is proposed: a Bayes factor is computed and classification is performed on the basis of the decision maker specifications (i.e. prior probability distributions on Cannabis type and consequences of classification measured by losses). Classification rates are computed with two statistical models and results are compared. Sensitivity analysis is then performed to analyze the robustness of classification criteria.

Ac ce

pt

ed

M

Keywords: Bayes’ factor, classification, decision theory, loss function, drugs

∗ Corresponding

author

Preprint submitted to Elsevier

August 4, 2014

Page 1 of 14

Acknowledgments

Ac ce

pt

ed

M

an

us

cr

ip t

Acknowledgments This research was supported by the Swiss National Science Foundation, grant no. 100012-144227.

1

Page 2 of 14

*Highlights (for review)

Highlights (Bayesian classification criterion for forensic multivariate data) • A classification criteria for two-class Cannabis seedlings based on multivariate data; • Discrimination between drug type (illegal) and fibre type (legal) Cannabis at an early stage of the growth; • A Bayesian decision-theoretic approach for classification;

Ac ce

pt

ed

M

an

us

cr

ip t

• Analysis of the robustness of the classification criteria.

1

Page 3 of 14

*Manuscript (without author details)

Bayesian classification criterion for forensic multivariate data

Abstract

us

cr

ip t

This study presents a classification criteria for two-class Cannabis seedlings. As the cultivation of drug type Cannabis is forbidden in Switzerland, law enforcement authorities regularly ask laboratories to determine cannabis plant’s chemotype from seized material in order to ascertain that the plantation is legal or not. In this study, the classification analysis is based on data obtained from the relative proportion of three major leaf compounds measured by gas-chromatography interfaced with mass spectrometry (GC-MS). The aim is to discriminate between drug type (illegal) and fibre type (legal) Cannabis at an early stage of the growth. A Bayesian procedure is proposed: a Bayes factor is computed and classification is performed on the basis of the decision maker specifications (i.e. prior probability distributions on Cannabis type and consequences of classification measured by losses). Classification rates are computed with two statistical models and results are compared. Sensitivity analysis is then performed to analyze the robustness of classification criteria.

an

Keywords: Bayes’ factor, classification, decision theory, loss function, drugs

1. Introduction

Ac ce

pt

ed

M

Scientists are routinely faced with the problem of classifying items or individuals into one of two or more populations on the basis of the available measurements of several attributes, in other words in presence of multivariate data. A recovered item of unknown origin, say a skeletal remain, a fragment of glass recovered at a crime scene, a drug sample, a handwritten or a printed character in a questioned anonymous document, can be described by more than one variable. The paper focuses on a recurrent forensic problem that is to ascertain whether a Cannabis plantation is legal or not. The data that have been used to perform the present work come from previous studies focused on the discrimination of Cannabis seedlings using their chemical profiles and chemometric tools. The implemented methodology is described in detail in [1] and [2]; a GC-MS analysis was used to establish the chemical profile for every Cannabis seedling. Fifteen target compounds were selected taking into account their presence in drug type (illegal) and fibre type (legal) Cannabis. These target compounds are specific to Cannabis and showed high discrimination capabilities for the differentiation between drug- and fibre type Cannabis. For each target compound, the area values for the respective target ions were extracted. From this latter set of compounds, the main cannabinoids D9-Tetrahydrocannabinol (THC), Cannabidiol (CBD) and Cannabinol (CBN), which is a degradation product of THC, were selected. After the extraction of the respective GC-MS areas for CBD, THC and CBN a data processing was performed. Peaks areas were normalized to the internal standard and the square root was performed in order to reduce the influence of larger peaks and thus to have the variables on a comparable scale. Then, data were scaled to zero mean and unit variance. A statistical model for the evaluation of evidence through the computation of the Bayes factor (BF) for multivariate data has been proposed, among others, by [3] in the context of elemental composition of glass data and by [4] in the context of handwritten questioned documents. The proposed models allow one to compare a recovered item of unknown origin with a control item of known origin through the computation of a Bayes factor (in the context often referred as a likelihood ratio), a rigorous method that offers a balanced measure of the degree to which the evidence is capable of discriminating between opposing propositions. Generally, it can be said that the Bayes factor can be used for two main purposes. A first purpose consists of assigning a value for a given item of evidence. This refers to the evaluative level at which forensic scientists, for example, operate. Evaluating a piece of evidence means that the scientist provides an expression of the value of the evidence in support of a hypothesis of interest involving a given person (e.g. the recovered and the control items come from the same source or from different sources). A second purpose is that of providing information to the investigators. Here, the scientist acts at an investigative level. At this stage, the scientist tries to answer questions such as ‘what happened?’ The forensic scientist is said to be ‘crime focused’ and observes evidence which forms the basis to generate hypotheses and suggestions for explanations, in order to give guidance to the Preprint submitted to Elsevier

August 4, 2014

Page 4 of 14

investigators (e.g. the recovered item comes from a given population or from an alternative one?) In this paper, the second purpose is developed and a Bayesian classification criteria is considered. The paper is organized as follows. Section 2 presents the available databases and the statistical model. Standards of classification and results (in terms of models’ performances and their robustness) are presented respectively in Section 3 and 4. Section 5, finally, concludes the paper. A case example is presented in the Appendix. 2. Population database and models

us

cr

ip t

The background data consist of measurements of k variables (k = 3) expressing amounts in a sample of m1 = 132 plants of illegal drug type (population 1, p = 1) and a sample of m2 = 158 plants of legal fibre type (population 2, p = 2) with n = 3 replicate measurements on each sample. Denote the background data as x pi j = (x pi j1 , . . . , x pi jk )0 , p = 1, 2, i = 1, . . . , m p , j = 1, . . . , n. The available data suggest a statistical model with two levels of variation: that between replicate measurements on the same samples (also called the within-source variation), and that between measurements on different samples from the same population (also called the between-source variation). The distribution of the observations in each population is taken to be Normal with X pi j ∼ N(θ pi , W p ), p = 1, 2, i = 1, . . . , m p , j = 1, . . . , n, where θ pi denotes the mean vector within sample i in population p, and W p denotes the matrix of variances and covariances in population p. Once a new observation is available, the problem becomes classifying it as coming from one of the two multivariate Normal populations. 3. Standard of coherent classification

an

Denote the available measurements on the recovered item to be classified by y = (y1 , . . . , yn ), where y j = (y j1 , . . . , y jk ), j = 1, . . . , n. Two propositions are considered: H1 : the seized plant is of drug type (population 1);

M

H2 : the seized plant is of fibre type (population 2).

Ac ce

pt

ed

For each proposition, the probability distribution of the available measurements y is denoted by f (y | H p , φ p ), where φ p = {θ p , W p } is the vector of parameters under proposition H p , p = 1, 2. Observations will be considered as realisations from a given suitable probability model f (·) (e.g., in this paper observations will be treated as realisations from a Normal distribution). Clearly, for each population the peculiar type of Cannabis seedling will be described by specific population parameters (such as the population mean vector θ p and the population matrix of within-source variances and covariances W p ). Therefore, conditional on proposition H p , the distribution of the P sample measurements y¯ = n1 nj=1 y j is taken to be Normal, that is (¯y | H p ) ∼ N(θ p , n−1 W p ). In a Bayesian perspective, let π1 denote the prior probability of proposition H1 , π1 = Pr(H1 ), and let π2 denote the prior probability of proposition H2 , π2 = Pr(H2 ). Prior probability serves as a measure of uncertainty for a given proposition, that is how likely the seized material is of type ‘drug’ (or ‘fibre’, respectively). Once law enforcement authorities ask laboratories to determine Cannabis plant’s chemotype from seized material it is not known whether a given proposition is true or not (say, if the plantation is legal or not). The prior probability expresses the degree to which the proposition of interest is taken to be true, given the circumstantial information the investigators may dispose of. The ratio π1 /π2 of the prior probabilities of propositions H1 and H2 is called the prior odds of H1 to H2 . The prior odds indicates whether a priori proposition H1 is more or less likely than proposition H2 (prior odds being larger or smaller than 1), or whether the two propositions are almost equally likely (prior odds close to 1). The posterior probability of proposition H1 , Pr(H1 | y), is denoted α1 and can be easily computed according to Bayes’ theorem, α1 = Pr(H1 | y) =

f (y | H1 , φ1 )Pr(H1 ) , f (y | H1 , φ1 )Pr(H1 ) + f (y | H2 , φ2 )Pr(H2 )

(1)

where φ1 = {θ1 , W1 } and φ2 = {θ2 , W2 } are the parameters’ vectors characterizing the two populations. In the same way, the posterior probability Pr(H2 | y) of proposition H2 , denoted α2 , is equal to α2 = Pr(H2 | y) =

f (y | H2 , φ2 )Pr(H2 ) . f (y | H1 , φ1 )Pr(H1 ) + f (y | H2 , φ2 )Pr(H2 )

(2)

Posterior probabilities of the two propositions incorporate data and prior opinions. In this way, following a Bayesian perspective, the task of deciding whether classifying observations in population 1 and 2 is greatly simplified; one needs to calculate the posterior probabilities α1 and α2 of the two propositions and decide accordingly. 2

Page 5 of 14

Table 1: The ‘0 − l p ’ loss function. Decision d1 , d2 refers to the classification of an observation on an unknown items in population 1 and 2, respectively.

H1 0 l2

d1 d2

H2 l1 0

f (y | H1 , φ1 ) Pr(H1 ) π1 α1 Pr(H1 | y) = × = LR × . = α2 Pr(H2 | y) f (y | H , φ ) Pr(H2 ) π2 | {z2 2} LR

ip t

The ratio α1 /α2 of the posterior probabilities in (1) and (2) of propositions H1 and H2 is called the posterior odds of H1 to H2 and is equal to (3)

f (y | H1 , φ1 ) α1 π1 / = = LR α2 π2 f (y | H2 , φ2 )

an

BF =

us

cr

The posterior odds indicates whether a posteriori, that is whenever data become available, proposition H1 is more or less likely than proposition H2 , as it was underlined for the prior odds. The Bayes factor (BF), that is defined as the ratio of the posterior odds, α1 /α2 , to the prior odds, π1 /π2 , measures the change produced by the data in the odds when going from the prior distribution to the posterior distribution. It is worth noting that, whenever population parameters φ p are known, the posterior odds is simply expressed as the product of the likelihood ratio times the prior odds as in (3), and therefore the BF reduces to the well known likelihood ratio, that is

Ac ce

pt

ed

M

Nevertheless, population parameters are generally not known and according to the Bayesian approach a prior distribution may be introduced to model available prior knowledge. In this way, as it will be shown in Section 3.1 and 3.2, the BF takes the form of a ratio of weighted likelihoods and does not simplify anymore to a likelihood ratio. A Bayesian decision-theoretic approach is adopted. Let D = {d1 , d2 } denote the decision space, where d1(2) represents the decision of classifying the plant originating the available observation in population 1(2). Decision d1(2) is correct if proposition H1(2) is true; conversely decision d1(2) is not correct if proposition H1(2) is not true. A loss function suitable to describe a two-action decision problem as the one at hand is the ‘0 − l p ’ loss function as in Table 1, where l1 = L(d1 , H2 ) represents the loss of classifying an item of population 2 (proposition H2 is true) as a member of population 1 (decision d1 is taken), and l2 = L(d2 , H1 ) represents the loss of classifying an item of population 1 (proposition H1 is true) as a member of population 2 (decision d2 is taken). The loss is zero whenever a correct decision is taken, that is L(d1 , H1 ) = L(d2 , H2 ) = 0. Conversely, whenever an incorrect decision is taken, one incurs in a positive loss. The losses l1 and l2 may be equal, whenever the incorrect decisions are considered equally undesirably; otherwise, the magnitude of each loss will represent the undesirability of each specific occurrence. For each decision, one can easily compute the expected loss EL(·) as: EL(d1 )

=

EL(d2 ) =

L(d1 , H1 ) Pr(H1 | y) + L(d1 , H2 ) Pr(H2 | y) = l1 α2 | {z } | {z } 0

L(d2 , H2 ) Pr(H2 | y) + L(d2 , H1 ) Pr(H1 | y) = l2 α1 | {z } | {z } 0

(4)

l1

(5)

l2

A coherent classification procedure is the Bayes decision procedure since it minimizes the probability of misclassification (see for example [5]). The best decision after having observed y will be the decision that minimises the expected loss that is calculated using the posterior probabilities. According to this, the plant of unknown origin is classified in population 1 (drug type) if the expected loss of decision d2 (5) is greater than the expected loss of decision d1 (4), that is if EL(d2 ) = l2 α1 > l1 α2 = EL(d1 )

(6)

Otherwise, if the expected loss of decision d1 is larger than the expected loss of decision d2 , the questioned plant is classified in population 2 (fibre type). Rearranging terms in Equation (6) as α1 /α2 > l1 /l2 and dividing both sides by the prior odds, a threshold c for the interpretation of the Bayes factor can be obtained, that is BF =

α1 π1 l1 π1 / > / = c. α2 π2 l2 π2 3

(7)

Page 6 of 14

The optimal decision criterion is to classify the observation in population 1(2) whenever the Bayes factor is greater (lower) than c. Note that, whenever a symmetric loss function can be reasonably adopted, and the two populations are a priori equally likely, then the Bayesian criterion suggests to classify the available observation in population 1(2) whenever the Bayes factor is greater (less) than 1. This assumption will be relaxed in Section 4.3, where the classification rate will be computed for asymmetric losses and different prior probabilities for the two populations of interest.

f p (y | H p , µ p , B p , W p ) = −nk/2

θp

|2π| |

|W p |

−n/2

θp

f (¯y | θ p , W p )π(θ p | µ p , B p )dθ p =

"  n 0   0  # 1 −k/2 −1/2 −1 −1 ¯ ¯ exp − θ p − µ p B p θ p − µ p dθ p . (8) exp − y − θ p W p y − θ p |2π| |B p | 2 2 {z }| {z }

us

Z

Z

cr

ip t

3.1. Bayes factor using a Normal distribution for the between-source variability The source mean vectors θ1 and θ2 for each population are not equal and unknown. Assume for the betweensource variability a Normal distribution, that is π(θ p | µ p , B p ) = N(µ p , B p ) for p = 1, 2, where µ p represents the mean vector between sources in population p, and B p represents the matrix of between-source variances and covariances in population p. To compute the BF, the marginal probability density function under each proposition needs to be computed. This is denoted by f p (y | H p , µ p , B p , W p ) and is obtained integrating out the unknown parameter θ p that is

N(θ p ,n−1 W p )

N(µ p ,B p )

an

The integral in (8) can be computed analytically and can be shown (see for example [3]) to be equal to f p (y | H p , µ p , B p , W p ) =

M

(  −1 1  1 0  −1  ) n 1 1  −2 −1 −1 ¯ ¯ = |2πW p |− 2 |2πB p |− 2 |2π nW p−1 + B−1 | exp − tr S W − y − µ n W + B y − µ , (9) p p p p p p 2 2

ed

P P for p = 1, 2, where S = nj=1 (y j − y¯ )(y j − y¯ )0 , and y¯ = n1 nj=1 y j . Note that the marginal probability density function under proposition H p and denoted by f p (·) depends also on prior parameters µ p and B p , and it does not belong to the Normal parametric family anymore1 . The Bayes factor is given by the ratio of the marginal probability densities for the two populations obtained in (9) for p = 1, 2 and is equal to f1 (y | H1 , µ1 , W1 , B1 ) = f2 (y | H2 , µ2 , W2 , B2 )   1  2  !− n2 !− 12  | nW −1 + B−1 −1 | − 2   −1     0  −1  |W1 | |B1 | 1 1 X  i1 −1   (−1) . tr(S Wi ) + y¯ − µi n Wi + Bi y¯ − µi  =   −1  exp      |W2 | |B2 | 2 | nW2−1 + B−1 | i=1 2 (10)

Ac ce

pt

BF =

The Bayes factor is now the ratio of the two weighted likelihoods as in (9). For this reason the BF can’t anymore be viewed as a measure of the relative support for the alternative proposition provided by the data only, because it also depends on the prior specification about parameters. The Bayes factor will be compared with the threshold c specified in (7) and the questioned item will be classified accordingly. This classification criteria will be termed as method 3.1. 3.2. Bayes factor using a Kernel distribution for the between-source variability In the previous Section a multivariate Normal distribution was introduced to model the between-source variP ability. However, the inspection of the scatter diagrams of the group means x¯ pi (i.e., x¯ pi = n1 nj=1 x pi j ) showed rather clearly that the normality assumption may not be appropriate. The assumption of normality can be removed by considering a kernel density estimate for the between-group distribution, as proposed by [3] in the context of glass fragments. The kernel density function is given by a multivariate Normal distribution with mean equal to the within-group mean x¯ pi and covariance matrix h2p B p , 1 To

underline this, a subscript p has been added to the notation for the marginal probability density, that is f p (·).

4

Page 7 of 14

where h p is the smoothing parameter, and is denoted by K(θ p | x¯ pi , B p , h p ). Therefore, the kernel density estimate for the between-group distribution, denoted πˆ (θ p | x¯ p1 , . . . , x¯ pm p , B p , h p ), is given by mp

πˆ (θ p | x¯ p1 , . . . , x¯ pm p , B p , h p ) = =

1 X K(θ p | x¯ pi , B p , h p ) m p i=1 ( ) mp 1 X (2π)−p/2 |B p |−1/2 1 −2 0 −1 ¯ ¯ exp − h (θ − x ) B (θ − x ) . p pi p pi p m p i=1 2 p h2p

Z θp

f (¯y | θ p , W p )ˆπ(θ p | x¯ p1 , . . . , x¯ pm p , B p , h p )dθ p

2 −1 −1/2 = (2π)−p |B p |−1 (m p h2p )−2 |D p |−1/2 |D−1 p + (h p B p ) |

mp X i=1

) ( 1 exp − (¯y − x¯ pi )0 (D p + h2p B p )−1 (¯y − x¯ pi ) , (11) 2

cr

f p (y | H p , W p , B p , h p ) =

ip t

The numerator (denominator) of the Bayes factor for which proposition H1(2) is assumed true, can be shown ([3]) to be equal to

f1 (y | H1 , W1 , B1 , h1 ) . f2 (y | H2 , W2 , B2 , h2 )

(12)

an

BF =

us

where D p = n−1 W p . The Bayes factor is then given by the ratio of the marginal probability densities computed in (11) for p = 1, 2, that is

M

Again, this will be compared with the threshold c specified in (7) and the observation will be classified accordingly. This classification criteria will be termed as method 3.2. 4. Analysis

ed

The objective of this section is to compare the performances of the classification criterion that were presented in the previous Section 3. 4.1. Summary statistics

pt

The overall mean µ p , the within- and between-source variances and covariances matrices are estimated from the available background data. The overall mean µ p is estimated by

Ac ce

µˆ p = x¯ p =

mp n 1 XX x pi j . nm p i=1 j=1

The within-source covariance matrix is estimated by ˆp= W

mp

n

XX 1 (x pi j − x¯ pi )(x pi j − x¯ pi )0 , m p (n − 1) i=1 j=1

(13)

P where x¯ pi = n1 nj=1 x pi j . The between-source covariance matrix is estimated by mp

Bˆ p =

mp

n

XX 1 X 1 (¯x pi − x¯ p )(¯x pi − x¯ p )0 − (x pi j − x¯ pi )(x pi j − x¯ pi ). m p − 1 i=1 nm p (n − 1) i=1 j=1

(14)

The smoothing parameter h is estimated as in [6] and is given by hˆ p =

4 2k + 1

!1/(k+4) m−1/(k+4) . p

(15)

5

Page 8 of 14

cr

us

Method (3.2) 0 0 0 0 0 0 1 0 1 2 3 14 20 10 11 9 9 12 5 3 4 28 125 (95%) 7 (5%)

an

Method (3.1) 1 0 1 0 0 0 0 0 0 0 2 8 10 12 14 13 7 6 9 7 2 40 128 (97%) 4 (3%)

M

BF < 10−10 10−10 − 10−9 10−9 − 10−8 10−8 − 10−7 107 − 106 10−6 − 10−5 10−5 − 10−4 10−4 − 10−3 10−3 − 10−2 10−2 − 10−1 10−1 − 1 1 − 10 10 − 102 102 − 103 103 − 104 104 − 105 105 − 106 106 − 107 107 − 108 108 − 109 109 − 1019 > 1010 BF > 1 BF < 1

ip t

Table 2: Distribution of Bayes factor values obtained using method 3.1 and 3.2 for population 1 (Drug type)

Ac ce

pt

ed

4.2. Comparative results between methods To assess the performances of the proposed approach, for each plant from each population the available replicate measurements were considered and the plant was classified according to the classification criteria described in Section 3. The Bayes factor was therefore computed for each plant (either using a Normal distribution for the between-source variability as in Section 3.1, or a kernel distribution as in Section 3.2), while the measurements on the remaining plants in the database were used to estimate the hyperparameters µ p , W p and B p , p = 1, 2. Tables 2 and 3 shows the distribution of the Bayes factor values obtained for the population of drug type (Table 2) and the population of fibre type (Table 3). The last two lines report the classification rates for each model and each population (i.e., assuming populations a priori equally likely, π1 = π2 , and symmetric losses, l1 = l2 , as discussed in Section 3). Table 3 refers to the population of fibre type cannabis and it puts on evidence a better performance of the kernel distribution for the between-source variability (method 3.2). The correct classification rate raises to 94% and 99%, respectively; moreover, a smaller percentage of extremely small values is observed. As far as the false positives (i.e, plants of fibre type yielding a BF greater than 1 that are classified erroneously as drug type) one may observe they are distributed on more moderate values that those obtained with method 3.1 (e.g., method 3.2 does not produce false positives with a BF larger than 10). 4.3. Sensitivity analysis The classification rates in Table 2 and 3 have been computed assuming symmetric loss functions and equal prior probabilities for the two populations. Therefore, with the prior odds and the loss ratio equal to 1, the optimal classification criteria was to classify the observed item as drug (fibre) type whenever the Bayes factor was greater (smaller) than 1. It might be questioned that a symmetric loss function could not necessarily represent a coherent choice for this context. In fact, one could agree that falsely classifying an observation as drug type should be regarded more severely than the opposite (that is falsely classifying an observation as fibre type). The quantification of the loss function is beyond the scope of this work, and will not be addressed in this paper, however a sensitivity analysis is conducted to study the performances of the classification method 3.2 in presence of asymmetric losses and prior odds not necessarily equal to 1, that is when populations are not a priori equally likely. 6

Page 9 of 14

cr

us

Method (3.2) 0 0 0 0 0 0 0 0 111 40 5 2 0 0 0 0 0 0 0 0 0 0 2 (1%) 156 (99%)

an

Method (3.1) 8 0 0 1 3 9 18 47 50 9 3 2 3 1 0 0 2 1 0 0 0 1 10 (6%) 148 (94%)

M

BF < 10−10 10−10 − 10−9 10−9 − 10−8 10−8 − 10−7 107 − 106 10−6 − 10−5 10−5 − 10−4 10−4 − 10−3 10−3 − 10−2 10−2 − 10−1 10−1 − 1 1 − 10 10 − 102 102 − 103 103 − 104 104 − 105 105 − 106 106 − 107 107 − 108 108 − 109 109 − 1019 > 1010 BF > 1 BF < 1

ip t

Table 3: Distribution of Bayes factor values obtained using method 3.1 and 3.2 for population 2 (Fibre type).

Ac ce

pt

ed

Figure 1 shows the correct classification rate computed in correspondence of an increasing loss ratio (from a minimum of 1, where l1 = l2 , to a maximum of 10, where l1 = 10l2 ), and prior odds supporting proposition H1 (π1 /π2 = 4), proposition H2 (π1 /π2 = 0.25), neutral (π1 /π2 = 1), respectievely. For the population of illegal plants, the correct classification rate clearly decreases as the loss ratio takes larger values (Figure 1a): the more a misclassification in the population of drugs is penalized the larger will be the number of false negatives observed (i.e., items of drug type classified as fibre type). However, it can be observed that even with a prior odds unfavorable to H1 (π1 /π2 = 0.25), and an asymmetric loss function such that erroneously classifying a legal Cannabis plant is considered ten times worse than erroneously classifying an illegal plant (i.e., l1 = 10l2 ), the correct classification rate is nearly 80%. On the other hand for the population of legal plants, the correct classification rate clearly grows as the loss ratio takes larger values (Figure 1b): the more a misclassification in the population of drugs is penalised, the less will be the false positives observed (items of fibre type classified as drug type). 5. Discussion and conclusion

Medical physicians, paleontologist and forensic scientists, are routinely faced with the problem of classifying an observation of unknown origin into one of several populations on the basis of the available measurements of some attributes. Imagine, for sake of illustration, the diagnosis process in medicine where the task is the assignement of an individual to one of two categories (diseased or not diseased) on the basis of available information. In a forensic context, as seen previously, an investigator would like to classify a measurement from an unknown item into a drug- or fibre-type Cannabis population. The statistical models specify that different observations belonging to different populations yield different measurements, and this variability will be expressed in probabilistic terms. Therefore, the scientist can treat the observation as a random observation from one of these populations, the distribution of which depends on the actual population. The (decision) problem is to classify the available observation in the correct population. A Bayesian decision procedure is proposed that can be summarized in the following way. Given an observation of unknown origin, the Bayes factor is computed and the exceeding (or not) of a given threshold allows the scientist to classify the observation in population 1 or in population 2, respectively. This offers a very simple 7

Page 10 of 14

cr

95 90

us

85 80

an

75

Prior odds = 4 Prior odds = 1 Prior odds = 0.25

70

Correct classification rate (%)

100

ip t

a

2

4

6

8

10

M

loss ratio

92

Ac ce

94

pt

ed

98 96

Correct classification rate (%)

100

b

90

Prior odds = 4 Prior odds = 1 Prior odds = 0.25 2

4

6

8

10

loss ratio

Figure 1: Percentage of correct classification rate computed with method 3.2 (Kernel) for an increasing loss ratio l1 /l : 2 and different prior odds π1 /π2 for the population of drug type (a), and the population of fibre type (b).

8

Page 11 of 14

Ac ce

pt

ed

M

an

us

cr

ip t

and intuitive classification criteria, and we believe this is one of the most remarkable advantages of the proposed approach in the context of interest. The necessity to choose a prior probability for the two propositions may be felt as a struggling issue, since there is not an ad hoc recipe at this purpose (the probability is not a state of nature). Probabilities are personal since they depend on one’s extent of knowledge, may change as the information changes and may vary amongst individuals, depending on the available information and on the subjective assessment criteria. That is, a given proposition may be felt almost true for one individual, but far less likely for another individual, and there is no problem in principle with different individuals specifying different probabilities for the same event ([7]). The only strict requirement is the coherence to which the assessment criteria must obey: coherence has the normative role of forcing people to be honest and to make the best assessment of their own knowledge base. The same argument can be extended to the specification of the loss function, that may be indeed a difficult task. Again, there does not exist a ‘correct’ loss function, since each individual will have a personal system of preferences, and the undesirability of consequences originating by incorrect decisions may be felt as more or less severe depending on the background context. The term ‘personal’ does not imply a connotation of arbitrariness for the proposed approach. There can be found alternative devices for measuring the value of consequences originating from decisions: what really matters in a situation in which a decision maker is asked to make a choice among alternative courses of action having uncertain consequences is that a rational behavior must be undertaken. This includes a coherent specification of the loss function, reflecting personal preferences among consequences (in terms of desirability/undesirability). We are aware that the elicitation of prior probabilities of propositions and the measuring of value of consequences may be a demanding task, but we do not believe the underlined difficulties may represent a severe drawback of the proposed approach. The decision maker is provided with a classification criteria that i) is simple to implement ii) allows rational decision making under uncertainty; iii) makes a clear distinction between the evaluation of evidence (that is represented by the BF) that is domain of the forensic laboratory, and the specification of the threshold with which the BF is compared (i.e., the ratio between the loss ratio and the prior odds) that is domain of the investigative authorities. As far as the BF, in Section 3.1 it has been underlined that it is not a measure of the relative support for the alternative propositions provided solely by the data. This is not in contradiction with the previous statement iii) since the BF is influenced by the elicitation of the subjective prior densities for model parameters under proposition H1 and H2 , and this represents background knowledge that forensic laboratories may dispose of. Prior elicitation of population parameters must not be confused with prior probabilities of alternative propositions. Values of the Bayes factor can also be used to assess what can be called ‘misleading evidence’. Imagine a given situation where measurements about an item of unknown origin are available and a BF of, say 5, is obtained. The scientist may thus report that this evidence supports the hypothesis that the plant is of drug type by a factor of 5. Such a result may give rise to questions (in relation to its robustness) of the following kind: (a) how often may a scientist get such a BF for samples that actually come from the drug type population of cannabis? and (b) how often may a scientist get such a BF for samples that do not come from that population but from an alternative one? Referring to Tables 2 and 3 (e.g., method 3.2), a Bayes factor in the range 1-10 is obtained in nearly 10% of cases for measurements from the cannabis population (i.e., in 14 cases over 125 illegal plants analyzed), while it is obtained only in approximately 1% of cases for measurements from the fibre population (i.e., in 2 cases over 158 legal plants analyzed). A ratio between the percentage of BF for the fibre type population over the total percentage of BF that takes values in the range 1 − 10 gives a simple measure of the misleading evidence: it shows how many times a given BF is misleading (it can supports the wrong hypothesis). In conclusion, it can be said that the Bayes factor does not simply allow the scientist to classify an observation into a population, but also to measure the robustness of the methodology applied through the quantification of false classification rates in both populations and consequently the report of what can be called a ‘misleading evidence’ in a specific situation of interest. The Bayesian decision-theoretic approach allows the decision maker to introduce in the classification procedure more contextual informations such as a reasonable loss function for the case at hand and prior probabilities for the two populations of interest. Appendix Consider, for sake of illustration, a plant of illegal drug type (i.e., the recovered item) it is analyzed and the chemical profile is extracted. Three replicate measurements are taken and three variables are selected: Cannabidiol (CBD), D9-Tetrahydrocannabinol (THC) and Cannabinol (CBN). Available standardized measurements y on the recovered item to be classified (note that the rows represent the repeated measurements, while the columns represent the selected variables) are

9

Page 12 of 14

  −1.3040 0.2310  y =  −1.2918 0.2400  −1.0719 0.3176

0.6874 0.7350 0.9113

   

ip t

The Bayesian classification criteria described in Section 3 requires the computation of the Bayes factor (BF) and of the threshold (c) to be compared as in equation (7). The plant (whose origin is generally unknown) is correctly classified in population 1 whenever the BF exceeds the threshold c. To compute the Bayes factor the unknown parameters can be estimated from the available background data, that consist in 132 plants of illegal drug type and 158 plants of legal fibre type. The overall means µ1 and µ2 are estimated by the overall sample means as h i h i x¯ 1 = −0.4827 0.9663 0.9225 x¯ 2 = 0.4075 −0.7870 −0.7610

cr

ˆ 1 and W ˆ 2 , from equation (13), as The within-source covariance matrices W1 and W2 are estimated by W    

  1.1775  Bˆ 1 =  0.6008  0.4046

0.4046 1.0695 1.2496

   

0.6008 1.3090 1.0695

  ˆ 2 =  W 

  3.3069 0.1709  Bˆ 2 =  0.1709 0.1987  0.0550 0.1615

an

0.0104 0.0053 0.094

us

0.179 × 10−1 1.883 × 10−3 −3.664 × 10−5 1.883 × 10−3 5.632 × 10−4 7.854 × 10−5 −5 −5 −3.664 × 10 7.854 × 10 1.8609 × 10−2 The between-source covariance matrices B1 and B2 are estimated by Bˆ 1 and Bˆ 2 , from equation (14), as   0.0191 0.0152 ˆ 1 =  0.0152 0.0152 W  0.0104 0.0053

0.055 0.1615 0.3356

   

   

The smoothing parameters h1 and h2 are estimated as in equation (15) and are

hˆ 2 = [0.4479]

M

hˆ 1 = [0.4595]

Moreover, given the available measurements y, one can compute the sample mean y¯ that is

and the matrix S =

j=1 (y j

ed

y¯ = (−1.222492, 0.2629209, 0.777929), Pn

− y¯ )(y j − y¯ )0 , that is

pt

  0.0341  S =  0.0124  0.0304

0.0124 0.0045 0.0111

0.0304 0.0111 0.0278

   

Ac ce

Given the available measurements, the Bayes factor is computed to classify the recovered item in population 1 (drug type) or in population 2 (fibre type). According to method 3.1, the estimated parameters and the available measurements y are substituted in equation (10), and the BF is computed as

BF =

f1 (y | H1 , µ1 , W1 , B1 ) 5027.29 = = 14.72 f2 (y | H2 , µ2 , W2 , B2 ) 341.404

According to method 3.2, the estimated parameters and the available measurements y are substituted in equation (12), and the BF is computed as BF =

f1 (y | H1 , W1 , B1 , h1 ) 4021.16 = = 5.89 f2 (y | H2 , W2 , B2 , h2 ) 682.62

Assuming populations are a priori equally likely, and a symmetric loss function, a threshold equal to 1 is obtained and both methods allow to correctly classify the recovered item in population 1 (drug type) since the Bayes factor is greater than 1.

10

Page 13 of 14

References

Ac ce

pt

ed

M

an

us

cr

ip t

[1] J. Broséus, F. Anglada, P. Esseiva, The differentiation of fibre- and drug type cannabis seedling by gas chromatography/mass spectrometry and chemometric tools, Forensic Science International 200 (2010) 87–92. [2] J. Broséus, J. Vallat, P. Esseiva, Multi-class differentiation of cannabis seedlings in a forensic context, Chemometrics and Intelligent Laboratory Systems 107 (2) (2011) 343–350. [3] C. G. G. Aitken, D. Lucy, Evaluation of trace evidence in the form of multivariate data, Applied Statistics 53 (2004) 109–122. [4] S. Bozza, F. Taroni, R. Marquis, M. Schmittbuhl, Probabilistic evaluation of handwriting evidence: likelihood ratio for authorship, Applied Statistics 57 (2008) 329–341. [5] T. Anderson, An introduction to Multivariate Statistical Analysis, 3rd Edition, John Wiley and Sons, Hoboken, 2003. [6] B. Silvermann, Density Estimation for Statistics and Data Analysis, Chapman & Hall, London, 1986. [7] D. V. Lindley, The philosophy of statistics, The Statistician 49 (2000) 293–337.

11

Page 14 of 14

Bayesian classification criterion for forensic multivariate data.

This study presents a classification criteria for two-class Cannabis seedlings. As the cultivation of drug type cannabis is forbidden in Switzerland, ...
270KB Sizes 0 Downloads 5 Views