Biometrical Journal 00 (2014) 00, 1–19

DOI: 10.1002/bimj.201300150

1

On measures of dissimilarity between point patterns: Classification based on prototypes and multidimensional scaling Jorge Mateu∗,1 , Frederic P. Schoenberg2 , David M. Diez3 , Jonatan A. Gonz´alez1 , and Weipeng Lu2 1 2 3

Department of Mathematics, University Jaume I, E-12071, Castellon, Spain Department of Statistics, UCLA, 8125 Math-Sciences, Los Angeles, CA 90095, USA Department of Biostatistics, Harvard SPH, MA 02115, Boston, USA

Received 29 July 2013; revised 11 June 2014; accepted 11 June 2014

This paper presents a collection of dissimilarity measures to describe and then classify spatial point patterns when multiple replicates of different types are available for analysis. In particular, we consider a range of distances including the spike-time distance and its variants, as well as cluster-based distances and dissimilarity measures based on classical statistical summaries of point patterns. We review and explore, in the form of a tutorial, their uses, and their pros and cons. These distances are then used to summarize and describe collections of repeated realizations of point patterns via prototypes and multidimensional scaling. We also show a simulation study to evaluate the performance of multidimensional scaling with two types of selected distances. Finally, a multivariate spatial point pattern of a natural plant community is analyzed through various of these measures of dissimilarity.

Keywords: Classification; K-function; Multidimensional scaling; Point patterns; Prototypes; Spike-time distance.



Additional supporting information may be found in the online version of this article at the publisher’s web-site

1 Introduction Point processes, which are random collections of points falling in some measurable space, have found use in describing an increasing number of naturally arising phenomena in a wide variety of applications, including epidemiology, ecology, forestry, mining, hydrology, astronomy, ecology, and meteorology (Cox and Isham, 1980; Ripley, 1981; Daley and Vere-Jones, 2003; Moller and Waagepetersen, 2004; Schoenberg and Tranbarger, 2008; Tranbarger and Schoenberg, 2010; Schoenberg, 2011). In particular we refer the more applied reader to a recent book by Illian et al. (2008). In the mid-20th century, interest extended to spatial point processes, where each point represented the location of some object or event, such as a tree or sighting of a species (Ripley, 1981; Cressie, 1993; Diggle, 2003). By the mid 1990s, a wealth of large multiple neuronal spike train datasets began to become available to neuroscientists, and the analysis of this type of data required some new methods (Brown et al., 2004; Kulkarni and Paninski, 2007). Unlike data from seismology or epidemiology, for instance, where the events were naturally seen as one realization of a point process, the neuronal data typically consisted of many repeated observations of a point pattern. For instance, one might observe, for several subjects, the times of firings of neurons in a particular part of the brain in the instant immediately following a stimulus. In order to classify neuronal spike trains into clusters or to differentiate between diseased and ∗ Corresponding

author: e-mail: [email protected], Phone: +34-964-728-391, Fax: +34-964-728-429

 C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

2

J. Mateu et al.: On measures of dissimilarity between point patterns

healthy patients based on their neuronal firing patterns, one needs methods requiring the definition of a distance between two point patterns. The seminal work of Victor and Purpura (1997) proposed several distance metrics, including the spike-time distance, which the authors used for describing neuronal spike trains. The list of distances in Victor and Purpura (1997) is far from exhaustive, however, and certain alternative types of distances and nonmetric measures of dissimilarity are more useful for clustered or inhomogeneous point patterns or for point patterns in high-dimensional spaces. Literature on spatial point patterns has mostly focused on modeling the (spatial) distribution of the locations, where only distances among such locations played a role. Little attention has been paid to measuring distances between point patterns (these understood as samples or realizations of a stochastic point process) to solve, for example, clustering or classification problems or prototype determination. The essential focus of this article is the study of dissimilarity measures for classification of point patterns when multiple replicates of patterns of different types are available. In particular, we review in a tutorial form several types of distances (and nonmetric measures of dissimilarity) between two point patterns, X and Y , each observed on the same metric space. We comment on their uses, and on their pros and cons. These distances are then used to summarize, describe, and finally classify collections of repeated realizations of a point pattern via prototypes and multidimensional scaling. We run some simulations to gain insight on the performance of multidimensional scaling when using two of the proposed distance metrics, and we apply some of the most relevant techniques to the dataset of a plant community in Western Australia. The paper is outlined as follows. In Section 2, we discuss the spike-time distance and its variants which involve matching individual points of X to corresponding points in Y . Dissimilarity measures based on functional or scalar descriptors of point processes, including estimators of first and second moments of the processes, or classical test statistics based on these moments, are described in Section 3. Applications to summaries of collections of independent realizations of a point process via prototypes or multidimensional scaling are given in Section 4. This section also includes a simulation study. Section 5 considers the analysis of a plant community dataset. The paper ends with a section containing some discussion and further remarks. 1.1

A western australian plant community

The planar point patterns constituted by plant communities form a very particular applied area in the field of spatial point processes. This paper aims to present a new approach for analyzing spatial patterns from a plant community with high biodiversity. The data considered here come from Cataby, in the Mediterranean type shrub- and heathland of the southwestern area of Western Australia (Beard, 1984). The locations of 6378 plants from 67 species of seeders and resprouters on a 22 × 22 m plot were recorded (see Armstrong, 1991; Illian et al., 2008; Illian et al., 2009; and references therein). This is a clear example of a sampled area with a large number of species. The geology of the area implies that plants can interact mutually, being inhibited by each other due to competition for resources. There may be attraction if the resource conditions affecting microhabitats are modified by some species in favor of others. There are regions particularly sensitive to a high rate of annual fires in which plants have adapted using different forms of regeneration. For example, resprouting species can survive from fires through regrowing from the subsoil root, while seeding species die in the fire although their seeds are released (see Armstrong, 1991; Illian et al., 2009). Figure 1 shows the planar point patterns of the 20 most abundant species. The patterns display different intensities (some appear inhomogeneous) and clustering behaviors. A table with information on the correspondence between our chosen species and the species in Illian et al. (2008) is provided in the Supporting Information. The sampled area is notable for its large number of species, the large number of rare flora and a high biodiversity located at the southwest of Australia. Lying within a mineral sand mining area, the study area was mined shortly after data collection in 1990 and will have to be renaturated as soon as mining has ceased. Current efforts toward renaturation in neighboring areas, however, have resulted  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Biometrical Journal 00 (2014) 00

3

Figure 1 Observed planar point patterns of the 20 most abundant and influential species (five Seeders and 15 Resprouters) in the Western Australia plant community. in a very low survival rate for some species. The characteristic sandy soil in the area is extremely low in nutrients and water. Hence, individual plants may interact in various ways by inhibiting each other’s growth while competing for scarce resources. Attraction may also occur if conditions (shade, nutrients, water availability) in the microhabitat are made more suitable for a plant by the presence of another plant. The vegetation in the area is Banksia low woodland which is susceptible to regular natural fires occurring approximately every ten years. Plants have adapted to this through the development of different regeneration strategies, so the specific regeneration will have an impact on the pattern formed  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

4

J. Mateu et al.: On measures of dissimilarity between point patterns

Figure 2 Spike-time distance setting pa = pd = 1 and pm = 2.5 as penalties (left); and nearest-point distance for moving X to Y (right).

by the individual plants and also on the structure of the interaction between species (see Illian et al., 2009) and references therein).

2 Pointwise distances between point patterns through transformations Following some standard treatments such as Cressie (1993), Daley and Vere-Jones (2003), Moller and Waagepetersen (2004) and Illian et al. (2008), we refer to a locally finite collection of points falling in some space as a point pattern or point configuration, and a random point pattern, that is, a mapping from a probability space to the set of point patterns, is called a point process. One family of distances between point patterns is characterized by a transformation of one point pattern X into another point pattern Y by actions on individual points. Individual points in X are moved so that the resulting pattern resembles Y . The spike-time distance developed by Victor and Purpura (1997) is one popular distance metric that has been applied to neuronal, earthquake, and wildfire data (Victor and Purpura, 1997; Schoenberg and Tranbarger, 2008; Tranbarger and Schoenberg, 2010; Nichols et al., 2011; Diez et al., 2012). Techniques for computing this distance metric were only recently extended to multiple dimensions (Diez et al., 2010; Diez et al., 2012). The distance is defined as the minimum cost associated with the transformation of one point pattern X into a pattern Y by deleting, adding, and moving points. The left panel of Fig. 2 represents a transformation of X into Y via actions on individual points of X . Those points in X not moved are deleted from X , and those points in Y to which no point in X corresponds are added to X . If we describe such a transformation as T , then a possible cost associated with T is C(T |X, Y ) = pd |Xdelete | + pa |Yadd | +



pm dx

(1)

x∈Xmove

where pd , pa , and pm are penalty parameters, Xdelete , Yadd , and Xmove are subsets of X and Y that are deleted, added, and moved, respectively, dx is the distance a point x in Xmove is moved, and |S| represents the number of points in a set S. The spike-time distance, dst (X, Y ), is defined as the infimum of the cost C(T |X, Y ) over all such possible transformations T . The transformation shown in the left panel of Fig. 2 is the transformation minimizing (1) for a particular set of parameters (pa = pd = 1 and pm = 2.5). Victor and Purpura (1997) showed that spike-time distance is a well-defined distance metric, and they also considered other closely related distances. Spike-time distance in single or multiple dimensions may be computed using the free R statistical software via the stDist function in the ppMeasures package (Team, 2010; Diez et al., 2010). Further information can be found in the Supporting Information.  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Biometrical Journal 00 (2014) 00

5

Figure 3 The first two steps in the cluster distance are shown in the left and center panels (Clusters 1, 2, and 3). The most right panel represents the process of removing the clustering characteristics prior to applying spike-time distance. Another useful distance function that is also defined in terms of such transformations of X into Y is the nearest-point distance, where each point x in X is moved to its nearest-neighbor in Y ; call this nearest point yx (see the right panel of Fig. 2). Unlike spike-time distance, nearest-point distances are computed without allowing the addition or deletion of points. The nearest-point distance is defined simply as dn (X, Y ) =



||x − yx ||

x∈X

where || · || may represent the Euclidean distance for point patterns in Rk , or some other distance metric for point patterns in a more general metric space. The measure dn is useful in facility placement. For instance, if points in X represent consumers and points in Y represent facilities, then dn characterizes the total cost for all consumers to visit their closest facility. Note that for two distinct points x, x in X , we may have yx = yx , and that in general, dn (X, Y ) = dn (Y, X ), so that dn is not formally a distance metric. However, the sum of these two distances forms a symmetric nearest-point distance metric dN (X, Y ) = dn (X, Y ) + dn (Y, X ). A representative characterization of a collection of point patterns is provided by the point pattern prototype, which was defined by Schoenberg and Tranbarger (2008) as the point pattern with minimal total distance to the point patterns in the observed collection (see details in Section 4). When a point pattern is highly clustered, metrics such as spike-time distance tend to yield prototypes that do not reflect the typical behavior of the point patterns in the collection. Clustered processes thus require a separate family of distances where movements of collections of points are permitted. For example, consider the following metric, called cluster distance. Let T represent a transformation of X into Y that involves sequentially moving collections of points in X . Thus, T is itself a sequence of transformations, ti , where each ti moves a subset Xi of points by a vector zi (see the first two panels of Fig. 3). Then the cost associated with T is defined as pd |Xdelete | + pa |Yadd | +



pm |Xi |q ||zi ||

(2)

i

where q is a parameter in [0, 1). The cluster distance, dc (X, Y ), is the infimum cost over all such transformations, T . The further each cluster must be moved (the larger zi is), and the more points that must be moved in order to match X up with Y (the larger |Xi | is), the larger the cluster distance between X and Y . A similar type of distance for clustered point processes is defined by first aligning clusters in X with clusters in Y and subsequently applying simple spike-time distance. For instance, let {R j } represent  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

6

J. Mateu et al.: On measures of dissimilarity between point patterns

a set of disjoint and convex regions of the space that contain all the points of X . One may translate all of the points in a given region by some fixed vector and repeat for each region, assigning a cost of pc per unit distance to each translated region, and then the spike-time distance may be applied after these regions are moved. This declustered spike-time distance is defined as the minimum cost over all such transformations and choices of {R j }. An illustration is given in the right panel of Fig. 3. A drawback of both the cluster distance and the declustered spike-time distance is that methods for their efficient computation appear to be unavailable presently, and searching over all possible translations of all possible subsets of points in X is extremely computationally burdensome. An advantage of the cluster distance is that the clusters Xi and their associated translation vectors, zi , may have physical meaning and be of direct interest. For instance, a subset of point pattern X is nearly equivalent to a subset of point pattern Y in the case of a temporal delay in response to stimuli, in the case where the points are observed in time, and in the purely spatial case, the points in X or Y may be translated due to physical deformation or may require spatial translation due to miscalibration of the recording instruments.

3 Dissimilarity measures based on numerical and functional summary statistics In this section, we describe further techniques for quantifying the difference between point patterns. The considered metrics, such as integrals of the squared differences between K-functions or emptyspace functions, are useful for comparison of important pattern characteristics. 3.1

Model-based distances

A temporal point process model is typically specified not in terms of its likelihood but rather in terms of its conditional intensity, that is, the rate at which points occur at the time in question, given information on all previous points. This turns out to be an intuitively appealing way to formulate point process models, as well as being necessary for many models. In the planar case, the lack of a natural ordering in a two-dimensional space implies that there is no natural generalization of the conditional intensity of a temporal process given the past or history up to time t. Instead, the appropriate counterpart for a spatial point process is the Papangelou conditional intensity (Daley and Vere-Jones, 2003; Illian et al., 2008), say λ(u, X ), which conditions on the outcome of the process at all spatial locations other than u. Informally, λ(u, X )du gives the conditional probability of finding a point of the process inside an infinitesimal neighborhood of the location u, given complete information on the point pattern at all other locations. The Papangelou conditional intensity of a finite point process uniquely determines its probability density and vice-versa. Given specific models for the point processes giving rise to the point patterns X and Y , one may define the distance between X and Y in terms of the differences between characteristics of these models. For instance, if X and Y are characterized by their conditional intensities λX (η, X ) and λY (η, Y ), which are random quantities, then one metric summarizing the difference between these point process models is the integral of the squared difference of the two expected values of the conditional intensities (thus the unconditional intensities), denoted by λX (η) and λY (η), respectively, over the observation region W  dλ (X, Y ) =

W

(λX (η) − λY (η))2 dη.

(3)

This is illustrated in the left panel of Fig. 4, where the intensity functions (corresponding to expected values of the conditional intensities) have been estimated based on the data shown at the top of the  C 2014 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim

www.biometrical-journal.com

Biometrical Journal 00 (2014) 00

7

Figure 4 Left: A comparison of two intensity models using Eq. (3). In this illustration, the intensities λX and λY are estimated by kernel smoothing the points in X and Y, respectively. Right: an example using Eq. (4) to characterize the difference in clustering behavior of two patterns using the Ripley’s K function. panel. These methods readily extend to a variety of other summaries, such as the second moment measure or higher moments, or cumulants of the processes (see Illian et al., 2008). 3.2

Distances based on functional statistical summaries

Differences between point patterns can also be characterized by comparing summaries of the first or second moments for each pattern. For example, given two point patterns X and Y , one could estimate the overall intensities of X and Y , respectively, for example, using a nonparametric kernel estimator. Alternatively, if the patterns are one-dimensional, then one could look at the empirical cumulative distribution function of a particular selected distance for each point pattern as a statistical summary of the realization. A useful summary statistic is the nonparametric estimator of Ripley’s K-function (Ripley, 1977), essentially it is a renormalized empirical distribution of the pairwise distances between observed points. More pragmatically, for a stationary point process with intensity (mean number of points per unit area) λ, λK (r) is the expected number of other points of the process within a distance r of a typical point of the process. Thus, a further possibility for defining distances between point patterns would be to use the estimated K-function or its derivative, the reduced second moment measure, for each pattern, and examine the quadratic difference. These functional summary statistics can be also adapted and used under nonstationary contexts, where the intensity of the point pattern is no longer a constant λ but a function of the spatial locations and/or covariates. For example, a general estimator for the K-function (see Moller and Waagepetersen, 2004; Illian et al., 2008) is given by

Kˆ (r) =

=  1[0

On measures of dissimilarity between point patterns: classification based on prototypes and multidimensional scaling.

This paper presents a collection of dissimilarity measures to describe and then classify spatial point patterns when multiple replicates of different ...
1MB Sizes 2 Downloads 5 Views