Biometrics 72, 649–658 June 2016

DOI: 10.1111/biom.12431

Capitalizing on Opportunistic Data for Monitoring Relative Abundances of Species Christophe Giraud,1,2, * Cl´ ement Calenge,3, ** Camille Coron,1, *** and Romain Julliard4, **** 1

Laboratoire de Math´ematiques d’Orsay, UMR 8628, Universit´e Paris-Sud, 91405 Orsay Cedex, France 2 CMAP, UMR 7641, Ecole Polytechnique, route de Saclay, 91128 Palaiseau Cedex, France 3 Office national de la chasse et de la faune sauvage, Saint Benoist, BP 20. 78612 Le Perray en Yvelines, France 4 CESCO, UMR 7204, MNHN-CNRS-UPMC, CP51, 55 rue Buffon, 75005 Paris, France ∗ email: [email protected] ∗∗ email: [email protected] ∗∗∗ email: [email protected] ∗∗∗∗ email: [email protected] Summary. With the internet, a massive amount of information on species abundance can be collected by citizen science programs. However, these data are often difficult to use directly in statistical inference, as their collection is generally opportunistic, and the distribution of the sampling effort is often not known. In this article, we develop a general statistical framework to combine such “opportunistic data” with data collected using schemes characterized by a known sampling effort. Under some structural assumptions regarding the sampling effort and detectability, our approach makes it possible to estimate the relative abundance of several species in different sites. It can be implemented through a simple generalized linear model. We illustrate the framework with typical bird datasets from the Aquitaine region in south-western France. We show that, under some assumptions, our approach provides estimates that are more precise than the ones obtained from the dataset with a known sampling effort alone. When the opportunistic data are abundant, the gain in precision may be considerable, especially for rare species. We also show that estimates can be obtained even for species recorded only in the opportunistic scheme. Opportunistic data combined with a relatively small amount of data collected with a known effort may thus provide access to accurate and precise estimates of quantitative changes in relative abundance over space and/or time. Key words:

Detection probability; Opportunistic data; Sampling effort; Species distribution map.

1. Introduction How species abundance varies in space and time is a major issue both for basic (biogeography, macroecology) and applied (production of biodiversity state indicators) ecology. Professionals working on biodiversity thus spend considerable resources collecting data that are suitable for estimating this variation (Yoccoz et al., 2001). Most of the scientific literature recommends the implementation of both a statistically valid sampling design and a standardized protocol for collecting such data (e.g., see Williams et al., 2002, for a review). Many methods have been developed to estimate species abundance in a defined location, e.g., using mark-recapture methods (Seber, 1982) or distance sampling approaches (Buckland et al., 1993). However, these approaches require an intense sampling effort and are not always practical. Many authors have noted that most frequently interest will not be in abundance itself but in either the rate of population change, i.e., the ratio of abundance at the same location at two different time points, or in the relative abundance, i.e., the ratio of abundance at the same time point at two separate locations (MacKenzie and Kendall, 2002). Relative abundance is frequently monitored with the help of simpler schemes. For instance, a set of sites is randomly sampled in the area of interest, and counts of organisms are © 2015, The International Biometric Society

obtained from these sites using a given protocol. At a given location, the resulting count can be used as an index of the true abundance. Indeed, assuming constant detectability over space and time, the average number of animals counted per sampled site is proportional to the true abundance of the species in the area. Log-linear models can be used to represent this average number of animals detected per site as a function of space and/or time (and possibly other factors such as habitat; see, e.g., van Strien and Pannekoek, 2001) and thereby to infer population trends. Such programs have been implemented in many countries to monitor the changes in the abundance of several groups of species, such as birds (e.g., for the French Breeding Bird Survey, see Julliard et al., 2004) and butterflies (e.g., for the European Butterfly Monitoring Scheme, see van Swaay et al., 2008). Estimates of relative abundance have also been commonly used for mapping the spatial distribution of several species (e.g., Gibbons et al., 2007). In addition to data characterized by a known sampling effort, a large amount of data can also be collected by non-standardized means, with no sampling design and no standardized protocol. In particular, the distribution of the observers and of their sampling effort is often unknown (Dickinson et al., 2010). These so-called “opportunistic data”

649

650

Biometrics, June 2016

have always existed, and with the recent development of citizen science programs, we observe a massive increase in the collection of these data on a growing number of species (e.g., Dickinson et al., 2010; Hochachka et al., 2012; Dickinson et al., 2012). Additionally, as the use of online databases facilitates the exchange and storage of data, such opportunistic data may now include millions of new observations per year that are collected in areas covering hundreds of thousands of square kilometers (e.g., the global biodiversity information facility, including more than 500 million records at the time of writing; see Yesson et al., 2007). The temporal and spatial distribution of the observations in such data reflect unknown distributions of both observational efforts and biodiversity. Thus, a report of a high number of individuals of a given species at a given location compared to other locations could be because the focus species is abundant at this location or because numerous observers were present at this location. Using such opportunistic data to estimate changes in the space and time of species abundance is therefore complex, because any modeling approach should include a submodel of the observation process (K´ery et al., 2009; Hochachka et al., 2012) or an attempt to manipulate the data to remove the bias caused by unequal effort (see a discussion in Phillips et al., 2009). As noted by MacKenzie et al. (2005), “In some situations, it may be appropriate to share or borrow information about population parameters for rare species from multiple data sources. The general concept is that by combining the data, where appropriate, more accurate estimates of the parameters may be obtained.” In this article, we propose a general framework that makes it possible to combine data with known and unknown sampling efforts. We focus on multi-species and multi-site data that correspond to the data typically collected in this context. Our purpose is to estimate the relative abundance of the species at the different sites. The term “sites” can either refer to different spatial sites, to different times, or to different combinations of sites and times. We base this estimation on two datasets recording the number of animals detected by observers for each species of a pool of species of interest and each spatial unit of a study area of interest: (i) One “standardized” dataset is collected under a program characterized by a known sampling effort, possibly varying among spatial units, and (ii) one “opportunistic” dataset is characterized by a completely unknown sampling effort. We take into account the variation in detectability among species; as a first step, we do so by assuming that the observational bias toward some species is the same across the different sites. We show that, under this assumption, the information concerning both the distribution of the observational effort and the biodiversity can be efficiently retrieved from “opportunistic” data by combining them with standardized data. Moreover, we prove that such a combination returns more accurate estimates than when using the standardized data alone. Our statistical framework allowing this win-win combination can open numerous avenues for application. We use data on French birds, which are typical of existing data, to illustrate the numerous qualities of this framework. Note, however, that the work presented in this article is a first step and that further work will be required to fully account for observational bias potentially varying with habitat type.

During the reviewing process of this article, we became aware of independent and simultaneous work by Fithian et al. (2015), who develop similar ideas for combining multi-species and multi-site data, using thinned Poisson models. We refer to the web Appendix A.3 for a discussion of the similarities and differences between the two approaches. 2.

Combining Standardized and Opportunistic Data

2.1. Statistical Modeling We want to estimate the relative abundance of I species at J sites. We suppose that we have access to 2 datasets, indexed by k, which consist of counts for each species i at each site j. We have in mind the case where the dataset k = 0 has been collected using a standardized protocol, while the dataset k = 1 is of an opportunistic nature. For illustration, let us briefly describe the two datasets discussed in Section 4. The “standardized” dataset (k = 0) gives the observed abundance of several bird species at several randomly sampled sites, all obtained using the same observation protocol (10 minutes, within 4 hours after sunrise, in appropriate weather conditions). The opportunistic dataset k = 1 was collected from the largest French bird-watcher NGO, for which any citizen can record online (www.fauneaquitaine.org) his/her own observations (no observation protocol, no complete check-list). Let Xijk denote the count gathering all the records of individuals of species i from all visits to site j for dataset k. We assume that an individual is only counted once during a single visit; however, it can be counted several times due to the possible multiple visits to a site j for a dataset k. In particular, we may have Xijk larger than the number Nij of individuals of species i at site j. As detailed in web Appendix A.2.1, under mild assumptions, the counts Xijk can be modeled by Xijk ∼ Poisson(Nij Oijk ), for i = 1, . . . , I, j = 1, . . . , J and k = 0, 1, (1) where Nij is the number of individuals (animals, plants, etc.) of a species i at site j, and Oijk is a nuisance term due to the observational process. The term Oijk can be larger than 1 when the number of visits to site j for dataset k is large, because an individual can be counted several times during multiple visits (see web Appendix A.2.1 for details). Without additional assumptions, the model (1) cannot be fitted because there are more parameters than observations and the model is not identifiable. Our main hypothesis in this article is that the observational parameter Oijk can be written as Oijk = Pik Ejk ,

(2)

where Pik and Ejk are two parameters accounting for the bias induced by the observational processes. As explained in web Appendix A.2.2 (see also Section 5.2), this essentially amounts to assuming that the habitat types are known or any observational bias towards some habitat types is the same across sites.

Capitalizing on Opportunistic Data Hence, our model for the counts Xijk can be rewritten as Xijk ∼ Poisson(Nij Pik Ejk ), for i = 1, . . . , I, j = 1, . . . , J and k = 0, 1, (3) The parameter Pik reflects both the detectability of the species i (some species are more conspicuous than others, some are more easily trapped, etc.) and the detection/reporting rate of this species in dataset k (the attention of the observers may systematically vary among species). It can be interpreted as the detection/reporting probability of a typical individual of species i for dataset k; see equation (A7) in Appendix A.2.2. The parameter Ejk , which we refer to as the observational intensity at site j in dataset k, reflects the impact of the variable observational effort (including number and duration of visits, number of traps, etc.) and the variable observational conditions met during the counting sessions. It is thus a complex function of multiple features of the visits to site j for dataset k, and it can be (much) larger than 1 when the number of visits to site j for dataset k is very large. Note that Ejk can be very large even when Oijk is less than 1 if Pik is very small. In the following, we assume that the observational intensities in the standardized dataset, Ej0 , are known up to a constant. For illustration, in the example in Section 4, the observation protocol in the monitoring dataset from the French National Hunting and Wildlife Agency is the same at each site j, so Ej0 is the same for all j. In contrast, the opportunistic dataset is characterized by unknown observational intensities Ej1 . 2.2. Identifiability We have 2IJ observations and IJ + 2(I + J) parameters Nij , Pik , and Ejk . When IJ > 2(I + J) (i.e., when the harmonic mean of I and J is greater than 4), which typically holds for large J and I ≥ 3, we have more observations than parameters. However, without further constraints, we cannot identify all of the parameters. For example, multiplying all the Nij by 2 and dividing all the Ejk by 2 leaves the product Nij Pik Ejk unchanged. As explained in the web Appendix A, we need to impose J + I + 1 identifiability constraints. We therefore reparametrize the model in a manner that enables us to easily express these constraints (see web Appendix A.2.3 for the rationale behind this parametrization). We define:

Ejk P1k × , E10 P10

ik = P

Pik P11 × . Pi1 P1k

 j0 = Ej0 /E10 for all j, P i1 = 1 with the J + I + 1 constraints E 10 = 1. In addition, in the special case where for all i and P species i is not monitored in dataset 0 but is recorded in dataset 1, we clearly have the constraint Pi0 = 0 (and hence, i0 = 0). P Let us interpret the new parameters in (5). We are interested in estimating the relative abundances Nij /Ni1 for species ij is equal to Nij times a constant dei in each site j. Since N ij /N i1 . Hence, the relapending only on i, we have Nij /Ni1 = N    j1 are tive abundances are given by Nij /Ni1 . The parameters E equal, up to a constant, to the observational intensity Ej1 ; therefore, they provide the relative observational intensities i0 is proporEj1 /E11 at each site j in dataset 1. Finally, P tional to the ratio Pi0 /Pi1 by an unknown factor P11 /P10 , so we can compare the ratios Pi0 /Pi1 across the different species. The ratio Pi0 /Pi1 reflects the systematic difference in attention towards some species i among observers from the two schemes. 2.3. Estimation Using a Poisson Model We can obtain the maximum likelihood estimators ij , E  jk , P ik ) of the parameters N ij , E  jk and P ik using a (N ij ), ejk = log(E  jk ), Poisson model. Thus, if we write nij = log(N ik ), Model (5) can be recast as a Poisson and pik = log(P model with a log link, i.e. Xijk ∼ Poisson(λijk ),

with log(λijk ) = nij + ejk + pik ,

(6)

 j0 is a known offset, pi1 = 0 for all i, and where ej0 = log E p10 = 0. This model can therefore be fitted using standard statistical software. For example, using the glm command in R (R Core Team, 2013) leads to mod

Capitalizing on opportunistic data for monitoring relative abundances of species.

With the internet, a massive amount of information on species abundance can be collected by citizen science programs. However, these data are often di...
1KB Sizes 0 Downloads 7 Views