Mathematical Biosciences 255 (2014) 43–51

Contents lists available at ScienceDirect

Mathematical Biosciences journal homepage: www.elsevier.com/locate/mbs

Semivarying coefficient models for capture–recapture data: Colony size estimation for the little penguin Eudyptula minor Jakub Stoklosa a,⇑, Peter Dann b, Richard Huggins c a

School of Mathematics and Statistics, The University of New South Wales, New South Wales 2052, Australia Research Department, Phillip Island Nature Parks, P.O. Box 97, Cowes, Phillip Island, Victoria 3922, Australia c Department of Mathematics and Statistics, The University of Melbourne, Victoria 3010, Australia b

a r t i c l e

i n f o

Article history: Received 29 April 2013 Received in revised form 12 December 2013 Accepted 26 June 2014 Available online 3 July 2014 Keywords: Functional data analysis Generalized cross validation Jolly–Seber model Open population P-splines Varying coefficient models

a b s t r a c t To accommodate seasonal effects that change from year to year into models for the size of an open population we consider a time-varying coefficient model. We fit this model to a capture–recapture data set collected on the little penguin Eudyptula minor in south-eastern Australia over a 25 year period using Jolly–Seber type estimators and nonparametric P-spline techniques. The time-varying coefficient model identified strong changes in the seasonal pattern across the years which we further examined using functional data analysis techniques. To evaluate the methodology we also conducted several simulation studies that incorporate seasonal variation. Ó 2014 Elsevier Inc. All rights reserved.

1. Introduction Understanding the changes in abundance for wildlife populations across time is critical to developing appropriate conservation strategies. These changes may be due to environmental or biotic variables. The size of a population of a species may vary accordingly throughout the year due to factors such as: individual recruitment, individuals moving in and out of the population, and individual mortality (which may occur naturally or be caused by humans). These factors may also be driven by seasonality [1]. Assumptions concerning closed populations are seldom met in nature [2] and immigration/emigration are obvious components of the life histories of many birds [3]. The little penguins Eudyptula minor spend more time ashore during spring and summer due to breeding and moulting and, consequently, the recapture probabilities are susceptible to seasonal effects of a population model. It is known for the little penguin that the timing of the breeding season varies from year to year [4] so that models should allow the seasonal effects to vary according to the year. In classical parametric models, the model coefficients are usually considered fixed. To allow these coefficient to vary, say over time, Cleveland et al. [5] proposed the varying coefficient model

⇑ Corresponding author. Tel.: +61 (2) 9385 7029. E-mail address: [email protected] (J. Stoklosa). http://dx.doi.org/10.1016/j.mbs.2014.06.014 0025-5564/Ó 2014 Elsevier Inc. All rights reserved.

(VCM). These models were further extended by Hastie and Tibshirani [6], and have now become increasingly popular in various applications. Specifically, they are very flexible, easy to construct and are readily interpretable. Indeed, VCMs are natural extensions of classical parametric models, however as Fan and Zhang [7] mention ‘‘the varying coefficient models are not stimulated by the desire of purely mathematical extension, rather they come from the need in practice’’. Consequently, an extensive literature on these models have been successfully applied in many areas. For example, Cheng et al. [8] investigated infant mortality in China by considering generalized linear models and allowed the coefficients for covariates, such as mother’s age at the birth of the child and urban–rural residence, to depend on time; and Huang et al. [9] analysed longitudinal data collected on homosexual men who were infected by HIV by considering longitudinal models and allowing coefficients for some covariates to be dependent on years after infection. A special case of VCMs are the semivarying coefficient models (SVCM) arising in Fan and Zhang [10] and Zhang [10,11] that have some varying and some nonvarying coefficients. Xia et al. [12] applied SVCM to hospital admissions where the coefficients for covariates, such as birth rates, were allowed to vary but seasonal effects were kept fixed; and Huggins et al. [13] have applied SVCM to model numbers of suicides in Hong Kong. Estimation is commonly conducted using kernel smoothing [14] or smoothing splines [6,15]. Here, we focus on the latter, extending the VCM approach of Marx [15], also called generalized linear additive

44

J. Stoklosa et al. / Mathematical Biosciences 255 (2014) 43–51

smooth structures [16], to SVCM and apply it to population models. We use varying coefficients to model the change in seasonal effects from year to year. For example, breeding and moulting which are seasonal, may vary from year to year. Importantly, the SVCM reduces the effective number of parameters in the model because we model the seasonal effects for each month as functions of time, rather than estimate a separate effect for each year. To make inferences on the size of a population size over time, a capture–recapture (CR) experiment conducted on individuals from the population may be undertaken. In these experiments, individuals are uniquely marked on first capture and records are noted if the individuals are seen or not seen again over a fixed number of visits or capture occasions. There are many methods for estimating the population size from CR data, one of the most common being the Jolly–Seber model [JSM, 17,18] which gives closed form estimators for the number of marked individuals and the population size within the study area. These models consider an open population, which permit the processes of immigration, emigration, birth and death to occur over time. Several key assumptions are also made: all individuals have equal catchabilities, all individuals have equal survival rates, all emigration from the population is permanent, marked individuals do not lose their marks/tags and sampling periods are short [2]. A number of modifications and variants within the JSM framework have now been developed, these include: Pollock’s Robust design [19,20], martingale estimators [21,22], unequal catchabilities [23] and a more refined approach to handle birth process [24]. More recent extensions of the JSM include nonparametric and semiparametric models, which explicitly model the population size as functions of time and monthly effects [25,26], and functions of covariates – such as sea-surface temperature, in addition to time [27]. We are directly motivated by observations on a colony of little penguins located at Summerland Beach on the western end of Phillip Island, Victoria, Australia. There are approximately 32,000 breeding penguins living on the island [28]. Little penguins are inshore feeders [29,30], spending much of their time at sea searching for food and coming ashore for relatively short periods throughout the year except during moulting when they are ashore for up to three weeks [31]. When ashore they are vulnerable to introduced mammalian predators [32], and their numbers ashore are believed to be highest during breeding and moulting [33]. Pilchards Sardinops sagax were a major source of food until an extensive mortality of pilchards across southern Australia [34,35] resulted in a switch in diet to other prey [36] with increased anchovy Engraulis australis presence in some years [35]. We consider a CR data set collected on adult little penguins, consisting of monthly captures from years 1985 to 2009. The first record was taken at 09/01/1985 and the last at 29/12/2009, yielding a total of 9118 uniquely marked individuals on 300 capture occasions. Using these data, we plot the JSM estimates [37] for: (a) the number of marked; and (b) the colony size with 95% confidence bands across each occasion in Fig. 1. Some seasonal peaks are evident here, however a clear interpretation of the seasonality cannot be made using these models. The drop in colony size around 1996 is likely to be due to higher adult mortality following the mass mortality of pilchards in 1995 [34], and the sudden drop at the end of the study is likely to be an artefact of not picking up some of the birds that were missed in 2009 but may have picked up in 2010 or 2011 if the study had continued. The start of the breeding season (i.e., egg-laying) for little penguins is usually in spring [33,38] and moulting occurs in late summer to mid autumn [31] but it has been observed that the timing of breeding and moulting seasons can be quite variable between years [4,31]. Johannesen et al. [39] recorded seasonal variation in survival, weights, and population counts of little penguins in

Otago, New Zealand. A consequence of this variability is that modelling seasonal effects on the population/colony size over a long time span becomes difficult; the breeding and moulting seasons may not be strictly related to calendar time. To model the seasonality and obtain meaningful estimates using open population models we extend the semiparametric models of Huggins and Stoklosa [27] to SVCMs, as described above. Inference is conducted using flexible P-splines techniques [40] as in Marx [15]. For further details on P-splines see Eilers and Marx [41]. This SVCM framework has several advantages over the recent works of Huggins and Stoklosa [27]:  firstly, a flexible modelling approach is considered for the seasonal effects – i.e., we now allow seasonal effects to vary from year to year;  secondly, the varying seasonality effects can be easily interpreted using functional data analysis (FDA) techniques [42,43]; and  thirdly, the proposed approach helps clarify the information given by the regression plots (which are a common means of presenting the nonparametric estimates) – i.e., we can (visually) identify key structures in the data. In order to present our proposed methods, we first review the semiparametric JSM of Huggins and Stoklosa [27] in Section 2, and discuss a method of accounting for temporary emigration in Section 2.2.2. A preliminary data analysis on the little penguin data is conducted in Section 2.3.2. We then extend the semiparametric JSM to the proposed SVCM in Section 3. Simulations are conducted in Section 4 and we apply the SVCM models to the little penguin data in Section 5. Some discussion is given in Section 6. We used the R-software [44] to conduct our simulations and data analysis. 2. Semiparametric Jolly–Seber models 2.1. Notation As our motivating data were collected at monthly intervals we consider t ¼ 1; . . . ; s monthly capture occasions. Let i ¼ 1; . . . ; I denote the years and k ¼ 1; . . . ; 12 denote the months, such that for each t there is a corresponding i and k. Although the little penguin data was collected from a single colony, the models given below can be used to estimate population sizes. Thus we will use population when describing our models in a general framework, and colony when considering the little penguin data. Let pt be the probability of capture on occasion t. We assume all individuals in the population have the same capture probability and survival rates. Denote by M t the unknown number of marked individuals in the population and N t the unknown population size at time t. Let nt be the number of individuals captured on occasion t and mt the number of previously captured individuals that have been recaptured on occasion t. Also, let r t denote the number of the nt individuals captured and released on occasion t that were recaptured at least once on occasions t þ 1; . . . ; s. Finally, denote by zt the number of individuals captured at least once on occasions 1; . . . ; t  1 that were not captured on occasion t but were recaptured at least once on occasions t þ 1; . . . ; s. Note that, rt is defined for t ¼ 1; . . . ; s  1 and zt for t ¼ 2; . . . ; s  1. For computational purposes we set z1 ¼ zs ¼ 0 and rs ¼ 0. In conducting inference for both the JSM and the proposed models below, note that given N t ; Mt ; nt , the observed mt has a hypergeometric distribution [22], and M t and N t are the parameters of this conditional distribution. Moreover, M t is a nuisance parameter that is estimated separately from equating r t =nt to zt =ðM t  mt Þ, i.e., the conditional distributions of r t given nt , and zt given M t and mt . Noting that only at time t does N t become a

J. Stoklosa et al. / Mathematical Biosciences 255 (2014) 43–51

45

Fig. 1. Jolly–Seber estimates for (a) marked and (b) colony size with 95% confidence bands across each occasion fitted to the little penguin data. The number of capture occasions is s ¼ 300. Note the decline in colony size after 1995 and the irregular large peak during 2001.

parameter, Jolly [17, p. 228] refers to these as conditional parameters and notes they are unlike ordinary parameters as they are not fixed characteristics of the population at the start of the experiment. Further, note that if we make the assumption that at most one individual can enter or leave the population in any instant then N t and M t will be approximately continuous as functions of t 2 R. 2.2. Estimation of the number of marked individuals 2.2.1. Models Following Huggins and Stoklosa [27] we consider a nonparametric model using P-splines for the number of marked individuals in the population, that for completeness we describe the P-spline formulation here. The M t ; t ¼ 1; . . . ; s are nuisance parameters and we model them as a nonparametric function of time only. The JSM will estimate each of these conditional parameters separately, whereas a nonparametric model reduces the effective number of parameters according to the smoothness of M t ; t ¼ 1; . . . ; s. Suppose that M t ¼ X M ðtÞw for a B-spline basis X M with tth row X M ðtÞ for t ¼ 1; . . . ; s and corresponding parameter vector w ¼ ðw1 ; . . . ; ws ÞT 2 Rs . When considering P-splines, the number of knots is usually set to be quite large and to be evenly spaced across t [41,27] – we do the same here. In this data driven model, the smoothness of the function is determined by a smoothing parameter kM (see Web Appendix A1 of the supplementary data), which is estimated from the data. Note that for feasible estimation the model requires many capture occasions, hence many surveys are required. Let n ¼ ðn1 ; . . . ; ns ÞT , m ¼ ðm1 ; . . . ; ms ÞT , N ¼ diagðn1 ; . . . ; ns Þ, R ¼ diagðr 1 ; . . . ; rs Þ, Z ¼ diagðz1 ; . . . ; zs Þ and M ¼ diagðm1 ; . . . ; ms Þ. Then, as in Huggins and Stoklosa [27], we consider a penalized weighted least squares estimator, which yields the following:

b¼ w

n

 o1 ðX TM RÞðZn þ RmÞ X TM R ðRX M Þ þ kM DTd Dd

where Dd is the difference penalty matrix [40] of order d with a nonnegative smoothing parameter kM.T An estimate for Mt is then  b 1; . . . ; M b s ¼ XM w b t ¼ X M ðtÞ w. b Let c b and D b ¼ diagð c M¼ M MÞ. M 2.2.2. Temporary emigration In estimating the number of marked individuals, the JSM method (that we have modified above) assumes that emigration

is permanent. Under this assumption, individuals that are observed before and after a time t must be in the population at time t. Thus, zt defined above overestimates the number of individuals that are observed before and after t that are not observed at t but are in the e t ¼ mt þ zt nt =r t , population at t. The simple JSM estimator is M which will overestimate the number of marked individuals in the e t ¼ nt M e t =mt , so that population. The JSM estimator of N t is then N if we overestimate Mt , the estimator of the population size will be similarly biased. This carries through to our nonparametric and semiparametric estimators. The assumption of permanent emigration for the JSM is violated for the little penguin data since individuals move in and out of the colony. For example, they may be away from the colony for several months after fledging [45] and during winter [30]. This issue was addressed on a related data set in Huggins and Yip [25], Huggins [26] and Huggins and Stoklosa [27] who modified the estimation of the number of marked individuals at time t by only considering captures on occasions t  K; . . . ; t þ K. This approach supposes that emigration is permanent over shorter time periods. One may fix K to some specified value, however an optimal value for K via some adaptive data driven approach is also feasible. For example, Huggins [26] used a Pearson statistic and Huggins and Stoklosa [27] used generalized cross validation [GCV, 46]. For the little penguin data described in Section 1, we fix K to 12, which corresponds to a yearly cycle or a two year width which assumes permanent emigration. The same value was also used in Huggins and Yip [25] and Huggins [26] using a CR experiment on the same species at the same location but conducted over a shorter period (July 1990–May 1995). 2.3. Models for the population size 2.3.1. Models In our initial analysis we considered the following models of Huggins and Stoklosa [27], that for completeness we again describe the estimators and model structures here. Since we now consider smoothing over the years i rather than monthly capture occasions, we change the notation for population size by indexing N ðÞ with i and k.  Nonparametric Model (NPM):

N ik ¼ X N ðiÞc

46

J. Stoklosa et al. / Mathematical Biosciences 255 (2014) 43–51

for a B-spline basis X N with ith row X N ðiÞ corresponding to the year of capture i for each t and the corresponding parameter vector c ¼ ðc1 ; . . . ; cp ÞT 2 Rp . Let W ¼ X N and h ¼ c.  Semiparametric Model (SPM):

Nik ¼ X N ðiÞc þ bk where bk are monthly/seasonal effects corresponding to the month of capture k for each t. For these models the first component can be viewed as the long-term trend (i.e., an overall change or trend across the entire experiment) and second component as the seasonality effects. Note that we set b1 ¼ 0 for identifiability. Let W ¼ ½X N ; Z  where Z is an indicator function for t and h ¼ T ðcT ; bT2 ; . . . ; bT12 Þ . We then solve the equation:

  b W T M MWh  Dn þ kN Dd DTd c ¼ 0 n o1 b where the penalty yielding b ðW T MÞ Dn h ¼ ðW T MÞðMWÞ þ P has the form P ¼ blockdiagðkN DTd Dd ; 0Þ for nonnegative kN and 0 is a vector of zeroes of length 11. Let N ¼ ðN11 ; . . . ; N I;12 ÞT , hence an b ¼ Wb h. Note that estimates for N are funcestimate for N is then N tions of c M. Smoothing parameters kM and kN are chosen by minimizing a GCV statistic (Eq. (S1) in Web Appendix A1), and we consider an Akaike information criterion (AIC) for model selection (Eq. (S2) in Web Appendix A1). In Web Appendix A2 we give b. approximate standard errors for c M and N Note that we are only estimating N ik at discrete number of points and the P-spline formulation allows us to reduce the effective number of parameters to less than s. The degree of reduction is determined by the smoothness of M. In comparison to our estimates of M, the seasonal effects relating to the population size are of interest here, hence no seasonal effects are considered in the model structure of M. Moreover, since the population size is now assumed to be a smooth function across years (i.e., the longterm trend) rather than months, a fewer number of knots will be used for N in contrast to M.

2.3.2. Initial analysis of the little penguin data To illustrate these models we fit them to the little penguin data as described in Section 1. The number of knots was set to 300 for the marked basis and 20 for the long-term trend. Difference penalty orders were set to d ¼ 4 for the marked and d ¼ 2 for the long-term trend. In Fig. 2 we give the population size estimates with nominal 95% confidence bands for the NPM/SPM along with the JSM estimates corrected for temporary emigration where K ¼ 12 (see Section 2.2.2). In Table 1 we report their log-likelihood values, AIC values and the effective degrees of freedom (EDF). To check the goodness-of-fit for each fitted model, we consider Dunn–Smyth residuals [47] based on n and plot these in Web Fig. 1 of the supplementary data. We also separately plot the marked and colony size estimates with nominal 95% confidence bands across each occasion for each model in Web Figs. 2 and 3. The colony size estimates for the NPM and SPM are now much smoother and there are no irregular jumps compared with the JSM estimates, see Fig. 2. Furthermore, the goodness-of-fit appears adequate for all nonparametric models, whereas the JSM estimates reveal a small trend. Note that these JSM estimates have significantly decreased once corrected for temporary emigration. The increase in the colony size before 1996 was expected as habitat restoration and predator control work was conducted to protect the little penguin. As noted previously the drop in colony size around 1996 was due to pilchard mortality. The lowest AIC value was reported for the NPM, however almost no seasonality is estimated here (see Fig. 2). In Fig. 3 we give estimates for (a) the long-term trend with nominal 95% confidence bands across each occasion; and (b) monthly effects with error bars for the SPM. Here, the monthly effect estimates start high for warmer months (February–April), which is also around the moulting period, they gradually drop for cooler months (June–September) and begin increasing as the breeding season commences. However, these estimated seasonal effects are much smaller than those estimated by Huggins [26] and similarly in Huggins and Stoklosa [27], where seasonal effects were estimated as functions of the sea-surface temperature. Sea surface temperature data were not available for the full time period considered here so we do not consider these models. A possible explanation for the comparatively small seasonal effects estimated in the semiparametric

Fig. 2. Colony size estimates for the JSM, NPM, SPM and the SVCM (see Section 3) fitted to the little penguin data. Note that these models were corrected for temporary emigration where K ¼ 12.

J. Stoklosa et al. / Mathematical Biosciences 255 (2014) 43–51 Table 1 JSM, NPM, SPM and the SVCM fitted to the little penguin data. We report the effective degrees of freedom (EDF), the log-likelihood values, Akaike information criterion (AIC) and the difference in AIC compared with the NPM (DAIC). Notice the large number of parameters (EDF) for the JSM. EDF

log-likelihood

AIC

DAIC

298.000

2:704  105

2:710  105

295.590

NPM

17.064

5

2:706  10

2:707  105

0

SPM

28.033

2:718  105

2:719  105

1164.492

SVCM

65.154

2:708  105

2:710  105

253.738

JSM

models is that they vary from year to year. In the absence of covariate data to model these changes, a SVCM is suggested, which we now develop. 3. Semivarying coefficient Jolly–Seber models In the SVCM we extend the SPM and take:

Nik ¼ X N ðiÞc þ ak ðiÞ

47

b procedure for the SVCM and approximate variance formulae for N under the SVCM are given in Web Appendix A3. 4. Simulations To examine the performance of the SVCM estimators and the standard error formulae (as given in Web Appendix A2), we conducted two simulation studies similar to Huggins [26] and Huggins and Stoklosa [27]. We considered a 10 year experiment with monthly capture occasions so that t ¼ 1; . . . ; s ¼ 120. Throughout the entire experiment we set the birth/immigration rate equal to deaths/permanent emigration rate i.e., for each individual removed from the population, a new individual enters. To include temporary emigration, we considered the same set of individuals moving in and out of the population, and model these through seasonal effects. These ‘‘temporary emigrants’’ were also susceptible to deaths. The capture probabilities were set to pt ¼ 0:3 for t ¼ 1; . . . ; s for both simulation studies. 4.1. Simulation study 1

where ak ðiÞ being functions of the years i ¼ 1; . . . ; I and we set a1 ðiÞ ¼ 0 for identifiability. That is, ak ðiÞ is the seasonal effect corresponding to month k in year i which is the varying coefficient component of the model. Similar to the models above, we model ak ðiÞ as T

a B-spline. For k ¼ 2; . . . ; 12, write dk ¼ ðdT1 ; . . . ; dTq Þ and take ak ðiÞ ¼ YðiÞdk where Y is a B-spline basis for i ¼ 1; . . . ; I with ith row YðiÞ. We now write:  Semivarying Coefficient Model (SVCM):

Nik ¼ X N ðiÞc þ YðiÞdk ¼ RðtÞh T

T

where RðtÞ has tth row ðX N ðiÞ; YðiÞÞ; h ¼ ðcT ; dT Þ and d ¼ ðdT2 ;. . . ;dT12 Þ . As in Section 2.3, we may solve for h by computing a penalized weighted least squares estimator with an additional smoothing parameter kVC , which may be similarly obtained using the GCV statistic described in Web Appendix A1. Finally, details for the fitting

In the first simulation study we let the monthly effects be constant across the years. Let the long-term trend be represented by f ðiðtÞÞ ¼ 600 þ 400 sinðiðtÞ=pÞ þ  where   Nð0; 52 Þ, and let b ¼ ð0; 32; 162; 512; 1250; 2592; 2592; 1250; 512; 162; 32; 32ÞT represent the monthly effects (true coefficient values) of temporary emigrants entering and leaving the population. Note that some additional variation  was also included about the true smooth curve in the long-term trend. We set the number of knots to 120 for the marked basis and 8 for both the long-term trend and the varying coefficient basis components. The penalty orders were set to d ¼ 4 for the marked and d ¼ 2 for both the long-term trend and the varying coefficient component. As these seasonal effects are quite large, we would expect many temporary emigrants, therefore we fix K ¼ 2 for each simulation to remove the emigration bias as discussed in Section 2.2.2. To select the optimal smoothing parameters, we consider a grid search on powers 10K , where K ¼ 1; . . . ; 4 for kN and kVC , and K ¼ 2; . . . ; 3 for kM . We

Fig. 3. SPM plots of: (a) long-term trend estimates (smoothed across the years) with nominal 95% confidence bands across each occasion and (b) monthly effects estimates with error bars fitted to the little penguin data. Note that we set b1 ¼ 0 (January) for identifiability.

48

J. Stoklosa et al. / Mathematical Biosciences 255 (2014) 43–51

conducted 500 simulations and plot the results in Web Figs. 11–14 of the supplementary data. Estimates for the number of marked and long-term trend showed little bias (Web Fig. 11(a) and (c)), however their standard error (S.E.) estimates have a slight downward bias (Web Fig. 11(b) and (d)). The population size estimates and their standard errors produced reasonable results with some bias at most peaks and a slight downward bias in the population size standard errors (Web Fig. 12(a) and (b)). Performance of varying coefficient estimates and their standard errors were quite mixed (Web Figs. 13– 14). For example, a6 ðtÞ was upwardly biased but little bias was apparent in its standard errors. This suggests that there may be some bias arising from the use of fixed smoothing parameters if some parametric effects are dramatically larger than others. 4.2. Simulation study 2 In the second simulation study we let the monthly effects vary across the years. We considered common smooth varying functions to the coefficients but change their respective magnitudes. Let b ¼ ð12; 27; 48; 75; 108; 108; 75; 48; 27; 12; 12ÞT and

ak ðiÞ ¼



bk  f6 sinði=0:55pÞ þ 7:2g if k is even; bk  f6 cosði=0:55pÞ þ 7:2g if k is odd:

We used the same number of knots, penalty orders, grid search, temporary emigration constant and long-term function as above. There was some overestimation for the number of marked and long-term trend (Web Fig. 15(a) and (c)), whereas their standard errors showed downward bias. The population size estimates showed little bias (Web Fig. 16) and the varying coefficient estimates picked up the correct nonlinear shape, though again there was some bias apparent (Web Fig. 17). Their standard error estimates (Web Fig. 18) also appeared reasonable. Note that only straightforward functional structures on each coefficient were considered here, however overall we found these estimates were in good agreement with the true values and the standard error formulae provide good approximations for nominal 95% confidence intervals. 5. Case study: little penguin data set We return to the analysis in Section 2.3.2 by now fitting the SVCM to the little penguin data set. We used the same number of knots and difference penalties as in Section 2.3.2 for marked and long-term trends, with the number of knots set to 20 and a difference penalty order set to d ¼ 4 for the varying coefficient component. The AIC values, reported in Table 1, indicated that the SVCM gave a better fit to the data compared with the SPM but a slightly poorer fit than the NPM. From Fig. 2, the colony size estimates were slightly larger in comparison to SPM, however within each year there is now more variability corresponding to the seasonal changes, for example between years 1993 and 1997, moreover some of the seasonal peaks appear to be shifted. In Web Fig. 4 we plotted the marked and colony size estimates with nominal 95% confidence bands against the capture occasions. In Fig. 4 we plotted the varying coefficient (smoothed) estimates with nominal 95% confidence bands across years. The estimated monthly effects in Fig. 4 indicated changes in the pattern around 1996 and 1997. After the pilchard mortality, little penguins switched their main food source which most likely reflected differences in the availability of these fish species. In Web Fig. 5 we plotted the colony size estimates across each month and year simultaneously and view these on the xyz-plane. Based on these figures, there appeared to be substantial variation across time for each coefficient. In Web Fig. 6 we plotted varying coefficient

^ k ðiÞ) for each year across months, which we then components (a also viewed on the xyz-plane in Web Fig. 7. The estimated monthly effects in Web Fig. 6 suggest the monthly effects have indeed changed with time. Furthermore, for both figures there appears to be peaks in the early parts of the year (February–May) and dips in the latter parts of the year (July–Oct). In order to interpret these varying seasonal effects we first applied FDA techniques of Chapter 7 of [43] to the plots of Web Fig. 7. As our coefficients are functions of time, we consider functional principal component analysis, which is analogous to principal component analysis (PCA) but the eigenvectors are replaced with eigenfunctions (or harmonics). These eigenfunctions are associated with each eigenvalue. Once positive eigenvalue/eigenfunction pairs are obtained, we may then calculate the principal component scores from the leading eigenfunctions. The principal component scores can help interpret the nature of the variation identified by the functional PCA. For example, to visualize the results, Ramsay et al. [43] recommend plotting scores for the two major principal component against each other. This allows us to see how curves cluster and distribute themselves. Note that more meaningful components of variation in the functional data may be obtained by applying rotations to the eigenfunctions. For our analysis, wed use the fda [43] R-package. In Web Fig. 8(a) we plotted the yearly means for the varying coefficient estimates across each month (see Web Fig. 6), and in Web Fig. 8(b) we plotted the smoothed estimates for each of these. We then plotted the first two rotated principal component functions (harmonics) in Web Fig. 9(a) and (b) where the thick line represents the mean and the ‘‘+’’/‘‘’’ represent the effect of adding/ subtracting a small amount of each principal component. These accounted for 95.5% of the variation around the smoothed curves from Web Fig. 8(b). In Web Fig. 9(a) we see a contrast between colony sizes at moulting and breeding times of the year. To further enhance the inference and identify any possible clustering, we considered a bivariate normal mixture model [48] on the principal component scores for the pairs of rotated harmonics. The bivariate normal mixture models which we consider here assume there exists a finite number classes (or clusters in our case) – to find the optimal number of classes, a Bayesian information criterion (BIC) was used. We plotted the BIC values against the number of clusters in Web Fig. 10(a) where the optimal number of clusters identified was 3. In Fig. 5 we plotted the principal component scores for the pairs of rotated harmonics with an overlay of the identified clusters from the bivariate normal mixture model. We also added contours around the centres of the clusters – which were also obtained from the mclust R-package. These contours (or ellipsoids) are defined by the covariances of the bivariate normal mixture model, and were added to see how dispersed the component scores are around the centre of the clusters. The clusters from Fig. 5 correspond to key structures throughout the 25 years, i.e., notice how some component scores will cluster around the centres, whereas others diverge from the centres. We explain these patterns in more detail here. Prior to 1990 the seasonal effects were clustered together. From 1990 to 1994 there was a change and then clustering from 1994 to 1999. There was then a change until 2004 and the seasonal effects again clustered. We interpret the effects prior to 1995 as indicative of the colony trend when pilchards were a major food source and many anthropogenic causes of adult mortality were being reduced or eliminated. The change from after 1995 to 1999 is likely to be in response to changes in pilchard numbers. The changes after 1999 may reflect changes in penguin productivity and survival after anchovies became an important prey in the absence of pilchards giving the stable effects after 2004. Note that the computational times may vary. These will depend on how many capture occasions/number of knots will be

49

J. Stoklosa et al. / Mathematical Biosciences 255 (2014) 43–51

Fig. 4. Smooth estimates for each varying coefficient (i.e., February to December) across years using the SVCM fitted to the little penguin data.

300

200

harmonic 2

100 factor (Cluster) 1 2 3

0

-100

-200

-300 −600

−400

−200

0

200

400

harmonic 1 Fig. 5. Principal component scores for the pairs of rotated harmonics from Web Fig. 9. An overlay of the identified clusters from fitting a bivariate normal mixture model with contours is also given. Notice the three prominent clusters which correspond to significant colony sizes changes across the 25 year period.

considered in the analysis. In the above case study the SVCM took approximately 7 min. (3.50 GBo RAM, 2.67 GHz CPU) to complete with the specified ranges for smoothing parameters and fixed K.

6. Discussion We focused on extending the approach of Huggins and Stoklosa [27] by considering a SVCM which gave the regressors a modified effect. Our approach differs from previous studies that also model seasonality; each having their advantages and disadvantages. Yang et al. [33] used a nonparametric time series decomposition model which allows for greater flexibility when considering a full time series structure, for example, lag effects can be incorporated in

their model, etc.; and Johannesen et al. [39] used closed CR models in conjunction with Pollock’s Robust design [19] which will better account for temporary emigration compared with the proposed model. By conducting simulations we found that the estimators performed well giving some bias, particularly when peaks were present. The variance formulae also performed reasonably well though more bias was present here, this could be due to several factors: such as not accounting for the variation in the smoothing parameters, or some correlations being ignored in the estimators for the number of marked. As discussed in Huggins [26] and Huggins and Stoklosa [27], the correlation in the martingale estimating equations is unknown and may not be a vector of martingale differences, however reasonable approximations can still be made.

50

J. Stoklosa et al. / Mathematical Biosciences 255 (2014) 43–51

We noticed that the bias and standard errors of individual parameters were sensitive to the choice of kM ; kN and kVC . Since these parameters were selected to minimize a global GCV statistic based on the residual sum of squares rather than the individual variances, some imbalance may occur. Importantly, the structure of our proposed model assumes that the actual population size lies on some smooth curve across the capture occasions. This assumption may not always hold in practice, as some wildlife populations may fluctuate rapidly. In our example data, some unusually large peaks were indeed evident when considering the JSM (see Fig. 1). These estimated peaks occurred due to the structure and assumptions of JSM estimators, as it is unrealistic for the little penguin colony size to inflate and then rapidly decline to such an extreme magnitude within a one month period. The proposed models try to overcome this by assuming smoothness over time, which yields more stable population size estimates whilst simultaneously accounting for seasonal changes. As noted in Section 5, Web Fig. 6 has similar features from year to year suggesting a common functional form, however these are not consistent for each year, suggesting the timing of moulting and breeding may not be synchronized for a calendar year, hence registration techniques [43, Chapter 7] may also be appropriate. From a practical point of view, the SVCM gives information on the change in the little penguin colony size in relation to environmental changes, in this case: the colony increased in area and colony size from 1985 to 1995 through the declining negative effects of a suite of anthropogenic mortality factors [32] including foxes, dogs and cars; and decreased in colony size due to a loss of a major food source after 1995 [36] until some stability was reached after about 1998 (Fig. 3). Finally, a further extension on inference on detecting population changes may be considered. For example, we may consider techniques such as those given by Fewster et al. [49] where significance testing is conducted to test for population changes. Acknowledgements We would like thank the handling editor and two anonymous referees for their helpful comments in an early version of this article. We would also like to thank the Penguin Study Group for collecting the data and Leanne Renwick, Marg Healey and Roz Jessop for maintaining the database. The penguins were handled under permits from the Department of Sustainability and Environment in Victoria. The Australian Bird and Bat Banding Schemes provided the banding permit and the flipper bands. The first authors research was supported by the David Hay Writing-Up award. Appendix A to Appendix C. Supplementary material Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.mbs.2014.06.014. References [1] P.A. Parsons, The Evolutionary Biology of Colonizing Species, Cambridge University Press, Cambridge, 1983. [2] S.C. Amstrup, T.L. McDonald, B.F.J. Manly, Handbook of Capture–Recapture Analysis, Princeton University Press, New Jersey, 2005. [3] J.B. Hestbeck, J.D. Nichols, R.A. Malecki, Estimates of movement and site fidelity using mark-resight data of wintering Canada geese, Ecology 72 (1991) 523. [4] P.N. Reilly, J.M. Cullen, The Little Penguin Eudyptula minor in Victoria, II: Breeding, Emu 81 (1981) 1. [5] W.S. Cleveland, E. Grosse, W.M. Shyu, Local regression models, in: J.M. Chambers, T.J. Hastie (Eds.), Statistical Models in S, Wadsworth and Brooks, Pacific Grove, 1991, p. 309. [6] T. Hastie, R. Tibshirani, Varying-coefficient models, J. Roy. Stat. Soc., Ser. B (Methodological) 55 (1986) 757.

[7] J. Fan, W. Zhang, Statistical methods with varying coefficient models, Stat. Interface 1 (2008) 179. [8] M.-Y. Cheng, W. Zhang, L.-H. Chen, Statistical estimation in generalized multiparameter likelihood models, J. Am. Stat. Assoc. 104 (2009) 1179. [9] J.Z. Huang, C.O. Wu, L. Zhou, Polynomial spline estimation and inference for varying coefficient models with longitudinal data, Stat. Sinica 14 (2004) 763. [10] J. Fan, W. Zhang, Simultaneous confidence bands and hypothesis testing in varying-coefficient models, Scand. J. Stat. 27 (2000) 715. [11] W. Zhang, Local polynomial fitting in semivarying coefficient model, J. Multivariate Anal. 82 (2002) 166. [12] Y. Xia, W. Zhang, H. Tong, Efficient estimation for semivarying-coefficient models, Biometrika 91 (2004) 661. [13] R.M. Huggins, P.G. Hall, P.S.F. Yip, Q. Bui, Application of additive semivarying coefficient models: monthly suicide data from Hong Kong, Biometrics 63 (2007) 708. [14] J. Fan, W. Zhang, Statistical estimation in varying coefficient models, Ann. Stat. 27 (1999) 1491. [15] B.D. Marx, P-spline varying coefficient models for complex data, in: T. Kneib, G. Tutz (Eds.), Statistical Modelling and Regression Structures: Festschrift in Honour of Ludwig, Physica-Verlag, 2010. [16] P.H.C. Eilers, B.D. Marx, Generalized linear additive smooth structures, J. Comput. Graph. Stat. 11 (2002) 758. [17] G.M. Jolly, Explicit estimates from capture–recapture data with both death and immigration-stochastic model, Biometrika 52 (1965) 225. [18] G.A.F. Seber, The Estimation of Animal Abundance, Griffin, London, 1982. [19] K.H. Pollock, A capture–recapture design robust to unequal probability of capture, J. Wildlife Manage. 46 (1982) 757. [20] W.L. Kendall, K.H. Pollock, C. Brownie, A likelihood-based approach to capture–recapture estimation of demographic parameters under the robust design, Biometrics 52 (1995) 293. [21] P.S.F. Yip, A martingale estimating equation for a capture–recapture experiment in discrete time, Biometrics 47 (1991) 1081. [22] P.S.F. Yip, Statistical inference procedure for a hypergeometric model for capture–recapture experiment, Appl. Math. Comput.tion 58 (1993) 225. [23] W.D. Hwang, A. Chao, Quantifying the effects of unequal catchabilities on Jolly–Seber estimators via sample coverage, Biometrics 51 (1995) 128. [24] C.J. Schwarz, A.N. Arnason, A general methodology for the analysis of capture– recapture experiments in open populations, Biometrics 52 (1996) 860. [25] R.M. Huggins, P.S.F. Yip, Estimation of the size of an open population from capture–recapture data using weighted martingale methods, Biometric 55 (1999) 387. [26] R.M. Huggins, Semiparametric estimation of animal abundance using capture– recapture data from open populations, Biometrics 62 (2006) 684. [27] R.M. Huggins, J. Stoklosa, Semiparametric inference for open populations using the Jolly–Seber model: a penalized spline approach, J. Stat. Comput. Simul. 83 (2013) 1741. [28] D.R. Sutherland, P. Dann, Improving the accuracy of population size estimates for burrowing seabirds, Ibis 154 (2012) 488. [29] M. Collins, J.M. Cullen, P. Dann, Seasonal and annual foraging movements of little penguins from Phillip Island, Victoria, Wildlife Res. 26 (1999) 705. [30] C. McCutcheon, P. Dann, M. Salton, L. Renwick, A. Gormley, J. Arnould, Foraging range of Little Penguins during winter, Emu 111 (2011) 321. [31] P.N. Reilly, J.M. Cullen, The Little Penguin Eudyptula minor in Victoria, IV: Moult, Emu 83 (1983) 94. [32] P. Dann, Distribution, population trends and factors influencing the population size of Little Penguins Eudyptula minor on Phillip Island, Victoria, Emu 91 (1992) 263. [33] H. Yang, L. Chambers, R.M. Huggins, A seasonal decomposition of the estimated size of a penguin population at Phillip Island, Australia, J. Zool. 53 (2005) 111. [34] P. Dann, F.I. Norman, J.M. Cullen, F.J. Neira, A. Chiaradia, Mortality and breeding failure of little penguins, Eudyptula minor, in Victoria, 1995–1996, following a widespread mortality of pilchard, Sardinops sagax, Marine and Fresh-water, Research 51 (2000) 355. [35] A. Chiaradia, M.G. Forero, K.A. Hobson, J.M. Cullen, Changes in diet and trophic position of a top predator ten years after a mass mortality of a key prey, ICES J. Mar. Sci. 67 (2010) 1710. [36] A. Chiaradia, A. Costalunga, K. Kerry, The diet of Little Penguins (Eudyptula minor) at Phillip Island, Victoria, in the absence of a major prey Pilchard (Sardinops sagax), Emu 103 (2003) 43. [37] K.H. Pollock, J.D. Nichols, C. Brownie, J.E. Hines, Statistical inference for capture recapture experiments, Wildlife Monogr. 107 (1990) 97. [38] J.M. Cullen, L.E. Chambers, P.C. Coutin, P. Dann, Predicting onset and success of breeding in little penguins Eudyptula minor from ocean temperature, Mar. Ecol. Prog. Ser. 378 (2009) 69. [39] E. Johannesen, H. Steen, L. Perriman, Seasonal variation in survival, weights, and population counts of blue penguins (Eudyptula minor) in Otago, New Zealand, New Zeal. J. Zool. 29 (2002) 213. [40] P.H.C. Eilers, B.D. Marx, Flexible smoothing with B-splines and penalties, Stat. Sci. 11 (1996) 89. [41] P.H.C. Eilers, B.D. Marx, Splines, knots, and penalties, Wiley Interdiscipl. Rev.: Comput. Stat. 2 (2010) 637. [42] J.O. Ramsay, B.W. Silverman, Functional Data Analysis, second ed., Springer, New York, 2005. [43] J.O. Ramsay, G. Hooker, S. Graves, Functional Data Analysis with R and MATLAB, User R, Springer, London, 2009.

J. Stoklosa et al. / Mathematical Biosciences 255 (2014) 43–51 [44] R Development Core Team, R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, 2013. .ISBN 3-900051-07-0. [45] P. Dann, J.M. Cullen, R. Thoday, R. Jessop, The movements and patterns of mortality at sea of Little Penguins Eudyptula minor from Phillip Island, Victoria, Emu 91 (1992) 278. [46] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer, New York, 2001.

51

[47] P. Dunn, G. Smyth, Randomized quantile residuals, J. Comput. Graph. Stat. 5 (1996) 236. [48] C. Fraley, A.E. Raftery, Model-based clustering, discriminant analysis, and density estimation, J. Am. Stat. Assoc. 18 (2002) 611. [49] R.M. Fewster, S.T. Buckland, G.M. Siriwardena, S.R. Baillie, J.M. Wilson, Analysis of population trends for farmland birds using generalized additive models, Ecology 81 (2000) 1970.

Semivarying coefficient models for capture-recapture data: colony size estimation for the little penguin Eudyptula minor.

To accommodate seasonal effects that change from year to year into models for the size of an open population we consider a time-varying coefficient mo...
848KB Sizes 0 Downloads 3 Views