Accident Analysis and Prevention 79 (2015) 133–144

Contents lists available at ScienceDirect

Accident Analysis and Prevention journal homepage: www.elsevier.com/locate/aap

Prioritizing Highway Safety Manual’s crash prediction variables using boosted regression trees Dibakar Saha * , Priyanka Alluri 1, Albert Gan 2 Department of Civil and Environmental Engineering, Florida International University, 10555 West Flagler Street, EC 3680, Miami, FL 33174, United States

A R T I C L E I N F O

A B S T R A C T

Article history: Received 15 September 2014 Received in revised form 14 February 2015 Accepted 10 March 2015 Available online xxx

The Highway Safety Manual (HSM) recommends using the empirical Bayes (EB) method with locally derived calibration factors to predict an agency’s safety performance. However, the data needs for deriving these local calibration factors are significant, requiring very detailed roadway characteristics information. Many of the data variables identified in the HSM are currently unavailable in the states’ databases. Moreover, the process of collecting and maintaining all the HSM data variables is costprohibitive. Prioritization of the variables based on their impact on crash predictions would, therefore, help to identify influential variables for which data could be collected and maintained for continued updates. This study aims to determine the impact of each independent variable identified in the HSM on crash predictions. A relatively recent data mining approach called boosted regression trees (BRT) is used to investigate the association between the variables and crash predictions. The BRT method can effectively handle different types of predictor variables, identify very complex and non-linear association among variables, and compute variable importance. Five years of crash data from 2008 to 2012 on two urban and suburban facility types, two-lane undivided arterials and four-lane divided arterials, were analyzed for estimating the influence of variables on crash predictions. Variables were found to exhibit non-linear and sometimes complex relationship to predicted crash counts. In addition, only a few variables were found to explain most of the variation in the crash data. Published by Elsevier Ltd.

Keywords: Highway Safety Manual Data mining Boosted regression trees Variable importance Crash predictions Calibration factor

1. Introduction The Highway Safety Manual (HSM), published by the American Association of State Highway and Transportation Officials (AASHTO) in 2010, is designed to “assist agencies in their effort to integrate safety into their decision-making processes” (AASHTO, 2010). Part C of the HSM presents predictive models to estimate predicted average crash frequency at individual sites on different roadway facilities including rural two-lane two-way roads, rural multilane highways, and urban and suburban arterials. The general form of the predictive models in the HSM can be expressed as follows: Npredicted;i ¼ Nspf;i  ðCMF1;i  CMF2;i  :::  CMFn;i Þ  C i

(1)

where Npredicted,i is the predicted average crash frequency for a specific year for site type i; Nspf,i is the predicted average crash

* Corresponding author. Tel.: +1 786 488 7744. E-mail addresses: dsaha003@fiu.edu (D. Saha), palluri@fiu.edu (P. Alluri), gana@fiu.edu (A. Gan). 1 Tel.: +1 305 348 1896. 2 Tel.: +1 305 348 3116. http://dx.doi.org/10.1016/j.aap.2015.03.011 0001-4575/ Published by Elsevier Ltd.

frequency for a specific year for site type i for base conditions; CMF1,i . . . CMFn,i are crash modification factors for n geometric conditions or traffic control features for site type i; and Ci is the calibration factor to adjust SPF for local conditions for site type i. As shown in Eq. (1), there are three components of the predictive models: base safety performance functions (SPFs), crash modification factors (CMFs), and calibration factors. Base SPFs are statistical models that are used to estimate predicted average crash frequency for a facility type with specified base conditions. CMFs are used to account for the effects of non-base conditions on predicted crashes. Calibration factors are required “to account for differences between the jurisdiction and time period for which the predictive models were developed and the jurisdiction and time period to which they are applied by HSM users” (AASHTO, 2010). Calibration factor is estimated as the ratio of the total number of observed crashes to the total number of predicted crashes calculated using the SPFs and CMFs provided in the HSM. The predictive models are most effective when calibrated to local conditions (Findley et al., 2012; Lu, 2013; Sun et al., 2006; Young and Park, 2013). Very detailed roadway geometry, traffic, and crash characteristics data are needed to derive local calibration factors. Several of

134

D. Saha et al. / Accident Analysis and Prevention 79 (2015) 133–144

the variables are often unavailable in the states’ databases. Collecting and maintaining all the data variables on the entire road network for the purpose of implementing the HSM is not costfeasible. Therefore, a process to streamline the data requirements that minimizes the potential impacts to the quality of analysis is desirable. The objective of this study is to investigate the impact of the variables identified in the HSM on crash predictions. The study used five years of crash data from 2008 to 2012 on urban and suburban two-lane undivided arterials and urban and suburban four-lane divided arterials in Florida. Boosted regression tree (BRT), a data mining approach, is applied to evaluate variables’ importance and analyze their marginal effects on crash predictions. 2. Literature review Traditionally, statistical regression models are developed in highway safety studies to associate crash frequency with the most significant variables (for example, Hadi et al., 1995; Abdel-Aty and Radwan, 2000; Sawalha and Sayed, 2001; Hauer et al., 2004; Caliendo et al., 2007; Cafiso et al., 2010 etc.). The models, however, were limited in their scope to evaluate the influence of predictor variables on crash outcome. Few studies identified and ranked the influence of predictor variables on crash predictions using sensitivity analysis (Alluri and Ogle, 2012; Findley et al., 2012; Jalayer and Zhou, 2013). The typical approach used in sensitivity analysis is to alter the value of one predictor variable at its maximum, minimum, and/or average, and estimate the change in output relative to the output generated from using the actual values of the variable. The variables that produced substantial changes in predicted crash frequencies were identified as influential variables. The main limitation of this approach is that only a single variable is evaluated at one time and, therefore, the possible association between the variables is ignored while measuring the effect of each variable on crash predictions. Data mining procedures are increasingly being applied in transportation safety studies to capture the complex and nonlinear relation between data variables and crash characteristics. Tree-based method, typically known as classification and regression tree (CART) (Breiman et al., 1984) or decision tree, is a suitable data mining approach in this regard (Williams, 2011). The CART method provides several benefits over traditional regression

models. It does not require a pre-specified function and variable transformation for developing models. It can intrinsically identify non-linear association among predictor variables. Also, the method provides interpretable results by demonstrating the relative influence of each variable on model prediction. Karlaftis and Golias (2002) developed CART model, termed as hierarchical tree-based regression (HTBR) model, to estimate the relative contribution of variables on crash frequency for rural twolane and multilane roads in Indiana. Yan and Radwan (2006) analyzed crashes involving two vehicles at signalized intersections by developing two classification tree models; one was built to obtain the causal features associated with rear-end crashes compared to non-rear-end crashes, and the other was built to identify the factors attributed to at-fault drivers/vehicles against not-at-fault drivers/vehicles for rear-end crashes. Chang and Wang (2006) and Kashani and Mohaymany (2011) applied the CART method to quantify the effects of vehicle, driver, and crash attributes on injury severity. Elmitiny et al. (2010) fitted a CART model to analyze the behavioral patterns of driver’s stop-and-go situations and red-light running violations at an intersection. Although CART models have several advantages, they can sometimes be unstable and produce output with high variance (Zhang et al., 2005). The ensemble approach that combines outputs from a collection of trees reduces prediction error and is usually more stable (Das et al., 2009; De’ath, 2007; Elith et al., 2008). There are two ensemble approaches based on decision trees: random forests and boosted regression trees (BRT). In random forests, predictions are computed by fitting a number of trees (typically 50–1000) using a bootstrap sample of data and a subset of predictors. Harb et al. (2009) used random forests technique to estimate the relative importance of variables on the binary outcome of drivers’ crash avoidance maneuvers. Abdel-Aty and Haleem (2011) applied random forests method to determine the importance of the explanatory variables in predicting angle crash frequency at unsignalized intersections. Alluri et al. (2014) prioritized the variables in the HSM for several segment and intersection subtypes using random forests algorithm. Unlike random forests, the BRT method produces the assembly of trees with a slow learning rate and in a sequential manner to extract more variability in data. One major advantage of the BRT approach over other tree-based models is that it can also rigorously deal with different types of response variable such as binomial,

Table 1 Variables identified in the HSM for urban and suburban arterials. Variable

Type

Average annual daily traffic (AADT) Segment length Median widtha Number of major commercial driveways Number of major residential driveways Number of major industrial driveways Number of minor commercial driveways Number of minor residential driveways Number of minor industrial driveways Number of other driveways Number of roadside objects Speed limit Presence of on-street parking Presence of lighting Presence of automated speed enforcement Type of on-street parking Parking by land use type Curb length with on-street parking Offset to roadside objects

Continuous Continuous Categorical (10 levels: 10 ft, 20 ft, 30 ft, 40 ft, 50 ft, 60 ft, 70 ft, 80 ft, 90 ft, 100 ft) Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous Categorical (Two levels: 30 mph, >30 mph) Categorical (Two levels: absent, present) Categorical (Two levels: absent, present) Categorical (Two levels: absent, present) Categorical (Two levels: angle parking, parallel parking) Categorical (Two levels: residential/other, commercial or industrial/institutional) Continuous Categorical (Seven levels: 2 ft, 5 ft, 10 ft, 15 ft, 20 ft, 25 ft,  30 ft)

a

Median width is applicable only for four-lane divided arterials.

D. Saha et al. / Accident Analysis and Prevention 79 (2015) 133–144

count, normal, etc. with the use of appropriate and robust loss function (Elith et al., 2008). Chung (2013) explored the potentiality of the BRT approach in safety analysis. Single-vehicle motorcycle crashes in Taiwan were used to quantify the impact of contributing factors on injury severity (i.e., fatal vs. non-fatal crashes). Several BRT models were developed to investigate relative contribution of the variables, evaluate their marginal effects, and demonstrate the effect of pair-wise interactions on crash predictions. The study concluded that since crash analysis usually involves a number of explanatory variables and crash events are distinctive, the slow and sequential learning process of the BRT method would be able to give more emphasis on hard-to-fit observations. Ahmed and Abdel-Aty (2013) applied the BRT method to develop a framework for real-time risk assessment on freeways by fusing detector, weather, and roadway geometry data. Although BRT method is applied in various fields including ecology (De’ath, 2007; Elith et al., 2008; Esther et al., 2014; Hale et al., 2014), epidemiology (Cheong et al., 2014; Ellis et al., 2013; Neumann et al., 2004), soil mapping (Jafari et al., 2014; Lemercier et al., 2012), agriculture (Etter et al., 2006; Gellrich et al., 2008; Müller et al., 2013), and fisheries (Froeschke and Froeschke, 2011), its application in transportation engineering, particularly in highway safety, is underexplored. 3. Data collection and preparation Table 1 provides the list of variables identified in the HSM for urban and suburban roadway facilities. The roadway characteristics inventory (RCI) database maintained by the Florida Department of Transportation (FDOT) is the primary source of information for the data variables on Florida roadways. Data were extracted from the RCI for urban and suburban arterials that are part of the state highway system in Florida. However, data were available only for three variables, AADT, median width, and speed limit. AADT data were extracted for five years (i.e., 2008–2012) from the corresponding year’s RCI database, while data on median width and speed limit were retrieved for the most recent year (i.e., 2012). Data for the remaining variables were collected using aerial images assuming that roadway geometry characteristics did not change during the analysis period. An in-house web-based application was developed to facilitate the data collection process using Google Maps. The application works as follows. It first reads a linear-referenced roadway segment for which the data are to be

135

collected, converts its coordinates to the Google Maps projection on the fly, and then displays the segment on the Google Maps, in either the street view or the aerial view, depending on the user selection. The application allows the user to scan from the beginning to the end milepost, similar to a typical video log system. An important component of the application is that it includes an input pane that allows the user to quickly specify and record the observed data pertaining to each segment. Categorical variables were provided with the specific options to select from the dropdown list and data for continuous variables were entered directly in the box. The collected roadway geometry and traffic data were thoroughly checked for possible outliers and inconsistencies. For example, sites with extremely high or extremely low AADT values were considered as outliers and, therefore, were excluded from analysis. Individual sites with a huge difference in AADT between consecutive years were also discarded. Furthermore, the sites identified with any type of construction work were not included in the analysis. Finally, a total of 1791 urban and suburban two-lane undivided arterial segments and 4969 urban and suburban fourlane divided arterial segments were included in the analysis. In addition, crash data were obtained for five years (i.e., 2008– 2012) from FDOT’s Unified Basemap Repository (UBR) system. Crashes were assigned to segments based on segment ID, crash location ID and the milepost at which the crash occurred. Crashes that occurred on the point between two roadway segments were consistently assigned to the beginning segment. Overall, during the five-year analysis period, 5415 crashes were observed on the 616.7 miles of urban and suburban two-lane undivided arterials, and 28,314 crashes were observed on the 1400.7 miles of urban and suburban four-lane divided arterials. 4. Methodology This section discusses the methodology of the BRT technique. It includes the underlying principle of the BRT method, the algorithm for fitting a BRT model, and a synopsis about regularization parameters for optimizing BRT models. 4.1. Basic principle of BRT The BRT method is based on two powerful procedures: regression trees and boosting (Elith et al., 2008). Regression trees are decision-tree based models formed by dividing the predictor

Fig. 1. Illustration of a decision tree (Kashani and Mohaymany, 2011).

136

D. Saha et al. / Accident Analysis and Prevention 79 (2015) 133–144

space into a number of mutually exclusive regions (Hastie et al., 2009). Each region contains a group of observations based on a series of decision rules. The process can be demonstrated using an inverted tree structure, with root at the top representing the entire data set and leaves at the bottom representing the regions (see Fig. 1). Trees are typically grown by binary recursive partitioning through a series of splits or nodes. First, the root or parent node is “partitioned” into two child nodes based on the value of an independent variable (splitter) that generates maximum homogeneity in the child nodes. The process is “binary” because it splits a node into two nodes and “recursive” because the binary split successively occurs in each of the child nodes until there is a prespecified criterion to stop the process. The nodes where trees cease to grow are called terminal nodes, or leaves. The mean values provided by the terminal nodes are the predictions (Kashani and Mohaymany, 2011). The term ‘boosting’ refers to an ensemble approach that fits a number of trees in a sequential process. The basic idea is to give more emphasis to poorly fitted observations (i.e., the observations that highly deviate from the mean) based on the results from the previous tree alone rather than from all the other previously fitted trees (Bühlmann and Hothorn, 2007). The boosting procedure thus combines predictions from many weak models so as to produce a strong prediction and improve model accuracy (De’ath, 2007; Elith et al., 2008; Hastie et al., 2009). 4.2. Algorithm In BRT, once the first tree is fitted to the training data, the residuals are calculated. The observations with high residual values indicate poor fit and, therefore, are assigned more weight to fit by the next tree. In the next and subsequent steps, trees are fitted to the residuals of the previous tree, and so on. The BRT procedure is thus a forward and stagewise procedure where “the existing trees are left unchanged; only the residual for each observation is reestimated to reflect the contribution of the newly added tree” (Elith et al., 2008). The algorithm for the BRT model can be described as follows (De’ath, 2007; Hastie et al., 2009): Let x be a set of predictor variables and a function f(x) be an approximation of the response variable y. The BRT model estimates the function as an additive expansion of basis functions, b(x; g m), as follows: X X f m ðxÞ ¼ bm bðx; g m Þ (2) f ðxÞ ¼ m

m

where bm (m = 1, 2, . . . ., M) are the expansion coefficients and b(x; g m) are single regression trees with the parameter g m representing the split variables, their values at the splitting nodes, and the predicted values at the terminal nodes. The coefficients bm represent weights given to the nodes of each tree and determine how predictions from each of the trees are combined (De’ath, 2007). The parameters bm and g m are estimated by minimizing a specified loss function, L(y,f(x)), which indicates a measure of prediction performance (e.g., deviance). Friedman (2001) formulated a numerical optimization technique called ‘functional gradient descent’ that approximates the solution of loss function minimization by the method of steepest descent for a forward stagewise BRT model. The procedure can be summarized in the following steps (De’ath, 2007): 1. Initialize f0(x) 2. For m = 1 to M (number of trees) a) For i = 1 to N (number of observations), calculate the residuals, b)

h i rim ¼  @Lð@yfið;fxðxÞ i ÞÞ i f ðxÞ¼f m1 ðxÞ

Fit a regression tree to rim to estimate g m of b(x; g m)

the estimate bm Lðy; f m1 ðxi Þ þ bbðx; g m ÞÞ: d) Update f m ðxÞ ¼ f m1 ðxi Þ þ b bðx; g Þ m m c) Obtain

3. Calculate f ðxÞ ¼

by

minimizing

Sm f m ðxÞ

Friedman (2002) also purported to bring in some randomness to the gradient boosting procedure in order to improve prediction performance and to reduce overfitting and computation time. This is attained by extracting a portion of the training data (usually 50–75%) without replacement at each iteration. 4.3. Regularization parameters The sequential tree building process continues to add trees until all the observations are perfectly fit. This can lead to overfitting of training data. In order to fit a balanced model that reduces overfitting and improves prediction accuracy, regularization of the BRT models is important. Regularization process usually involves optimizing three parameters, shrinkage, tree complexity, and number of trees, to obtain a balance between bias and variance. Shrinkage, also called learning rate, is used to reduce the contribution of each tree in the model. It is implemented by applying a small number, usually from 0.1 to as low as 0.0001. Smaller shrinkage value better minimizes loss function; however, requires more trees to be added to the model. A 10-fold reduction in shrinkage requires to be fitted with approximately 10 times more trees (De’ath, 2007). Tree complexity represents the depth of a tree implying interaction among predictor variables. A tree complexity of 1 corresponds to the main effect of predictor variables, and generates each tree with only two terminal nodes, which are called single decision stumps (Hastie et al., 2009). A tree complexity of 2 develops models with up to two-way interactions between variables (i.e., a maximum of two nodes in each branch), and so on (Hastie et al., 2009). In order to reflect the complexity between variables and utilize the strength of BRT model, trees need to be grown with higher levels of tree complexity. The success of a BRT model thus depends on the optimal settings of these regularization parameters. 5. Analysis setup A series of BRT models were generated with combinations of shrinkage (0.05, 0.01, 0.005, 0.001, and 0.0005) and tree complexity (1, 5, 10, and 15) values by fitting a total of 20,000 trees for both urban and suburban two-lane undivided and four-lane divided arterial segments. The analysis was carried out using the gbm package of the statistical software R (R Core Team, 2014). Since crashes are random, non-negative, and discrete events, the models were built using Poisson distribution, where the dependent variable was total number of observed crashes in five years. The logarithm of the product of segment length and number of years used in the analysis (i.e., five) was included as an offset factor in the model formulation to obtain the output (i.e., crash predictions) in crashes per mile per year. The variables listed in Table 1 were used as predictor variables for developing BRT models. However, several considerations were made. For example, type of on-street parking, parking by land use type, and curb length with on-street parking are only applicable for locations with on-street parking facility. It is clearly discernible that these variables will be highly correlated to the presence of onstreet parking variable. Similarly, offset to roadside objects is only applicable for locations with roadside objects. To avoid the known effect of correlation, these variables were not included in the analysis. Additionally, driveways were included in the model as driveway density, i.e., driveways per mile. Tables 2 and 3 provide

D. Saha et al. / Accident Analysis and Prevention 79 (2015) 133–144

the descriptive statistics of the variables for urban and suburban two-lane undivided arterials and urban and suburban four-lane divided arterials, respectively. Cross-validation (CV) procedure was used for developing the BRT models. The CV procedure divides the full data set into n mutually exclusive subsets, and fits the data n times. In each run, a different subset is chosen as test set while the remaining (n-1) subsets are used to train the model. Since all the data are being used at some stage for training as well as for validation, the model thus can learn better from data through this procedure. A 10-fold CV was used to determine which combination of shrinkage and tree complexity better optimizes the BRT model. To reduce overfitting, models (or trees) were fitted stochastically by a random extraction of 50% of the training data without replacement at each iteration. The stopping criterion for a node from splitting further was set such that terminal nodes must have at least 10 observations.

137

6.1. Optimization of parameters Fig. 2 shows the performance of the BRT models for urban and suburban two-lane undivided arterials. The curves in the figure depict how Poisson deviance is reduced over the ensemble of 20,000 trees with different sets of shrinkage factors and tree complexities. It is obvious that models with larger shrinkage factors (0.05, 0.01, and 0.005) would fit fewer trees, whereas models with smaller shrinkage factors (0.001 and 0.0005) would fit several thousands of trees to gradually converge to the minimum loss in Poisson deviance. For example, at tree complexity levels higher than 1, shrinkage factors between 0.05 and 0.005 fitted less than 4000 trees and shrinkage factors of 0.001 and 0.0005 fitted more than 5000 trees to achieve minimum deviance. Note that shrinkage factor of 0.0005 did not achieve minimum deviance at tree complexities of 1 and 5 by fitting 20,000 trees. It is also observed that relatively fewer trees were required to fit models

6. Analysis and results This section presents the study results. The BRT model outputs are first presented to show model performance and parameter optimization. Based on the optimal parameter values, variable importance and the marginal effect of variables on crash prediction are evaluated.

Table 2 Descriptive statistics for urban and suburban two-lane undivided arterials. Continuous variable

Minimum Maximum Average

AADT (veh/ day)

800 850 800 550 550 0.04

2008 2009 2010 2011 2012

Segment length (miles) Number of major commercial driveways Number of major residential driveways Number of major industrial/ institutional driveways Number of minor commercial driveways Number of minor residential driveways Number of minor industrial/ institutional driveways Number of other driveways Roadside fixed object density (objects per mile)

29,000 30,500 27,500 27,500 28,500 2

Standard deviation

11,410 5180 10,709 5058 10,766 4968 10,589 4799 10,255 4699 0.34 0.35

0

6

0.09

0.46

0

26

0.12

1.07

0

9

0.11

0.53

0

43

0.71

2.74

0

129

4.46

10.98

0

27

0.74

2.52

0

23

0.29

1.52

52.25

46.50

Table 3 Descriptive statistics for urban and suburban four-lane divided arterials. Continuous variable

Minimum Maximum Average

AADT (veh/ day)

2008 1900 2009 2500 2010 1200 2011 1650 2012 950 Segment length 0.04 (miles) 0 Number of major commercial driveways Number of major 0 residential driveways 0 Number of major industrial/ institutional driveways 0 Number of minor commercial driveways Number of minor 0 residential driveways 0 Number of minor industrial/ institutional driveways 0 Number of other driveways 0 Roadside fixed object density (objects per mile)

81,500 79,500 88,500 89,000 86,000 2.03

10,778 10,541 10,190 10,339 10,078 0.34

2

0.09

1.20

3

0.05

0.60

0

0

0.22

19

1.86

3.78

67

1.43

2.72

0

0

0.46

0

0

0.56

586.21

144.25

59.34

Categorical variable Speed Category

0

560.98

Categorical variable Speed Category Presence of on-street parking Presence of lighting Presence of automated speed enforcement

Median width

Number of segments 30 mph >30 mph Yes No Yes No Yes No

116 1675 85 1706 746 1045 4 1787

Standard deviation

27,723 26,490 25,973 26,099 25,613 0.28

Presence of on-street parking Presence of lighting Presence of automated speed enforcement

Number of segments 30 mph >30 mph 10 ft 20 ft 30 ft 40 ft 50 ft 60 ft 70 ft 80 ft 90 ft 100 ft Yes No Yes No Yes No

97 4872 205 1941 786 1525 260 125 51 14 14 46 74 4895 3699 1270 80 4889

138

D. Saha et al. / Accident Analysis and Prevention 79 (2015) 133–144

with increasing tree complexities for the same shrinkage value. Note that Elith et al. (2008) recommended to select models that have fitted at least 1000 trees. Models fitted with tree complexity level 1 had the highest deviance compared to models fitted with higher tree complexity levels 5, 10, and 15. It means that decision trees with only two terminal nodes (tree complexity of 1) could not explain the variability of the data well, even with the price of fitting more trees; rather, models fitted with interaction among variables would be more effective in capturing the complexity of crash frequency data. Table 4 presents minimum values of Poisson deviance of BRT models for urban and suburban two-lane undivided arterials. The digits in the parentheses show the number of trees at which the minimum deviance is achieved. Table 4 shows that models with tree complexity level 5 had significant reduction in deviance from models with tree complexity level 1. However, the reduction in deviance is not substantial between models with higher tree complexity levels (i.e., 5, 10, and 15). Again, models with shrinkage

factor of 0.05 produced larger reduction in Poisson deviance with increasing tree complexities, while models with shrinkage factors of 0.01 and 0.005 had better reduction in Poisson deviance at tree complexity level 5 than those at tree complexity levels 10 and 15. The overall minimum loss in Poisson deviance is obtained at shrinkage factor of 0.001 and tree complexity level 10 with an ensemble of 7983 trees. Because the lowest deviance is achieved by fitting more than 1000 trees, shrinkage factor of 0.001 and tree complexity level 10 were considered to have optimized the model to determine variable importance and evaluate the marginal effects of variables for urban and suburban two-lane undivided arterials. Fig. 3 shows the performance of BRT models for urban and suburban four-lane divided arterials. It is observed that the BRT models for this facility type fitted more trees to reach to a minimum value of Poisson deviance compared to those for twolane facilities for the same set of regularization parameters. Similar to the curves in Fig. 2, the curves in Fig. 3 show gradual descending

Fig. 2. BRT model performance for urban and suburban two-lane undivided arterials.

D. Saha et al. / Accident Analysis and Prevention 79 (2015) 133–144

139

Table 4 Minimum deviance of BRT models for urban and suburban two-lane undivided arterials. Shrinkage

Minimum poisson deviance (optimum number of trees) Tree complexity: 1

0.05 0.01 0.005 0.001 0.0005 a

2.8352 2.8377 2.8366 2.8295 2.8094

Tree complexity: 5 (515) (3955) (7946) (19,939) (20,000)a

2.9065 2.9241 2.9262 2.9246 2.9185

(233) (1711) (3317) (17,135) (19,998)a

Tree complexity: 10

Tree complexity: 15

2.9103 2.9222 2.9239 2.9287 2.9274

2.9125 2.9187 2.9194 2.9223 2.9224

(176) (1010) (1734) (7983) (13,947)

(97) (543) (884) (5184) (9575)

Poisson deviance did not reach to its minimum value.

of the Poisson deviance at shrinkage values of 0.001 and 0.0005. Also, models with increasing tree complexities fitted fewer trees to converge to the minimum Poisson deviance. Table 5 presents the corresponding minimum values of Poisson deviance of BRT models for urban and suburban four-lane divided roads. Models with tree complexity of 15 did not perform better in reducing the deviance than models with lower tree complexity

levels (5 and 10). The lowest reduction in Poisson deviance occurred at tree complexity of 10 and shrinkage of 0.005 by fitting 1877 trees. Therefore, the BRT model with shrinkage factor of 0.005, tree complexity level 10, and ensemble of 1877 trees was used to determine the importance of variables and show their marginal effects on crash occurrence on urban and suburban fourlane divided arterials.

Fig. 3. BRT model performance for urban and suburban four-lane divided arterials.

140

D. Saha et al. / Accident Analysis and Prevention 79 (2015) 133–144

Table 5 Minimum deviance of BRT models for urban and suburban four-lane divided arterials. Minimum poisson deviance (optimum number of trees) Shrinkage

Tree complexity: 1

0.05 0.01 0.005 0.001 0.0005

12.6679 12.6661 12.6662 12.6588 12.6317

a

Tree complexity: 5 (1336) (4723) (7496) (19,988)a (20,000)a

12.6798 12.6936 12.6911 12.6940 12.6885

(181) (1915) (3061) (14,118) (19,996)a

Tree complexity: 10

Tree complexity: 15

12.6842 12.6906 12.6961 12.6889 12.6904

12.6786 12.6836 12.6918 12.6896 12.6906

(112) (786) (1877) (7596) (15,307)

(112) (650) (1238) (5870) (13,627)

Poisson deviance did not reach to its minimum value.

Table 6 Relative influence of predictor variables for urban and suburban two-lane undivided arterials. Variable

Relative influence (%)

Cumulative influence (%)

AADT Roadside object density Minor residential driveway density Minor commercial driveway density Major commercial driveway density Minor industrial driveway density Other driveway density Major industrial driveway density Presence of lighting Speed limit Presence of on-street parking Major residential driveway density Presence of automated speed enforcement

44.62 22.10 11.93 5.63 5.07 4.76 1.64 1.15 1.00 0.89 0.86 0.35 0.00

44.62 66.72 78.65 84.28 89.35 94.11 95.75 96.90 97.90 98.79 99.65 100.00 100.00

6.2. Variable importance The influence of a predictor variable in a single tree is estimated by the number of times the variable is used to split the nodes and the squared improvement attributed to the tree due to the splits by the variable. The influence of the variable is summed over the ensemble of trees and the average value of the summation is regarded as the measure of variable importance in a BRT model (Friedman and Meulman, 2003). The higher the variable importance score, the greater is the contribution of the variable on crash predictions. Table 6 gives the relative influence of predictor variables for urban and suburban two-lane undivided arterials. As expected, AADT is the most influential variable with a relative contribution of 44.62% to the model. The density of roadside objects, with a contribution of 22.10%, is the second most influential variable. Minor residential driveway density is another variable that had more than 10% of contributions to the BRT model. As such, cumulatively, 75% of the total influence was attributed to these three influential variables (i.e., AADT, roadside object density, and minor residential driveway density). Minor and major commercial

driveway density and minor industrial driveway density are the next three influential variables with approximately 5–6% of contributions by each in developing the BRT model for crash predictions. Among the other variables, major industrial driveway density, other driveway density, and lighting each had slightly over 1% of contributions, while speed limit, on-street parking, and major residential driveway density variables each had less than 1% of contributions to model development. Automated enforcement was hardly observed along the sections of two-lane undivided roads and, therefore, had no influence on crash predictions. Table 7 shows the relative influence of variables on predicted crash frequency for urban and suburban four-lane divided arterials. AADT, with a relative contribution of nearly 50%, is the most influential variable. Major commercial driveway density, roadside object density, and minor commercial driveway density are the next three influential variables, each with more than 10% of contributions. Median width is the next influential variable that contributed approximately 7.5% to the model. Among the other variables, only minor and major residential driveway density variables are found to have more than 1% of contributions in developing the BRT model. The remaining seven variables

Table 7 Relative influence of predictor variables for urban and suburban four-lane divided arterials. Variable

Relative influence (%)

Cumulative influence (%)

AADT Major commercial driveway density Roadside object density Minor commercial driveway density Median width Minor residential driveway density Major residential driveway density Major industrial driveway density Minor industrial driveway density Presence of lighting Presence of automated speed enforcement Presence of on-street parking Other driveway density Speed limit

48.71 13.57 12.07 10.29 7.46 3.52 1.79 0.78 0.69 0.61 0.22 0.16 0.09 0.04

48.71 62.28 74.35 84.64 92.10 95.62 97.41 98.19 98.88 99.49 99.71 99.87 99.96 100.00

D. Saha et al. / Accident Analysis and Prevention 79 (2015) 133–144

141

Fig. 4. Partial dependence plots showing marginal effects of predictor variables on crash predictions for urban and suburban two-lane undivided roads.

including major and minor industrial driveway density, lighting, automated speed enforcement, on-street parking, other driveway density, and speed limit had very small influence (i.e., less than 1% by each) on predicted crashes. 6.3. Marginal effects of predictor variables The marginal effect of each variable on crash prediction can be demonstrated using partial dependence plots. The plots illustrate the association of a predictor variable with the response variable while all other variables have an average effect in the model (Elith et al., 2008). Figs. 3 and 4 provide the partial dependence plots of the six most influential variables that have a minimum of 3% of

influence on the models for urban and suburban two-lane undivided arterials and four-lane divided arterials, respectively. Fig. 4(a) shows the marginal effect of AADT on crash predictions for urban two-lane undivided arterials. The plot demonstrates a non-linear relationship between AADT and predicted crash frequency. AADT would likely impact crash predictions at an exponentially increasing rate. Figs. 4(b) and (c) show the impacts of roadside object density and minor residential driveway density, respectively, on predicted crashes. Both the plots have several peaks and valleys and show a complex pattern of variation in predicted crash frequencies. The plots in Figs. 4(d) and (e) demonstrate similar pattern of impact on crash predictions by minor and major commercial driveway density, respectively.

142

D. Saha et al. / Accident Analysis and Prevention 79 (2015) 133–144

Crashes would likely increase at varying rates with major and minor commercial driveway density. Also, these two types of driveways did not induce more crashes beyond a particular density. Fig. 4(f) shows that crashes would likely increase with minor industrial driveway density from 0 to 10 and from 20 to 30 driveways per mile, while a decreasing rate of impact is observed with values of density between 10 and 20 and between 30 and 40 driveways per mile. The effect on predicted crashes is, however, plateaued beyond 40 driveways per mile. Fig. 5(a) shows the marginal effect of AADT on crash predictions for urban four-lane divided arterials. The plot indicates that the rate of impact of AADT on crashes is non-linear. The likelihood of crash occurrence increases with higher values of AADT at an exponential rate; however, the impact is lower beyond an AADT of

approximately 50,000 veh/day. Fig. 5(b) illustrates that a nonlinear relationship exists between major commercial driveway density and predicted crashes. Although the plot shows a slight downward trend in crash frequency between 22 and 30 driveways per mile, the overall plot indicates a logarithmic relation between major commercial driveway density and predicted crashes. The plot in Fig. 5(c) indicates that the impact of roadside object density on predicted crash frequencies is quite erratic. The plot also shows that roadside object density with greater than 250 objects per mile has a plateaued effect on predicted crashes. It implies that higher roadside object density does not have any increased impact on predicted crashes. As can be observed from Fig. 5(d), the partial dependence plot of minor commercial driveway density shows an increasing rate of effect on predicted crashes. The plot has two

Fig. 5. Partial dependence plots showing marginal effects of predictor variables on crash predictions for urban and suburban four-lane divided arterials.

D. Saha et al. / Accident Analysis and Prevention 79 (2015) 133–144

dipping points and has plateaued at about 60 driveways per mile. This implies a higher degree of non-linearity between minor commercial driveway density and predicted crash counts. Fig. 5(e) shows that median width of 50 ft would significantly reduce crashes. However, median widths of 30 ft and 100 ft would likely result in more crashes. Although median width of up to 30 ft may increase median-related crashes (Stamatiadis et al., 2009), crash likelihood due to 100 ft wide medians is counter-intuitive. Stamatiadis et al. (2009) referred to the study by Hauer (2000) that concluded that “the effect of median width on total crashes is questionable”. Fig. 5(f) illustrates the relation between minor residential driveway density and crash predictions. The plot shows varying rates of impact on predicted crashes for different ranges of density of minor residential driveways. Crashes would be impacted by a gradual decreasing rate up to a density of 30 minor residential driveways per mile and then by a sharp increasing rate for a density between 30 and 50 driveways per mile. Again, the impact of minor residential driveways with a density of greater than 50 driveways per mile does not have substantial variation on crash predictions. In summary, the variables including AADT, roadside object density, density of all minor driveway types, and major commercial driveway density have a non-linear relationship with crash predictions. The relations are quite random and complex in nature. 7. Summary and conclusions Calibration factors are required to adjust crash frequencies predicted using the HSM default safety performance functions (SPFs) to local site conditions. The HSM requires very detailed roadway geometry, traffic, and crash characteristics data to derive local calibration factors, and unfortunately, several of the variables are often not available in the states’ databases. Agencies are required to collect the missing data to generate calibration factors to be able to implement the HSM. As such, this study aims to prioritize the HSM variables for calibration purposes by determining their influence on crash predictions. Boosted regression trees (BRT) approach was applied to determine HSM variables’ influence on crash predictions for urban and suburban two-lane undivided arterials and four-lane divided arterials using five years of crash data (from 2008 to 2012) in Florida. In general, the analysis results of the BRT models revealed that higher-order BRT models (i.e., BRT models with higher tree complexity levels) resulted in a better fitted model than the firstorder models (i.e., BRT models with tree complexity level 1), which consider only the main effect of variables. This indicates that models developed with lower tree complexity levels cannot accurately explain complex crash data. Results from the BRT models showed that only a few variables explain most of the variation in crash data. For both urban two-lane undivided and four-lane divided segment types, AADT was found to have the most significant impact on crash predictions. Furthermore, the top three most influential variables contributed approximately 75% of the total influence. The variables that accounted for 95% of cumulative influence were identified as high priority variables. Seven out of 13 variables for urban and suburban two-lane undivided roads and six out of 14 variables for urban and suburban four-lane divided arterials have collectively accounted for 95% of the total effect of the models. For urban and suburban two-lane undivided facilities, the list of high priority variables includes AADT, roadside object density, minor residential driveway density, minor commercial driveway density, major commercial driveway density, minor industrial driveway density, and other driveway density. Similarly, for urban and suburban four-lane divided facilities, the list of high priority variables includes AADT, major commercial driveway density, roadside object density, minor commercial driveway density, median width, and minor

143

residential driveway density. It is observed that all the high priority variables that cumulatively accounted for 95% of total contributions had at least 1% of individual contributions in crash frequency predictions. The results suggested that only those variables that showed at least 1% of individual contributions and cumulatively 95% of total contributions to crash predictions can be considered for generating calibration factors. Note that when the HSM methodology is to be used for site-specific analysis, it is desirable to consider all the HSM variables depending on data availability. The partial dependence plots demonstrate the marginal effect of variables on crash predictions. They provide an insight on how crashes are impacted by the contributing variables. For both the facility types analyzed in the study, the plots revealed higher degree of non-linearity between the variables discussed in the HSM and predicted crash frequency. The plots also showed that the impacts of AADT, roadside object density, and driveway density are neutralized beyond a certain point. In summary, the BRT method employed in this study provides several advantages over traditional regression methods, including: a pre-specified functional form and variable transformation are not required for developing models; interactions among predictor variables are intrinsically detected and modeled; different types of explanatory variables are efficiently handled; and model output can be interpreted by means of variable importance and partial dependence plots. Furthermore, because a large number of trees are fitted by randomly extracting a fraction of observations for each tree, BRT models are more stochastic than single decision tree models. One of the advantage of the BRT method over single decision tree method and other ensemble of trees method (e.g., random forests) is that specific distribution type can be specified in the case of BRT models. Moreover, the slow learning rate and the process of fitting trees by giving more weights to poorly fitted observations make BRT effectively learn from the data. In spite of the BRT method’s several benefits, it has two potential drawbacks. First, the lack of determining confidence interval and significance of difference between relative contributions are the limitations of BRT’s relative influence measure (Chung, 2013). Second, it requires optimization of several parameters (such as shrinkage, tree complexity, number of trees, and bagging fraction) for determining the best model with improved predictive performance. This involves extensive calculations escalating computation time. Although BRT models usually take longer time to optimize the parameters, their ability to better capture complex and non-linear relationships between crashes and their explanatory variables is advantageous.

References Abdel-Aty, M., Haleem, K., 2011. Analyzing angle crashes at unsignalized intersections using machine learning techniques. Accid. Anal. Prev. 43 (1), 461– 470. doi:http://dx.doi.org/10.1016/j.aap.2010.10.002. Abdel-Aty, M., Radwan, E., 2000. Modeling traffic accident occurrence and involvement. Accid. Anal. Prev. 32 (5), 633–642. Ahmed, M., Abdel-Aty, M., 2013. A data fusion framework for real-time risk assessment on freeways. Transp. Res. C: Emerg. Technol. 26, 203–213. http://dx. doi.org/10.1016/j.trc.2012.09.002. Alluri, P., Ogle, J., 2012. Effects of state-specific SPFs, AADT estimations, and overdispersion parameters on crash predictions using SafetyAnalyst. Proceedings of the 91st Annual Meeting of the Transportation Research Board, Washington, D.C.. Alluri, P., Saha, D., Liu, K., Gan, A., 2014. Improved Processes for Meeting the Data Requirements for Implementing the Highway Safety Manual (HSM) and SafetyAnalyst in Florida. Final Report BDK80-977-37. Florida Department of Transportation, Tallahassee, FL. American Association of State Highways and Transportation Officials (AASHTO), 2010. Highway Safety Manual, first ed. American Association of State Highway and Transportation Officials, Washington, D.C. Breiman, L., Friedman, J.H., Olsen, R.A., Stone, J., 1984. Classification and Regression Trees. Chapman & Hall, NY.

144

D. Saha et al. / Accident Analysis and Prevention 79 (2015) 133–144

Bühlmann, P., Hothorn, T., 2007. Boosting algorithms: regularization, prediction, and model fitting. Stat. Sci. 22 (4), 477–505. http://dx.doi.org/10.1214/07STS242. Cafiso, S., Di Graziano, A., Di Silvestro, G., La Cava, G., Persaud, B., 2010. Development of comprehensive accident models for two-lane rural highways using exposure, geometry, consistency and context variables. Accid. Anal. Prev. 42 (4), 1072– 1079. doi:http://dx.doi.org/10.1016/j.aap.2009.12.015. Caliendo, C., Guida, M., Parisi, A., 2007. A crash-prediction model for multilane roads. Accid. Anal. Prev. 39 (4), 657–670. doi:http://dx.doi.org/10.1016/j. aap.2006.10.012. Chang, L.-Y., Wang, H.-W., 2006. Analysis of traffic injury severity: an application of non-parametric classification tree techniques. Accid. Anal. Prev. 38 (5), 1019– 1027. doi:http://dx.doi.org/10.1016/j.aap.2006.04.009. Cheong, Y.L., Leitão, P.J., Lakes, T., 2014. Assessment of land use factors associated with dengue cases in Malaysia using boosted regression trees. Spat. Spatiotemporal Epidemiol. 10, 75–84. doi:http://dx.doi.org/10.1016/j. sste.2014.05.002. Chung, Y.-S., 2013. Factor complexity of crash occurrence: an empirical demonstration using boosted regression trees. Accid. Anal. Prev. 61, 107–118. doi:http://dx.doi.org/10.1016/j.aap.2012.08.015. Das, A., Abdel-Aty, M., Pande, A., 2009. Using conditional inference forests to identify the factors affecting crash severity on arterial corridors. J. Saf. Res. 40 (4), 317–327. doi:http://dx.doi.org/10.1016/j.jsr.2009.05.003. De’ath, G., 2007. Boosted trees for ecological modeling and prediction. Ecology 88 (1), 243–251. http://dx.doi.org/10.1890/0012-9658(2007)88[243:BTFEMA]2.0. CO;2. Elith, J., Leathwick, J.R., Hastie, T., 2008. A working guide to boosted regression trees. J. Anim. Ecol. 77 (4), 802–813. doi:http://dx.doi.org/10.1111/j.13652656.2008.01390.x. Ellis, A.R., Dusetzina, S.B., Hansen, R.A., Gaynes, B.N., Farley, J.F., Stürmer, T., 2013. Confounding control in a nonexperimental study of STAR*D data: logistic regression balanced covariates better than boosted CART. Ann. Epidemiol. 23 (4), 204–209. doi:http://dx.doi.org/10.1016/j.annepidem.2013.01.004. Elmitiny, N., Yan, X., Radwan, E., Russo, C., Nashar, D., 2010. Classification analysis of driver’s stop/go decision and red-light running violation. Accid. Anal. Prev. 42 (1), 101–111. doi:http://dx.doi.org/10.1016/j.aap.2009.07.007. Esther, A., Imholt, C., Perner, J., Schumacher, J., Jacob, J., 2014. Correlations between weather conditions and common vole (Microtus arvalis) densities identified by regression tree analysis. Basic Appl. Ecol. 15 (1), 75–84. doi:http://dx.doi.org/ 10.1016/j.baae.2013.11.003. Etter, A., McAlpine, C., Wilson, K., Phinn, S., Possingham, H., 2006. Regional patterns of agricultural land use and deforestation in Colombia. Agric. Ecosyst. Environ. 114 (2–4), 369–386. http://dx.doi.org/10.1016/j.agee.2005.11.013. Findley, D., Zegeer, C., Sundstrom, C., Hummer, J., Rasdorf, W., 2012. Applying the Highway Safety Manual to two-lane road curves. J. Transp. Res. Forum 51 (3), 25–38. Friedman, J.H., 2001. Greedy function approximation: a gradient boosting machine. Ann. Stat. 29 (5), 1189–1232. Friedman, J.H., 2002. Stochastic gradient boosting. Comput. Stat. Data Anal. 38 (4), 367–378. Friedman, J.H., Meulman, J.J., 2003. Multiple additive regression trees with application in epidemiology. Stat. Med. 22 (9), 1365–1381. http://dx.doi.org/ 10.1002/sim.1501. Froeschke, J.T., Froeschke, B.F., 2011. Spatio-temporal predictive model based on environmental factors for juvenile spotted seatrout in Texas estuaries using boosted regression trees. Fish. Res. 111 (3), 131–138. doi:http://dx.doi.org/ 10.1016/j.fishres.2011.07.008. Gellrich, M., Baur, P., Robinson, B.H., Bebi, P., 2008. Combining classification tree analyses with interviews to study why sub-alpine grasslands sometimes revert to forest: a case study from the Swiss Alps. Agric. Syst. 96 (1–3), 124–138. http:// dx.doi.org/10.1016/j.agsy.2007.07.002. Hadi, M.A., Aruldhas, J., Chow, L.-F., Wattleworth, J.A., 1995. Estimating safety effects of cross-section design for various highway types using negative binomial regression. Transp. Res. Rec. J. Transp. Res. Board 1500, 169–177.

Hale, R., Marshall, S., Jeppe, K., Pettigrove, V., 2014. Separating the effects of water physicochemistry and sediment contamination on Chironomus tepperi (Skuse) survival, growth and development: a boosted regression tree approach. Aquat. Toxicol. 152, 66–73. doi:http://dx.doi.org/10.1016/j.aquatox.2014.03.014. Harb, R., Yan, X., Radwan, E., Su, X., 2009. Exploring precrash maneuvers using classification trees and random forests. Accid. Anal. Prev. 41 (1), 98–107. doi: http://dx.doi.org/10.1016/j.aap.2008.09.009. Hauer, E., 2000. The Median and Safety. www.trafficsafetyresearch.com. Hauer, E., Council, F.M., Mohammedshah, Y., 2004. Safety models for urban four-lane undivided road segments. Transp. Res. Rec. J. Transp. Res. Board 1897, 96–105. Hastie, T., Tibshirani, R.J., Friedman, J.H., 2009. The Elements of Statistical Learning, second ed. Springer-Verlag, NY. Jafari, A., Khademi, H., Finke, P.A., Van de Wauw, J., Ayoubi, S., 2014. Spatial prediction of soil great groups by boosted regression trees using a limited point dataset in an arid region, southeastern Iran. Geoderma 232–234, 148–163. doi: http://dx.doi.org/10.1016/j.geoderma.2014.04.029. Jalayer, M., Zhou, H., 2013. A sensitivity analysis of crash prediction models input in the Highway Safety Manual. Paper Presented at the 2013 ITE Midwestern District Meeting, Milwaukee, WI. Karlaftis, M.G., Golias, I., 2002. Effects of road geometry and traffic volumes on rural roadway accident rates. Accid. Anal. Prev. 34 (3), 357–365. doi:http://dx.doi.org/ 10.1016/S0001-4575(01) 33-1. Kashani, A.T., Mohaymany, A.S., 2011. Analysis of the traffic injury severity on twolane, two-way rural roads based on classification tree models. Saf. Sci. 49 (10), 1314–1320. doi:http://dx.doi.org/10.1016/j.ssci.2011.04.019. Lemercier, B., Lacoste, M., Loum, M., Walter, C., 2012. Extrapolation at regional scale of local soil knowledge using boosted classification trees: a two-step approach. Geoderma 171–172, 75–84. doi:http://dx.doi.org/10.1016/j. geoderma.2011.03.010. Lu, J., 2013. Development of safety performance functions for SafetyAnalyst applications in Florida. Ph.D. Dissertation. Department of Civil and Environmental Engineering, Florida International University, Miami, FL. Müller, D., Leitão, P.J., Sikor, T., 2013. Comparing the determinants of cropland abandonment in Albania and Romania using boosted regression trees. Agric. Syst. 117, 66–77. doi:http://dx.doi.org/10.1016/j.agsy.2012.12.010. Neumann, A., Holstein, J., Le Gall, J.-R., Lepage, E., 2004. Measuring performance in health care: case-mix adjustment by boosted decision trees. Artif. Intell. Med. 32 (2), 97–113. doi:http://dx.doi.org/10.1016/j.artmed.2004.06.001. R Core Team, 2014. R: A Language and Environment for Statistical Computing. Foundation for Statistical Computing, Vienna, Austria. http://www.R-project. org/. Sawalha, Z., Sayed, T., 2001. Evaluating safety of urban arterial roadways. J. Transp. Eng. 127 (2), 151–158. Stamatiadis, N., Pigman, J., Sacksteder, J., Ruff, W., Lord, D., 2009. Impact of Shoulder Width and Median Width on Safety. Transportation Research Board of the National Academics, Washington, D.C. http://onlinepubs.trb.org/onlinepubs/ nchrp/nchrp_rpt_633.pdf. Sun, X., Li, Y., Magri, D., Shirazi, H.H., 2006. Application of the Highway Safety Manual draft chapter: Louisiana experience. Transp. Res. Rec. J. Transp. Res. Board 1950, 55–64. Williams, G., 2011. Data Mining with Rattle and R: The Art of Excavating Data for Knowledge Discovery. Springer, NY. Yan, X., Radwan, E., 2006. Analyses of rear-end crashes based on classification tree models. Traffic Inj. Prev. 7 (3), 276–282. doi:http://dx.doi.org/10.1080/ 15389580600660062. Young, J., Park, P.Y., 2013. Benefits of small municipalities using jurisdiction-specific safety performance functions rather than the Highway Safety Manual’s calibrated or uncalibrated safety performance functions. Can. J. Civ. Eng. 40 (6), 517–527. http://dx.doi.org/10.1139/cjce-2012-0501. Zhang, M.H., Xu, Q.S., Daeyaert, F., Lewi, P.J., Massart, D.L., 2005. Application of boosting to classification problems in chemometrics. Anal. Chim. Acta 544 (1–2), 167–176. doi:http://dx.doi.org/10.1016/j.aca.2005.01.075.

Prioritizing Highway Safety Manual's crash prediction variables using boosted regression trees.

The Highway Safety Manual (HSM) recommends using the empirical Bayes (EB) method with locally derived calibration factors to predict an agency's safet...
2MB Sizes 0 Downloads 8 Views