Applied Ergonomics 51 (2015) 152e162

Contents lists available at ScienceDirect

Applied Ergonomics journal homepage: www.elsevier.com/locate/apergo

Development of safety incident coding systems through improving coding reliability Nikki S. Olsen*, Ann M. Williamson School of Aviation, The University of New South Wales, Kensington, Sydney, NSW 2052, Australia

a r t i c l e i n f o

a b s t r a c t

Article history: Received 3 September 2014 Accepted 27 April 2015 Available online 21 May 2015

This paper reviews classification theory sources to develop five research questions concerning factors associated with incident coding system development and use and how these factors affect coding reliability. Firstly, a method was developed to enable the comparison of reliability results obtained using different methods. Second, a statistical and qualitative review of reliability studies was conducted to investigate the influence of the identified factors on the reliability of incident coding systems. As a result several factors were found to have a statistically significant effect on reliability. Four recommendations for system development and use are provided to assist researchers in improving the reliability of incident coding systems in high hazard industries. © 2015 Elsevier Ltd and The Ergonomics Society. All rights reserved.

Keywords: Incident classification Reliability Safety management system

1. Introduction 1.1. Incident coding systems and their testing Incident and accident investigation is now an integral part of safety management systems in safety critical industries. Identifying how incidents and accidents occur can reveal factors and causes contributing to past incidents that will help target preventive action to reduce the likelihood of future accidents. An increasing number of accident and incident classification and coding systems have been developed for this purpose in recent years. These coding systems aim to provide a standard, systematic framework for identifying the factors that contribute to accident and incident occurrence. Thus they are tools for managing and understanding the often large amount of information gathered about accidents and incidents and for analysing common causal factors across different incidents and over time. The systems vary considerably in the nature of coding used, with some based on simple lists of possible contributing factors (eg. af Wåhlberg, 2002), others on formal models of accident causation (eg. TRACEr, Shorrock and Kirwan, 2002) and some include more in-depth coding of aspects such as different taxonomies of error involvement (eg. HFACS, Wiegmann and Shappell, 2003). Systems also vary in the type and number of people who do the coding (‘coders’).

* Corresponding author. Tel.: þ61 3 51467916. E-mail address: [email protected] (N.S. Olsen). http://dx.doi.org/10.1016/j.apergo.2015.04.015 0003-6870/© 2015 Elsevier Ltd and The Ergonomics Society. All rights reserved.

Reliability is an essential attribute of all standard and systematic coding systems (Kirwan, 1996; Ross et al., 2004; Stanton and Stevenage, 1998). It is essential that the coding system produces consistent coding of the data regardless of when or by whom the coding is conducted. This includes intercoder reliability, or how well two or more coders agree with one another, and intracoder reliability, or how well a single coder agrees with themselves when coding the same incident on different occasions. Clearly, if coders cannot agree, then there will be variability in the range of codes applied, so making unclear how the incident actually occurred. Without acceptable levels of reliability there is little chance that the classification and coding of incidents will reflect accurately how they occurred. Classification and coding systems with poor reliability will contribute little to the understanding of safety failures in the settings in which they are applied so making them unlikely to contribute to improving safety. A range of different methods have been used to estimate the reliability of classification and coding systems (Olsen, 2013). High levels of agreement are needed for a classification and coding system to be judged as reliable but many of these systems, even some in common use, have poor or unknown reliabilities (eg. HFACS-ADF, TAPS, HFACS-ME). Better and more reliable tools are needed to code and classify safety incidents, but the best way of going about this is not clear. A few studies have discussed some of the factors that could play a role in limiting the reliability of classifying and coding safety information (Baker and Krokos, 2007; Olsen and Shorrock, 2010; Li and Harris, 2005), but there has been no systematic analysis of what factors play a role or any

N.S. Olsen, A.M. Williamson / Applied Ergonomics 51 (2015) 152e162

empirical study of their influence. The objective of this research was to identify the factors that might influence reliability, and examine the nature of their influence in order to provide guidance to incident coding system developers, users and researchers in safety critical industries. 1.2. Reliability of incident classification and coding systems Existing studies of reliability and coding system development provide evidence to support hypotheses concerning the factors that might influence the reliability of incident classification and coding systems. Firstly, the scope of the system may influence the reliability of the system for various users. Taib et al. (2012) found that a domain-specific medication error taxonomy (such as one designed only to classify medication errors) was more reliable than a generic medication error taxonomy (one that can be used for errors across all medical departments) when used by medical personnel including pharmacists and nurses. These authors suggested that domainspecific systems, where the scope and wording are very similar to that which would be used in the reports generated by individual medical departments, may produce higher reliability regardless of the profession of the coder. This is because they may require less interpretation in matching report terminology to the available system codes. On the other hand, translating detailed medical terminology from individual departments into more generic medical terminology might result in different interpretations by different coders and lead to lower levels of reliability. Second, the profession or expertise of the coder is also a potential factor that may influence the reliability of incident coding systems. For example, Taib et al. (2012) found that comparing the reliability of coded reports of medication errors, pharmacists were more reliable than nurses regardless of whether they were using domain-specific or generic systems however did not venture to explain why this might be so. Also supporting the potential influence of coder expertise, Krippendorf (2013) argued that reliability will be poor where coders without knowledge of the subject matter are asked to interpret texts with unfamiliar terms, therefore coders should have an appropriate amount of expertise in the field in order to produce high reliability. This is supported by Baker and Krokos (2007), O'Connor and Walker (2011) and Olsen and Shorrock (2010) who raised concerns for the reliability and validity of coding of human error concepts by coders without human factors expertise who may not be able to distinguish between contextual and cognitive concepts. Third, coder expertise with the specific coding system may also influence reliability. Baysari et al. (2011) argued that a coder's experience with a particular incident coding system may increase reliability as coders will be familiar with how to use it and the codes available. However this view has been challenged by Olsen and Shorrock (2010) who suggested that experience may only serve to establish users' preferred codes which would look like good reliability, but only really reflect a truncation of the range of codes used. Peter and Lauf (2002) similarly argued that experienced coders develop routines over their career that are inconsistent with the original coding methodology or change their understanding of category meanings. If these effects varied from person to person it might be expected that reliability would be lower with experience. Clearly the effects of coder experience need further evaluation. Fourth, the size of the coding system in terms of the range of coding choices to be made may influence system reliability. A number of researchers have expressed concerns over the appropriate size of incident coding systems (af Wåhlberg, 2002; Baysari et al., 2011; Jacobs et al., 2007; O'Connor, 2008). Size, here, refers to either the number of coding choices available at each level of the system (when depicted on paper this would be the ‘width’ of the

153

system) or the number of levels within the system (when depicted on paper this would be the ‘height’ of the system) or both. Systems with greater than two levels of analysis tend to involve more specific codes for describing incident phenomena. Some researchers (Makeham et al., 2008; O'Connor, 2008; Olsen and Shorrock, 2010) suggest that as the levels of analysis increase, the reliability decreases. Isaac et al. (2003a) relate the reason for decreasing reliability to the degree of choice offered to the coder e as the coder progresses through each level of the system, more options are available and more choices are made by coders so reducing reliability. Similarly, when considering the ‘width’ of the system Krippendorf (2013) more explicitly stated that there should be no more than seven categories at each point in the system because coders find it difficult to consider large numbers of categories. Such large numbers of categories are also thought to encourage coders to form coding habits and preferences which has been suggested to artificially inflate reliability, as described above. There is little consistency in the literature to define what researchers and developers describe as a ‘large’ or ‘small’ system. However O'Connor (2008) and Olsen and Shorrock (2010) suggest that DoD-HFACS and HFACS-ADF respectively were too large for reliable coding with the total number of codes for each of these systems 147 and 155 respectively. This is compared to the original version of HFACS that consists of 19 codes. This comparison gives a little appreciation of the description of ‘large’ and ‘small’ incident coding systems in the current literature. Fifth, the terms and concepts used in the system may also influence its reliability. A number of researchers (Chang et al., 2005; Runciman et al., 2009; Shorrock and Kirwan, 2002) argue that terms within systems need to be unambiguous and incorporate both theoretical concepts as well as generally accepted vocabulary in order that researchers can understand each other's work and thus facilitate the systematic collection, aggregation and analysis of relevant information. Ercan et al. (2007) point out that certain terms may invite bias on behalf of users. They identify two types of bias in the terminology of incident coding systems that may affect reliability. The first, loaded terms, refer to terms that cause some meaning or feeling in a defined subject that may affect whether the term is selected or not. For example, the code ‘failed to prioritise attention’ is more loaded due to the ‘failure’ term than the code ‘distraction’ which is more neutral. The second form of bias refers to category names that describe illegal attitudes, behaviours or private subjects and that may cause an unwillingness to select them due to the instigation of potential repercussions. This has been highlighted in the development of the Canadian version of the Human Factors Analysis and Classification System (HFACS). Developers found that air traffic controllers tended to baulk at the term ‘violations’ so they replaced the word with the term ‘contraventions’ which they found to be more palatable to the users and resulted in the code being used more often (Wiegmann and Shappell, 2003). The use of terms that may cause bias in coders may result in low levels of reliability. Li and Harris (2005) also suggested that coders were less inclined to use categories where the terms represented abstract concepts and were more inclined to use categories for which more tangible evidence was available from the accident report narratives. For example, they found that categories showing the lowest levels of reliability were those inferring a lack of ‘situational awareness’ or identifying ‘organisational climate’ as a contributing factor. It may be that physical concepts such as ‘equipment failure’ are coded more reliably than more abstract concepts such as ‘decision making’ because equipment failure may be readily seen whereas the making of a decision is an internal cognitive event. As many incident coding techniques are based on conceptualisations of cognitive failures that are inferred rather than seen it is possible that the

154

N.S. Olsen, A.M. Williamson / Applied Ergonomics 51 (2015) 152e162

coding of human error may be particularly prone to low levels of reliability. More specifically, it is expected that terms representing abstract concepts will result in lower levels of reliability than terms representing physical concepts. 1.3. Objectives of this paper The aim of this paper is to understand more about the influences on reliability in the incident coding domain. Firstly, a method will be developed to enable the comparison of reliability results obtained using different analysis indices. Second, a statistical and qualitative review of reliability studies will be made to investigate the influence of the factors identified in the introduction on the reliability of incident coding systems. The results of this review and analysis will provide guidance on how system developers can improve the reliability of their classification and coding systems and improve the usefulness of these systems. 2. Comparing reliability results across indices To determine the relationship between reliability and system design it is necessary to measure the agreement between coders for comparison. Agreement refers to whether or not coders have selected the same code to assign to each discrete event (Ross et al., 2004). In order to compare obtained reliabilities across studies of safety incident coding methods, it is essential to ensure that reliability estimates can be compared. Although a number of different methods are used to analyse agreement in the reviewed studies, percentage agreement, Index of Concordance (Maxwell, 1977) and Kappa (Cohen, 1960) are the most popular. Percentage agreement refers to the percentage of coders agreeing on the most common code selected (see equation (1)). Index of Concordance, on the other hand, takes a comparison of each pair of participants for each event coded, noting whether they agreed or disagreed. The index is then calculated by dividing the number of agreements by the total possible number of agreements (agreements plus disagreements) and representing this as a percentage (equation (2)).

 number of coders agreeing on the modal category  100% total number of coders

1

 number of agreements  100% number of agreements þ number of disagreements

2

 Percentage agreement ¼

 Index of Concordance ¼

The third popular method, Kappa, is calculated by correcting the figure for percentage agreement by the agreement which could be expected for chance alone. There is some question over the suitability of Kappa in safety incident coding studies. Ross et al. (2004) argue that the use of Kappa assumes that codes are mutually exclusive and exhaustive which cannot be assumed prior to the system actually being tested for it. Secondly, they note that Kappa assumes independence of the coders which is unlikely as coders do not start from a position of complete ignorance, they begin with subject matter expertise and at least a basic understanding of the system. Lastly they argue that it is inappropriate to make corrections for chance agreement, since coders do not make random selections but rather make informed selections based on their experience, expertise and the design and terminology of the codes presented in the system. Most relevant to the current study of reliability across studies, Kappa is also unable to be compared across studies due to the characteristics of how the chance value is calculated (Cicchetti and Feinstein, 1990; Feinstein and Cicchetti, 1990). Therefore reliability results using Kappa will not be used in this review. Although studies using percentage agreement can be compared to other percentage agreement studies and likewise for Index of Concordance studies, the differences in how each of these tools calculates agreement makes comparisons between the indices difficult. RSSB (2005) found that when analysing the same data set using percentage agreement and Index of Concordance, the difference between the results ranged from 14% to 28%, with the Index of Concordance being the more conservative of the two due to its method of taking into account the disagreements. To develop a method for comparing reliability studies using different analysis indices, an investigation into how the methods are related to each other was necessary. Thus a comparison of reliability using percentage agreement and Index of Concordance for the same data set was conducted to determine the relationship between the two indices and to develop a method for combining studies using the two methods. A number of hypothetical datasets were established in which three, five, ten and fifteen coders applied codes to the same number of events in each case. In each case, the hypothetical dataset

To illustrate the difference between percentage agreement and Index of Concordance consider the following example. If four coders coding the same event use codes A, A, A and B, then the percentage agreement would be calculated by determining the number of coders selecting the modal code (A) and dividing this by the total number of coders (3 divided by 4 ¼ 75% agreement). However using the Index of Concordance there would be three agreements (coders 1 and 2, 2 and 3, 1 and 3) and three disagreements (coders 1 and 4, 2 and 4, 3 and 4). Therefore the Index of Concordance result is obtained by 3 agreements divided by 6 possible agreements ¼ 50%, which is a considerably more conservative estimate of agreement.

included coding decisions in which in the first there was no modal code with all coders selecting a different code. For the second, two coders selected the same code but all others selected different codes. For the third, three coders selected the same code but all others selected different codes and so on until all coders had selected the same code. Reliability was calculated for each decision using both methods of reliability. The results of the hypothetical approach are shown for each dataset in Table 1. From the table it is evident that the pattern for percentage agreement is linear whereas the pattern for the Index of Concordance presents a curve of exponential growth. The Index of Concordance results are always lower than percentage agreement,

N.S. Olsen, A.M. Williamson / Applied Ergonomics 51 (2015) 152e162 Table 1 Comparison of percentage agreement and Index of Concordance using hypothetical data involving different numbers of coders and decision events with increasing numbers of coders agreeing. 3 coders; 3 events

5 coders; 5 events

10 coders; 10 events

15 coders; 15 events

PA (%)

IOC (%)

PA (%)

IOC (%)

PA (%)

IOC (%)

PA (%)

IOC (%)

0 67 100

0 33 100

0 40 60 80 100

0 10 30 60 100

0 20 30 40 50 60 70 80 90 100

0 2 7 13 22 33 47 62 80 100

0 13 20 27 33 40 47 53 60 67 73 80 87 93 100

0 1 3 6 10 14 20 27 34 43 52 63 74 87 100

PA ¼ Percentage Agreement; IOC ¼ Index of Concordance.

except for the extremes and they tend to be most conservative when agreement between coders is lower. Due to these differences between the two reliability measures, it is not possible to directly compare studies using the two different measures. In order to compare studies using the two measures, therefore, it will be necessary to recalculate one measure to the other. To calculate one measure into another an additional set of hypothetical data was created for 100 coders coding 100 events for both percentage agreement and Index of Concordance. Percentage agreement results were then approximated to Index of Concordance results using the table. Although not precise, reliability results had often been rounded to the nearest percentage in published papers and the intention was to continue this trend for the data comparisons in this paper. Thus an approximation method that was more precise than an approximation to the nearest percentage was not necessary. Therefore with 100 data points for converting percentage agreement to Index of Concordance it was determined that this method was suitably accurate for the purposes of this paper. 3. Analysis of factors affecting reliability 3.1. Introduction The review of literature identified a number of factors for which there is evidence that they are likely to affect the reliability of coding systems. These include: the nature of the coding (whether it is domain-specific or generic and whether it contains sensitive coding terms); the role of coder expertise; the size of the coding system; and coder experience with the coding system. The aim of this section of the research was to explore further the influence of these five factors on reliability of safety system coding. Specifically this involved five research questions as follows: RQ1: Does reliability vary between domain-specific and generic systems? RQ2: Does reliability vary with coder expertise? RQ3: Does reliability vary with coder experience in the system? RQ4: Does reliability vary for the number of coding choices within the systems? RQ5: Does reliability vary for the terminology used in the systems?

155

Table 2 Search terms. Taxonomy System Model Intercoder agreement Interrater agreement Intercoder consistency Interrater consistency Intracoder agreement Intracoder consistency Intrarater agreement Intrarater consistency

Reliability Consensus accident incident event adverse event hazard

3.2. Method This research involved two steps. The first involved a literature search to identify safety system coding studies that included estimates of reliability of coding. The second step involved analysis of the set of studies identified in the literature search specifically for each of the five research questions. 3.2.1. Literature search Three electronic databases were searched, for reliability studies of incident, accident and adverse event coding systems: MEDLINE, Ergonomics Abstracts and Health and Safety Science Abstracts. The search terms are presented in Table 2. The titles and abstracts of returned entries were checked for relevance to the review. In addition, the reference lists of papers selected for review were manually searched by title and where further clarification of relevance was required, by searching abstracts. The search was limited to English language papers published prior to June 2013. 3.2.2. Research question analysis 3.2.2.1. RQ1: Does reliability vary between domain-specific and generic incident coding systems?. Each of the studies identified in the literature review were classified as using domain-specific or generic error codes using the method employed by Taib et al. (2012). This defines domain-specific as a system that include codes specific to their domain, such as those designed for pathology, surgery or paediatrics, and generic as those that integrate information from various domains, such as medical error taxonomies. Each study was classified by two classifiers (the first author and an aviation safety investigator) as either domain-specific or generic. To ensure the validity of the classification, studies where the two classifiers disagreed on the classification were not included in the analysis. The average reliability was calculated, Index of Concordance results were converted to percentage agreement using the method developed in Part 2 of this research and then compared across studies using domain-specific or generic systems. 3.2.2.2. RQ2: Does reliability vary with coder expertise?. The literature review revealed that most coders fit into one of four professions: university students, accident and incident investigators, human factors specialists (including psychologists) and Subject Matter Experts (SMEs; line workers such as air traffic controllers, pilots, nurses, rail drivers and the associated front line managers within the industry). The retrieved studies were read and coders classified in one of these four groups according to the description provided by the authors of the studies. Where coders could not be classified into one group or there was insufficient information to classify into any of the groups, the study was excluded from the

156

N.S. Olsen, A.M. Williamson / Applied Ergonomics 51 (2015) 152e162

analysis. The studies for each of the professions were grouped and reliabilities averaged within the group for comparison. Results were again converted from Index of Concordance to percentage agreement. 3.2.2.3. RQ3: Does reliability vary with coder experience in the incident coding system?. The studies were reviewed for coder experience. As most provided little specific information about experience arising from formal training and experience using the system while on the job, studies were only classified where details were provided of experience or training with the coding system in the paper. Studies where the coder experience prior to coding was not clear were excluded from the study. The average reliability for two groups of participants, those who had prior experience or training in the system and those who did not, were compared. Results were converted from Index of Concordance to percentage agreement where applicable. 3.2.2.4. RQ4: Does reliability vary for the number of coding choices within the incident coding systems?. The studies identified were examined for the number of coding choices available in the incident coding system used and the total number of codes was counted. To reduce over-representation of any particular incident coding system, the studies for specific systems were averaged to produce a single reliability result for each system. This reliability and the system's corresponding total number of codes were compared. Again, Index of Concordance reliability results were converted to percentage agreement results. 3.2.2.5. RQ5: Does reliability vary for the terminology used in the incident coding systems?. Studies identified were examined for the nature of the terminology used. Two different types of terms were classified: the level of subjectivity and the level of bias of the terms. Subjectivity was classified into either abstract terms that were not observable, and/or represented mental or psychological concepts, and physical terms, defined as being observable or in physical existence. The level of bias was defined by whether the classification system coded violation behaviour or not on the basis that judgments about violation behaviour are often sensitive as they signify illegal or defiant behaviour and may be avoided by coders. For the subjectivity analysis, the same two classifiers as for research question one reviewed and defined each code within the incident coding systems according to the type of terminology (abstract or physical). For both types of terminology, studies were grouped and reliabilities averaged within the group. Where classifiers disagreed, the code was not included in the analysis. For the bias analysis, codes that included the term ‘violation’ were compared to all other codes. Again, Index of Concordance results were converted to percentage agreement results. 3.3. Results 3.3.1. Literature search After excluding papers that did not use percentage agreement or Index of Concordance, a total of 21 papers relevant to the review were found containing 42 studies and 19 incident coding systems from five safety-critical industries. Each paper contained from one to five individual reliability studies with the average being two studies per paper. The characteristics of the reliability studies reviewed are described in Table 3. For all research questions except RQ5, the analysis began with 19 papers and 40 separate studies, as two further papers were excluded because they presented their percentage agreement results as a reliability range rather than an overall result.

3.3.2. Research question analysis 3.3.2.1. RQ1: Does reliability vary between domain-specific and generic incident coding systems?. Of the 40 studies analysed, 20 were classified as using a domain-specific system and 19 as using a generic system. The classifiers disagreed on classification of system type for two systems (System for the Retrospective and Predictive Analysis of Cognitive Error e Rail Australian Version (2) (TRACEr-RAV2)) which were excluded from analysis. The total data set to answer the question therefore comprised 18 papers, 39 studies and 17 incident coding systems. The average reliability for the domain-specific and generic system groups are graphed at Fig. 1. Analysis by Mann Whitney U test revealed that the mean reliability was significantly higher for domain-specific compared to generic systems (U ¼ 123, critical value of U at p  0.05 ¼ 130) so confirming the hypothesis that domain-specific coding systems would have higher reliability. 3.3.2.2. RQ2: Does reliability vary with coder expertise?. In the analysis of the effect of coder profession on reliability, five studies were excluded from the 40 studies identified because the authors did not specify the profession of the coders and three studies were excluded because they contained coders from multiple professions. The total data set available to analyse this question was therefore 15 papers containing 32 studies and 14 incident coding systems. Fourteen studies used subject matter experts, seven used incident investigators, six used human factors specialists and five used university students. The average reliabilities for each of the groups is graphed at Fig. 2. Comparison of average reliability for each profession by Kruskal Wallis one way analysis of variance by ranks showed significant differences between the coder professions (X2 (df ¼ 3, p  0.05) ¼ 7.81) indicating that reliability is lower for SMEs than coders from other professions. 3.3.2.3. RQ3: Does reliability vary with coder experience in the incident coding system?. For the analysis of the effect of coder training and experience on reliability, 18 of the 40 studies identified for this analysis did not report clearly or at all the experience or training levels of coders so were excluded from the analysis. Therefore the total data set for the analysis consisted of 13 papers, 22 studies and 10 incident coding systems. The distribution of training and experience was very skewed. Most studies used coders without training or other experience (nine studies), six used coders with approximately one half day of experience, four used coders with one day of experience, and single studies used coders with 1.5, 5 and 61 days respectively. Therefore a simplified comparison was done separating studies into those with coders with no experience and those with some experience of the coding system. Mean reliability was around 70% for both groups and analysis by Mann Whitney U test confirms no significant difference between the two samples (z ¼ 0.03, p ¼ 0.98 at p  0.05). 3.3.2.4. RQ4: Does reliability vary for the number of coding choices within the incident coding systems?. Analysis of the size of the incident coding system required exclusion of six studies from the identified set of forty because the overall number of codes could not be determined since full diagrams of the coding systems could not be sourced (System for the Retrospective and Predictive Analysis of Cognitive Error (TRACEr) and System for the Retrospective and Predictive Analysis of Cognitive Error- Rail (TRACEr-Rail)). The analysis therefore comprised 18 papers, 34 studies and 15 incident coding systems. The reliability results from multiple studies using each of the incident coding systems were averaged and then plotted for each of the fifteen systems against the total number of codes available in each system (Fig. 3). Analysis by Spearman's Rho Correlation showed a weak, non-significant negative correlation (rs ¼ 0.302) indicating no convincing relationship between the

Table 3 Reliability papers and studies included in the literature review. System name

System source paper

Field

Aim of reliability study

Studies

af Wåhlberg, (2002) Baker (2007)

af Wåhlberg, 2002 Krokos and Baker (2005)

Automotive Aviation

Part of validation of system developed in this paper. Part of validation of system developed in this paper.

1 2

Wiegmann and Shappell (2003) Shorrock and Kirwan (2002)

Rail

Comparison of reliability of TRACEr and HFACS.

0 1

RSSB (2005)

Rail

Development of an Australian version of TRACEr-Rail.

2

Comparison of reliability for TRACEr-Rail and TRACEr-RAV2.

1

Isaac (2003a)

unnamed Aviation Causal Contributors for Error Reporting Systems (ACCERS) Human Factors Analysis and Classification System (HFACS) System for the Retrospective and Predictive Analysis of Cognitive Error (TRACEr) System for the Retrospective and Predictive Analysis of Cognitive Error e Rail driver error (TRACEr-Rail) System for the Retrospective and Predictive Analysis of Cognitive Error e Rail driver error, Australian version (TRACEr-RAV2) Human Error in Air Traffic Management (HERA)

2

Li (2005) Makeham (2008) O'Connor (2008)

Part of validation of system developed in Isaac et al. (2003b). Comparison of reliability when system used by coders of various professions. Reliability assessment of system used in a different culture: China. Part of validation of system developed in this paper. Reliability assessment of HFACS system adapted by Department of Defense (2005). Reliability assessment of system when used by simulated mishap boards. Reliability assessment of system when used by trained coders.

Baysari (2009)

Baysari (2011)

O'Connor (2010) O'Connor (2011) Olsen (2010) Olsen (2011) RSSB (2005)

Shorrock (2003) Stanton (1998) Taib (2012)

Terhune (1983) Wallace (2002)

Wallen (2010) Woods (2005) Zarbo (2005)

Baysari et al. (2011)

Isaac et al. (2003b)

Aviation

Human Factors Analysis and Classification System (HFACS)

Wiegmann and Shappell (2003)

Aviation

Threats to Australian Patient Safety (TAPS) Human Factors Analysis and Classification System e Department of Defense (DoD-HFACS) Human Factors Analysis and Classification System e Department of Defense (DoD-HFACS) Human Factors Analysis and Classification System e Department of Defense (DoD-HFACS) Human Factors Analysis and Classification System e Australian Defence Force (HFACS-ADF) Human Factors Analysis and Classification System (HFACS)

This paper Department of Defense (2005)

Medicine Aviation

Department of Defense (2005)

Aviation

Department of Defense (2005)

Aviation

Australian Government (2009)

Aviation

Wiegmann and Shappell (2003)

Aviation

System for the Retrospective and Predictive Analysis of Cognitive Error e Rail driver error (TRACEr-Rail) and TRACErRail-lite System for the Retrospective and Predictive Analysis of Cognitive Error (TRACEr) and TRACEr-lite Systematic Human Error Reduction and Prediction Approach (SHERPA) Threats to Australian Patient Safety (TAPS)

RSSB (2005)

Rail

Shorrock and Kirwan (2002)

Aviation

Embrey (1986)

Aviation

Makeham et al. (2008)

Medicine

National Coordinating Council for Medication Error Reporting and Prevention (NCC-MERP) Collision taxonomy for research and traffic records (CALAX) Strathclyde Event Coding and Analysis System (SECAS) Observed Cause and Root Cause Analysis System

Terhune (1983) Wallace et al. (2002) UK Nuclear Industry

Automotive Nuclear

Driving Reliability and Error Analysis Method (DREAM) Paediatric Patient Safety Event Taxonomy Taxonomy for Anatomic Pathology Errors

Ljung (2002) Woods (2005) Zarbo et al. (2005)

Automotive Medicine Medicine

1 1 1 1 1

Reliability assessment of HFACS system adapted by Australian Government (2009). Comparison of reliability of system when used by air traffic controllers versus human factors specialists. Part of validation of system developed in this paper. Comparison with TRACEr, used by air traffic controllers.

3

Study reported in RSSB (2005) relating to reliability assessment of TRACEr when used by experienced versus novice coders. Comparison of intracoder reliability of SHERPA on three occasions. Comparison of reliability of domain-specific and generic systems used by two different medical professions.

3

NCC (2014)

2 2

N.S. Olsen, A.M. Williamson / Applied Ergonomics 51 (2015) 152e162

First author (year)

3 2 2

Part of validation of system developed in this paper. Part of validation of system developed in this paper. Comparison of SECAS and the Observed Cause and Root Cause Analysis System. Reliability assessment of system developed in Ljung (2002). Part of validation of system developed in this paper. Part of validation of system developed in this paper.

5 2 1 1 1 1

157

158

N.S. Olsen, A.M. Williamson / Applied Ergonomics 51 (2015) 152e162

Fig. 1. Comparison of reliability for domain specific and generic incident coding systems.

Fig. 2. Comparison of average reliability for coders classified into one of four professional groups.

final number of codes and reliability. As can be seen in Fig. 3, however, the distribution of size of coding system and reliability falls into two distinct groups: systems with very large numbers of codes (>350) and systems with smaller or moderate numbers (

Development of safety incident coding systems through improving coding reliability.

This paper reviews classification theory sources to develop five research questions concerning factors associated with incident coding system developm...
624KB Sizes 1 Downloads 7 Views