Complacency and Automation Bias in the Use of Imperfect Automation.

581940

HFSXXX10.1177/0018720815581940Human FactorsComplacency and Automation Bias

Complacency and Automation Bias in the Use of Imperfect Automation Christopher D. Wickens, Alion Science and Technology, Boulder, Colorado, Benjamin A. Clegg and Alex Z. Vieane, Colorado State University, Fort Collins, and Angelia L. Sebok, Alion Science and Technology, Boulder, Colorado Objective: We examine the effects of two different kinds of decision-aiding automation errors on human–automation interaction (HAI), occurring at the first failure following repeated exposure to correctly functioning automation. The two errors are incorrect advice, triggering the automation bias, and missing advice, reflecting complacency. Background: Contrasts between analogous automation errors in alerting systems, rather than decision aiding, have revealed that alerting false alarms are more problematic to HAI than alerting misses are. Prior research in decision aiding, although contrasting the two aiding errors (incorrect vs. missing), has confounded error expectancy. Method: Participants performed an environmental process control simulation with and without decision aiding. For those with the aid, automation dependence was created through several trials of perfect aiding performance, and an unexpected automation error was then imposed in which automation was either gone (one group) or wrong (a second group). A control group received no automation support. Results: The correct aid supported faster and more accurate diagnosis and lower workload. The aid failure degraded all three variables, but “automation wrong” had a much greater effect on accuracy, reflecting the automation bias, than did “automation gone,” reflecting the impact of complacency. Some complacency was manifested for automation gone, by a longer latency and more modest reduction in accuracy. Conclusions: Automation wrong, creating the automation bias, appears to be a more problematic form of automation error than automation gone, reflecting complacency. Implications: Decision-aiding automation should indicate its lower degree of confidence in uncertain environments to avoid the automation bias. Keywords: automation, complacency, automation bias, process control, first failure Address correspondence to Christopher D. Wickens, Alion Science and Technology, 4949 Pearl East Circle, Boulder, CO 80301, USA; e-mail: [email protected]. HUMAN FACTORS Vol. XX, No. X, Month XXXX, pp. 1–12 DOI: 10.1177/0018720815581940 Copyright © 2015, Human Factors and Ergonomics Society.

Introduction

Imperfect automation in alerting systems can result in distinct types of errors. For example, alerting systems may either miss actual dangerous events (an error of omission) or issue false alerts (an error of commission). These two automation errors have very different implications for human–automation interaction (HAI) performance (Dixon & Wickens, 2006; Dixon, Wickens, & McCarley, 2007; Maltz & Shinar, 2003; Meyer, 2001, 2004; Meyer & Lee, 2013). In particular, each type of error will trigger distinct HAI behavior when the operator’s dependence on automation is high and the operator’s expectancy of automation failure is low. An automation miss is problematic when the human operator trusts the automation to identify all events. This problem, overreliance, is associated with an increased likelihood of the human’s failing to identify and respond to a true system issue (an HAI error of omission). In contrast, a false alarm is problematic in that the operator can incorrectly believe that the automation has correctly identified a fault. This situation is known as overcompliance and typically includes incorrect, unnecessary actions performed on the system (an HAI error of commission; Dixon & Wickens, 2006). These errors are most pronounced the first time an automation failure occurs (Wickens, Hooey, Gore, Sebok, & Koenicke, 2009), and automation false alarms have been observed to be generally more serious than automation misses, often leading to a subsequent disregard of future alarms that may be true (Dixon et al., 2007). The contrast between overreliance (omission errors) and overcompliance (commission errors) in alerting systems can be extended to the automation of decision aids (Mosier & Skitka, 1996; Parasuraman & Manzey, 2010), in which overdependence and first failures (FFs) have been

Downloaded from hfs.sagepub.com at Gazi University on April 30, 2015

2 Month XXXX - Human Factors

associated generally with HAI problems of complacency and the automation bias. These two problems have been integrated by Parasuraman and Manzey (2010) into a common attentionbased framework. Because of the fuzziness of definition of both HAI concepts, the corresponding parallel with the more well-defined reliance– compliance distinction in alerting systems is not perfect but still useful. Complacency, for example, has typically been associated simply with the failure to be vigilant in supervising automation prior to the automation failure. Hence it does not directly imply that the failure will be a failure to provide advice at all (omission error). However, generally, complacency has been associated with an assumption that “all is well” when in fact a dangerous condition exists (Parasuraman & Manzey, 2010, p. 382), that is, the omission error. Furthermore, the term complacency is most typically invoked in the concept of alerting systems rather than decision aids (Parasuraman, Molloy, & Singh, 1993; Parasuraman, Mouloua, & Molloy, 1996), although it has been invoked in the latter context as well (Bahner, Huper, & Manzey, 2008). In parallel, the automation bias is formally defined as the tendency to use automated cues as a heuristic replacement for vigilant information seeking and processing [which] results in errors when decision makers fail to notice problems because an automated fails to detect them (an omission error) or when people inappropriately follow an automated decision aid directive or announcement (a commission error). (Mosier & Skitka, 1996, p. 205) The term could be applied to alerting systems (following the advice that “all is well” when the alert stays silent—an omission error). However, automation bias more often refers to the context of a decision aid that provides incorrect advice, which is then followed inappropriately by the human user (Mosier & Fischer, 2010; Mosier & Skitka, 1996; Mosier, Skitka, Heers, & Burdick, 1998). Following these two general traditions, in the research that follows, we refer to complacency

as the state of monitoring prior to an automation decision aid failure that expresses itself as a delayed or “guessing” response when the decision aid fails to function at all (automation error of omission). In contrast, we refer to an automation bias as the expression of such poor monitoring, when the decision aid provides incorrect advice (an error of commission), by following that advice. We note, in both cases, that the precedent of these two different error expressions, poor prior monitoring, can be associated with the loss of situation awareness (Endsley, 1995) at both Level 1 (fail to attend to what automation is doing) and Level 2 (fail to understand attended information) in the period prior to the automation failure. The critical variable contrasted is the human–system consequence of such nonvigilant monitoring prior to failure on the response to the two different kinds of automation failures. Distinctions between the two automation error types are critical in the design of decision aids because of designer choices. In an uncertain world, decision and diagnostic aids may, of necessity, make errors (Lees & Lee, 2007; Madhavan, Lacson, & Wiegmann, 2006). However, designers may choose to make the automation aids either (a) more “aggressive,” offering firm recommended actions or diagnoses based on uncertain information (and hence producing errors), or (b) less aggressive, simply “backing off” under higher levels of uncertainty to offer no recommendation whatsoever. Both approaches can produce an automation error. In the first case, the errant automation is “wrong” (commission), whereas in the second, the automation is “gone,” perhaps unexpectedly (omission). High user trust and dependence in the first case will lead to the automation bias of inappropriate compliance. High user trust and dependence in the second case will reflect complacency associated with prior automation monitoring. Each may have different negative consequences to full human–system performance when the respective automation error occurs. In two experiments, Bahner et al. (2008) and Manzey, Reichenbach, and Onnasch (2012) examined the constructs of automation bias and complacency as dependent variables for decision aiding within the same environmental process


Complacency and Automation Bias

control microworld, AutoCAMS (Manzey et al., 2008). AutoCAMS simulates the life-support control system for a space capsule and requires the user to manage and repair routine process control failures, such as gas leaks and valve blockages. To examine the automation bias, an automated decision aid called the Automated Fault Identification and Repair Agent (AFIRA) unexpectedly provided an incorrect diagnosis of a process control failure. To examine complacency, investigators identified the proportion of correct diagnostic steps that participants took in order to firmly confirm the diagnosis. Shortchanging those steps represented complacency. Similarly, the incorrect AFIRA-recommended actions that participants followed provided a measure of automation bias. In the two studies, the authors noted that between 25% and 50% of the participants demonstrated the automation bias, and those who did manifested greater complacency than those who did not. Bahner et al. (2008) varied the kind of automation error, whether AFIRA was gone or wrong. However, in both studies, the authors varied automation error type within participants between blocks, with the wrong advice from the aid (AFIRA wrong) occurring in the block prior to the absent advice (AFIRA gone). Hence whereas the first (AFIRA wrong) failure was clearly unexpected, the second (AFIRA gone) was not, probably accounting for the minimal disruption observed for AFIRA gone. HAI literature provides compelling evidence to indicate greater cost to human performance of FFs that are unexpected “black swans” compared to subsequent “gray swan” failures (Molloy & Parasuraman, 1996; Wickens et al., 2009; Yeh, Merlo, Wickens, & Brandenburg, 2003). To address the objective of evaluating FFs of decision aiding across automation failure types, in the current experiment in a between-subjects design, three groups of participants managed AutoCAMS 2.0, the microworld process control simulation employed by Bahner et al. (2008), Manzey et al. (2012), and Clegg, Vieane, Wickens, and Sebok (2014). Operators managed a series of trials in AutoCAMS during which they experienced a series of “routine system malfunctions” (e.g., stuck valves that could be diagnosed and fixed using procedures). For two

3

automation-supported groups, diagnosis and management were supported by the AFIRA decision aid. A third (control) group did not have the decision aid available. After considerable exposure, potentially inducing aid dependence and possible learning, the aid itself unexpectedly failed: For one group (AFIRA gone), the AFIRA guidance was unavailable. This condition was designed to assess complacency (in monitoring variables) that may have manifested itself on prior trials with AFIRA support. For the other group (AFIRA wrong), AFIRA provided an incorrect diagnosis, designed to assess the automation bias. In addition, a final trial also contained the AFIRA-gone failure (i.e., second failure [2F]) for both automation groups, allowing us to assess what, if anything, had been learned (regarding manual diagnosis and fault management performance) from the two types of errors in managing the previous FF. The control group participated in the same number of sessions but never received any kind of AFIRA support. We summarize our discussion in Table 1, where we present the fundamental contrast between the two types of automation failures and the corresponding implications and terminology associated with each. With this design, we formulated three major hypotheses. Hypothesis 1: During training with “routine failures,” the two automation groups will benefit from the AFIRA decision aid (relative to the control group), increasing accuracy, shortening diagnosis time, and reducing workload. Hypothesis 2: Participants in the automationwrong group will perform worse than participants in the automation-gone group, following the FF of the AFIRA aid. Performance (speed and accuracy of diagnosis) will suffer, as both groups, dependent on the aid, fail to learn the correct diagnosis and fault-management steps during previous “routine failure” trials. If we extrapolate the analogous automation false alarm– versus–miss contrast examined by Dixon et al. (2007) to the current contrast, we might predict that automation wrong would have more serious consequences; that is, the


4 Month XXXX - Human Factors Table 1: Automation Failure Types and Their Implications Omission: Failure to Support the Human

Automation Failure (Error) Typically examined in Errant human performance described as HAI shortcoming often referred to as Examined in current experiment as Consequences of error for operator learning to diagnose and manage failure

Alerting systems Overreliance

Commission: Incorrect Support for the Human Alerting systems and decision aids Overcompliance

Complacency prior to failure Automation bias at time of failure Decision aid gone Decision aid wrong Substantial Minimal

Note. HAI = human–automation interaction.

automation bias, much like a response to a false alarm, is more problematic than the complacency consequence of automation gone leading to a miss or delayed response. Such a prediction is also consistent with the finding that later-stage automation errors are more serious than earlier-stage automation errors (Onnasch, Wickens, Manzey, & Li, 2014; Parasuraman, Sheridan, & Wickens, 2000, 2008). In addition, the automationgone condition is highly salient: A red status alert indicates a problem, but the AFIRA aid shows a blank screen. The participant quickly knows that something is amiss. In the automation-wrong condition, the participant sees the alert and sees what initially appears to be correct, plausible guidance. Hypothesis 3: The salient experience of aid withdrawal will lead to an increased allocation of attention toward the diagnostic process, now imposing higher workload and subsequently learning (acquiring) an improved ability to deal with future failures. In contrast, to the extent that automation aiding is wrong (rather than simply gone), the participant shows an automation bias, the automation failure does not lead to such an investment of effort in learning the process (lower workload), and thus transfer to the second failure will be reduced. Method Participants

One hundred and nineteen participants were recruited from undergraduate psychology classes

for optional, partial course credit. Forty were assigned to the control condition, 40 were assigned to the automation-wrong condition, and 39 were assigned to the automation-gone condition. Apparatus and Design

A supervisory process control task was used for this experiment (AutoCAMS 2.0; Manzey et al., 2008, 2012). This task simulated a cabin life-support system in which the operator is responsible for ensuring that oxygen and pressure subsystems stay within a normal range. When a component failure is injected into the system, causing one or both of the critical subsystems to go out of normal range, operators must detect, diagnose, repair, and manage failures with the automated failure diagnostic assistant (AFIRA) present or with it removed. A 3 (automation group: control, automation gone [or auto gone], automation wrong [or auto wrong]) × 3 (automation support: baseline “routine” fault with correct AFIRA support, FF of automated support, 2F) mixed design was used, with automation group as a between-subjects variable and automation support as a withinsubjects variable. The differences in AFIRA do not apply to the control group. Participants were randomly assigned to one of the three groups. Participants were provided with study aids containing information pertinent to diagnosing each of nine failures in AutoCAMS, which included an annotated copy of the interface and a table of diagnoses (e.g., oxygen valve leak) and symptoms (e.g., oxygen decreasing). Participants



could use these study aids during training and practice, and they could refer to them during the experimental blocks. Prior to participation in the experimental study conditions, all participants completed a self-paced multimedia training presentation on how to operate AutoCAMS. Most participants completed the training in 30 to 40 min. The multimedia training presentation covered the components of AutoCAMS and demonstrated how to detect, diagnose, manage, and repair the system. In addition to learning how to handle routine failures within the system, participants were instructed on two secondary tasks: logging the carbon dioxide level at 1-min intervals and clicking on a connection icon as it appeared (only data for the former were sensitive to experimental manipulations). Participant actions and the time and accuracy of secondary tasks were recorded. After training, participants had a 5-min practice block in AutoCAMS, featuring automated assistance from AFIRA that correctly identified the failure and provided the steps for managing and repairing the failure. One routine failure occurred during the practice block (either an oxygen or nitrogen failure). Participants were told that AFIRA might not always be available and could be incorrect, suggesting that they should cross-check its advice. After the practice block, participants were encouraged to ask questions, and answers were provided to ensure understanding. Following practice, participants were given a set of four experimental blocks. The first two 15-min blocks provided correct automated assistance only for those in the AFIRA-gone and AFIRA-wrong groups, in a manner consistent with that received in the practice block. The AFIRA system identified the specific failure in need of repair and then provided steps to follow regarding managing and repairing the failure. Failures were scheduled at specific times within a block (Block 1: 4 and 11 min; Block 2: 2, 6, and 12 min). For all of the blocks in the experiment, there was a 60-s repair time for each repair initiated. After the participant initiated the repair actions, 60 s elapsed before feedback indicated if the actions had the desired effect. At the time of a routine failure, all participants experienced a visual master alarm that

5

changed color from green to red. For the automation groups, AFIRA also provided a diagnosis (e.g., “Nitrogen Valve Block”) and steps to manage the failure (e.g., “Set Nitrogen Flow to High,” “Send Repair Task”). The control group received the same master alarm indicating the onset of the fault as the automation groups but was never provided any automated assistance for diagnosis, repair, or system management. In Block 3, the automated assistance was unexpectedly removed for the automation-gone group, whereas automated assistance was unexpectedly incorrect for the automation-wrong group. The control group continued to experience failures without automated assistance. Block 3 lasted a total of 15 min and had the FF scheduled to occur 10 min into the block. Those in the automation-wrong condition were provided with the red alarm and AFIRA guidance to alert them of a failure. However, AFIRA’s diagnosis of the failure was incorrect. Those in the automation-gone condition were solely provided with the alarm (but AFIRA was blank) to indicate that a failure was present and were now additionally required to diagnose the failures (unexpectedly in the FF), manage the system (i.e., manually control oxygen or pressure), and repair the fault on their own. In Block 4, the automated assistance was unexpectedly unavailable to all of the groups. Block 4 lasted a total of 5 min and had the failure (2F) scheduled to occur 1 min into the block. All participants were provided with an alarm to alert them to a failure in the system, but AFIRA was blank. Participants received unique failures throughout practice and experimental blocks (eight failures used). The order of failures was partially counterbalanced. The NASA Task Load Index subjective workload rating ((NASA-TLX; Hart & Staveland, 1988) was administered to report workload after the first and second blocks. A combined report was made for the third and fourth experimental blocks. Results Latency to First Repair Attempt

Analysis of the time from the onset of the fault to the first repair effort initiated reflects the



Latency to irst repair attempt (s)

Initial Time Spent Diagnosing

Control

90

Auto Gone

80

Auto Wrong

70 60 50 40 30 20 10 0 Baseline

FF

2F

Failure Type

Figure 1. Diagnosis time prior to the first repair attempt as a function of failure type and group. In the baseline trials, the two automation groups had the same Automated Fault Identification and Repair Agent (AFIRA) support. For first failure, only automation wrong had AFIRA, but it was incorrect. For second failure, none of the groups had AFIRA.

time invested in the initial diagnostic effort (i.e., before receiving feedback that the action based on that diagnosis was or was not effective and further diagnosis was needed). A MANOVA was conducted across the three critical failure types: (a) the baseline routine failure at the start of Block 2, (b) the automation-gone versus automation-wrong versus manual control failure in which automation support fails unexpectedly for the first time (first automation fault, FF), and (c) the failure whereby all groups received no automation support (second automation failure, 2F). (The use of only the single routine failure for the baseline data was because this trial provided maximum practice, closely equivalent to the two AFIRA failure trials.) As reflected in Figure 1, the analysis showed a main effect of failure type, Wilks’ Λ = .70, F(2, 97) = 20.75, p < .01; a main effect of condition, F(2, 98) = 4.53, p < .05; and the critical interaction of Failure Type × Condition, Wilks’ Λ = .63, F(4, 194) = 12.68, p < .01. A planned linear contrast showed that AFIRA support provided substantial aid to the two automation groups (M = 28 s) in comparison to the manual control group (M = 50 s) on the baseline routine failure, t(109) = 5.14, p < .01, d = 1.03. Planned comparisons on baseline failure versus FF showed that when automation was

unexpectedly gone, a large increase in diagnosis time occurred compared to the baseline routine failure when automation support had been present, t(29) = 4.33, p < .005, d = 0.79. In contrast, the automation-wrong group showed no increase in diagnostic latency on the FF. Operators issued a repair command as quickly as they did when AFIRA offered correct advice, t(35) = 0.63, p > .20, d = 0.10, and were significantly faster than the control group on the FF trial: linear contrast, t(101) = 4.54, p < .01, d = 1.04. However, as will be described later, this increased speed came at the cost of accuracy. Participants in the auto-wrong condition were simply (and incorrectly) following AFIRA guidance. When automation was unavailable in the next fault (2F), no differences in latency were found between the two AFIRA groups: linear contrast, t(113) = 0.63, p > .10. The auto-gone group, who had guidance unavailable during the FF trial, maintained an equivalent slow response, whereas the auto-wrong group response latency slowed to match the speed of the auto-gone group. Thus a removal of support for both groups appears to have had the same effect. However, on the 2F, both groups remained slower than the control group: linear contrast, t(113) = 2.00, p < .05.



7

Table 2: Percentages of Necessary Diagnostic Steps Taken for Complete Manual Diagnosis Group Control Auto gone Auto wrong

Baseline

First Failure

Second Failure

50 5 3

47 58 5

47 44 46

Diagnosis Step Percentages

Management During System Faults

Echoing the approach of Bahner et al. (2008), we assessed complacency as a dependent variable by measuring the percentage of diagnostic steps taken, relative to those that would be necessary to fully make (or confirm) the diagnosis. That is, 100% would represent a perfect failsafe diagnostic process. As shown in Table 2, these data followed a pattern identical to the diagnosis time in Figure 1, indicating both the dependence of response time (RT) on the number of these steps and, on the FF, the tendency of the autowrong group to shortchange this confirmation (percentage far less than 100%).

In addition to the diagnosis and repair of the faults, operators were required to manage systems during the faults to maintain required parameters, a sort of secondary task. Performance on this aspect of the task was indexed in terms of time (in seconds) in which the faulty system was allowed to remain outside of the necessary range. A MANOVA was conducted on performance across the three critical failures. As reflected in Figure 3, there was a main effect of failure type, Wilks’ Λ = .92, F(2, 100) = 4.49, p < .05; no main effect of condition, F(2, 101) < 1; and an interaction of Failure Type × Condition, Wilks’ Λ = .92, F(4, 200) = 2.37, p =.05. A planned linear contrast showed the hypothesized system management benefit from AFIRA support for the two automation groups on the baseline routine failure, t(111) = 2.28, p < .05, d = 0.43. Planned comparisons on baseline versus FF showed that both automation failure groups declined in fault management when that support was unavailable: auto gone, t(30) = 2.49, p < .05, d = 0.45; auto wrong, t(37) = 2.78, p < .01, d = 0.45. Although the decrement appears greater for the auto-wrong group (echoing the pattern for diagnosis accuracy), the difference between these groups failed to reach statistical significance on the FF trial: linear contrast, t(102) = 1.66, p = .10. On the final trial (2F), there were no differences in management accuracy between the groups.

Accuracy of Diagnosis

A MANOVA was conducted on the accuracy of initial diagnosis across the three critical failures. As reflected in Figure 2, there was a main effect of failure type, Wilks’ Λ = .70, F(2, 99) = 43.26, p < .01; no main effect of condition, F(2, 100) = 1.22, p > .05; and an interaction of Failure Type × Condition, Wilks’ Λ = .65, F(4, 198) = 11.97, p < .01. As with diagnosis time, a planned linear contrast showed the hypothesized large accuracy benefit from AFIRA diagnostic support for the two automation groups on the baseline routine failure, t(111) = 4.21, p < .01, d = 0.97. Planned comparisons on baseline versus FF showed that both automation failure groups declined in their accuracy when that support was unavailable— automation gone, t(30) = 4.21, p < .01, d = 0.75; automation wrong, t(36) = 9.61, p < .01, d = 1.58—but there was a much greater loss in diagnostic accuracy for the automation-wrong group than for the automation-gone group on the first-failure trial: linear contrast, t(101) = 4.05, p < .01. On the final trial (2F), there was no evidence for any difference in accuracy between the three groups.

Secondary Task

The secondary CO2 logging task accuracy depicted in Figure 4, assessing operator workload, showed no main effect of failure type, Wilks’ Λ = .97, F(2, 100) = 1.49, p > .05; a significant main effect of condition, F(2, 101) = 3.59, p < .05; and a significant interaction of Failure Type × Condition, Wilks’ Λ = .88, F(4, 200) = 3.23, p < .05.



Control


1

Auto Gone Auto Wrong

0.9


0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Baseline

FF

2F

Failure Type

Time Damaged System Out of Range (s)

Figure 2. Accuracy of the initial diagnosis (measured in terms of the nature of the first repair attempt initiated) by failure type and group.

Fault Management

120

Control Auto Gone Auto Wrong

100 80 60 40 20 0 Baseline

FF

2F

Failure Type

Figure 3. Effectiveness of managing the failed system (measured in time with the damaged system allowed to remain out of range) by failure type and group.

A planned linear contrast showed the expected benefit from AFIRA diagnostic support for the two automation groups on the baseline routine failure, t(111) = 2.22, p < .05, d = 0.44, a benefit eliminated on later trials. Further analysis revealed that the source of the significant interaction was the loss in secondary task performance (increase in workload) for auto wrong, from the FF to the 2F.

Subjective Workload

Workload was also indexed using retrospective NASA-TLX self-reports. Unweighted measures were derived from the baseline Block 2 that contained up to three routine failures and a combined report following Block 4 for events within Blocks 3 and 4, which included both AFIRA failures (FF and 2F). Data for 11 participants were unavailable because of recording failures.



9

CO2 Logging Task

1

Auto Gone

0.9

Proportion Correct

Control

Auto Wrong

0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Baseline

FF

2F

Failure Type

Figure 4. Secondary task (CO2 value logging) proportion correct by failure type and group.

The NASA-TLX workload data revealed a main effect of block, Wilks’ Λ = .94, F(1, 85) = 5.57, p < .05; no main effect of condition, F(2, 85) = 1.37, p > .05; and an interaction of Block × Condition, Wilks’ Λ = .81, F(2, 85) = 10.19, p < .01. Figure 5 clearly indicates the source of this interaction: Over blocks, workload for the manual group declined nonsignificantly, t(28) = 1.58, p > .05, consistent with a very modest practice/ learning effect increasing reserve capacity. Workload for the auto-wrong group showed a nonsignificant increase, t(28) = −0.93, p > .05, with the higher workload on the less practiced unsupported failure management on 2F offset by the low workload for the largely neglected FF. For the AFIRA-gone group, there was a large, significant 44% increase, t(29) = −5.18, p < .05, reflecting the high effort of unpracticed manual diagnosis on both failure trials. These findings suggest, consistent with the variables described previously, that the auto-gone participants suddenly had to marshal their resources to diagnose the problem in the FF condition, and the autowrong participants did so on the 2F but treated the FF just like all previous AFIRA trials. Discussion The goal of the experiment was to compare performance decrements caused by two qualitatively different types of unexpected automation

errors: an omission of decision aiding advice (AFIRA gone), traditionally associated with revealing the effects of prior complacency, and the presentation of incorrect advice (AFIRA wrong), traditionally associated with the automation bias. Our goal was to examine both the differential response to the two different automation errors to the initial “black swan” failure and the response to a subsequent failure that might reflect what was learned about the system during the previous AFIRA failures. Hypothesis 1, regarding the overall benefit of correct decision-aiding automation was confirmed, indicating that those supported by correct automation in the baseline trial improved accuracy by 100%, reduced latency (by half), and reduced workload as indicated by the improved secondary-task performance, better fault management, and lower subjective ratings. Hence the AFIRA aid was both heavily used and effective, as found in previous studies (Bahner et al., 2008; Manzey et al., 2012). Hypothesis 2 was also confirmed. For the FF, there was a pronounced decrease in accuracy for both failure types. But a closer scrutiny reveals a very different pattern between the two. The AFIRA-gone group showed the expected pattern of somewhat lower accuracy and longer time, replicating the complacency effect observed by Clegg et al. (2014). Because the alert was salient, this


10 Month XXXX - Human Factors 60

Baseline block Failure blocks

Unweighted TLX score

50 40 30 20 10 0 Control

Auto Gone

Auto Wrong

Group

Figure 5. Unweighted NASA Task Load Index workload self-reports by group.

effect was not in the failure to notice but rather in the inadequacy of monitoring the state of variables prior to the alert (complacency). However, for the auto-wrong (automation bias–inducing) error, accuracy dropped much more (to near zero), but RT was unchanged, a pattern of the speed– accuracy trade-off, suggesting that participants did not even notice that the advice was wrong, at least until they began managing the incorrect failure. This pattern was confirmed by the minimal diagnostic tests performed by this group (Figure 2), with a clear inadequacy of engaging in independent diagnostic actions. This study identified a strong automation bias. Thus, from the point of overall system performance, these data suggest that the automation bias is far more problematic than the response to an unexpectedly missing support system on the FF, given that the large loss of accuracy showed by the auto-wrong group should be viewed as more serious in safety-critical systems than a delay in diagnosis showed by the auto-gone group. As such, these effects are consistent with the overall pattern of automation failure effects observed in the metaanalysis by Onnasch et al. (2014). Wrong advice of what action to take (which is a later-stage automation failure) is more problematic than the absence of advice (akin to an earlier-stage automation information failure). Hypothesis 3 addressed the recovery and longer-term learning revealed on the 2F. Such

recovery is clearly shown by the auto-wrong group, which demonstrated obvious improvement to its diagnosis accuracy (Figure 3) compared with the FF, whereby the group blindly followed the incorrect AFIRA advice. Indeed, the diagnosis accuracy of the auto-wrong group on the 2F was nearly equivalent to that of the other two groups of participants. In addition, on the 2F, the auto-wrong group showed a speed of diagnosis that was only slightly slower than that of the manual group and equivalent to the AFIRA-gone group. Thus although the AFIRA-wrong automation error was more devastating than the auto-gone error, its long-term consequences to performance were no greater when followed by a more salient auto-gone failure (the AFIRA screen was black). Indeed, the long-term consequences of both error types were relatively minor, with only a residual, but significant, 20% cost to diagnosis latency on the 2F trial (Figure 1). Beyond addressing the three hypotheses, three important observations from the data deserve further comment. First, in contrast to previous studies (Manzey et al., 2012), AFIRA supported groups on baseline trials failed to achieve 100% diagnostic accuracy, despite the perfection of AFIRA, indicating some degree of mistrust in AFIRA, occasionally countering its recommendations. Second, it is clear that the task is challenging. Without AFIRA support,



even the well-trained manual group hovered at around 50% initial diagnostic accuracy, far below the 80% to 90% level obtained in prior research (Bahner et al., 2008; Manzey et al., 2012). This finding probably reflects in part the difference in participant demographics—paid engineering students in the prior studies and unpaid participants from undergraduate psychology courses in the present studies—and also the longer training given in the prior studies. Third, both automation groups reflected complacency as a dependent variable in the manner assessed by Bahner et al. (2008) by severely shortchanging the number of necessary diagnostic steps, a trend that continued for the autowrong group on the first-failure trial. Conclusions and Limitations

In conclusion, this comparative analysis has revealed different consequences of the two different types of decision-aiding automation failures on the FF. It suggests that caution be exercised by decision-aiding designers in imposing aggressive and confident diagnostic or predictive advice in the face of uncertainty (i.e., when automation may be incorrect). It suggests that such decision aiding should be offered in the spirit of likelihood alerts (Sorkin, Kantowitz, & Kantowitz, 1988; Wiczorek & Manzey, 2014), in which automation can express its doubts in certain diagnoses, hence mitigating the automation bias. The major limitation of the current study is its use of relatively untrained undergraduate participants. However, it is important to consider that the finding of the automation bias within this paradigm well replicated those reported by Manzey et al. (2012) and also that such a bias in following incorrect diagnostic advice has been observed to be just as prevalent with experts as with novices in other domains (Skitka, Mosier, & Burdick, 2000). Acknowledgments This research was sponsored by NASA Grant NNX12AE69G. The authors wish to thank Brian Gore, Jessica Marquez, Sandra Whitmire, Kristine Ohnesorge, Laura Bollweg, Heather Paul, and Lauren Leveton for their guidance and support.

11 Key Points •• Diagnostic decision-aiding automation can fail in two qualitatively different ways: by providing wrong advice and by providing no advice. •• When humans strongly depend on such automation, they may manifest the automation bias with wrong advice, triggering lost accuracy, and manifest complacency (in prior monitoring) with gone advice, reflected in a delay in manual failure management. •• The automated decision-aiding tool for process management used here increased accuracy and decreased both latency and workload when it was correct. •• When automation was unexpectedly wrong, a pronounced automation bias was manifested in poor diagnostic accuracy; when automation was unexpectedly gone, more modest evidence of a complacency-based delay in diagnosis and reduction in accuracy was shown. •• The greater disruption of the automation bias, when automation was wrong, suggests automation diagnostic aids should express their own confidence.

References Bahner, E., Huper, A. D., & Manzey, D. (2008). Misuse of automated decision aids: Complacency, automation bias and the impact of training experience. International Journal of Human–Computer Studies, 66, 688–699. Clegg, B. A., Vieane, A. Z., Wickens, C. D., Gutzwiller, R. S., & Sebok, A. L. (2014). The effects of automation-induced complacency on fault diagnosis and management performance in process control. In Proceedings of the Human Factors and Ergonomics Society 58th Annual Meeting (pp. 844–848). Santa Monica, CA: Human Factors and Ergonomics Society. Dixon, S. R., & Wickens, C. D. (2006). Automation reliability in unmanned aerial vehicle flight control: A reliance compliance model of automation dependence in high workload. Human Factors, 48, 474–486. Dixon, S. R., Wickens, C. D., & McCarley, J. S. (2007). On the independence of compliance and reliance: Are automation false alarms worse than misses? Human Factors, 49, 564–572. Endsley, M. (1995). Toward a theory of situation awareness in dynamic systems. Human Factors, 37, 32–64. Hart, S. G., & Staveland, L. E. (1988). Development of NASATLX (Task Load Index): Results of empirical and theoretical research. In P. A. Hancock & N. Meshkati (Eds.), Human mental workload (pp. 139–184). Amsterdam, Netherlands: North Holland Press. Lees, M. N., & Lee, J. D. (2007). The influence of distraction and driving context on driver response to imperfect collision warning systems. Ergonomics, 50, 1264–1286. Madhavan, P., Lacson, F., & Wiegmann, D. (2006). Automation failures on tasks easily performed by operators undermine trust in automated aids. Human Factors, 48, 241–256.


12 Month XXXX - Human Factors Maltz, M., & Shinar, D. (2003). New alternative methods in analyzing human behavior in cued target acquisition. Human Factors, 45, 281–295. Manzey, D., Bleil, M., Bahner-Heyne, J.E., Klostermann, A., Onnasch, L., Reichenbach, J., & Röttger, S. (2008). AutoCAMS 2.0 manual. Retrieved from http://www.aio.tu-berlin .de/?id=30492 Manzey, D., Reichenbach, J., & Onnasch, L. (2012). Human performance consequences of automated decision aids: The impact of degree of automation and system experience. Journal of Cognitive Engineering and Decision Making, 6, 1–31. Meyer, J. (2001). Effects of warning validity and proximity on responses to warnings. Human Factors, 43, 563–572. Meyer, J. (2004). Conceptual issues in the study of dynamic hazard warnings. Human Factors, 46, 196–204. Meyer, J., & Lee, J. (2013). Trust, reliance, and compliance. In J. D. Lee & A. Kirlik (Eds.), The Oxford handbook of cognitive engineering (pp. 109–124). New York, NY: Oxford University Press. Molloy, R., & Parasuraman, R. (1996). Monitoring and automated systems for a single failure: Vigilance and task complexity effects. Human Factors, 38, 311–322. Mosier, K. L., & Fischer, U. (2010). Judgment and decision making by individuals and teams: Issues, models, and applications. Reviews of Human Factors and Ergonomics, 6, 198–255. Mosier, K., & Skitka, L. (1996). Human decision makers and automated decision aids. In R. Parasuraman & S. Mouloua (Eds.), Automation and human performance: Theory and applications (pp. 201–220). Mahwah, NJ: Lawrence Erlbaum. Mosier, K. L., Skitka, L. J., Heers, S., & Burdick, M. (1998). Automation bias: Decision-making and performance in high-tech cockpits. International Journal of Aviation Psychology, 8, 47–63. Onnasch, L., Wickens, C.D., Manzey, D., & Li, H. (2014). Human performance consequences of stages and levels of automation: An integrated meta-analysis. Human Factors, 56, 476–488. Parasuraman, R., & Manzey, D. (2010). Complacency and bias in human use of automation: An attentional integration. Human Factors, 52, 381–410. Parasuraman, R., Molloy, R., & Singh, I. L. (1993). Performance consequences of automation-induced “complacency.” International Journal of Aviation Psychology, 3, 1–23. Parasuraman, R., Mouloua, M., & Molloy, R. (1996). Effects of adaptive task allocation on monitoring of automated systems. Human Factors, 38, 665–679. Parasuraman, R., Sheridan, T. B., & Wickens, C. D. (2000). A model for types and levels of human interaction with automation. IEEE Transactions on Systems, Man, and Cybernetics. Part A: Systems and Humans, 30, 286–297. Parasuraman, R., Sheridan, T. B., & Wickens, C. D. (2008). Situation awareness, mental workload, and trust in automation:

Viable, empirically supported cognitive engineering constructs. Journal of Cognitive Engineering and Decision Making, 2, 141–161. Skitka, L. J., Mosier, K. L., & Burdick, M. (2000). Accountability and automation bias. International Journal of Human– Computer Studies, 52, 701–717. Sorkin, R. D., Kantowitz, B. H., & Kantowitz, S. C. (1988). Likelihood alarm displays. Human Factors, 30, 445–460. Wickens, C. D., Hooey, B. L., Gore, B. F., Sebok, A., & Koenicke, C. S. (2009). Identifying black swans in NextGen: Predicting human performance in off-nominal conditions. Human Factors, 5, 638–651. Wiczorek, R., & Manzey, D. (2014). Supporting attention allocation in multitask environments: Effects of likelihood alarm systems on trust, behavior, and performance. Human Factors, 56, 1209–1221. Yeh, M., Merlo, J. L., Wickens, C. D., & Brandenburg, D. L. (2003). Head up versus head down: The costs of imprecision, unreliability, and visual clutter on cue effectiveness for display signaling. Human Factors, 45, 390–407.

Christopher D. Wickens is a professor emeritus of aviation and psychology at the University of Illinois and is currently a senior scientist at Alion Science and Technology, Boulder, Colorado. Benjamin A. Clegg is a professor of cognitive psychology at Colorado State University. He received his PhD in psychology in 1998 from the University of Oregon. Alex Z. Vieane is a graduate student at Colorado State University. She received her BA in psychology in 2012 from California State University, Long Beach. Angelia L. Sebok is a principal human factors engineer and program manager at Alion Science and Technology. She earned her MS degree in industrial and systems engineering from Virginia Tech in 1991. Date received: October 13, 2014 Date accepted: March 3, 2015


Automation bias: empirical results assessing influencing factors.

Displaying contextual information reduces the costs of imperfect decision automation in rapid retasking of ISR assets.

Improving the driver-automation interaction: an approach using automation uncertainty.

Using Modeling and Simulation to Predict Operator Performance and Automation-Induced Complacency With Robotic Automation: A Case Study and Empirical Validation.

Automation, Leisure and the Unions.

Laboratory Automation and Middleware.

The future of laboratory automation.

DNA probes and automation.

Automation of perimetry.

Home automation in the workplace.

Adaptive Automation and Mental Processes.

Automation of pesticide analysis.

DNA sequencing, automation, and the human genome.

The automation of microspectrophotometry of tissue sections.

An Employer's View of Automation.

Complacency bias in clinical trials.

Coadaptive aiding and automation enhance operator performance.

[Automation in blood banks (author's transl)].

StrAuto: automation and parallelization of STRUCTURE analysis.

Laboratory automation of high-pressure liquid chromatography.

Laboratory automation in a functional programming language.

Total Laboratory Automation of Routine Hemostasis Testing.

Individual differences in the calibration of trust in automation.

Automation of Salmonella typhi phage typing.