Part 1: "Reliance" and "distraction" effects in PTC automation

By T. B. Sheridan (MIT),
F. C. Gamst (Univ. of Mass., Boston),
and R. A. Harvey, BLE

White Paper, 11/28/99

EXECUTIVE SUMMARY

This document was requested by T. Raslear of the Federal Rail Administration (FRA) on 3/3/99 of the PTC Human Factors Team in conjunction with ongoing discussions of PTC standards. The charge was to investigate the "reliance effect" and the "distraction effect," where definition and focus were left to the authors.

With regard to future automation of railway systems, and in particular with regard to the implementation of Positive Train Control (PTC), questions have been raised about the possible propensity for a locomotive engineer (LE) or conductor (C) to become over-reliant on automation and/or to become distracted by the additional monitoring burdens required by the automation, and for these effects to compromise the performance of their duties and for safe and efficient train operation.

This white paper is organized by section as follows:

(1) First, details on the charge given to the authors by the FRA.

(2) Next, working definitions of terms "reliance effect" and "distraction effect" and the issues surrounding them.

(3) Review of the general human factors literature regarding humans and automation, and specifically the reliance and distraction phenomena - for example in piloting aircraft, driving highway vehicles, operating nuclear power plants and performing routine machine operation tasks. For each of the reliance and distraction effects the relevance to PTC automation is discussed.

(4) Details of the relation of reliance and distraction to operations under PTC, along with implied recommendations. This section, the longest, reviews the "open system" nature of the rail transportation system, proposes a "human-centered" design philosophy for PTC, comments on the relevance of the UK's Great Western accident of 1997, discusses which kinds of distraction are particularly threatening, analyses the potential levels of automation for PTC design, and recommends which level seems best for safety.

(5) Classroom and simulator training for PTC.

(6) Conclusions.

The conclusions are:

(1) Over-reliance on (or not knowing how much to rely on) automation, and added distraction of having to (or poor ability to) monitor automation, are well known problems in the human factors literature, but there are few easy remedies.

(2) Maintenance of the locomotive engineer's perceptual, decision-making and control skills is considered mandatory.

(3) A PTC system should provide an auditory warning of appropriate hazards and graphical information about stopping profiles from the given speed. Otherwise it should allow for manual operation, unless certain limits are exceeded, at which point automatic braking enforcement should go into effect.

(4) Failures of a PTC system should be announced by a clearly discernible auditory alarm, and the type and time of failure recorded on the locomotive event recorder.

(5) Special classroom and simulator training for PTC operation, including failure scenarios, should be given to train crews.

1. Charge from the FRA

The original charge to the RSAC "Human Factors Team" dated 3/30/99 was as follows.

(1) "Investigate the 'Reliance Effect' on the non-fail safe systems. Will the operator become reliant upon the overlay system and become less attentive? If so, is it possible to estimate the effect on the safety of railroad operations? Are there countermeasures or redesign alternatives that warrant exploration?"

2. 'Investigate the 'Distraction Effect' associated with frequent or complex requirements to interact with the system. Is this a legitimate concern? To what extent? If it is a significant problem, is it possible to describe tolerable limits for these interactions and redesign alternatives that warrant exploration?"

The 9/8/99 Report of the Railroad Safety Advisory Committee to the Federal Railroad Administrator (page xiii, item 5.c) reads: "Develop human factors analysis methodology to project the response of crews and dispatchers to changes brought about by overlay' type PTC technology, including possible 'reliance' or 'complacency' and 'distraction' effects (initiated 2nd quarter 1999). Apply methodology to candidate projects."

2. The Concepts of Reliance and Distraction

2.1 Purpose of PTC and PTS

PTC has been defined to have the following core features in the Railroad Safety Advisory Committee's report to the Federal Railroad Administrator "Implementation of Positive Train Control Systems" (RSAC, 1999: vii, 16-17).

(1) Prevent train-to-train collisions (positive train separation).

(2) Enforce speed restrictions, including civil engineering restrictions (curves, bridges, etc.) and temporary slow orders.

(3) Provide protection for roadway workers and their equipment operating under specific authorities."

It should be noted that Positive Train Separation (PTS) is included in the core-feature definition of PTC. Consequently, PTS need not be mentioned in discussion of PTC without a particular reason to do so.

2.2. Working definitions of "Reliance Effect" and "Designed Reliance" in PTC Automation

The "reliance effect" is taken to refer to the tendency of the LE, C or train dispatcher to over-rely (rely more than the system designers or managers intend) on automation such as PTC in performing work tasks, particularly to the degree that the automation is deemed not to be fail-safe by itself. Concepts closely related to "reliance" are "complacency" and ''over-trust.

Insofar as the system is intentionally designed, or the level of automation is such, that the that the human operator is compelled or encouraged to defer to the automation, we call that "designed reliance." In Section 4.5 below we make specific recommendations in that regard. There may be a thin line between intentional, designed-in reliance and unintentional over-reliance, especially as understood by the human operator.

2.3. Definition of Distraction Effect in PTC Automation

The "distraction effect" is assumed to refer to the tendency of the LE to be distracted from other duties by frequent or complex cognitive interactions with the automation to plan and program its operation, monitor its performance, detect and diagnose and stay aware of any abnormalities, and rectify any abnormalities and ensure control. (Of course there are other distractions from radio conversation or wayside events.) Associated with "distraction" are the concepts of "mental workload," "attention deficit," and decrement in "situation awareness."

2.4. Levels of Automation

Insofar as reliance implies reliance on automation by design it is sometimes useful to consider levels of automation from none to full computerized automation. The following scale (Sheridan, 1987) has been used in a variety of contexts:

1. The computer offers no assistance: the human must do it all.

2. The computer suggests alternative ways to do the task.

3. The computer selects one way to do the task, and

4. executes that suggestion if the human approves, or

5. allows the human a restricted time to veto before automatic execution, or

6. executes automatically, then necessarily informs the human, or

7. executes automatically, then informs the human only if asked.

8. The computer selects, executes, and ignores the human.

The tendency to move further along this scale has been a continuing trend in recent years, and is most evident in the evolution of commercial aircraft. It began with autopilot systems, then came navigation aids, then diagnostic aids, collision and stall and ground proximity warnings, and finally the integration of all these into the Flight Management System, a multi-purpose computer system which oversees all functions and through which the pilot flies the aircraft. Pilots now call themselves "flight managers." Similar evolution is beginning to happen in highway vehicles, ships, factories, chemical plants, power stations, and hospitals as well as trains. It is commonly called "supervisory control" (see Sheridan, 1987, 1992).

3. Review of Reliance and Distraction Effects in the General Literature, and Their Relevance to PTC

In considering the experimental literature as well as practical experience with automation in piloting aircraft, driving highway vehicles, operating nuclear power plants and performing routine manufacturing tasks, one cannot discuss reliance without discussing complacency and trust.

3.1. Reliance Effect in the General Literature

When machines or people demonstrate their reliability it is only natural to depend on, indeed trust, them. Most of the technology around us works well, and even though our life may depend upon it, we simply do not think about it. Do we rely on the roofs over our heads or the buildings we are in not to fall down? Do we trust our brakes to slow and stop our cars from high speeds? Obviously we do - unless there are environmental circumstances (e.g., earthquakes, very steep hills) which cause us to make closer observations, or unless we receive unexpected signals (ominous noises, leaking oil, etc.). To some degree reliance on trustworthy systems is proper behavior, since we do not have time or attentional capacity to attend to and worry about everything around us. Clearly, however, one can become reliant on automation, trusting and complacent (insofar as the third term implies the first two) to a degree greater than is justified by the small risks which may be involved (where risk means probability of serious consequences times magnitude of those consequences.) There have been numerous studies of human reliance on automation recently (see, e.g., Riley, 1994; Sheridan, 1992; Parasuraman and Moula, 1994; Moula and Koonce, 1997).

Safety engineers have long worried about whether, if actions are taken to make systems safer, operators will simply take advantage of that safety margin to take correspondingly more risks, to the point where level of safety remains constant. The technical term for this is "risk homeostasis." Evidence in automotive vehicles is clearly that as brakes, tires, handling qualities and highways have improved drivers drive faster. Are they driving so fast that the safety improvements are nullified? Apparently not, for mortality and morbidity rates per passenger mile have declined significantly over the last 50 years (see National Highway Traffic Safety Administration database). At the same time it can be said they are not as safe as they would be if they continued to drive at the same speeds as they did 50 years ago. So clearly in this context risk homeostasis, in the sense of behaving so as to maintain constant risk, is a false premise. But, surely, drivers are taking advantage of the technology to achieve greater performance while maintaining acceptable risk, where what is acceptable is now significantly safer than it was earlier. "Acceptable" is an important term in understanding human behavior relative to risk. It is also a relative term regarding danger to humans and property. What might be acceptable to persons removed from a danger might not be to persons directly affected by such danger.

The story with respect to risk homeostasis appears to be similar in other aspects of driving and in other transportation contexts. Currently there is worry that radar-based intelligent cruise control systems will lead drivers to follow the lead car more closely, and that GPS-based air traffic displays in the cockpit, heretofore not available to pilots (only the ground controllers saw radar returns) will lead pilots to second-guess ground controllers and take more chances.

"Trust" is a term which is relatively new in the human factors literature but which is drawing much attention. The term can have different subtle meanings, but usually it relates to the subjective expectation of future performance. Muir and Moray (1996) showed that as automation errors manifest themselves trust declines and monitoring behavior increases. Lee and Moray (1992) showed that subjective trust is a significant determiner of whether an operator will use an automatic controller or, given the choice, or will opt for manual control. They modeled subjective trust as a function of both overall automation performance, the seriousness of faults, and the recency of faults. They also discuss the mounting evidence that a system is less trusted if there are no clear indications about what it is doing or about to do. Aircraft pilots, for example, frequently complain that they cannot tell what the automation is thinking or will do next (Woods and Roth, 1988).

Should we worry that human supervisors of automation may become complacent? Clearly this begs the further question of what is the optimum level of sampling the displays and/or adjusting the control settings. If, given the relative costs of attending to the automation (less time available to attend other things) and not attending, plus some assumptions about the statistics of how soon after a sample the automation is likely to become abnormal, one can specify an optimal sampling rate (Sheridan, 1970). If the operator samples at the optimal rate that of course does NOT mean that critical signals will never be missed - they still occasionally will. Moray (1999) argues that if the optimal rate is not specified one can never assert that there is complacency (assuming it means sampling at less than the optimal rate). A recent qualitative model by Moray, Inagaki and Itoh (1999) suggests that in the absence of faults or disagreements with the decisions of the automation, subjective trust asymptotes to a level just below the objective reliability, which does not suggest complacency.

A concern with automated warning systems is that a very small percentage of warnings truly indicate the condition to be avoided. This occurs because the designer has set the sensitivity threshold such that false alarms occur much more often than misses (the misses carrying a much more serious consequence) -which is rational based on the objective tradeoff between risks associated with each.

Signal detection theory, the same analytic techniques that design engineers developed during World War II to decide how to make the optimal trade-off between false alarms and misses, has by now been widely applied to measuring how humans should or actually do make the trade-off (Swets and Pickett, 1982; Parasuraman et al., 1998). It requires knowledge of probability densities for true positives (hits) and false positives (false alarms) as functions of input signals or symptoms, or the equivalent relative operating characteristic (ROC) curve - the cross-plot of probability of hit vs. probability of false alarm. It has been shown that the human operator does not respond mechanically and indifferently to these events. Indeed, the fact that the warning system may "cry wolf" so often may lead the operator to lose confidence in the automated warning system and come to respond slowly or even ignore it (Getty et al., 1995).

Classical expected-value decision theory, from which signal detection theory is derived, can also be used to make optimal decisions as to whether one or another form of automatic fault detection system is better, or whether the human is better (Sheridan and Parasuraman, 1999).

3.2. Operating Crew Reliance, Trust and Complacency with PTC

With regard to "risk homeostasis" there is some question as to whether a LE or C would ever be motivated to "take advantage" of the safety margin in a PTC system. This is because of an ever-present electronic monitoring of their acts. The event recorder on locomotives should be an interacting subsystem of PTC. Event recording should be of failures in PTC and other automation as well as errors in human performance. The overall PTC system will serve as a kind of event recorder, just as does the present centralized train control (CTC) system. Thus any infraction of the operating rules by the LE will meet with the normal disciplinary procedures and penalties-all the more so with the teeth in the rules of FRA certification, and decertification.

At present many computer workstations in ordinary business offices monitor and record the nature of an employee's work tasks and the speed, accuracy, and rules-compliance of employee performance. The ability of PTC, similarly, to monitor electronically operator compliance with the rules is comprehensive. The on-locomotive computers are all the more effective in this monitoring because of their interfacing with other machine systems, usually, having electronic and, often, computer characteristics. Railroads have traditionally and are required by FRA regulations to conduct in-field efficiency tests for operating employees. PTC has the capability of continuously testing operating personnel.

It is generally true that in automated warning systems only a very small percentage of warnings truly indicate the condition to be avoided- most are false alarms. Nevertheless, in railroading danger signals are ordinarily observed. We distinguish between false alarms not safety critical and those that constitute railroading's "danger (stop)" signals. And we realize the great operating frequency of such rail danger signals. A nonsafety in-cab warning such as "hot engine" or "dynamic brake overload" might go immediately unheeded but not so with a danger signal. First, the danger signal (such as red stop-and-proceed signal) is common in railroading. Repeating these signals on a display in the cab does not necessarily make them any different in their operating effect on personnel. Second, railroaders do not lose confidence in a danger signal: it might be for real; it might be an efficiency test; or it might be a false-alarm "wolf cry." But all tend to be heeded, regardless.

We would have to hypothesize PTC-generated wolf cries of danger signals that would overcome the particular culture of safety in railroading that observes possible wolf cries as danger signals. For example, when two torpedoes unexpectedly explode on the rail head and, from experiential knowledge, the LE immediately reduces to and observes restricted speed, it does not matter whether a MOW flagman forgot to pick them up at the end of the workday, or he left them for a good, unanticipated, reason. This is not an argument against a need for PTC. The LE or C could be incapacitated or distracted when first confronted with a danger signal.

A falsely and reportedly overacting warning device for a danger signal, such as an in-cab alarm, might not be heeded as much as one not giving false signals. But, then, the railroad rules ordinarily call for eliminating such failed components and a consequent operating under more restrictive rules than previously.

3.3. Distraction Effect in the General Literature

The long accepted Yerkes-Dodson "law" in experimental psychology refers to the notion that with very low attentional demand humans get bored and drowsy and are not vigilant, while with very high attentional demand people cannot take in all appropriate information. Performance is best in a broad middle-range of attentional demand.

During World War II there was interest in the low end of this curve because watches on ships and monitors of sonar in submarines and radar in aircraft ground control stations found themselves scanning electronic displays over long periods for signals which seldom occurred. The associated research was identified with the term "vigilance", and the net result was a variety of studies which showed that after about 30 minutes people's monitoring performance declines significantly (Mackworth and Taylor, 1963). Associated studies of operators performing visual inspection tasks on assembly lines produced a similar result. Allegedly it was asserted that in one test of a cola bottle washing inspection operation, a higher percentage of clean bottles resulted when cockroaches were randomly added to bottles at the start of the line.

Interest in the high-demand end of the curve peaked in the mid 1970s when many new attentional demands were being placed on fighter aircraft pilots, and military laboratories started research on "mental workload." At that same time, in conjunction with the certification of the MD-80, pressures from aircraft manufacturers and airlines to automate and allegedly justify reducing the crew from three to two set off a dispute with the pilots. The regulatory agency, the Federal Aviation Administration, turned to the human factors community to observe commercial pilots and try to define mental workload. After a flurry of research, four methods were evolved to define and measure mental workload: physiological indices, secondary task measures, subjective scaling, and task analysis (Moray, 1988). It should be noted that physical workload is nowadays relatively easily measured by percent of CO2 increase between inhaled and exhaled respiratory gas, but this physical workload has no correlation with what is called mental workload.

The various physiological indices tested over the years include: heart rate variability, particularly in the power spectrum at 0.1 Hz.; galvanic skin response (as in a lie-detector test); pupil diameter; the 300 msec characteristics of the transient evoked response potential; and formant (spectral) changes in the voice (frequencies rise under stress). Unfortunately none of these measures has proven satisfactory for most requirements because the measures have to be calibrated to the individual being measured and because they usually require relatively long time samples - often longer than the period over which one seeks to measure changes in mental workload.

The second measure of mental workload is the secondary task. It assumes that a human monitor has a fixed workload capacity, and that by giving the test subject some easily measurable additional task (such as performing mental arithmetic or simple tasks of motor skill), along with specific instructions to perform the secondary task only as time is NOT required to perform the primary task, "spare capacity" can be measured. The assumption is made that the worse the performance on the secondary task the greater are the primary task mental workload. This technique has been used successfully in laboratory tests, but is usually impractical in real-world tasks such as landing an aircraft since operators refuse to cooperate because of possible compromise with safety.

A third method, subjective scaling, is not the design engineer's ideal, simply because it is subjective rather than objective. Yet it is the method most often used, and indeed is the method most frequently used to validate the other methods. NASA has developed a subjective scale called TLX and the U.S. Air Force a scale called SWAT (Williges and Wierwille, 1979). Multi-dimensional subjective scales have been suggested, including for example fraction of time busy (spare capacity), emotional stress, and problem complexity -the idea being that these are orthogonal attributes of a situation (Sheridan and Simpson, 1979).

The fourth method, task analysis, simply considers the number of items to be attended to, the number of actions to be performed, etc. without regard to the operator's actual performance or subjective sense of workload. This method has been criticized as not really being about mental workload because it neglects level of training or experience. A well trained or experienced operator, after all, may have an easy time performing a task, i.e., with insignificant mental workload, where a novice might be heavily loaded. However, such task analysis is amenable to objectivity, for example use of the Shannon (1949) information measure H= average of log [l/p(x)], p(x) being the probability of each different stimulus element (x) which must be attended to (or different response element which must be executed). This provides an index of "difficulty" or entropy (degree of uncertainty to be resolved). The problem lies in the somewhat arbitrary classification of stimulus and response elements.

For simple tasks the greater the mental work load and/or information difficulty (entropy) H the greater the operator's response time (Hick, 1952; Fitts, 1954) in almost direct proportionality to H. For complex tasks there may be great variability in response time. It is well established that human response times follow a log normal probability density, meaning that no response takes zero time, and the 95th percentile may be one or two orders of magnitude greater than the median. Experiments of experienced nuclear plant operators responding to simulated emergencies showed an almost perfect fit to a log normal function (Sheridan, 1992). The long responses often result from confusion about what problem is presented to the person and what is the expected criterion for satisfactory response.

There have been numerous studies to determine whether operators are better monitors or failure detectors when they are controlling a task manually or when they are monitoring automation. Mostly these studies have shown that performance capability (in terms of failure detection and response recovery) declines when operators are monitors of automation and the automation fails. (Wiener and Curry, 1980; Desmond et al., 1998; Wickens, 1992). However, at the extreme where the operator is so heavily loaded performing manual operations that there is no attentional capacity remaining for failure detection, automation may provide relief and improved capability to detect failures.

One problem with automation is that there may be very little to do for long periods of monitoring, but suddenly and without warning, the automation may fail and/or unexpected circumstances may arise, and the operator is expected to get back into the control loop instantly to set matters straight. Such workload transients are deemed to be more troublesome in many cases than sustained period of high workload, for the operator is unlikely to be able to "wake up" and figure out what is happening, and quickly make the correct decision.

A currently popular term in aviation is "situation awareness". The ideal is have a maximum level of situation awareness. A means to test situation awareness in a simulator experiment is to stop the simulation abruptly and unexpectedly and ask the subject to recall certain stimuli or response events (Endsley, 1995; Endsley and Kiris, 1995). Improvements in graphic displays and decision aids have been suggested to enhance situation awareness. Automation which is opaque to the user may well impede situation awareness. However it has been pointed out that to the extent that an operator expends more mental effort on situation awareness, to that extent less spare mental capacity, if we can accept that notion, remains for decision and response execution (Sheridan, 1999).

3.4. Maintaining Performance in a Broad Middle-Range of Attentional Demand

Given the Yerkes-Dodson "law," that with very low attentional demand humans do get bored and drowsy and are not vigilant, and with very high attentional demand people cannot take in all appropriate information, safety is clearly best in a broad middle-range of attentional demand. But how do we assure this in PTC operations for the C and LE? The most effective way to assure operation in the mid-range is by skills maintenance through retention of most pre-PTC motor and cognitive work tasks, despite the "designed in reliance" effect of PTC. A primarily manual operation of trains by the LE and C, with a fully automated safety compliance backup is, therefore, necessary. This primary manual operation should be at the reliance level-2 of the automation scale (the PTC suggests alternative ways to do the task) or, perhaps, 3 (the PTC selects one way to do the task). That is, the system provides an audible warning in advance of a civil speed restriction (CSR), a signal (in-cab or otherwise) change to a more restrictive indication, or some other restriction of train movement. And the system also meets the requirement of PTC in that the restrictions will be enforced by a sub-system on board the locomotive at level 6 (the PTC executes automatically, then necessarily informs the human and the event recorder). In all, the automation scale level of 2 or 3 is what we strive for as normal PTC operation, but level 6 must always be operable in the background as the safeguard.


Part 2 of the PTC White Paper will be published in the February 2000 issue of the Locomotive Engineer Newsletter.

A complete copy of the 23-page report can be found on the BLE webpage, at http://www.ble.org/pr/news/ptcposition.pdf.

Back to Contents

2000 Brotherhood of Locomotive Engineers