What’s Wrong with the
Numbers?
A Questioning Look at
Probabilistic Risk Assessment
Jack
Crawford
Copyright
is retained by the author (March 2001).
Copies of all or part of this document may be made for personal study and
research purposes, provided that the copies are not used for commercial
advantage and provided that credit is given to the source.
Introduction
1.
Probabilistic
Risk Assessment (PRA), or Probabilistic Safety Assessment as they prefer to call
it in the nuclear power industry, has been developed over the last 30 years as a
discipline heavily influenced by the mathematical theory of probability. Its mathematical methods are endlessly
extended and refined in the literature.
But how confident can we be that the output numbers mean what they claim
to mean, i.e. probabilities of future events? I believe that the time has come for a
pause to think about that basic issue.
2.
This
article explains what led me to initiate a study of the foundations of PRA,
defines key questions which need to be asked about its credibility, and arrives
at some provisional answers.
Why
Should a Study be Needed?
3.
The
factors which triggered the study are as follows:
a.
The
incredible magnitude of many of the probability numbers.
b.
The
sometimes over-optimistic assumption that an assessment encompasses all credible
failures.
c.
Observation
of some gross discrepancies between predictions and
outcomes.
d.
Difficulty
in finding examples of accidents caused by genuinely random component
failures.
e.
PRA
seems to be too narrowly focused on measurable events, especially failure
rates. It too easily ignores
accidents which are not caused by failures.
4.
I
will give some examples to illustrate those points. During 17 years involvement in risk and
safety assessment in the weapon systems field, in the UK and in Australia, I
have been bombarded with numerical probabilities. Many of them have seemed incredible, or
at best to venture into the unknowable.
Some of the powers of ten ascend into the high teens and even the
twenties. The record in my
experience was a probability of premature functioning of a mine fuzing system
predicted to be 1 in 1044.
5.
In
another example, the design authority (DA) for a weapon system decided to
include in it an electro-mechanical device which had an excellent record in
another application. After pages of
calculations to assess the effects of stresses in its new application, they
predicted that its probability of mechanical failure would be 9.116 in
109 operating hours. The
operating cycle time of the device was only 40 seconds at a likely rate of fewer
than 10 cycles per battlefield day, so the predicted failure rate should have
seen us through many times more use than the system would ever get in
service. But in a system test,
which included four of the devices, we had two mechanical failures before they
had accumulated one hour of operation.
The failures happened in two different modes, neither of which had been
considered in the analysis. This
example illustrates three of the trigger factors mentioned
above:
a.
The
magnitude and precision of the number, by which the DA claimed to be able to
predict so accurately the number of failures in a billion operating
hours.
b.
The
gross discrepancy between the prediction and the outcome.
c.
When
something went wrong, it happened for reasons which had not been quantified in
the analysis.
On
the relatively few occasions when we get a chance to compare safety predictions
and outcomes, those are quite common features in my
experience.
6.
On
the other hand, I have found it difficult to come by examples of accidents
caused by what the textbooks and safety standards describe as “random”
failures. Three years ago a dozen
of us attended a meeting in the UK Ministry of Defence at which the contribution
of random hardware failures to accidents was questioned. Between us we could think of only one
example of an accident caused by a combination of genuinely random events. Six years ago the UK Health & Safety
Executive (HSE) published a booklet called “Out of Control” [Ref. 1] containing
34 examples of control system failures.
In the summary of causes at the end of the booklet not one system failure
is attributed to random hardware failure.
If that kind of failure were indeed a major cause of accidents, we would
surely expect it to turn up somewhere in 34 examples.
7.
Readers
may remember the disastrous first flight of the European Space Agency’s Ariane 5
rocket in June 1996, when it broke up and exploded 40 seconds after launch. According to Aviation Week [Ref. 2] the
pre-launch estimate of the probability of a successful mission was 98.5%. The reality, as the report of the Board
of Inquiry [Ref. 3] showed, was that the design ensured that the rocket would
crash after 40 seconds. The real
probability of success was zero, so the estimated probability was optimistic by
a factor of infinity. To compare
that example with the trigger factors:
a.
It
illustrates a gross discrepancy between prediction and
outcome.
b.
There
was nothing random about any of the causes.
c.
The
accident was not caused by component failures. The inquiry did not report that any
component of the rocket system failed to behave as it was designed to behave
throughout the short flight.
d.
The
real causes of the accident, which in this case came down to errors of
management, were not considered in the analysis.
Initiation
of the Study
8.
After
observing those and other examples, it seemed reasonable to look into the
methodology of PRA. In the course
of a few quick checks, my pocket calculator failed to find anything wrong with
the mathematics of any of the assessments that were readily to hand, so the next
step had to be to investigate the basis on which the mathematical structures
were built.
9.
For
several years I have been searching for a test of the theory that we can draw
probabilistic data on failure rates from past experience, and then synthesise a
selection of the data in order to predict the failure rate of a new system. The safety and reliability literature
does not help much because it generally goes no deeper than the mathematics that
are built on the theory.
10.
My
search has involved talking to many people in the UK, including the Civil
Aviation Authority, the Health & Safety Executive, and several leading
engineering companies and academic and engineering institutions. The only people to come up with anything
that attempted to test the theory were AEA Technology plc. They kindly provided me with a study
[Ref. 4] which compared predicted and observed reliability figures for equipment
used in nuclear power plants. It
concluded that the correlation was reasonably good. That was useful as far as it went, but
the study seemed to me to have two shortcomings. One was that it looked at failure rates
at the reliability level, rather than at the safety level which (in the military
field at least) are much harder to predict. The other was that it had been done as
an afterthought, so it was not the properly designed and controlled experiment I
had been looking for.
11.
By
now I find myself being driven towards a conclusion that the scientific method
may never have been applied to this particular theory. I still hope to be shown that I am
wrong, but meanwhile the apparent lack of science in this field threatens to
become the most disappointing finding of the study.
The
Main Questions
12.
Having
observed that PRA might be questionable, it became necessary to decide what the
questions should be. It seemed to
me that there are four key questions, one practical, one theoretical, one
philosophical and one contingency question which depends on the answers to the
other three. This section lists the
questions and provides some answers.
Question
1: To what extent does PRA encompass the main causes of
accidents?
13.
This
is the key practical question.
First, it is inevitable that any potential causes, modes and effects of
failure which have not been foreseen will escape the attention of PRA. One of the effects of the
ever-increasing complexity of systems is that we must expect that there will
usually be some failure modes which we have failed to anticipate. We can and should do more thinking to
reduce the number of missed tricks.
But, when we have done our best, we still have no way of knowing whether
we have thought of everything, as the example of the electro-mechanical device
illustrated.
14.
Second,
PRA tends to lead us into a mindset which assumes that systems fail only if
their critical components fail. It
does not lead us to think enough about that class of accidents in which
everything functions as designed.
Here are some examples:
a.
Turner
[Ref. 5] describes a collision on an unmanned railway level crossing. The drivers of the train and the road
vehicle did nothing wrong, and there was no equipment
failure.
b.
Kletz,
quoted by Leveson [Ref. 6], describes an accidental release from a
computer-controlled chemical reactor.
No human operator was involved.
The automatic control system, in triggering the release, functioned as
designed.
c.
From
my own experience, an anti-tank mine design was proposed which in certain
conditions would have killed soldiers laying the mines according to the correct
drill.
15.
A
third gap in the coverage of PRA is caused by invalid, or invalidated,
assumptions. The assumptions made
in a safety assessment are not always made explicit and may later be
forgotten. When an important
assumption is invalidated by changed circumstances, and nobody any longer knows
that it was made or that anything depended on it, an accident will be waiting to
happen as soon as certain conditions prevail. One of the findings of the subsequent
inquiry is likely to be that in those conditions the probability of the accident
was 1.
16.
A
major source of uncertainty is the way people respond to their perceptions of
risk. For example, Adams [Ref. 7]
produces evidence that the compulsory use of seat belts has not improved road
safety. He shows how the reduced
risk to people in vehicles has been balanced, through small changes in drivers’
behaviour, by increased risk to those who are not in vehicles. He also provides an example of such
“risk compensation” being enshrined in the law: in Germany coaches fitted with
seat belts are allowed to travel faster than those without. In the civil aviation field there has
been concern about the frequency of near misses between aircraft queuing to land
at busy airports. Yet the UK
National Air Traffic Services, observing that aircraft have become better at
station-keeping, have decided to reduce the vertical interval between aircraft
“stacked” while awaiting clearance to land. Even NATO is not immune. The announcement of a forthcoming
workshop on insensitive munitions [Ref. 8] specified objectives which included
both “reduction in collateral damage in the event of an accidental initiation”
and “reduction in safety zone for storage and transportation”. The organisers seemed unaware that the
latter benefit can be gained only at the expense of the former. In these ways potentially effective
measures to improve safety, for which quantified claims are commonly made, may
in practice be consumed in return for some other benefit such as improved
performance.
17.
In
many fields the fact that an accident had not happened for a long time would be
seen as indicating a low, and probably diminishing, risk. As the time since the last accident
increases, that view will be reinforced by conventional statistical methods
indicating that the probability of an accident is reducing because the mean time
between failures is increasing. The
reality may be quite different.
Many of us will have come across examples of accident-free periods
leading to complacency and greatly increased risk. In the civil engineering field, Petroski
[Ref. 9] identifies the “design climate” as a critical factor in catastrophic
failures of bridges. His argument,
based on examples, is that a period of successful use of a novel design can lead
a designer to become over-confident and consequently to under-design a new
structure in the interest of economy or beauty. The bridge is then liable to fail if it
is subjected to extreme conditions.
In situations such as these, where risks change inversely as people’s
perceptions of risk change, our attempts to pin down numerical probabilities of
accidents are likely to be about as successful as trying to capture a
will-o’-the-wisp.
18.
Of
all the sources of risk which PRA overlooks, management must be the most
prolific. Many apparently technical
failures have their roots in management weaknesses. Leveson [Ref. 10] points out that
"unmeasurable factors (such as …. management errors) are ignored even though
they may have greater influence on safety than those that are measurable". As she was writing those words, the
European Space Agency was committing the management errors which led to the
Ariane Flight 501 debacle, while using measurable data to predict a high
probability of success.
19.
An
important aspect of risk management is the quality of the culture in an
organisation. For example, the
Piper Alpha inquiry found that “Senior management …. adopted a superficial
response when issues of safety were raised”, and the judge in the Herald of Free
Enterprise case criticised the “disease of sloppiness” which had spread down
from the top of the Townsend Thoresen company. In each case the company’s safety
culture had contributed much to the disaster.
20.
All
of those sources of risk are “soft” or unmeasurable factors. They affect the frequency and scale of
accidents, but PRA does not encompass them. It focuses, rather, on the measurable
causes, modes and effects of failure.
With so limited a view of the scene PRA must be expected to deliver
optimistic results, contrary to what we normally aim to do in risk
assessments. In terms of the “As
Low As Reasonably Practicable (ALARP)” principle, the consequence is that PRA
can neither demonstrate that a risk is as low as reasonably practicable, nor
that it is tolerable.
Question
2: Can statistical inference take us forward from the past to the
future?
21.
This
question addresses the theoretical basis of PRA, for which the apparent absence
of any proper justification or test was noted above. The clearest argument I have found is
one developed by Deming [Ref. 11] in which he explores the limits of statistical
inference. He argues that the
historical results which provide input data for predictions depend on the sets
of conditions in which they were produced, and that those exact conditions are
unrepeatable. Furthermore, as
Feynman [Ref. 12] reminds us, we cannot assume that all of the conditions which
contributed to a result were recorded or even noticed. In other words the historical record is
not a reliable guide to the future.
Worse still, it can be hard to tell whether it is even a reliable guide
to the past.
22.
In
statistical terms, Deming concludes that there is no mathematical method by
which to extrapolate past results to future conditions, and consequently no
objective way of assigning a numerical probability that a prediction will be
right or wrong. Prediction
therefore means applying judgement and knowledge of the subject to the available
data, rather than just manipulating numbers.
23.
A
further problem is that most statistical methods assume that component failures
will be independent. In reality,
dependent failures contribute to many accidents. The “fudge factors” sometimes introduced
to allow for dependencies, such as cut-offs and beta factors, do at least move
the numbers in the right direction.
On the other hand they are arbitrary and are no substitute for an
understanding of the dependencies within a system and their potential
consequences.
24.
As
an aid to predicting the behaviour of systems, Deming [Ref. 13] advocates the
concept of stability developed by Shewhart. "Stability" in this context means that
the functions of the system display a stable range of variation. He argues that stability is a
prerequisite for predictable behaviour, and that in a man-made system it is not
a natural state - it has to be achieved and maintained. Systems are constantly threatened by
destabilising influences, so their stability must be monitored and, whenever
necessary, restored. Hence a system
will remain stable and predictable only by virtue of people's vigilance,
knowledge and effort. It is not a
question of probability.
25.
Without
stability there is no basis for prediction, but I have yet to find a safety or
reliability database which assures us that its estimates of component failure
rates were derived from stable systems by stable methods of measurement. Some may have been so derived but even
then, when we take those types of components and build them into a new system,
we leave the stability behind because we have changed the operating
environment. A new state of
stability will have to be achieved and maintained, and new data generated for
monitoring and predicting behaviour.
26.
Collectively,
those arguments seem to me to falsify the theory that we can rely on historical
frequency data to take us across the boundary between the past and the
future. To that conclusion many
would reply that our contracts and our regulators nevertheless insist that we
deliver predictions in the form of numerical probabilities. What then should we do? Many years ago Tukey [Ref. 14]
offered some relevant advice: “It is far easier to put out a figure than to
accompany it with a wise and reasoned account of its liability to systematic and
fluctuating errors. Yet if the
figure is to serve as the basis of an important decision, the accompanying
account may be more important than the figure itself”. That seems to indicate a reasonable way
to go.
Question
3: How much force does the
mathematical theory of probability add to a probability
statement?
27.
This
is the key philosophical question.
In looking for an answer, I have used ideas put forward by Toulmin [Ref.
15]. When we make a prediction,
especially a safety prediction, we want as much precision as we can manage. Toulmin distinguishes between precision
in the sense of definiteness and precision in the sense of exactness. So for example if we judge that an event
is extremely unlikely to happen, we are relying on definiteness. But if we estimate a probability that
the event will happen twice in a thousand rocket launches, we are relying on
exactness. This leads to further
questions such as how much do we gain when we are able to add exactness to
definiteness? And what should we do
if we find that we have one but not the other? Those sorts of question may seem
ethereal to some people, but the study is telling me that they actually matter
when it comes to taking decisions such as whether a system is safe enough to be
accepted for service.
28.
PRA
uses mathematical probability in an attempt to deliver precise predictions. But Toulmin, from a logician's
standpoint, argues that "Little is altered by the introduction of mathematics
into the discussion of the probability of future events" and that "The
development of the mathematical theory of probability accordingly leaves the
force of our probability-statements unchanged; its value is that it
greatly refines the standards to be appealed to".
29.
If
we accept the arguments of Deming and Shewhart, the refinement is spurious in
the context of PRA. (Deming [Ref.
11] points to areas in which numerical probability does provide a valid guide to
action, but they do not relate to PRA.)
The spurious refinement of the numbers is starkly illustrated by the two
examples given earlier in each of which, when the definiteness of the prediction
proved to be a delusion, its exactness was exposed as
ridiculous.
30.
A
relevant, if irreverent, statement of philosophy comes from Feynman [Ref. 16],
who preferred engineering judgement to what he regarded as meaningless numerical
probabilities: "If a guy tells me the probability of failure is 1 in
105, I know he's full of crap".
Question
4: If the numbers generated by PRA
do not represent probabilities of future events, are they still useful? If so, for what?
31.
Question
4 is the contingency question and it clearly needs to be answered. My view is that the numbers are still
useful. For one thing, factors that
are measurable do contribute to risk and PRA has been successful in helping us
to see how to reduce risks from those causes (it may even have contributed to
the scarcity of accidents from “random” causes). For another, its inherent optimism tells
us, when it indicates a risk which is too high, that improvements are definitely
needed. Thirdly I have found, when
working as a safety regulator in the weapon systems field, that I can learn much
from the numbers by digging for answers to the questions they
raise.
Conclusions
32.
The
study remains incomplete, partly because of the difficulty of finding a
justification for PRA. If anyone
can find or construct one, it would be very welcome. Meanwhile the provisional conclusions to
be drawn seem to me to be as follows:
a.
The
numbers delivered by PRA do not represent the probabilities of future events
because:
(1)
The
PRA methodology, by focusing on measurable factors, ignores some of the most
significant sources of risk.
(2)
The
theory that it is justifiable to extrapolate historical data, in order to assign
a numerical probability to a future event, is false.
b.
If
PRA is used on its own to support an ALARP or any other safety case, it is
likely to be misleading. To be
complete and credible, the case should provide:
(1)
Qualitative
data and argument on the issues not covered by PRA.
(2)
A
reasoned account of the liability to error of each quantified
prediction.
c.
Quantitative
probability statements have no more force than qualitative probability
statements. At best they may be
more refined, but only if the numbers can be shown to be
credible.
d.
Our
quest for reliable predictions would be better served by paying more attention
to the stability of the systems from which we draw data, and to the stability of
those whose behaviour we need to predict.
33.
So
should PRA be scrapped? My answer
is “no”, for the reasons given in the answer to Question 4. It remains an invaluable tool for
focusing our minds on issues related to measurable factors. We do not need to believe that the
numbers are probabilities in order to use them for purposes such as comparison
of design options, sensitivity checks and the improvement of designs. It is only the “P” of PRA that ought to
be abandoned if nobody can justify it.
34.
By
now it is clear that there is a Question 5 to be answered: "What would be a
better way and what place should (P)RA have in it?" The investigation continues.
References:
[1] Health & Safety Executive. Out of Control. HSE Books, Sudbury, Suffolk, UK,
1995.
[2] Aviation Week & Space Technology, 29 July
1996. (Page
33.)
[3] Ariane 5 Flight 501 Failure. Report by the Inquiry Board. Paris,
19 July 1996.
[4] E R Snaith. The Correlation between the Predicted
and the Observed Reliabilities of Components, Equipment and Systems. UK Atomic Energy Authority National
Centre of Systems Reliability, Culcheth, UK, 1981.
[5] Barry A Turner. Man-Made Disasters. Wykeham Publications, London,
1978.
[6] Nancy G
Leveson. Safeware. Addison-Wesley Publishing Company,
Reading, Massachusetts, 1995.
(Page 165.)
[7] John Adams. Risk. UCL Press, London, 1995. (Chapter 7.)
[8] NIMIC Newsletter 1st Quarter 2000. NATO Insensitive Munitions Information
Center, Brussels.
[9] Henry Petroski. Design Paradigms - Case Histories of
Error and Judgment in Engineering.
Cambridge University Press, 1994.
[10] Nancy G Leveson. Op. cit. (Page
59.)
[11] W Edwards Deming. On Probability as a Basis for
Action. The American Statistician,
Vol. 29 No. 4, 1975. (Pages 146 to
152.)
[12] Richard P Feynman. The Meaning of it All. Addison-Wesley Longman Inc.
1998.
[13] W Edwards Deming. The New Economics for Industry,
Government, Education.
Massachusetts Institute of Technology, 1993.
[14] John W Tukey in The
American Statistician, Vol. 3, 1949.
(Page 9.)
[15] S E Toulmin. The Uses of Argument. Paperback edition, Cambridge University
Press, 1993. (Chapter
2.)
[16] Richard P Feynman. What do You Care What
Other People Think? Paperback
edition, HarperCollins, London, 1993.
(Page 216.)
Acknowledgements
The
author acknowledges with thanks the constructive comments provided by Professors
David Kerridge and Henry Neave and by Felix Redmill, Editor of “Safety Systems”
in which an earlier version of this paper was
published.
20
March 2001