Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

PII: S0003-6870(96)00045-2 Applied Ergonomics Vol 2X. No. I. pp. 17-2.5.

WY7
Copyright @I I!#6 Elsevier Science Ltd
Printed in Great Britain. All rights resewed
wo34870197 $10.00 + o.olJ

ELSEVIER

The validation of three Human


Reliability Quantification techniques
- THERP, HEART and JHEDI:
Part II - Results of validation
exercise
Barry Kirwan, Richard Kennedy, Sally Taylor-Adams and Barry
Lambert
Industrial Ergonomics Group, School of Manufacturing & Mechanical Engineering, University of
Birmingham, Edgbaston B15 2TT, UK

(Received 15 June 1995)

This is the second of three papers dealing with the validation of three Human Reliability
Assessment (HRA) techniques. The first paper introduced the need for validation, the
techniques themselves and pertinent validation issues. This second paper details the results of
the validation study carried out on the Human Reliability Quantification techniques THERP,
HEART and JHEDI. The validation study used 30 real Human Error Probabilities (HEPs) and
30 active Hum<anReliability Assessment (HRA) assessors, 10 per technique. The results were
that 23 of the flassessorsshowed a significant correlation between their estimates and the real
HEPs, supporting the predictive accuracy of the techniques. Overall precision showed 72% (6&
87%) of all HE,Ps to be within a factor of 10 of the true HEPs, with 38% of all estimates being
within a factor of three of the true values. Techniques also tended to be pessimistic rather than
optimistic, when they were imprecise. These results lend support to the empirical validity of
these three approaches. Copyright @ 1996 Elsevier Science Ltd.

Keywords: human reliability; human error data; validation; accuracy and precision

Background (iii) Where techniques are imprecise, are they


optimistic or pessimistic?
The need for validation of Human Reliability Assess-
(iv) Are assessors consistently using the tech-
ment (HRA) techniques was identified in the first
niques, when they are accurately assessing the
paper (Kirwan, 1996) of this series of three, and three
tasks?
techniques were specified as most urgently requiring
(v) Are there any task types that appear to be
empirical validation, i.e. testing their predictive
difficult for one or more techniques to
accuracy against known human error probability data.
predict?
These techniques were the Technique for Human Error
(vi) Do assessors know when they are being
Rate Prediction (THEFLP: Swain and Guttmann,
accurate. i.e. are they well-calibrated?
1983); the Human Error Assessment and Reduction
(vii) How can we improve the techniques or their
Technique (HEART: Williams, 1986; 1988; 1992); and
usage?
the Justification of Human Error Data Information (viii) Do these techniques lead to the reliable
(JHEDI: Kirwan, 1990; 1994) technique (these tech-
identification of effective error reduction
niques are detailed in the first paper). These are the
mechanisms?
dominant techniques in use in the UK today, particu-
larly in the Nuclear Power and Reprocessing (NP&R) This paper presents results mainly relating to the first
industries, and hence were targeted for validation. three of the above questions, the third paper (Kirwan,
The first paper summarised the main questions which 1997) discussing the remaining issues, as these require
the study was attempting to answer as follows: more detailed analysis of the actual usage of the
techniques during the validation exercise.
(i) Does quantitativ’e HRA appear to work (i.e. Also in the first paper, a number of considerations
does any HRA technique work)? required when attempting to carry out validations of
(ii) If more than one technique works, which is HRQ approaches were raised, the main ones being the
best? following:

17
18 Validation of three Human Reliability Quantification Techniques - Part II: B. Kirwan et al

what type of scenarios to assess in the validation; l Predominantly skill and rule-based tasks
where to get robust, industrially relevant data; l Robust data (e.g. preferably field data rather than
how to select assessors; simulator data, etc.)
whether to pre-model scenarios or let the assessor l Broad spread of Human Error Probabilities (HEPs:
model them, and how far to ‘decompose’ the tasks from 0.27 to 0.00005)
into errors or error sequences; l Representative tasks for the NP&R industries
how to prevent the experiment from being biased to
In total, 30 HEPs and their associated tasks were
any one technique;
elicited from the developing database and available
how to ensure that the results transfer to real HRA
data sources, 23 from real incident experience, five
practise;
from simulator studies, and one each of expert-
whether and how to ensure subjects know their own
judgement-based and experimental literature-based
overall performance level.
HEP data. The data are presented in summary form in
These are each discussed in the main report for the Table I. Note that this table presents only a summary of
validation (Kirwan et al, 1994), and the results of these the data, and so the descriptors and associated HEPs
considerations are evident from the Method section in should not be used in assessments.
this paper.
Mode&g of scenarios
Objectives The data, once selected, were then modelled as
The objectives of this study as a whole were as follows: follows:

(9 To carry out a validation exercise for the tech- general description of the scenario;
niques HEART, JHEDI and THERP, to inclusion of relevant PSF information in the descrip-
determine their predictive accuracy and tion;
precision (empirical validity). provision of simple linear task analysis;
(ii) To determine the consistency of usage of the provision of diagrams if necessary and relevant;
techniques and (where appropriate) how to statement of exact human error requiring quantifica-
improve the predictive performance of the three tion;
techniques. review by the project team;
(iii) To consider the validity of using HRA quanti- pilot testing for each technique with one assessor;
fication techniques for the determination of editing of descriptions where appropriate.
error reduction measures. Three single-subject pilot studies (one for each tech-
This second paper deals with the first objective, with nique) in fact yielded only marginal changes to the
the third paper detailing the results of the further descriptive text, in terms of either clarification of the
analysis which leads to provisional practical guidance on exact scenario details (sequence of events; location of
the usage/development of the techniques, and the personnel, etc.); or in terms of Performance Shaping
consideration of the validity of these techniques for Factor (PSF) information (e.g. how good was the
error reduction purposes. interface; how much time would be available; etc.).
Because these changes were not significant, this also
meant that the results of the pilot studies could be
scope assumed to be homogeneous with the main assessor
The validation largely dealt mainly with slips and trials, and could be incorporated accordingly.
lapses, i.e. so-called skill and rule-based (after
Rasmussen et al, 1981) task/error quantification. It Selection of assessors
therefore dealt less with knowledge-based mistakes such The next stage of the project involved the selection of
as misdiagnosis, and routine or exceptional rule viola- assessors to participate in the study. It was critical that
tions. It should also be noted that the experiment used assessors were selected who had adequate experience
HEART-II, as commissioned by Nuclear Electric. This and/or training with the techniques. Otherwise,
is not within the public domain, but is the latest (non- another factor (independent variable) of experience
computerised) version of the technique, and Nuclear would have been likely to have a strong influencebver
Electric agreed to its usage by the assessors. It is the results and any interpretations based on them. In
believed by the authors that the results are nevertheless order to have high external validity of results, the
applicable to the version of HEART in the public ultimate criterion for the study was whether an assessor
domain, since the usage of the extra categories in had actually used the technique in a real risk assess-
HEART-II was slight in this experiment. ment. This criterion caused some initial problems in
that many people had used a technique, but in a more
Method informal setting than a formalised risk assessment. This
happened to a limited extent with HEART, but to a far
Selection of scenarios greater extent with THERP (JHEDI is only used in a
The scenarios were selected from the public domain formalised risk assessment framework). Identifying
and the developing CORE-DATA system, and are THERP assessors therefore took longer than for the
detailed in full in Taylor-Adams et al (1994). There other techniques. This is in itself an interesting shift in
were four main criteria for the selection of validation the predominant assessment technique used in the UK,
tasks or scenarios: since in the Kirwan (1988) validation it was the
Validation of three Human Reliability Quantification Techniques - Part II: B. Kirwan et al 19

Table 1 Summarised details of dlata used in validation experiment

Task number
& HEP Description Source of data Data pedigree

1. 0.03 Operator sets an incorrect calibration pressure Kirwan et al (1990) Real data
2. 0.001 Operator performs valve calibration procedure incorrectly Kirwan et al (I 990) Real data
3. 0.0005 Container moved by crane while still attached to equipment Kirwan et al (I 990) Real data
4. 0.003 West pump maintained instead of east pump Kirwan et al (1990) Real data
5. 0.0007 Wrong fuel container moved in error in highly controlled area Kirwan et al (1990) Real data
6. 0.0007 Operators open discharge valves on the wrong tank Kinvan et al (t99Oj Real data
7. 0.05 Gasket not fitted correctly Steward (1981) Real data
(reported)
8. 0.03 Bearings are installed incorrectly during maintenance Stewart (1981) Real data
(reported)
9. 0.0016 Operator sets switch to wrong position Beare et al (1984) Simulator
IO. 0.27 Numerical calculation error (10 problems) Agate and Drury (1980) Ergonomics
literature
Il. 0.195 Inspector fails to find 15 defects in an electrical unit within 3 h Jacobson, cited in Kinvan (1982) Real data
12. 0.064 Errors on a touchscreen (missing a target area) Stammers and Bird (1980) Simulator
13. 0.00005 Worker omits a solder joint in a unit; very high standards in the Swain cited in Kit&n (1982) Real data
in the organisation
14. 0.011 Solder error on an electrical panel (semi-skilled apprentices) Williams and Willey (1985) Real data
15. 0.0048 Worker selects an unsuitable component for an electrical panel Williams and Willev (1985) Real data
16. 0.00048 Operator puts active waste into the wrong flask Kirwan et al (199Oj \ ’ Real data
17. 0.03 Operator stores fuel in an area not cleared for fuel storage Kirwan et al (1990) Real data
18. 0.01 Operator moves material before obtaining a permit to work Kirwan et al (1990) Real data
19. 0.042 Welder works on the wrong pipe Kirwan et al (1990) Real data
20. 0.01 Operator leaves valve open at end of task Kirwan et al (1990) Real data
21. 0.029 Omits a procedural step in a nuclear power plant scenario Kozinsky (1981) Simulator
22. 0.0042 Operator attempts an illegal operation on a control panel Confidential source cited in Real data
Kirwan (1982)
23. 0.003 Operator enters set-point outside set-point range on a panel Confidential source cited in Real data
Kirwan (1982)
24. 0.16 Trainee fails to make a correct diagnosis using learned rules Marshall et al (1981) Simulator
25. 0.0023 Operator requests an invalid computer routine on a panel Confidential source cited in Real data
Kirwan (1982)
26. 0.02 Maintenance staff fail to isolate a subsystem before commencing Kirwan et al (1990) Real data
maintenance
27. 0.003 Operator fails to realise that a valve is in the wrong position during Comer et al (1984) Expert judgement
a proceduralised check data
28. 0.003 Nuclear power plant fuel storage limits are exceeded Kirwan et al (1990) Real data
29. 0.0005 Radiation alarm is disabled on a transporter Kirwan et al (1990) Real data
30. 0.0007 Chemicals of unsuitably high concentration are inadvertently Kirwan et al (1990) Real data
discharged into the environment during an operation

HEART assessors, not THERP assessors, who were only one technique or whether some assessors could
the more difficult ones to locate (JHEDI was only utilise more than one approach; and the determination
developed in 1990). of how many scenarios/errors/tasks each assessor should
All assessors came from either consultancies that be required to assess. Additionally, there were the
service the NP&R industries, or from NP&R companies issues of assessor instructions, formalisation of assessor
themselves. The range of experience was approxim- material and experimental protocols, and types of
ately 6 months to 10 years (average 4.2 years), with parametric and non-parametric testing of results that
assessors having each carried out a number of risk would be possible (for a full discussion see Kirwan et al,
assessments, mainly in the NP&R, offshore/onshore 1994; Lambert, 1993).
petrochemical and transport industrial sectors. Although a within-subjects design (each subject uses
Assessors were predominantly from an engineering each technique) is preferable both in terms of statistical
background, with about one third from an ergonomics/ power and in terms of determining the extent of the
human factors or psychology academic background. A subject effect (see paper III), very few assessors had
summary of the general experience and background of used more than one technique in a formal risk
assessors can be found in Kennedy et al (1994). Each assessment setting, and so a between-subjects design
assessor was paid (or their company paid) for their time was utilised (each subject used only one technique).
in the study, which was limited to two days as a The experiment designers then had to consider
maximum, at a standard consultancy rate. The assessors resources available for the experiment, the possible
selected are believed to be an adequate representation effects of fatigue on the assessors, and data availability
of the assessors active in HRA in the UK today. for the validation.
The outcome of consideration of all these factors
Experimental design and preparation suggested a target of 30 HEPs and 30 assessors, 10 per
There were two primary experimental considerations technique. A pilot study was carried out with one
which would critically affect the outcomes of the assessor using the most resource-intensive of the three
validation exercise: whether to have each assessor using techniques, THERP, and based on this study it was
20 Validation of three Human ReliabilityQuantification Techniques - Part II: B. Kirwan et al

determined that two days per assessor (as a maximum) correlations. This is a very positive result, supporting
would suffice, and that beyond this fatigue effects the validity of the HRQ approach.
would produce negative performance results. This was
generally confirmed by the reports of the remaining Precision. Table 3 shows that there is an overall average
assessors after they had each finished the exercise. of 72% precision (within a factor of 10) for all
assessors, irrespective of whether they were signific-
Assessor instructions, materials and experimental antly correlated or not. This figure includes all data
protocol estimates, even the apparent outliers that have been
Each subject worked through a detailed and illustrated identified in the study (see Task Outliers section). This
(with diagrams relevant to the task scenarios) question is therefore a reasonably good result, supporting HRA
booklet in the pre-specified randomised order over the quantification as a whole. Furthermore, no single
two-day period (some assessors finished after a day and assessor dropped below 60% precision in the study.
a half, most assessors finished towards the end of day The precision within a factor of 3 is approximately 38%
2). Throughout the experimental sessions (of which for all techniques (see Table 3). This is a fairly high
there were nine including the three pilot studies, in percentage given the precision level of a factor of 3.
different locations on different dates), one or two The degree of optimism and pessimism is not too large,
‘invigilators’ were present. These were part of the as shown in Table 3, and also in the histogram in Figure
project team, and their role was to ensure that the tasks 1, with only a small percentage of estimates at the
were completed, and to answer questions of clarifica- extreme optimistic and pessimistic ends of the histo-
tion. Such answers would also be transmitted to other gram (i.e. greater than a factor of 100 from the actual
assessors so that all assessors had equal information. estimate). Certainly there is room for improvement,
The invigilators’ role was also to prevent discussions but the optimism and pessimism are not in themselves
between two or more assessors, since this would have dominating the results, and estimates were more likely
allowed uncontrollable bias into the experiment. to be pessimistic (17.5%) than optimistic (9.7%). The
It should be noted that one of the authors (Kirwan) degree of optimism is further discussed below under
was the developer of the technique JHEDI. To prevent individual techniques and in the discussion.
any undue bias, the validation tasks were selected by
other members of the project team (see Lambert, 1993; Individual technique performance
Taylor-Adams et al, 1994), and ratified by Kirwan, and
the experimental tasks of randomisation and invigila- Predictive accuracy. The analysis of all data for all
tion, and the ‘first-cut’ analyses of the data were all individual technique (295 data points) shows a signific-
undertaken by the other team members (see Kennedy ant correlation in each case (using Kendall’s Coefficient
et al, 1994). This avoided bias towards JHEDI in the of Concordance): THERP Z = 6.86; HEART Z =
conducting of the experimental trials themselves, the 6.29; JHEDI Z = 8.14; all significant at p < 0.01.
data collection and preliminary analysis. Table 2 has already shown the individual correla-
tions achieved by individual assessors (their names are
excluded for reasons of confidentiality) using the three
Results techniques, with 23 out of 30 assessors achieving a
The experiment was successfully carried out in that 30
assessors took part, and out of a total of 900 required
HEP estimates only five missing values occurred, Table 2 Correlations for each subject for the three techniques
where assessors felt they could not quantify the HEP
using their respective technique. This means that in THERP HEART JHEDI
certain results quoted, namely the percentages, these
may not add up to exactly 100%. The results are given 1 0.615** (8) 0.577** (1) 0.633** (7)
first for the overall performance of the techniques, then 2 0.581** (2) 0.558** (9) 0.551** (9)
3 0.540** (2) 0.473** (8) 0.533** (5)
for individual techniques, and then consistency of usage 4 0.521** (4) 0.440** (2) 0.452** (10)
is considered, followed by the apparent calibration of 5 0.437** (10) 0.389* (5) 0.436** (2)
the assessors. The results section aims to answer the 6 0.311: (5) 0.370* (3) 0.423* (4)
questions defined earlier. The implications of the 7 0.298 (3) 0.351* (7) 0.418* (1)
8 0.297 (1) 0.347: (4) 0.401* (3)
results are then considered in the Discussion section 9 0.254 (7) 0.217 (10) 0.386’ (8)
(complete quantitative results and raw data are 10 0.078 (9) 0.124 (6) 0.275 (6)
contained in Kennedy et al, 1994).
*Significant p < 0.05; **Significant p < 0.01
Overall HRA performance Numbers in brackets refer to the assessor number for that technique
and correspond to the assessor number on scatterplots (e.g. see
Predictive validity. The analysis of all the data (i.e. all Figures 2-4, raw data and confidence levels: Kennedy er al, 1994)
895 estimated HEPs) shows a significant correlation
between estimates and their corresponding true values
(Kendall’s coefficient of concordance: 2 = 11.807, p <
Table 3 Precision for the three techniques for all HRA assessors
0.01). This supports the validity of the HRA quantifica-
tion approach as a whole, especially as no assessors or Factor 3 Factor 10 Optimistic Pessimistic
outliers have been excluded.
Individual correlations for all subjects are shown in THERP 38.33% 72% 13.67% 13.33%
Table 2. There are 23 significant correlations (some HEART 32.67% 70.33% 11% 18.33%
JHEDI 43.67% 75% 3.33% 21.33%
significant at the p < 0.01 level) out of a possible 30
Validation of three Human Reliability Quantification Techniques - Part II: B. Kirwan et al 21

n Pessimistic > 100 Consistency of usage of techniques


q Pessimistic > 10 There was variation in the usage of the techniques by
n Pessimistic < 10
q Pessimistic < 3 the different assessors, even though often there was
n Exact quantitative agreement in the HEP reached. This is
dealt with further in the third paper.

Assessor calibration
The average degree of calibration is shown below
(Table 6), where ‘calibrated’ means that the subjective
confidence rating by the assessor (e.g. within a factor of
3; or 10; or 100; or potentially outside a factor of 100)
equated with the actual corresponding accuracy score
[see Kennedy et al (1994) for full details].
These calibration results suggest there is room for
improvement. It is, however, perhaps not surprising
that the average calibration across all assessments is
relatively low (23%), since calibration within a factor of
3 is a comparatively strict criterion, and assessors so
rarely gain true feedback about their performance.
Whilst under-confidence (i.e. assessors being more
accurate than they thought), at an average level of
43.33%, is higher than average over-confidence
Accuracy bands
Figure 1 Overall HRA optimism and pessimism
Assessor 10

Table 4 Precision for the three techniques

Highest precision Lowest precision

Factor of 3 Factor of 10 Factor of 3 Factor of 10

THERP 56.67% 80% 10% 63.33% -6


HEART 46.67% 76.67% 16.67% 60% I I I I I I I I
-1
JHEDI 66.61% 86.67% 24.67% 70% -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0
HEP (log)
Figure 2

Assessor 1
significant level of correlation. This is an encouraging I
result since more than three quarters of the assessors, O-
some relatively inexperienced in terms of years of
training, have achieved a significant correlation.
Although the JHEDI technique has more correlations
than the other two techniques, this is believed to be
marginal and not significant (see Discussion).
-6 -
Precision for individual techniques.
Whilst some -1 I I I I I I I I
-4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0
assessors are more optimistic/pessimistic than others,
Ius (1%)
precision never drops below 60% for any assessor or
any technique, irrespective of whether the assessor Figure 3
achieved a significant correlation. The highest and
lowest precision values for the techniques within a
Assessor 7
factor of 3 and within a -factor of 10 were as follows 1.
(Table 4):
Figures 24 show scatterplots of the best perform-
ance for each technique, i.e. the most precise estimates
for a single assessor.
Table 3 earlier contained the overall optimism and
pessimism results for each technique, and it appears
that JHEDI in particular is a less optimistic technique,
-6
at the cost of being the more pessimistic of the three. I I I I I I I I I
-1
HEART also appears more pessimistic than THERP. -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0
The best and worst optimism results for individual HEP (log)
techniques were as follows (Table 5): Figure 4
22 Validation of three Human ReliabilityQuantification Techniques - Part II: B. Kirwan et al

Table 5 Optimism and pessimism for the three techniques Adequacy of the experiment

Least Most
The experiment has to be valid in several respects as
Most Least
optimistic optimistic pessimistic pessimistic identified in the Introduction:

26.67% 10% 20% 6.67%


l adequacy of data
THERP
HEART 26.67% 3.33% 33.33% 13.33% l adequacy of subjects
JHEDI 6.67% 0% 26.67% 13.33% l modelling of tasks
l prevention of bias
l transferability of results to real HRA practice

Table 6 Confidence and calibration


These are each discussed below.

Over-confident Under-confident Calibrated Adequacy of data. The range of data was not as evenly
spread as the experimenters would have liked, but due
THERP 25.33% 47.33% 24.67% to data limitations this was unavoidable. Thus, the data
HEART 34% 42.67% 21.33%
themselves do cluster around the HEP value of lE-2,
JHEDI 37% 40% 22.33%
with proportionately fewer real HEPs below lE-4 (one
value) and above lE-1 (three values). The geometric
mean of all of the real data is approximately 5E-3.
(32.22%), the over-confidence level appears undesir- There is little that can be done about this in such a
ably high. Overall, calibration was low: subjects were validation experiment, and perhaps this clustering of
not necessarily accurate when they thought they were observed data around lE-2 is reflecting reality. It
accurate, nor were they necessarily inaccurate when certainly appears to be the case, though, that assessors
thought they were inaccurate. have difficulty in accurately quantifying the lower end
of the HEP range. However, since this means that
Task outliers assessors will be pessimistic, rather than optimistic, this
suggests that the safety of assessments is not being
The results showed that in particular task 13 was compromised. This assumes that assessors are similarly
consistently poorly assessed by all techniques, with a pessimistic (or conservative) when carrying out real
few other specific tasks for each technique being assessments.
potential outliers. These are examined in more detail in
the third paper. However, at present the results are Adequacy of subjects. Thirty subjects participated in
favourable for HRA and the three individual tech- the experiment, which is a relatively large proportion of
niques, even with the potential outliers included in the the active assessors in the UK today. The experience
analysis. level average of 4 years fell short of that desired by the
experimenters, but this was a matter of practicalities
Summary of results
and assessor availability. However, since the results
The above results can be summarised in terms of the appear positive even with some relatively low-
original questions listed in the background: experience subjects, the adequacy of subjects has been
All three techniques have achieved a reason- demonstrated. Although not formally analysed in the
(9 results, due to the fact that such analysis would breach
able level of accuracy.
(ii) There is some variation in assessors’ usage of confidentiality of the subjects, there does not appear to
individual techniques. be a correlation between experience and accuracy (in
(iii) HRA appears to work, at least at a gross this experiment at least). This is at first sight surprising,
empirical level. since one would naturally expect that the more
No technique has out-performed the other(s). experienced subjects would be more accurate. How-
Where techniques are imprecise, there is a ever, in reality, the more experienced assessors will be
Y;’
V
more familiar with the techniques, but may not
tendency towards pessimism.
At least one task appeared difficult (task 13) necessarily be more familiar with real data. Thus, what
(Vi)
for all techniques, and certain tasks consist- is important to achieve good calibration is to gain
ently were inaccurately assessed by individual feedback on one’s assessments. If there is no feedback
techniques. to assessors on whether their assessments were right or
(vii) Assessors are not generally well-calibrated. wrong, then assessors may gain a false sense of
confidence due to carrying out more assessments. In
These and other points raised in the Introduction are the pure expert judgement field it is known that experts
discussed in the next section. do not perform well if they gain no feedback, or at the
least, their performance is variable.
Discussion Modelling of tasks. The task level of data appeared
The discussion firstly considers the adequacy of the adequate to lead to positive results, but a number of
experiment. This determines the validity of the results, subjects declared that they were working with less
and so is the logical pre-cursor to interpreting the information than would normally be the case. Signific-
results and considering their implications. Residual ant re-modelling of the tasks did not occur with any
areas for further investigation, and the implications of subjects, though a number broke the task down into
the results of the study will also be addressed before small event trees, particularly with THERP. Even with
concluding this paper. THERP, this by no means occurred with all tasks, since
Validation of three Human Reliability Quantification Techniques - Part II: B. Kirwan et al 23

some tasks could be estimated simply by extracting a expected to improve performance in a real HRA. Some
HEP from a table. The modelling therefore appeared assessors would therefore argue that precision and
to be reasonable. This issue is however explored in accuracy results derived here would be relatively low.
more depth in the third paper. Furthermore, assessments are rarely carried out in
such a time-pressured way, where there is approxim-
Prevention ofbius. The reader may have already noted ately 30 min from first encountering a scenario to
that JHEDI appeared to work better than the other giving it a final HEP. This compressed time frame does
techniques in terms of precision, correlational accuracy not give the assessors enough time to consider the
and was less optimistic. Whilst each one of these alone scenario fully, they simply have to accept the modelling
may not be significant, together they suggest that as being correct. Assessors would usually have time to
JHEDI out-performed HEART and THERP. This think about the scenario, and would model it them-
conclusion is not drawn frlom this experiment, however, selves. This allows ‘internalisation’ of the scenario, and
for the following two main reasons. Firstly, as already enables the assessor to view it from the operator’s
noted, a high proportion of the real data comes from perspective. Some assessors would argue that this
the Nuclear Chemical industry, where most JHEDI ‘incubation period’ is necessary to carry out an accurate
assessors work. There is therefore a potential bias and comprehensive assessment. Most real assessments
towards JHEDI assessoxs. However, in contrast, the therefore involve a significantly longer period of
remainder of the tasks were predominantly nuclear qualitative analysis, of the order of days (and sometimes
power, which will have put the JHEDI assessors at a weeks), and the quantification process itself usually
disadvantage (nuclear power plants and reprocessing takes at least half a day per scenario, and in some cases
plants are significantly different). takes days if carrying out a detailed assessment. This
Secondly, a small proportion (2) of the non-incident- validation must therefore be seen as stretching the
based HEPs are related to data already within the capabilities of HRA and the assessors themselves
HRMS technique’s database. Such data have been re- significantly, and positive results are probably them-
interpreted, re-described., made generic and rendered selves pessimistic compared to what might be seen if
conservative before being utilised in JHEDI, and assessors carried out the quantification of the scenarios
therefore are not immediately recognisable from the as they would normally operate in their usual work
task descriptions. However, several of the assessors did environment.
perform well on these two tasks. This would not make a Overall therefore, the experimental results are
significant impact on the results (i.e. in terms of believed to be valid (and possibly conservative). The
correlations or precision), but might have given JHEDI remainder of the discussion considers how they should
a small advantage with respect to these tasks. be interpreted, and their implications.
Of these two factors, it is probably the former that
has most facilitated JHEDI. In terms of the implication Interpretation and implications of results
for the results, these two factors may therefore mean Accuracy and precision. Despite the methodological
that, in another validation with less Nuclear Chemical drawbacks associated with carrying out 30 assessments
dominated data, JHEDI’s, performance may have been with limited time and information, the results appear to
more similar to HEART’s and THERP’s, at least with show that, overall, Human Reliability quantification
respect to accuracy and! precision. The pessimistic has validity as an approach. What is particularly
nature of JHEDI (i.e. rather than optimistic) though, noteworthy is that this conclusion stands based on all
would be unlikely to change from the results of this the data, i.e. without the need to remove outliers or
experiment, as it is not related to these potential non-significant subjects. Whilst there could obviously
biasing factors. be improvement, the average 72% precision, and the
relatively tight precision range of 6047%) is promising
Transferability of results to real HRA practice. Most of in a validation experiment.
the assessors have argued that they would have more The generally equivalent performance of the tech-
confidence in their results, and would hence expect niques has at least three main possible interpretations
them to be more accurate., if they had had more time to associated with it. The first is simply that the techniques
make the assessments, and if they had had more robust are all roughly equivalent in their performance, at least
information on the tasks and their PSF. Typically, even within the experimental confines of this study. This
when using HEART or JHEDI (the quicker tech- could be due to the fact that, as they are ‘competitors’,
niques), assessors would spend a couple of days on a all three techniques are as fine-tuned as has been
HEP, even though the calculation might only take half possible, and the general 72% level of precision is the
an hour. This extra time is spent on giving the task due best that can be hoped for in a validation, and possibly
consideration, re-checking the task analysis, establish- in practice as well. If this is the case, then effort should
ing PSF presences and levels, etc. This qualitative be expended on enhancing consistency of results, and
analysis is generally seen as essential by HRA assessors, avoidance of ‘extreme’ inaccuracies (outside a factor of
and as a principal part of the quantification process. lOO), and then improving accuracy by further
It should also be noted that most (if not all) current incorporation of better data into the techniques them-
HRAs for risk assessments usually have quite rigid selves.
quality assurance procedures, This means that in The second interpretation is that the ‘subject’ effect
practice the assessors would have their assessments is actually dominating the results, rather than the
checked, queried and calrroborated by at least one ‘technique’ effect or variable. This means that equival-
other assessor, as well as having them further checked ent performance has been achieved because subjects
by a supervisor. This quality assurance process could be have effectively been randomly selected from the
24 Validation of three Human ReliabilityQuantification Techniques - Part II: B. Kirwan et al

available pool, and so the subject effect has averaged general calibration process, more data needs to be
out the technique effect. The implications of this accessible to assessors. This is occurring via the CORE-
interpretation are less clear. It implies that there are DATA data collection programme (Taylor-Adams and
‘good’ assessors, and ‘not-so-good’ assessors. What a Kirwan, 1995). Such data availability should be
company wants to know, then, is not which technique considered a priority to improve HRA performance
to use, since it may not matter as much as which generally in the UK NP&R and other fields.
assessor to use. In fact, from the company or the The second way of dealing with optimism is to review
regulatory perspective the implications are perhaps the techniques themselves, and to produce more
somewhat unpalatable. structured guidance in carrying out the assessments.
A strong argument against this interpretation is that What would perhaps be most beneficial would be the
some assessors with relatively little experience have provision for each technique of a number of ‘bench-
performed well. If this is a product of training and mark’ HRA quantification exercises. Such benchmark
feedback, rather than one of natural intuition, then examples could in fact be derived from the validation
proper training can be devised and a general minimum experiment result, possibly with some input from the
standard of assessment performance can be established technique developers themselves.
and maintained. Such a situation would fit well into the Analysis of consistency of usage, provision of bench-
organisational values of the NP&R industries, since it mark examples and investigation of outliers are
would ‘de-mystify’ the art of HRA, and enable more probably the best ways to determine how to improve
accountable regulation both by the industries them- the techniques performance. Each of these aspects are
selves and their respective regulators. dealt with in the further analysis reported in the third
Nevertheless, this ‘subject-effect’ interpretation, paper.
which certainly echoes the comments of a number of
practitioners and clients, deserves further examination, Further fundamental HRA issues
in terms of further investigation of the consistency of Once consistency of usage of the techniques is examined
usage of the three techniques. The only alternative way in detail, more fundamental issues may be examined.
of examining the subject effect is another validation This current paper has to an extent vindicated the three
experiment using a within-subjects design. As noted quantification techniques THERP, HEART and
earlier in the Introduction, however, there is a paucity JHEDI, and has more generally shown that HRA is a
of multi-technique assessors, and until such time as this potentially valid approach. HRA, however, is more
situation changes, such a within-subjects validation than just quantification, with significant analysis
experiment cannot take place. required to analyse tasks and identify errors, and to
The third main potential implication of these results reduce error impact if required, as is often the case in
is that there is a third artifactual (i.e. uncontrolled) current assessments. The three quantification tech-
variable causing the results. The most likely contender niques reviewed can all be used to prioritise error
is the data range, or clustering of real HEPs around the reduction measures, since all three techniques (but
lE-2 region. This could imply that, for example, if particularly HEART and JHEDI, and also SLIM-
assessors simply assessed every HEP as lE-2, then MAUD: Embrey et al, 1984) utilise Performance
adequate precision would be obtained. In fact this Shaping Factors (PSF) in their quantification algo-
appears at first sight correct, as such an assessor would rithms. However, to many ergonomists, PSF such as
have obtained over 60% precision. However, no asses- ‘training’ or ‘quality of the interface’, are simply too
sor kept to the lE-2 range. Furthermore, an assessor gross measures of the ergonomics status of a scenario to
who did so would obtain a correlation approaching zero. be meaningful (e.g. a 300 question interface audit,
Assessors, rather, did attempt to use the full range of which an ergonomist might use to evaluate the inter-
HEPs (i.e. from lE-5 to l.O), although JHEDI in face, must be reduced to an assessor rating the interface
particular generally prevents assessors quantifying as a whole on a single monotonic dimension on a scale
below lE-4 (see the first paper). Therefore this of 1-9, called ‘quality of interface’). It may be,
interpretation is given least credibility in this report. therefore, that although such PSF are necessarily gross
Therefore, it is believed that the results are valid, and for HRA purposes, and are relatively useful in
that these results lend empirical support to the validity predicting error rates, such PSF should be considered
of the techniques. as inappropriate for error reduction purposes. This to
an extent depends on theoretical considerations, but
Optimism, calibration and consistency of usage. Optim- can also be examined empirically by reviewing the
ism is undesirable in any HRA technique used in PSA, detailed results of how different assessors used the PSF
and the results suggest that pessimism rather than to arrive at the same quantitative HEP. If different
optimism is the norm. However, there has been some assessors use different PSF but produced the same
optimism, with a small proportion of estimates being (accurate) HEPs, then such PSF should not be used for
significantly optimistic (i.e. by greater than a factor of error reduction since there is a logical incoherence in
100). Such optimism needs to be dealt with in two ways. the quantification process, possibly explained by the
The first is to improve the calibration of assessors (as subject effect or by inherent flexibility or uncertainty in
shown their calibration could be significantly the technique’s construction. The third paper will
improved). Calibration is always a function of feed- therefore examine the consistency of usage for this
back. Some initial feedback has already occurred, via perspective, to determine firstly whether HRA tech-
formal feedback to the assessors, but this only extends niques are accurate but internally inconsistent, and
to those assessors who participated in the study, and to secondly to determine whether HRA quantification
a limited number of HEPs. To further help in the more techniques can be used for error reduction purposes. If
Validation of three Human Reliability Quantification Techniques - Part II: B. Kirwan et al 25

the answer to the second issue is negative than other Rea, K. 1984 ‘SLIM-MAUD: an approach to assessing human
measures (task analysis and error identification) should error probabilities using structured expert judgement’ USNRC,
NUREG/CR-3518. Washington DC 20555
have priority in determining how to reduce error Kennedy, R., Kirwan, B and Taylor-Adams, S. 1994 ‘Detailed results
likelihood. of the validation of THERP, HEART, and JHEDI’ Vol. 2.
Industrial Ergonomics Group, University of Birmingham
Kirwan, B. 1982 ‘A comparative evaluation of three subjective
human reliability quantification techniques’ MSc thesis, Depart-
Conclusions ment of Engineering Production, University of Birmingham
Kirwan, B. 1988 ‘A comparative evaluation of five human reliability
(1) The validation experiment successfully elicited 30 assessment techniques’ in Sayers, B.A. (ed). Human Factors and
HEP assessments from each of 30 assessors using Decision Making, Elsevier, London, pp. 87-109
one of the three techniques HEART, THERP and Kirwan, B. 1990 ‘A resources flexible approach to human reliability
JHEDI. assessment for PRA’ Safety and Reliability Symposium-,
Altrincham, September, Elsevier ADDlied Sciences. London
(2) The results show a significant overall correlation of Kirwan, B., Martin, B. R., Rycraft, H’.‘and Smith, A: 1990 ‘Human
all estimates with the known true values, with 23 error data collection and data generation’ InternationalJournal of
individual significant correlations, and a general Qualityand ReliabilityManagement 7, 34-66
precision range of 60-87%, the average precision Kirwan, B. 1994 A Guide to Practical Human Reliability Assessment,
being 72%. Taylor and Francis, London
Kirwan, B., Kennedy, R., Taylor-Adams, S. and Lambert, B. 1994
(3) All techniques performed similarly well, though ‘Validation of three Human Reliability Assessment techniques:
JHEDI appeared to perform slightly better THERP, HEART, and JHEDI’ Vol. 1. Industrial Ergonomics
in terms of correlations, precision and in its Group, University of Birmingham
optimistic/pessimistic ratio, than the other two Kirwan, B. 1996 ‘The validation of three Human Reliability
techniques. However, there is no means currently Quantification techniques - THERP, HEART and JHEDI: Part
1 - Technique descriptions and validation issues’ Applied
of testing whether this is a statistical difference, and Ergonomics 27, 359-373
it may be due to a slight bias in favour of Nuclear Kirwan, B. 1997 ‘The validation study of three Human Reliability
Reprocessing tasks. Quantification Techniques: THERP, HEART and JHEDI - Part
III - Practical aspects of the usage of the techniques’ Applied
(4) Subjects were not consistent in usage of the
Ergonomics 28, 27-39
techniques, and calibration was low. The consist-
Kozinsky, E. J. 1981 ‘Human factors research on power plant
ency of usage of techniques could therefore benefit simulators’ Proceedings of the 25th Human Factors Society Annual
from further investigation, and from the provision Meeting, Rochester New York, October 12-16, pp. 173-177
of ‘benchmark’ examples based on the results of the Lambert, B. 1993 ‘Validation of human reliability quantification
validation study. The determination of whether techniques’, MSc thesis, University of Birmingham
Marshall, E. C., Duncan, K. D. and Baker, S. M. 1981 ‘The role of
certain tasks are beyond the scope of certain withheld information in the training of process plant fault
techniques could also help improve the performance diagnosis’ Ergonomics 24, I l-724
of the techniques. This further work, along with Rasmussen, J., Pedersen, 0. M., Carnino, A., Griffon, M., Mancini,
discussion of more fundamental issues of use of C. and Gagnolet, P. 1981 ‘Classification system for reporting events
these techniques for explicit error reduction, is involving human malfunctions, RISO-M-2240. DK-4000’ Riso
National Laboratories, Roskilde, Denmark
reported in the third paper. Stammers, R. B. and Bird, J. M. 1980 ‘Controller evaluation of a
touch input traffic data system: an indelicate experiment’ Human
Factors 22, 581-590
Stewart, C. 1981 ‘Human error probabilities and the impact of
Acknowledgements
communications (maintenance)’ US Department of Energy. Idaho
This work was carried out as part of the UK Health & Operations Office, EG&G Report SSBC-5572, Idaho
Swain, A. D. and Guttmann, H. E. 1983 ‘A handbook of human
Safety Executive (HSEI) Generic Safety Nuclear reliability analysis with emphasis on nuclear power plant applica-
Research programme for 1993/94. The authors would tions’, USNRC, NUREG/CR-1278’ Washington DC 20555
like to thank Mike Gray of HSE who was the project Taylor-Adams, S. E. and Kirwan, B. 1995 ‘Human reliability data
officer, and all the personnel and companies who took requirements’ International Journal of Quality and Reliability
Management 12, 24-46 MCB, Bradford
part in the validation exercises. The contents of this
Taylor-Adams, S., Lambert, B., Kennedy, R. and Kirwan, B. 1994
paper are however the opinions of the authors and do ‘Task data used in the validation of THERP, HEART and JHEDI’
not necessarily reflect those of the HSE or of any of the Vol. 3 Industrial Ergonomics Group, University of Birmingham
companies/participants who took part in the study. Williams, J. C. 1986 ‘HEART-a proposed method for assessing and
reducing human error’ in Ninth Advances in Reliability Technology
Symposium, NEC, Birmingham, June. AEA Technology,
Culcheth, Warrington
References Williams, J. C. 1988 ‘A data-based method for assessing and
reducing human error to improve operational performance’ in
Agate, S. J. and Drury, C. G. 1980 ‘Electronic calculators - which Proceedings of the IEEE Fourth Conference on Human Factors and
notation is better?’ Applied Ergonomics 11, 2-6 Power Plants, Monterey, California, 5-9 June, pp. 43W50. IEEE,
Beare, A. N., Dorris, R. E., Crowe, D. S. and Kozinskr. E. J. 1984 New York
‘A simulator-based study of human errors in nuclear-power plant Williams, J. C. and Willey, J. 1985 ‘Quantification of human error in
control room tasks’ USNRC remrt NUREGICR-3309. Washington
0 maintenance for process plant probabilistic risk assessment’ in
DC 10555 Proceedings of the Assessment and Control of Major Hazards,
Comer, M. K., Seaver, D A., Stillwell, W. G. and Gaddy, C. D. Institution of Chemical Engineers. Svmoosium Series No. 93.
1984 Generating Human Reliability Estimates Using Expert EFCE No. 322, pp. 353-365- - ’
Judgement. Vols. 1 and 2. NUREGICR-3688 (SAND 84-7115). Williams, J. C. 1992 ‘Toward an improved evaluation tool for users
Sandia National Laboratory, Albuquerque, New Mexico, 87185 of HEART’ Proceedings of the International Conference on Hazard
for Office of Nuclear Regulatory Research, US Nuclear Regulatory Identification, Risk Analysis, Human Factors and Human Reliability
Commission, Washington, DC 20555 Process Safety, Orlando, February. Chemical Centre for Process
Embrey, D. E., Humphreys, P. C., Rosa, E. A., Kirwan, B. and Studies (CCPS)

You might also like