Good Enough V&V For Simulations: Some Possibly Helpful Thoughts From The Law & Ethics of Commercial Software

Good Enough V&V for Simulations: Some Possibly Helpful Thoughts from the
Law & Ethics of Commercial Software

Cem Kaner
Florida Institute of Technology
Department of Computer Sciences
150 West University Blvd.
Melbourne, FL 32901
[email protected]
321-674-7137
Stephen J. Swenson
AEgis Technologies Group
[email protected]
401-539-2504
Keywords:
Ethics, software testing, liability, cost-benefit analysis, negligence analysis, cost of quality, VV&A
ABSTRACT:. Historically, the focus of VV&A research and development has been in the areas of developing new
techniques and metrics and developing standards to ensure consistency across a wide range of applications. This work
has proven technically beneficial to developers of M&S applications as “how to” guidance for conducting V&V. We
all understand that how much V&V is driven by the needs of the target application. The guidance is vague and doesn’t,
in any substantive way, address the value of V&V in relation to ramifications associated with inadequate testing. The
courts have long wrestled with the responsibility of product and service providers to adequately protect consumers and
by-standers against failure. The theme of the 2008 Spring SIW is “Innovation at the Intersections.” This paper will
address the intersection of modeling and simulation VV&A with civil law, suggesting a style of cost/benefit analysis that
might be specifically applicable to M&S VV&A, how it might be used to frame discussion of ethical commitment to
adequate M&S testing, and how some of the information needed for that analysis might be collected.
1. Introduction
“Since there are occasions when every vessel will
In 1944, a group of barges including the Anna C broke break from her moorings, and since, if she does,
free of their moorings in New York Harbor and drifted she becomes a menace to those about her; the
together. They stopped when Anna C bumped a tanker owner's duty, as in other similar situations, to
whose propeller broke a hole in her. No one was aboard provide against resulting injuries is a function of
the Anna C and therefore, no one noticed she was taking three variables: (1) The probability that she will
on water. She sank. Had someone noticed the water, she break away; (2) the gravity of the resulting injury,
could have been saved by the people who were on the if she does; (3) the burden of adequate
scene dealing with this multi-barge situation. The ensuing precautions. Possibly it serves to bring this notion
lawsuit required the judge to determine who should pay into relief to state it in algebraic terms: if the
for the loss of cargo on Anna C. probability be called P; the injury, L; and the
burden, B; liability depends upon whether B is less
Judge Learned Hand first determined there was not yet a than L multiplied by P: i.e., whether B less than
general rule in American maritime law for allocating PL.”
damages when a barge breaks free and the bargee (a
person assigned to stay on the barge) is not on the barge. This case, United States v. Carroll Towing Co. [1] is one
Bargees sometimes leave barges for legitimate reasons, of the most famous in American products liability law.
and if the risk of their absence is low enough, this is not
unreasonable. Rather than working through the mass of Learned Hand is widely recognized (on all sides of the
individual, conflicting precedents, Learned Hand laid out political spectrum) as one of the great judges of American
fundamental principle for decisions of this type: history. For example, his work was foundational for the
Law and Economics tradition of legal analysis, which has • as an ethical framework in a context (simulations)
played a large role in the “tort reform” movement that has that is so different from traditional DoD-related
largely limited the reach of products liability litigation. VV&A that many of the ethical/analytical
touchstones are not applicable;
The T.J. Hooper [2] is another key case decided by • as a practical guideline for considering the types of
Learned Hand. This case involves the sinking of two investigation that the V&V practitioner might
barges in a gale. Learned Hand determined that the tugs undertake to determine whether a product is good
that were towing the barges were unseaworthy because enough, or to persuasively build an argument that it
they did not have radios and therefore could not call for is not.
help. The owners argued that no legislation or regulations
required radios on tugs and industry standard practice was
that tugs did not have them. Learned Hand’s response:
2. The Problem
“Is it then a final answer that the business had not Software products for DoD clients are usually created
yet generally adopted receiving sets? There are under a development model in which the key decisions
yet, no doubt, cases where courts seem to make the are made early and documented in detail in specifications
general practice of the calling the standard of of the project/product requirements and the internal and
proper diligence; we have indeed given some external product design. Special care is taken to create
currency to the notion ourselves…. Indeed in most applicable oracles, decision rules that can be applied to
cases reasonable prudence is in fact common determine whether the product is operating within
prudence; but strictly it is never its measure; a expectations or not. Not all projects are done this way, but
whole calling may have unduly lagged in the in the context of a contracting and development culture
adoption of new and available devices. It may that assumes this as the normal style, significant
never set its own tests, however persuasive be its deviations from this approach carry significant
usages. Courts must in the end say what is development, cost, and quality risk.
required; there are precautions so imperative that
even their universal disregard will [**12] not Verification and validation traditions have evolved in the
excuse their omission.…We hold the tugs therefore context of this development culture. V&V analysts
because had they been properly equipped, they vigorously encourage project managers and others to
would have got the Arlington reports. The injury develop clearer and more comprehensive specifications
was a direct consequence of this unseaworthiness. and oracles and to relate them to the specific objectives of
Decree affirmed.” the model/simulation [3-5]. To the extent that the
specifications can be considered complete, thorough
Put succinctly, industry standards are not legal standards. testing tied directly to those specifications is desirable and
The argument that you were following industry-standard is often taken as sufficient for system-level testing. In
practices is no shield for inadequate work. DoD projects, verification also includes assessment of the
development process and practices employed [3, 6],
In this article, we argue that Learned Hand’s cost-benefit comparing them to standards-driven expectations.
analysis is a useful approach for VV&A of simulations.
A diligent, ethical practitioner in traditional V&V has a
broad collection of standards and cultural support for
We are not recommending a liability standard in this
doing her or his work in a way that will be seen as
paper. We are not arguing that courts should adopt a
competent and respectable.
products liability standard for simulations. Most complex
simulations are created for government clients, typically
Simulations are different.
for Defense applications. In the United States, developers
of custom products for the United States DoD are not
A simulation is an implementation of a model over time.
subject to civilian products liability litigation. Many
[6] “All models are wrong, but some are useful” [7, p.
factors mitigate against applying civilian risk regulations
424]. An essential part of the evaluation of a simulation is
to military decision-making and we are not considering
its utility, its value to its intended stakeholders. It is not
any aspect of the history or appropriateness of that well
only the actual implementation that is important, but the
established rule in this paper.
conceptualizations behind it. [8, 9] A perfectly
implemented simulation that starts from the wrong theory
What we are suggesting is that practitioners of VV&A for
will yield perfectly wrong results.
simulations might find this analytical approach useful:
The situations that simulations model are typically not informative if you don’t know whether the tests in
perfectly understood. That is, in a typical simulation, the suite are representative of the usage in the field
there are many major unknowns. If everything important [12]. The specification cannot be an effective
was known, we wouldn’t need a simulation. Therefore we surrogate for understanding usage in the field
are unlikely to have a strong set of specifications because it is incomplete and incorrect.
(requirements or design) or reliable oracles for evaluating
the simulation. This requires us to rethink the role and 3. An Analogy: Testing Commercial Software
strategy of verification and validation.[10] Products
If part of a simulation specification is thoroughly and In two important ways, commercial software projects are
credibly developed, this probably means that the often like simulations:
development team (defined broadly to include subject
matter experts and the customer) have such a good 1. A software product attempts to automate some task
understanding of this aspect of the situation being or system, but the humans who do the tasks that are
simulated that they can provide clear instructions. Coding being automated often use personal knowledge and
errors and misunderstandings of the specification are heuristics that are hard to capture. In effect, the
certainly possible, but we should expect more problems, specification (if there is one) is an imperfect model
and worse problems, with all of those aspects of the of the human system being automated, and the
simulation that were developed without the benefit of a implementation of that specification is a simulation
good specification, because these are probably the parts of of that human system.
the system that are not well understood and do not benefit 2. These projects often have incomplete, evolving
from stabilized human expertise. specifications and multiple influential stakeholders
who have conflicting interests (and therefore
Another complication is that the user’s understanding of conflicting success criteria for the project).
the system or situation being simulated will probably
evolve over time, partially through interaction with draft In practice, developing stopping rules for commercial
versions of the simulation. In other words, requirements software testing has been very difficult. Summarizing the
are likely to drift significantly.[11] results of several working meetings of test managers and
consultants, Kaner [13] listed a few hundred different
These characteristics of simulations make it difficult to measures of testing progress that are in use in the field.
rely on traditional stopping rules for software testing, There is remarkably little agreement on which subset of
such as: measures, or even which subset of types of measures,
• Structural coverage: Execution of every statement, should be used to determine how much testing has been
or every statement and branch, or similar simple done and to estimate how much testing is left.
coverage criteria. The code that is there might work,
but is it the right code? Does it do the right tasks? As with simulations, the oracle problem is difficult for
What about the conditions not considered, therefore commercial software. In an influential presentation to
not included in the code? practitioners, Doug Hoffman [14] argued that all oracles
• Coverage of the specification: Execution of one or for commercial software products are heuristic (useful
a few test cases for each statement or condition or decision rules, but not guaranteeably correct). Hoffman
data item listed in the specification. If we already made several arguments, but his argument from
know that the specification is significantly complexity is the most relevant to simulations:
incomplete and incorrect, then we know that tests
focused on the specification might be useful, but 1. An oracle provides a mechanism for determining
they are far from sufficient even for a superficial test whether the program’s behavior was correct, given
of the software application. a set of preconditions, a specified action, and the
• Adherence to a traditional development process: observed results.
Effective simulation development might look much 2. The first problem is that we never fully specify the
more like an agile development project, an approach preconditions. (How much free memory is in the
that relies on rapid iteration and close interaction system? When was the last garbage collection
with users or their subject-matter-expert cycle? How fragmented is the hard disk? Did the
representatives, rather than implementation of a test just before this one send data to the printer?)
well-understood solution to a well-understood One could argue that many conditions like these are
problem. (or should be) irrelevant, so long as the values are
• Low failure rates from testing: The rate of failures “reasonable.” However, many hard-to-reproduce
from a suite of regression tests is not very
bugs turn out to have unexpected values in heuristics from that video (see [17] for additional course
variables that the testers and programmers didn’t slides):
expect to be relevant and therefore didn’t formally
observe and control. The list of potentially-relevant • Consistent within product: The behavior of a
(not probably-relevant, but potentially-relevant) function should be consistent with behavior of
variables is enormous. comparable functions or functional patterns within
3. The second problem is that the user action is often the product.
loosely specified. For example, timing of a test • Consistent with comparable products: The behavior
(time between keystrokes for example) is of a function should be consistent with that of
sometimes critical to reproducing a failure, but this similar functions in comparable products.
is rarely specified. On systems (like Windows) that • Consistent with history: The present behavior of a
have ongoing background task processing by the function should be consistent with past behavior.
operating system, it seems impossible to know what • Consistent with our image: The behavior of a
else (what other user demands) compete for CPU function should be consistent with the image the
attention, memory, and other system resources. organization wants to project.
4. The third problem is that the output is not fully • Consistent with claims: The behavior of a function
specified. For example, if a program’s task is to add should be consistent with documentation or ads.
integers, and it adds 2+7, obtaining 9, is this • Consistent with specifications or regulations: The
correct? It might look correct, but what if the behavior of a function should be consistent with
program took 6 hours to add those two numbers? claims that must be met.
What if, along with displaying the answer on the
• Consistent with user’s expectations: The behavior
screen (as expected), it also sent the result as an of a function should be consistent with what we
email to everyone in your email client’s address
think users want.
book? These examples might seem to obviously
• Consistent with Purpose: The behavior of a
violate a rule of reason, but how many test
function should be consistent with the product or
specifications for a function that adds two numbers
function’s apparent purpose.
would instruct the tester to check for timing,
memory leaks, file corruption, email activation, or
This heuristic approach is relevant to simulations because
the many other things that could theoretically go
the exact, correct behavior for the simulation is probably
wrong (and sometimes do)?
unknown (for many systems, if we knew them
5. Given the unspecified aspects of the test (almost
completely, we wouldn’t have to simulate them) and even
every aspect of the test as it is actually run is
if we have a theoretically complete knowledge of the
unspecified), the program might appear to pass but
system, the simulation is a simplification for teaching
actually fail, or it might appear to fail but actually
purposes, demonstration purposes, or some other
be behaving correctly under the circumstances.
practical purposes and the ultimate criterion is whether
That is, the decision rule provided by the oracle is
the simulation is valuable and not whether it is perfectly
probably correct most of the time, but it is not
correct in all details.
guaranteed. It is a heuristic, rather than a genuine
rule.
The shared complexities of simulation software and
6. If an oracle is a mechanism for deciding whether a
(much) commercial software provide a basis for thinking
product’s behavior is correct or not, an oracle
that decision rules used to manage the complexity of
heuristic is an imperfect but useful mechanisms for
commercial software testing might be relevant to
deciding whether the behavior is correct or not.
simulation evaluation.
Hoffman’s argument was that all software oracles
are heuristic oracles.
We will focus the rest of this paper on one type of
decision rule used in commercial projects, the
James Bach extended Hoffman’s idea while he was
cost/benefit analysis, as applied to project management.
developing the testing process for Microsoft for
We’ll consider two variations on this theme:
qualifying products as Windows 2000 compatible [15].
He identified a collection of patterns in the ways that
1. Quality Cost Analysis as developed by the
Microsoft engineers argued that a product was or was not
American Society for Quality (e.g. [18]).
defective. These, he reasoned, were the implicit oracle
2. Negligence analysis as applied to the American law
heuristics that Microsoft’s development staff relied on.
of products liability (e.g. [19]).
Kaner & Bach [16] provide a video lecture that present
and discusses Bach’s heuristic oracles. Here is the list of
4. Overview of Cost-Benefit Analysis few of the costs. (These are our translations of the ideas
for a software development audience. More general, and
We use the phrase, “managing a project under more complete, definitions are available in [18] as well as
uncertainty” as a shorthand for the complexity of the in Juran’s and Feigenbaum’s works).
problem space when stakeholder interests conflict,
stakeholder requirements shift over time, development • Prevention Costs: Costs of activities specifically
starts without full knowledge (let alone specification) of designed to prevent poor quality. Examples of “poor
the solution to be achieved, oracles are sometimes quality” include coding errors, design errors,
informal, unavailable or untrustworthy outside a narrow mistakes in user manuals, as well as badly
range of parameters, exhaustive testing is impossible and documented or unmaintainably complex code. (Note
it is difficult to meaningfully measure how much testing that most prevention costs don’t fit within a
is completed, let alone how much risk has been mitigated. Verification Group’s budget. This money is spent by
the programming, design, and marketing staffs.)
How can a project manager or a test manager operate • Appraisal Costs: Costs of activities designed to find
effectively and ethically when working with a project quality problems, such as code inspections and any
under uncertainty? type of testing. Design reviews are part prevention
and part appraisal. (To the degree that you’re
Rather than focusing on the specifics of the individual looking for errors in the proposed design itself when
decisions or decision criteria, the quality-cost literature you do the review, you’re doing an appraisal. To the
and the products liability literature create a framework for degree that you are looking for ways to strengthen
thinking about the pattern of decisions as a whole. the design, you are doing prevention.)
• Failure Costs: Costs that result from poor quality,
5. Quality Cost Analysis such as the cost of fixing bugs and the cost of
dealing with customer complaints.
“Because the main language of [corporate • Internal Failure Costs: Failure costs that arise
management] was money, there emerged the before a company supplies its product to the
concept of studying quality-related costs as a customer. Along with costs of finding and fixing
means of communication between the quality staff bugs are many internal failure costs borne by groups
departments and the company managers.”[20, p. outside of Product Development. If a bug blocks
4.2] someone in the vendor company from doing her job,
the costs of her wasted time, missed milestones, and
Joseph Juran, one of the world’s leading quality theorists, overtime to get back onto schedule are all internal
began advocating analysis of quality-related costs 1951, failure costs. (Including costs like lost opportunity
when he published the first edition of his Quality Control and cost of delays in numerical estimates of the total
Handbook. Feigenbaum made it one of the core ideas cost of quality can be controversial. Campanella
underlying the Total Quality Management movement.[21] doesn’t include these in a detailed listing of
It is a tremendously powerful tool for product quality, examples [18, Appendix B]. Gryna [20, 4.9-4.12]
including software quality. recommends against including costs like these in the
published totals because fallout from the controversy
Quality costs are the costs associated with preventing, over them can kill the entire quality cost accounting
finding, and correcting defective work. For commercial effort. We include them here because in Kaner’s
products, these costs are huge, running at 20% - 40% of industrial experience in Silicon Valley, as a project
sales [20, p. 4.2. We are not aware of credible data on manager, test manager, and development director,
quality costs for commercial software] Many of these these were very useful in practice, even if it might
costs can be significantly reduced or completely avoided. not make sense to include them in a balance sheet.)
• External Failure Costs: Failure costs that arise after
One fundamental objective of quality engineering is to a company supplies the product to the customer,
drive down total cost of quality associated with a product such as the costs of customer service, maintenance,
or product line. Notice that this is full lifecycle cost, not warranty repairs, and public relations efforts to
just cost of development. soften the impact of a bad failure on the vendor’s
reputation.
Here are six useful definitions, as applied to software • Total Cost of Quality: The sum of costs: Prevention
products. Figure 1 gives examples of the types of cost. + Appraisal + Internal Failure + External Failure.
Most of Figure 1’s examples are (hopefully) self-
explanatory, but I’ll provide some additional notes on a
PREVENTION APPRAISAL
• Staff training • Design review
• Requirements analysis • Code inspection
• Early prototyping • Glass box testing
• Fault-tolerant design • Black box testing
• Defensive programming • Training testers
• Usability analysis • Beta testing
• Clear specification • Test automation
• Accurate internal documentation • Usability testing
• Evaluation of the reliability of development • Pre-release out-of-box testing by customer
tools (before buying them) or of other service staff
potential components of the product
INTERNAL FAILURE EXTERNAL FAILURE

• Bug fixes • Technical support calls
• Regression testing • Preparation of support database
• Wasted in-house user time • Investigation of customer complaints
• Wasted tester time • Refunds and recalls
• Wasted writer time • Coding / testing of interim bug fix releases
• Wasted marketer time • Shipping of updated product
• Wasted advertisements (1) • Added expense of supporting multiple versions
• Direct cost of late shipment (2) of the product in the field
• Opportunity cost of late shipment • PR work to soften drafts of harsh reviews
• Lost sales
• Lost customer goodwill
• Discounts to resellers to encourage them to
keep selling the product
• Warranty costs
• Liability costs
• Government investigations (3)
• Penalties
• All other costs imposed by law
Figure 1. Examples of quality costs associated with software products (4)
Here are a few additional notes on the figure:

(1) Imagine buying advertisements for a product that should release to the public in early 2009 and then releasing it in
2010.
(2) If bug fixes caused late shipment of a product, the direct cost of late shipment includes rush shipping fees and lost
sales. The opportunity cost includes costs of delaying other projects while everyone finishes this one.
(3) Cost of cooperating with a government investigation, including legal fees, whatever the outcome.
(4) Gryna [20] cautions against presenting estimates that include costs that other managers might challenge as not
quality-related. A perception that you are padding the numbers for dramatic effect can destroy the credibility of
your estimates. Consistent with this, in Figure 1, we omit such costs as high turnover (staff quit over quality-
related frustration) and lost pride (people will work less hard, with less care, if they believe the final product will be
low quality no matter what they do.) We don’t know how to credibly include them in our totals, but that doesn’t
mean that we shouldn’t bear them in mind. They can be very important.
The power of the quality-cost approach is that it translates every variable, compared to doing this work through
qualitative concerns into financial estimates that can verification at the black box system level later on?
attract the attention of more senior managers. What additional code inspection (inspection of the
unit tests) would be required and who should do it?
Some companies treat quality-related costs very formally With the added programming and inspection costs,
and in great detail, with quality- cost management is this a cost-favorable, neutral, or cost-wasteful
databases. Others use this concept more tactically. Here’s change?
an example of tactical use: • One of the common assertions from the agile
development movement [22] is that intense
Suppose you think the user interface, as designed, automated unit testing provides a foundation for
will lead to user errors as a person uses a product refactoring [23] and through that, relatively safe and
to control a device or analyze a situation under cost-controlled maintenance. Agile development
time pressure. You run some quick usability tests was created to cope with poorly-developed, rapidly
and demonstrate that in your pool of users, people changing user requirements. Would a shift in
made some characteristic mistakes. Approaching methodology that included some of the agile
other members of the product development staff, practices, such as using extensive unit test suites to
you could build arguments, with them, that helping improve maintainability, be appropriate for a
people avoid these mistakes in real use in the field simulation project? How much conceptual rework is
will require higher documentation costs (get an likely on this particular simulation? How much will
estimate from the writers), higher training costs require complete discarding of implemented parts of
(get estimates from trainers or training materials the system, rather than rework? How would those
developers), higher post-sales support costs, impact the cost and value of the unit test suites?
reputational damage if the mistakes are serious,
and goodwill damage with the users if an By providing an economic structure for these discussions,
investigation of mistakes made in the field blames quality-cost analysis describes rules of research and
them on “pilot error.” The estimates of these costs engagement for staff who are inclined to vigorously argue
don’t have to be perfect, but each one must come for or against choices of broad process or specific
from a credible source and be in some way practices. That is, people who can made a credible
plausible and justifiable. quality-cost argument for a given decision are more likely
to be taken seriously, and people (of comparable stature in
Along with arguing about individual bugs or decisions, the company) who cannot or will not develop cost/benefit
this approach opens opportunities to make business cases estimates are more likely to be ignored or bluntly told to
on broader issues. Here are three examples that quit wasting people’s time on subjective hunches about
verification staff might be involved in: project management and product quality.
• What savings could be realized if some The problem of this approach is that quality cost analysis
programming staff and some verification staff spent looks at the company’s costs, not the customer’s costs.
time with users in the field, developing an The manufacturer and seller are definitely not the only
understanding of the subject matter and the people who suffer quality-related costs. When a
complexity of the situations being simulated? If the manufacturer sells a bad product, the customer faces
design is more robust (because the programmers / significant expenses in dealing with that bad product.
low-level designers have more intuition about the
types of changes that are likely) and the tests are The Ford Pinto litigation provided a classic example of
more realistic and more powerful, how much time quality cost analysis. Among the documents produced in
and money is this worth? these cases was the Grush-Saunby report [24, p. 841, 25,
• Could savings be achieved by increasing the p. 225], which looked at costs associated with fuel tank
programming staff and requiring them to develop integrity. The key calculations appeared in Table 3 of
and maintain suites of unit tests that check for their report:
common errors (such as boundary conditions) in
Benefits and Costs Relating to Fuel Leakage
Associated with the Static Rollover Test Portion of FMVSS 208
Benefits
Savings – 180 burn deaths, 180 serious burn injuries, 2100 burned vehicles
Unit Cost -- $200,000 per death, $67,000 per injury, $700 per vehicle
Total Benefit – 180 x ($200,000) + 180 x ($67,000) + 2100 x ($700) = $49.5 million.
Costs
Sales – 11 million cars, 1.5 million light trucks.
Unit Cost -- $11 per car, $11 per truck
Total Cost – 11,000,000 x ($11) + 1,500,000 x ($11) = $137 million.
Figure 2. The cost of quality analysis for the Ford Pinto
This is an internal cost of quality analysis. Ford estimated of the dead father, the people who were disfigured for life
that it would cost $137 million to fix the gas tanks of with burns, those costs were probably much higher. How
these cars, and if they did not, they would be held much pain would it cause for you if your mother or your
accountable for 180 burn deaths, 180 serious burn daughter died? That’s not an economic loss, but it is a
injuries, and 2100 burned vehicles. (In fact, there would huge cost—for the customer. Not for Ford.
probably be more deaths and injuries, but this is the
estimated number of cases that could be proved to have From the point of view of quality cost analysis as it is
been caused by the exploding gas tanks.) documented by the American Society for Quality and
typically used in practice, it doesn’t matter how high the
To Ford, the estimated cost of causing the death of a cost of a defect is to a customer. That cost is made
customer was $200,000 and the cost of serious burn invisible. All that matters is cost to the company.
injuries was $67,000. To Ford’s customers, the children
SELLER: EXTERNAL FAILURE COSTS CUSTOMER: FAILURE COSTS
These are the types of costs absorbed by the These are the types of costs absorbed by the
seller that releases a defective product. customer who buys a defective product.
• Technical support calls • Wasted time

• Preparation of support database • Lost data
• Investigation of customer complaints • Lost business
• Refunds and recalls • Embarrassment
• Coding / testing of interim bug fix releases • Frustrated employees quit
• Shipping of updated product • Demos or presentations to potential
• Added expense of supporting multiple customers fail because of the software
versions of the product in the field • Failure when attempting other tasks that can
• PR work to soften drafts of harsh reviews only be done once
• Lost sales • Cost of replacing product
• Lost customer goodwill • Cost of reconfiguring the system
• Discounts to resellers to encourage them to • Cost of recovery software
keep selling the product • Cost of tech support
• Warranty costs • Injury / death
• Liability costs
• Government investigations
• Penalties
• All other costs imposed by law
Figure 3. Comparing sellers’ costs and customers’ costs

When costs to the customer are high, cheated or injured The greatest weakness of quality cost analysis is that it
customers have an incentive to transfer some of their ignores large categories of costs, such as costs borne by
costs back to the seller. They do this through litigation. the customer instead of the seller. It creates an incentive
for hiding costs and risks by externalizing them (pushing
Thus, in cases like Grimshaw v. Ford [26], juries awarded them onto the customer). As another example, technical
millions in punitive damages to the victims of the support costs used to cost software publishers an average
exploding gas tanks. Punitive damages are awarded when of $23 per call [28], creating enormous incentives to
the defendant’s actions are determined to be sufficiently improve the quality of consumer software. Those
outrageous. They are awarded not to compensate the incentives flipped when publishers started charging for
victim but to punish the defendant, to change the technical support. Customer dissatisfaction lost its biggest
economics of the defendant’s actions. Big products indicator in the quality cost spreadsheets.
liability verdicts change the cost of selling dangerously
defective cars. Let’s bring this back to simulations: the Ford cases are the
more compelling when we think about software used to
The Ford case is a classic example of a chronic problem. help people deal with (for example) complexities of
Another example from the automotive industry is the battlefield conditions. Lives are at stake. Analyses that
more recent General Motors Corp. v. Johnston [27]: A focus only on producers’ costs are, arguably, a
PROM controlled the fuel injector in a pickup truck. The fundamental breach of the trust under which the contract
truck stalled because of a defect in the PROM and in the for developing this simulation was created. The objective
ensuing accident, Johnston’s seven-year old grandchild of the effort is to protect the warfighter, not to protect the
was killed. The Alabama Supreme Court justified an profitability of the vendor (although, certainly, that
award of $7.5 million in punitive damages against GM by profitability is a justifiably essential requirement for any
noting that GM “saved approximately $42,000,000 by not company that wants to survive in this market space). The
having a recall or otherwise notifying its purchasers of the solution is not to ignore the vendor’s costs but to find a
problem related to the PROM.” way to also include costs (risks) to the customer / user.
The well-publicized cases are for disastrous personal It is hard to take quality cost analysis seriously as an
injuries, but there are plenty of cases against computer ethical guide for responsible, professional conduct by the
companies and software companies for breach of contract, verification engineer or the other development staff,
breach of warranty, fraud, etc. because it ignores the risks to the warfighter. The analysis
is too deeply self-serving to serve as a guide for how to
The fundamental problem of cost-of-quality analysis is deal with others with integrity.
that it sets sellers up to underestimate customer
dissatisfaction and litigation risks. Many sellers’ staff Negligence analysis, as laid out by Learned Hand,
think, when they have estimated the total cost of quality provides an alternative. It looks almost like quality cost
associated with a project, that they have done a fairly analysis, but is different in one critical way.
complete analysis. But if sellers don’t take customers’
external failure costs into account, then when those Here again are his factors, mapped to failure of a product:
customers resort to litigation to transfer back to the sellers
some of the enormous costs the sellers thought they had • A potential failure (F), causing a loss (L)
successfully pushed off on the customers, the sellers’ staff • The probability that this failure will happen in the
can be surprised by huge increased costs (lawsuits) over field (P)
decisions that they thought, in their incomplete analyses, • Measures that can be taken to prevent that loss, with
were safe and reasonable. a cost of C
Preventative measures are called for if C < P*L.

6. Negligence Analysis Preventative measures are unreasonably expensive and
need not be done if C > P*L.
The greatest strength of quality cost analysis is that it is
persuasive. If you can demonstrate to a business that it
The difference between quality cost analysis and
can reduce its costs by X% by spending less than this to
negligence analysis is this:
improve the company’s quality processes or products,
most executives will listen carefully. They will want to
• Under a quality cost analysis, we estimate L by
make these changes.
totaling all of the manufacturer’s costs associated
with the failure.
• Under a negligence analysis, we estimate L by suggesting it as a way for V&V professionals to think
totaling all of society’s costs (costs of failure to the about the implications of their work in a context that
user and anyone else impacted by the failure). strips them of other familiar touchstones for determining
whether the decisions and practices around them are
Unlike quality cost analysis, negligence analysis tells the responsible and appropriate or not.
manufacturer to balance its own costs against the risks
that its products create for society. In short, our proposal is this:
It’s important to distinguish here between the use of this Suppose that we drove the scope of testing by considering
analysis as a personal guide to ethical analysis of a risks and the ways to mitigate them [30]:
situation and as a legal criterion for liability.
• Given a potential failure (a hazard), we could ask
As a tool for determining legal liability, negligence what kind of testing would be required to determine
analysis has been criticized on several grounds. Here are whether the software could fail in this way, how
some of the concerns raised: likely the failure, and what it would take to fix it.
• What is our best starting estimate (guess) of the cost
• In the press of multimillion dollar litigation, there of doing the testing needed to expose failures of this
are enormous incentives to bias the estimates of kind?
costs and risks. Expert witness testimony to juries • What is our best starting estimate of the severity of
will conflict and jury opinions may depend more on such a failure, taking into account cumulative
their perception of credibility of the expert than on impact on all stakeholders (the vendor, the
the underlying soundness of the estimates. warfighter, the agency paying for the software, etc.)?
• In the American justice system, juries are charged • What is our best starting estimate of the probability
with making all decisions about the facts in a case. of such a failure?
They decide what is true and what is false. The
judge then applies the law to those facts. It is If the estimated additional cost of testing (and fixing) is
difficult to understand how a jury of people who significantly less than the estimated risk (cost times
have little mathematical knowledge and no probability), we should do the testing. Similarly, if there
engineering background can decide what the was some other way to mitigate the risk (such as better
engineering facts of a case are. programming rather than better testing) that cost less than
• A design flaw exists whenever all copies of the the testing, we should do that instead.
product have the same flaw. Software bugs are all
design flaws. A small flaw can have an enormous If the estimated additional cost of mitigation is
cumulative effect. The cumulative cost of 10 million significantly less than the risk, we should not mitigate the
people being slightly inconvenienced can be huge. risk. It is too expensive.
Unless we change the criterion, to require correction
of a problem only when social cost is much greater If the estimated cost of mitigation is close to the risk, we
than manufacturer’s cost, manufacturers of mass- should make the decision based on other heuristics, such
market products will face so much litigation over as asking for guidance from the client who is paying for
tiny defects, involving decisions that are even harder the simulation.
for juries because they are close calls that require
more careful judgment, that liability protection will 7. Stakeholder Analysis
become one of the main concerns of the business
(for those manufacturers who don’t simply abandon To estimate the cumulative impact on all stakeholders, we
the field). have to identify all of the stakeholders and estimate the
potential impact on each.
The American Law Institute [29] published a rebalancing
of products liability law to deal with issues like these. A stakeholder is normally defined as “anyone who could
While this work has been very influential in the legal be materially affected by the implementation of a new
community, the process and result were enormously system or application” [31, p. 40].
difficult and controversial.
We prefer a refinement to this definition suggested by
We don’t have to express opinions about these very Gause & Weinberg [32]. Some stakeholders are favored,
difficult issues in this paper, or join debates about how some are disfavored, and some we ignore.
well the legal system is handling them, because we are
not suggesting this approach as a legal framework. We are
• Most discussions of stakeholders are of favored to tell whether you have identified all of them. It is often
stakeholders—the best system affects these people impractical to treat each stakeholder (such as, every
positively. soldier who carries a certain type of weapon) individually.
• In contrast, the successful system affects disfavored It is probably impossible to tell whether you have
stakeholders negatively. For example, an embezzler identified all of the interests (and thus the roots of all of
is a disfavored stakeholder of a bank security the relevant costs and benefits) of a given stakeholder or
system. An enemy combatant is a disfavored stakeholder group.
stakeholder of any system designed to protect our
troops or improve their combat capabilities. Thus, a stakeholder-focused cost/benefit analysis will
always be incomplete and uncertain. This is not a precise
A complete stakeholder analysis should consider both estimation task, because that is not possible. This is a way
classes. A system that protects our troops might be an of thinking about a system’s effects, the tasks that you
utter failure if, in achieving this, it protects enemy might consider that could improve those effects, and the
combatants even more. likely net cost or net benefit of such a task.
Robertson & Robertson [33] and Gause & Weinberg [32] The likely win from this type of analysis is that in doing
provide guidance, including brainstorming suggestions, the brainstorming, and in training yourself and your group
for identifying the many potential stakeholders for a to think along these lines, you will sometimes identify
project or product, classifying them (favored, disfavored), situations that will have a big effect on a class of
grouping them, and deciding how important they are. If stakeholders, and the impact is obviously significant
you learn better from videos, Kaner [34] created a worked compared to the cost of mitigating it. These make for easy
example on video for students in Computer Law & Ethics decisions—once you identify them.
courses at Florida Tech and the University of Illinois at
Springfield. 8. Developing Tests for Simulations
Some of the most important lessons of stakeholder Much of traditional verification and validation involves
analysis are: relatively simple tests that focus on some part of a
specification (requirements spec, design spec, etc.).
• There are more stakeholders than you think Testing coverage is measured by checking that all parts of
• Disfavored stakeholders may be very important all of the controlling specifications have tests associated
• A stakeholder can be affected by the same system, with them, and not necessarily how powerful or insight-
or even by the same aspect of the same system, in provoking those tests might be.[35]
more than one way
• Favored stakeholders often have conflicts of interest: If a product is fully and accurately specified by its
the same aspect of the system may both benefit and collection of specification documents, and if the tests
interfere with interests of the same person (for cover all of the parts of the specifications in a way that
example, a particularly effective development includes tests for each way that the product might fail to
process might reduce a vendor’s costs to complete meet one of the specification parts, then one could argue
the project, thereby increasing its profit on a fixed- that this testing fully verifies the behavior of the product.
price contract, but might also reduce the need for Kaner would consider that argument incorrect even in this
later maintenance and thus the opportunities for case, but that is a controversy that we can avoid in this
maintenance-related profits) paper.
• The interests of different favored stakeholders may
conflict The problem is that even if “complete” specification-
• A requirements document reflects a balance of driven testing were sufficient when the specifications are
power/influence among the stakeholders at the time complete and accurate, when they are not (as in
it was written. Over time, the stakeholders change, simulations), then testing that is solely driven by the
their respective power changes, and the agreement specifications cannot be complete, or even adequate, by
reflected in the document no longer reflects the any reasonable criterion.[36]
realities of the stakeholder group.
When dealing with complex systems, there is certainly
If you try too hard to do a complete stakeholder analysis, value in creating a collection of fairly simple tests. If the
you will be paralyzed by the scope of the work: This is an program cannot do a specific function correctly, then
open-ended creative process. It is probably impractical to running a test that involves that one function in
identify every stakeholder, and it is probably impossible combination with many others will involve a lot of work
to expose a simple failure, both in running the test and in
troubleshooting the failure to the extent needed to reveal simulated, such as a realistic, live demonstration. That’s
that the problem is as narrow and simple as it is. more extreme than we intend.
However, once the individual features seem to work on
their own, there is a lot of value in tests that run these Kaner & Bach [40] provide a video and slide set on the
features together in ways that stakeholders will find design of scenario tests. They identify 16 ways in which
meaningful. people generate scenarios. Each of these questions, taken
on its own, might lead a tester to imagine a suite of
In a simulation, if it is not fully specifiable and the related scenarios:
requirements / expected behavior are not fully understood,
we won’t always know what behavior to expect when we • Write life histories for objects in the system. How
create a realistic test case. By “realistic test case”, we was the object created, what happens to it, how is it
mean a test in which the software is used to do something used or modified, what does it interact with, when is
that a stakeholder will actually want to get done. A well- it destroyed or discarded?
designed test of this type provides a sample of the • List possible users, analyze their interests and
program’s behavior that a stakeholder can respond to and objectives.
criticize. For example, a subject matter expert might say • Consider disfavored users: how do they want to
that it doesn’t matter what the specification says, under abuse your system?
these circumstances, the simulation should not behave this • List system events. How does the system handle
way. Or a person who will use the equipment or follow them?
the process being simulated might react to a test by saying • List special events. What accommodations does the
that it is unworkable, that if the real equipment (or the real system make for these?
process) works this way, it will be too complex (too slow, • List benefits and create end-to-end tasks to check
too confusing, too something-bad) to be useful. Tests of them.
this class thus mix verification (checking the actual
• Look at the specific transactions that people try to
behavior of the functions tested in combination against complete, such as opening a bank account or
the behavior you would predict from the specifications)
sending a message. What are all the steps, data
and validation (checking whether the behavior is
items, outputs, displays, etc.?
desirable).
• What forms do the users work with? Work with
them (read, write, modify, etc.)
Commercial software testers often call this type of testing,
• Interview users about famous challenges and
scenario testing. [37-39].
failures of the old system.
In a commercial software test, the typical scenario has • Work alongside users to see how they work and
five characteristics: what they do.
• The test is based on a story about how the program • Read about what systems like this are supposed to
is used, including information about the motivations do. Play with competing systems.
of the people involved. • Study complaints about the predecessor to this
• The story is motivating. A stakeholder with system or its competitors.
influence would push to fix a program that failed • Create a mock business. Treat it as real and process
this test. (Anyone affected by a program is a its data.
stakeholder. A person who can influence • Try converting real-life data from a competing or
development decisions is a stakeholder with predecessor application.
influence.) • Look at the output that competing applications can
• The story is credible. It not only could happen in the create. How would you create these reports / objects
real world; stakeholders would believe that / whatever in your application?
something like it probably will happen. • Look for sequences: People (or the system) typically
• The story involves a complex use of the program or do task X in an order. What are the most common
a complex environment or a complex set of data. orders (sequences) of subtasks in achieving X?
• The test results are easy to evaluate. This is valuable
for all tests, but is especially important for scenarios These are much like the activities / analyses you would
because they are complex. expect from a requirements analyst, but there are
differences:
The Online M&S Glossary [6] defines scenarios • Requirements analysts try to foster agreement about
differently: a simulation scenario under that definition the system to be built. Testers exploit disagreements
might require a setup of the actual conditions being to predict problems with the system.
• Testers don’t have to reach conclusions or make 6. United States Department of Defense: Defense
recommendations about how the product should Modeling and Simulation Office, “Online M&S
work. They have to expose credible concerns to the Glossary (DoD 5000.59-M),” Book Online M&S
stakeholders. Glossary (DoD 5000.59-M), Series Online M&S
• Testers don’t make the product design tradeoffs. Glossary (DoD 5000.59-M), ed., Editor ed.êds., pp.
They expose consequences of those tradeoffs, 7. G.E.P. Box and N.R. Draper, Empirical Model-
especially consequences that are unanticipated or Building and Response Surfaces, Wiley, 1987.
more serious than expected. 8. D.K. Pace, “The Value of a Quality Simulation
• Testers don’t have to respect agreements that were Conceptual Model,” 2002;
recorded in the requirements documents or other http://www.modelingandsimulation.org/MandS0101
specifications. They can report anything as a /Pace0101.html.
problem that they think a stakeholder with influence 9. R.G. Sargent, “Verification and Validation of
should (but does not yet) know. Simulation Models,” Proc. 31st Winter Simulation
• The set of scenario tests need not be exhaustive, just Conference, Society for Computer Simulation, 2000.
useful. 10. P. Castro, ed., Modeling and Simulation in
Manufacturing and Defense Acquisition,
A narrower definition of the “scenario” is an instantiation Enhancements for 21st Century Manufacturing and
of a use case—essentially a description of a sequence of Defense Acquisition, National Academies Press,
tasks or steps that the system should go through, along 2002.
with the data it should receive and the responses it should 11. S. Youngblood, RPG Special Topics: Requirements,
generate [41]. In this tradition, the use case focuses on Defense Modeling and Simulation Office, 2004.
what the system should do, abstracting out the individual 12. J. Musa, Software Reliability Engineering, McGraw-
and her or his motivations [42, 43]. This is often useful Hill, 1999.
for systems analysis, but not for our purposes in this 13. C. Kaner, “Measurement issues and software testing
paper. (Keynote address),” Book Measurement issues and
software testing (Keynote address), Series
We want a testing approach that helps us imagine why Measurement issues and software testing (Keynote
someone would be unhappy with the behavior of a system address), ed., Editor ed.êds., 2001, pp.
and how unhappy it would make them, because that is the 14. D. Hoffman, “A taxonomy of test oracles,” Book A
type of information that helps us estimate the cost of this taxonomy of test oracles, Series A taxonomy of test
(mis)behavior, which is exactly what we need for oracles, ed., Editor ed.êds., 1998, pp.
cost/benefit analysis. 15. J. Bach, “General Functionality and Stability Test
Procedure for Microsoft Windows 2000 Application
References Certification,” 1999;
http://www.satisfice.com/tools/procedure.pdf.
16. C. Kaner and J. Bach, “A Course in Black Box
1. United States v. Carroll Towing Co., vol. 159, 1947 Software Testing: Fundamental Issues in Software
p. 169 (United States Circuit Court of Appeals, Testing: The Oracle Problem ” 2005;
Second Circuit (Learned Hand)). http://www.testingeducation.org/k04/video/Overvie
2. The T.J. Hooper, vol. 60, 1932 p. 737 (United States wPartC.wmv.
Court of Appeals, Second Circuit). 17. J. Bach and M. Bolton, “Rapid Software Testing
3. S. Youngblood, “DoD Verification, Validation and (course slides),” Book Rapid Software Testing
Accreditation Recommended Practices Guide,” (course slides), Series Rapid Software Testing
Book DoD Verification, Validation and (course slides), Version 2.1.3 ed., Editor ed.êds.,
Accreditation Recommended Practices Guide, Series 2007, pp.
DoD Verification, Validation and Accreditation 18. J. Campanella, Principles of Quality Costs, ASQ
Recommended Practices Guide, ed., Editor ed.êds., Quality Press, 1999.
1996, pp. 19. W.P. Keeton, et al., Prosser and Keeton on Torts,
4. R.G. Sargent, “An Overview of Verification and West Publishing, 1984.
Validation of Simulation Models,” Proc. Winter 20. F.M. Gryna, “Quality Costs,” Juran's Quality
Simulation Conference, Society for Computer Control Handbook, 4th ed., McGraw Hill, 1988.
Simulation, 1987. 21. A.V. Feigenbaum, Total Quality Control McGraw-
5. M. Kilikauskas and D. Hall, “The Use of M&S Hill, 1991.
VV&A as a Risk Mitigation Strategy in Defense 22. K. Beck, Extreme Programming Explained:
Acquisition,” Journal of Defense Modeling and Embrace Change, Addison-Wesley, 2005.
Simulation, vol. 2, no. 4, 2005, pp. 209-216.
23. M. Fowler, Refactoring: Improving the Design of http://www.testingeducation.org/BBST/ScenarioTest
Existing Code, Addison-Wesley, 2000. ing.html.
24. W.P. Keeton, et al., Products Liability and Safety: 41. J. Rumbaugh, et al., The Unified Modeling
Cases and Materials, Foundation Press, 1989. Language Reference Manual, Addison-Wesley,
25. R.A. Posner, Tort Law: Cases and Economic 1999.
Analysis, Little Brown & Co., 1982. 42. A. Cockburn, Writing Effective Use Cases, Addison-
26. Grimshaw v. Ford Motor Co., vol. 174, 1981 p. 348 Wesley, 2001.
(California Court of Appeal). 43. S. Adolph and P. Bramble, Patterns for Effective
27. General Motors Corp. v. Johnston, vol. 592, 1992 p. Use Cases, Pearson Education, 2003.
1054 (Supreme Court of Alabama). 44. C. Kaner, “Quality cost analysis: Benefits and
28. C. Kaner and D.L. Pels, Bad Software: What To Do risks,” Software QA, vol. 3, no. 1, 1996, pp. 23.
When Software Fails, Wiley, 1998.
29. American Law Institute, Restatement of the Law Acknowledgements
Third, Torts: Products Liability, American Law
Institute Publishers, 1998. These notes are partially based on research that was
30. United States Coast Guard, “Risk-based Decision- supported by NSF Grant 0629454 “Learning Units on
making Guidelines,” http://www.uscg.mil/hq/g- Law and Ethics in Software Engineering” to Cem Kaner.
m/risk/e-guidelines/hazop.htm. Any opinions, findings and conclusions or
31. D. Leffingwell and D. Widrig, Managing Software recommendations expressed in this material are those of
Requirements: A Unified Approach, Addison- the author(s)and do not necessarily reflect the views of
Wesley, 2000. the National Science Foundation.
32. D.C. Gause and G.M. Weinberg, Exploring
Requirements: Quality Before Design, Dorset Much of the quality cost discussion, including Figure 1
House, 1989. updates work published in [44]. Much of the scenario
33. S. Robertson and J. Robertson, Mastering the discussion was previously published in [37] and [40].
Requirements Process, Addison-Wesley, 1999, p.
404.
34. C. Kaner, “Interest Analysis,” 2006;
http://www.testingeducation.org/k04/video/Copyrig
htInterestAnalysis.wmv.
35. P.K. Davis, “Generalizing Concepts of Methods of
Verification, Validation and Accreditation (VV&A)
for Military Simulations,” RAND, 1992.
36. D. Brade, “Enhancing M&S Accreditation by
Structuring Verification and Validation Results,”
Proc. Winter Simulation Conference, Society for
Computer Simulation, 2000.
37. C. Kaner, “The power of 'What If…' and nine ways
to fuel your imagination: Cem Kaner on scenario
testing,” Book The power of 'What If…' and nine
ways to fuel your imagination: Cem Kaner on
scenario testing, Series The power of 'What If…'
and nine ways to fuel your imagination: Cem Kaner
on scenario testing 5, ed., Editor ed.êds., 2003, pp.
16-22.
38. J.M. Carroll, ed., Scenario-Based Design:
Envisioning Work and Technology in System
Development, John Wiley & Sons, 1995.
39. R.M.B. Young, P.B., “The use of scenarios in
human-computer interaction research:
Turbocharging the tortoise of cumulative science,”
Proc. CHI+GI'87: Conference on Human Factors in
Computing Systems and Graphics Interface, ACM
Press, 1987, pp. 291-296.
40. C. Kaner and J. Bach, “A Course in Black Box
Software Testing: Scenario Testing,” 2004;

Good Enough V&V For Simulations: Some Possibly Helpful Thoughts From The Law & Ethics of Commercial Software

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Good Enough V&V For Simulations: Some Possibly Helpful Thoughts From The Law & Ethics of Commercial Software

Uploaded by

Copyright:

Available Formats

Good Enough V&V for Simulations: Some Possibly Helpful Thoughts from the

Law & Ethics of Commercial Software

INTERNAL FAILURE EXTERNAL FAILURE

Figure 1. Examples of quality costs associated with software products (4)

Here are a few additional notes on the figure:

Figure 2. The cost of quality analysis for the Ford Pinto

SELLER: EXTERNAL FAILURE COSTS CUSTOMER: FAILURE COSTS

• Technical support calls • Wasted time

Figure 3. Comparing sellers’ costs and customers’ costs

Preventative measures are called for if C < P*L.

You might also like