Best Test Design

Download as pdf or txt
Download as pdf or txt
You are on page 1of 242

TEST

DESIGN
BEST
TEST
DESIGN

Benjamin D. Wright
Mark H. Stone
University o f Chicago

M E S A PRESS
Chicago
1979
M ESA PRESS
5835 Kimbark Ave.
Chicago, IL
60637

Copyright © 1979 by Benjamin D. Wright and Mark H. Stone


A ll rights reserved

Printed in the United States o f America


Library o f Congress Catalog Card Number 79-88489

Cover Design: Andrew W. Wright


Graphics and Typesetting: B etty Stonecipher
ACKNOWLEDGEMENTS

This b oo k owes m ore to G eorg Rasch than words can convey. The many months o f
painstaking and inspired tutoring with which he introduced me to the opportunities and
necessities o f measurement during 1960 are the foundation on which Mark Stone and I
have built. I am also indebted to Jane Loevinger fo r her com m on sense during the 1965
M idwest Psychological Association Symposium on “ Sample-free Probability M odels in
Psychological Measurement,” to Darrell Bock fo r encouraging me to talk about “ Sample-
free Test Construction and Test-free A b ility Measurement” at the March, 1967 meeting
o f the Psychom etric S ociety and to Benjamin Bloom fo r insisting on “ Sample-free Test
Calibration and Person Measurement” at the 1967 ETS Invitational Conference on
Testing Problems.
Benjamin D. Wright

David Farr initiated an A E R A Training Session on “ Sample-free Test Calibration and


Person Measurement in Educational Resarch” in 1969. Since then, thanks to the support
o f Richard Jaeger and William Russell, there have been several A E R A “ presessions.”
These forums fo r debate on the theory and practice o f Rasch measurement have fostered
many instructive collaborations. F or these lessons in h ow to make Rasch measurement
useful we are especially indebted to Louis Bashaw, Richard W oodcock, R ob ert R entz and
Fred Forster.

The lon g hard w ork o f mathematical analysis, com puter programming and empirical
verification which makes Rasch measurement n ot only practical but even easy to use was
done with friends in the M E S A program at The University o f Chicago. This great adventure
began in 1964 with Bruce Choppin and Nargis Panchapakesan and continues v ith David
Andrich, Graham Douglas, Ronald Mead, R ob ert Draba, Susan Bell and G e o ffrey Masters.
We dedicate our b o o k to them.

Benjamin D. Wright
Mark H. Stone
The University o f Chicago
M arch 30, 1979

v
FORWARD

0.1 HOW T O USE T H IS BOOK

This is a handbook fo r learning h ow to do Rasch measurement. We give some theo­


retical explanations, but we emphasize practice. Chapters 2, 4, 5 and 7 use a small prob­
lem to illustrate the application o f Rasch measurement in com plete detail. T o those o f
yo u w h o learn best by doing we recom m end going directly to Chapters 2 and 4 and
w orking through the practical problem developed there. N e x t study the sections o f
Chapters 5 and 7 that continue the analysis o f this problem and then go back to Chapter 1.

Behind the practical procedures o f Rasch measurement are their reasons. The
m ethodological issues that m otivate and govern Rasch measurement are developed in
Chapters 1, 5 and 6. T o those o f you w ho like to begin with theory, we recommend
reading these chapters b efore w orking the problem in Chapters 2 and 4. Finally, if Rasch
measurement is entirely new to you, you m ight want to begin with Section 0.3 o f this
Forward which is an introduction to the topic given at the O ctober 28, 1967 ETS Invi­
tational Conference on Testing Problems (W right, 1968).

Section 0.2 reviews the m otivation and history o f the ideas that culminated in
Rasch’s Proba bilistic M odels f o r S om e Intelligence and A tta in m e n t Tests (1960). The
references cited there and elsewhere in this b ook are focused on w ork that (1 ) bears
directly on the discussion, (2 ) is in English and (3 ) is either readily available in a uni­
versity library or can be supplied b y us.

0.2 M O T IV A T IO N A N D H IS T O R Y

F ifty years ago Thorndike com plained that contem porary intelligence tests failed to
specify “ h ow far it is proper to add, subtract, m ultiply, divide, and com pute ratios with
the measures obtained.” (Thorndike, 1926, 1). A good measurement o f ability would be
one “ on which zero w ill represent just n ot any o f the ability in question, and 1, 2, 3, 4,
and so on w ill represent amounts increasing by a constant d ifferen ce.” (Thorndike, 1926,
4 ). Thorndike had the courage to complain because he believed he had worked out a
solution to the problem fo r his ow n intelligence test. So did Thurstone (1925).

Thurstone’s m ethod was to transform the proportion in an age group passing any
item into a unit normal deviate and to use these values as the basis fo r scaling. Common
scale values fo r d ifferen t age groups were obtained by assuming a linear relationship be­
tween the d ifferen t scale values o f items shared by tw o or more test forms using the
d ifferen t group means and standard deviations as the parameters fo r a transformation
on to a com m on scale. Thurstone redid a piece o f Thorndike’s w ork to show that his
m ethod was better (Thurstone, 1927). His “ absolute scale” (1925, 1927) yields a more
or less interval scale. But one which is quite dependent on the ability distribution o f the
sample used. In addition to item hom ogeniety, the Thurstone method requires the
assumption that ability is norm ally distributed within age groups and that there exist
viii FORWARD

relevant fixed population parameters fo r these distributions. Should the specification o f


population be inappropriate so will the estimated scale values. Should the sampling o f
intended populations be inadequate in any way so will the estimated scale values. They
cannot be invariant to sampling. Samples differing in their ability distributions will pro­
duce scale values different in magnitude and dispersion.

Thurstone used the 1925 version o f his method fo r the rest o f his life, but the major­
ity o f test calibrators have relied on the simpler techniques o f percentile ranks and stand­
ard scores. The inadequacies o f these methods were clarified by Loevinger’s 1947 analysis
o f the construction and evaluation o f tests o f ability (Loevinger, 1947).

Loevinger showed that test homogeniety and scale m onotonicity were essential
criteria fo r adequate measurement. In addition, “ A n acceptable method o f scaling must
result in a derived scale which is independent o f the original scale and o f the original
group tested.” (Loevinger, 1947, 46). Summing up the test calibration situation in 1947,
Loevinger says, “ N o system o f scaling has been proved adequate by the criteria proposed
here, though these criteria correspond to the claims made by Thurstone’s system.”
(Loevinger, 1947, 43). As fo r reliabilities based on correlations, “ Until an adquate system
o f scaling is found, the correlation between tests o f abilities, even between tw o tests o f
the same ability, will be accidental to an unknown degree.” (Loevinger, 1947, 46).

In 1950 Gulliksen concluded his Theory o f M ental Tests with the observation that

Relatively little experimental or theoretical work has been done on the effect
o f group changes on item parameters. I f we assume that a given item requires a
certain ability, the proportion o f a group answering that item correctly will
increase and decrease as the ability level o f the group changes.. . . As y e t there*
has been no systematic theoretical treatment o f measures o f item difficu lty
directed particularly toward determining the nature o f their variation with
respect to changes in group ability. Neither has the experimental work on item
analysis been directed toward determining the relative invariance o f item
parameters with systematic changes in the ability level o f the group tested
(Gulliksen, 1950, 392-393).

A t the 1953 ETS Invitational Conference on Testing Problems, Tucker suggested that,
“ An ideal test may be conceived as one fo r which the inform ation transmitted by each o f
the possible scaled scores represents a location on some unitary continuum so that uni­
form differences between scaled scores correspond to uniform differences between test
performances fo r all score levels ” (Tucker, 1953, 27). He also proposed the comparison
o f groups differing in ability as a strong method fo r evaluating test homogeneity (Tucker,
1953, 25). But the other participants in the conference belittled his proposals as imprac­
tical and idealistic.

In 1960 A n g o ff wrote in his encyclopedia article on measurement and scaling that

Most o f the test scales now in use derive their systems o f units from data taken
from actual test administrations, and thus are dependent on the performance
o f the groups tested. When so constructed, the scale has meaning only so long
as the group is well defined and has meaning, and bears a resemblance in some
fashion to the groups or individuals who later take the test fo r the particular
purposes o f selection, guidance, or group evaluation. However, if it is found
FOR W AR D ix

that the sampling fo r the developm ent o f a test scale has n ot been adequate,
or that the group on which the test has been scaled has outlived its usefulness,
possibly because o f changes in the defined population or because o f changes
in educational emphases, then the scale itself comes into question. This is a
serious matter. A test which is to have continued usefulness must have a scale
which does n ot change with the times, which w ill perm it acquaintance and
fam iliarity with the system o f units, and which will perm it an accumulation
o f data fo r historical comparisons (A n g o ff, 1960, 815).

A n d y e t the faulted methods referred to and criticized by Loevinger, Gulliksen and


A n g o ff are still w id ely used in test construction and measurement. This is in spite o f the
fact that considerable evidence has accumulated in the past tw enty-five years that much
better m ethods are possible and practical.

These better m ethods have their roots in the 19th century psychophysical models o f
W eber and Fechner. T h ey are based on simple models fo r what it seems reasonable to
suppose happens when a person responds to a test item. T w o statistical distributions have
been used to m odel the probabilistic aspect o f this event. The normal distribution appears
as a basis fo r mental measurement in Thurstone’s Law o f Comparative Judgement in the
1920’s. Th e use o f the normal ogive as an item response m odel seems to have been
initiated by L aw ley and Finney in the 1940’s. Lord made the normal ogive the corner­
stone o f his approach to item analysis until about 1967, when under Bim baum ’s influ­
ence, he switched to a logistic response m odel (L o rd , 1968).

Th e logistic distribution was used by biometricians to study growth and m ortality


rates in the 1920’s and Berkson has championed its practical advantages over the normal
distribution ever since. These biom etric applications were finally picked up, probably
through the w ork o f Bradley and T erry in the 1950’s, and formulated into a logistic
response m odel fo r item analysis by Bimbaum (1 9 6 8 ) and Baker (1961). Baker developed
com puter programs fo r applying lo git and p robit item analysis and studied their perfor­
mance with empirical and simulated data (Baker, 1 9 5 9,1 96 3 ).

In all o f these approaches to item analysis, however, at least tw o parameters are


sought fo r each item. Attem pts are made to estimate n o t only an item d ifficu lty, the
response ogive’s horizontal intercept at probability one-half, but also an item discrimina­
tion, the ogive’s slope at this intercept. Unfortunately this seemingly reasonable elabora­
tion o f the problem introduces an insurmountable d ifficu lty into applying these ideas
in practice. There has been a running debate fo r at least fifteen years as to whether or not
there is any useful way by which some kind o f estimates o f item parameters like item
discrimination and item “ guessing” can be obtained.

The inevitable resolution o f this debate has been im plicit ever since Fisher’s inven­
tion o f sufficient estimation in the 1920’s and Neymann and S cott’s work on the con­
sistency o f conditional estimators in the 1940’s. Rasch (1968), Andersen (1973, 1977)
and Bam dorff-Nielsen (1 9 7 8 ) each prove decisively that only item d ifficu lty can actually
be estimated consistently and sufficiently from the right/wrong item response data
available fo r item analysis. These proofs make it clear that the dichotomous response data
available fo r item analysis can only support the estimation o f item difficu lty and that
attempts to estimate any other individual item parameters are necessarily doomed.

T h e mathematics o f these proofs need n ot be mastered to become convinced o f their


X FORWARD

practical implications. Anyone who actually examines the inner workings o f the various
computer programs advertised to estimate item discriminations and tries to apply them to
actual data, w ill find that the resulting estimates are highly sample dependent. I f attempts
are made in these computer programs to iterate to an apparent convergence, this “ con­
vergence” can only be “ reached” by interfering arbitrarily with the inevitable tendency
o f at least one o f the item discrimination estimates to diverge to infinity. In most pro­
grams this insurmountable problem is sidestepped either by n ot iterating at all or by
preventing any particular discrimination estimate from exceeding some entirely arbitrary
ceiling such as 2.0.

As far as we can tell, it was the Danish mathematician Georg Rasch w ho first under­
stood the possibilities fo r truly objective measurement which reside in the simple logistic
response model. Apparently it was also Rasch who first applied the logistic function to
the actual analysis o f mental test data fo r the practical purpose o f constructing tests.
Rasch began his work on psychological measurement in 1945 when he standardized a
group intelligence test fo r the Danish Department o f Defense. It was in carrying out that
item analysis that he first “ became aware o f the problem o f defining the d ifficu lty o f an
item independently o f the population and the ability o f an individual independently o f
which items he has actually solved.” (Rasch, 1960, viii). By 1952 he had laid dow n the
basic foundations fo r a new psychometrics and w orked ou t tw o probability models fo r
the analysis o f oral reading tests. In 1953 he reanalyzed the intelligence test data and
developed the essentials o f a logistic probability m odel fo r item analysis.

Rasch first published his concern about the problem o f sample dependent estimates
in his 1953 article on simultaneous factor analysis in several populations (Rasch, 1953).
But his w ork on item analysis was unknown in this country until the spring o f 1960
when he visited Chicago fo r three months, gave a paper at the Berkeley Symposium on
Mathematical Statistics (Rasch, 1961), and published P roba bilistic M odels fo r S om e
Intelligen ce and A tta in m e n t Tests (Rasch, 1960).

In her 1965 review o f person and population as psychom etric concepts Loevinger
w rote,

Rasch (1 9 6 0 ) has devised a truly new approach to psychom etric problem s . . .


He makes use o f none o f the classical psychom etrics, but rather applies algebra
anew to a probabilistic m odel. The p robability that a person w ill answer an
item correctly is assumed to be the produ ct o f an ab ility parameter pertaining
o n ly to the person and a d iffic u lty param eter pertaining on ly to the item .
B eyon d specifying one person as the standard o f ability o r one item as the
standard o f d ifficu lty , the ab ility assigned to an individual is independent o f
that o f other members o f the group and o f the particular items w ith which he is
tested; sim ilarly fo r the item d iffic u lty . . . Indeed, these tw o properties w ere
once suggested as criteria fo r absolute scaling (L oevin ger, 1 94 7); at th at tim e
proposed schemes fo r absolute scaling had n o t been shown to satisfy the
criteria, n or does Guttman scaling d o so. Thus, Rasch must be credited w ith an
outstanding con tribu tion to one o f the tw o central psychom etric problems,
the achievem ent o f nonarbitrary measures. Rasch is concerned w ith a d ifferen t
and m ore rigorous kind o f generalization than Cron bach, Rajaratnam, and
Gleser. When his m od el fits, the results are independent o f the sample o f
persons and o f the particular items w ithin som e broad limits. W ithin these
lim its, generality is, one m igh t say, com p lete (L o evin ger, 1 9 6 5 ,1 5 1 ).
FOR W AR D xi

0.3 AN INTRODUCTION TO THE MEASUREMENT PROBLEM

M y top ic is a problem in measurement. It is an old problem in educational testing.


A lfre d Binet w orried about it 60 years ago. Louis Thurstone worried about it 40 years
ago. T h e problem is still unsolved. T o some it may seem a small point. But when you
consider it carefully, I think you w ill find that this small poin t is a matter o f life and
death to the science o f mental measurement. T h e truth is that the so-called measurements
w e n ow make in educational testing are no damn good!

Ever since I was old enough to argue with m y pals over w ho had the best IQ (I say
“ best” because some thought 100 was perfect and 60 was passing), I have been puzzled
b y mental measurement. We were m ixed up about the scale. IQ units were unlike any o f
those measures o f height, weight, and wealth with which we were learning to build a
science o f life. Even that noble achievement, 100 percent, was ambiguous. One hundred
m ight signify the w elcom e news that w e were smart. Or it m ight mean the test was easy.
Som etim es w e prayed fo r easier tests to make us smarter.

L ater I learned one w ay a test score could m ore or less be used. I f I were willing to
accept as a whole the set o f items making up a standardized test, I could get a relative
measure o f ability. I f m y perform ance put me at the eightieth percentile among college
men, I w ould k n ow where I stood. O r w ould I? T h e same score would also put me at the
eigh ty-fifth percentile among college wom en, at the ninetieth percentile among high
school seniors, and above the ninety-ninth percentile among high school juniors. M y
ability depended n o t o n ly on which items I to o k but on w h o I was and the com pany I
kept!

T h e truth is that a scientific study o f changes in a b ility —o f mental developm ent—is


far beyon d our feeble capacities to make measurements. H o w can we possibly obtain
quantitative answers to questions like: H o w much does reading comprehension increase in
the first three years o f school? What p roportion o f ability is native and what learned?
What p roportion o f mature ability is achieved b y each year o f childhood?

I hope I am reminding you o f some problems which a fflic t present practice in mental
measurement. T h e scales on which ability is measured are uncom fortably slippery. T h ey
have n o regular unit. T heir meaning and estimated quality depend upon the specific set o f
items actually standardized and the particular ability distribution o f the children who
happened to appear in the standardizing sample.

I f all o f a specified set o f items have been tried b y a child you wish to measure, then
yo u can obtain his percentile position among whatever groups o f children were used to
standardize the test. But h o w do you interpret this measure beyond the confines o f that set
o f items and those groups o f children? Change the children and you have a new yardstick.
Change the items and yo u have a new yardstick again. Each collection o f items measures
an ability o f its own. Each measure depends fo r its meaning on its ow n fam ily o f test
takers. H o w can we make objective mental measurements and build a science o f mental
developm ent when we w ork with rubber yardsticks?

T h e grow th o f science depends on the developm ent o f objective methods fo r trans­


form ing observation into measurement. The physical sciences are a good example. Their
basis is the developm ent o f methods fo r measuring which are specific to the measurement
intended and independent o f variation in the other characteristics o f the objects measured
xii FORWARD

or the measuring instruments used. When we want a physical measurement, we seldom


worry about the individual identity o f the measuring instrument. We never concern
ourselves with what objects other than the one we want to measure might sometime be,
or once have been, measured with the same instrument, It is sufficient to know that the
instrument is a member in good standing o f the class o f instruments appropriate fo r the
job.

When a man says he is at the ninetieth percentile in math ability, we need to know
in what group and on what test before we can make any sense o f his statement. But when
he says he is five feet eleven inches tall, do we ask to see his yardstick? We know yard­
sticks d iffer in color, temperature, compositions, weight—even size. Y e t we assume they
share a scale o f length in a manner sufficiently independent o f these secondary charac­
teristics to give a measurement o f five feet eleven inches objective meaning. We expect
that another man o f the same height will measure about the same five feet eleven even on
a different yardstick. I may be at a different ability percentile in every group I compare
m yself with. But I am the same 175 pounds in all o f them.

L e t us call measurement that possesses this property “ objective” . T w o conditions


are necessary to achieve it. First, the calibration o f measuring instruments must be
independent o f those objects that happen to be used fo r calibration. Second, the measure­
ment o f objects must be independent o f the instrument that happens to be used fo r
measuring. In practice, these conditions can only be approximated. But their approxi­
mation is what makes measurement objective.

Object-free instrument calibration and instrument-free object measurement are the


conditions which make it possible to generalize measurement beyond the particular
instrument used, to compare objects measured on similar but not identical instruments,
and to combine or partition instruments to suit new measurement requirements.

The guiding star toward which models fo r mental measurement should aim is this
kind o f objectivity. Otherwise how can we ever achieve a quantitative grasp o f mental
abilities or ever construct a science o f mental development? The calibration o f test-item
d ifficu lty must be independent o f the particular persons used fo r the calibration. The
measurement o f person ability must be independent o f the particular test items used for
measuring.

When we compare one item with another in order to calibrate a test, it should not
matter whose responses to these items we use fo r the comparison. Our method fo r test
calibration should give us the same results regardless o f whom we try the test on. This is
the only way we will ever be able to construct tests which have uniform meaning regard­
less o f whom we choose to measure with them.

When we expose persons to a selection o f test items in order to measure their ability,
it should not matter which selection o f items we use or which items they complete. We
should be able to compare persons, to arrive at statistically equivalent measurements o f
ability, whatever selection o f items happens to have been used—even when they have been
measured with entirely different tests.

Exhortations about objectivity and sarcasm at the expense o f present practices are
easy. But can anything be done about the problem? Is there a better way? In the old way
o f doing things, we calibrate a test item by observing how many persons in a standard
FORW ARD xiii

sample succeed on that item. T h e traditional item “ d ifficu lty ” is the proportion o f cor­
rect responses in some standardizing sample. Item quality is judged from the correlation
betw een these item responses and test scores. Person ability is a percentile standing in
the same “ standard” sample. Obviously this approach leans very heavily on assumptions
concerning the appropriateness o f the standardizing sample o f persons.

A quite d ifferen t approach is possible, one in which no assumptions need be made


about the ability distribution o f the persons used. This new approach assumes instead a
very simple m odel fo r what happens when any person encounters any item. The m odel
says simply that the outcom e o f the encounter shall be taken to be entirely governed
b y the difference between the ability o f the person and the d ifficu lty o f the item. Nothing
more. T h e m ore able the person, the better their chances fo r success with any item. The
easier the item , the m ore lik ely any person is to solve it. It is as simple as that.

But this simple m odel has surprising consequences. When measurement is governed
b y this m odel, it is possible to take into account whatever abilities the persons in the cali­
bration sample happen to demonstrate and to free the estimation o f item d ifficu lty from
the particulars o f these abilities. The scores persons obtain on the test can be used to
rem ove the influence o f their abilities fro m the estimation o f item d ifficu lty. The result
is a sample-free item calibration.

T h e same thing can happen when w e measure persons. The scores items receive in
whatever sample happens to provide their calibrations can be used to rem ove the influ­
ence o f item d ifficu lty from the estimation o f person ability. The result is a test-free
person measurement.1

'A d a p te d from Proceedings o f the 1967 In v ita tio n a l C onference on Testing Problem s. C opyright ©
1968 b y Educational Testing Service. A ll rights reserved. Reprinted by permission.
CONTENTS

AC KNO W LED G EM ENTS v

FO RW ARD vii

1 THE M EASU REM ENT M ODEL 1


1.1 H o w Tests are Used to Measure 1
1.2 H o w Scores are Used 4
1.3 What Happens When a Person Takes an Item 9
1.4 The Rasch M odel 15
1.5 Using the Rasch M odel fo r Calibrating and Measuring 17
1.6 A Simple Useful Estimation Procedure 21
1.7 H o w Traditional T est Statistics Appear in Rasch Measurement 24

2 IT E M C A L IB R A T IO N B Y H A N D 28
2.1 Introduction 28
2.2 The K n o x Cube Test 28
2.3 T h e Data fo r Item Analysis 29
2.4 Calibrating Items and Measuring Persons 30
2.5 Discussion 44

3 IT E M C A L IB R A T IO N B Y C O M P U T E R 46
3.1 Introduction 46
3.2 B IC A L Output fo r a P R O X Analysis o f the K n o x Cube Test Data 46
3.3 Comparing P R O X by Hand with P R O X b y Computer 55
3.4 A n alyzin g K C T with the U C O N Procedure 56
3.5 Comparing U C O N to P R O X with the K C T Data 60
3.6 A Com puting A lgorith m fo r P R O X 61
3.7 Th e Unconditional Procedure U C O N 62

4 T H E A N A L Y S IS O F F IT 66
4.1 Introduction 66
4.2 The K C T Response M atrix 66
4.3 The Analysis o f F it b y Hand 69
4.4 M isfitting Person Records 76
4.5 M isfitting Item Records 77
4.6 B rief Summary o f the Analysis o f F it 79
4.7 Com puter Analysis o f F it 80

5 C O N S T R U C T IN G A V A R IA B L E 83
5.1 Generalizing the D efinition o f a Variable 83
5.2 D efining the K C T Variable 83
5.3 Intensifying and Extending the K C T Variable 87

xv
xvi CONTENTS

5.4 Control Lines for Identity Plots 94


5.5 Connecting T w o Tests 96
5.6 Building Item Banks 98
5.7 Banking the K CTB Data 106
5.8 Common Person Equating with the K CTB 109
5.9 Common Item Equating with the K C TB 112
5.10 Criterion Referencing the K C T Variable 118
5.11 Item Calibration Quality Control 121
5.12 Norm Referencing the K C T Variable 126

6 D E SIG N IN G TESTS 129


6.1 Introduction 129
6.2 The Measurement Target 129
6.3 The Measuring Test 131
6.4 The Shape o f a Best Test 133
6.5 The Precision o f a Best Test 134
6.6 The Error Coefficient 135
6.7 The Design o f a Best Test 137
6.8 The Complete Rules fo r Best Test Design 139

7 M A K IN G M EASU RES 141


7.1 Using a Variable to Make Measures 141
7.2 Converting Scores to Measures by UCON, P R O X and U FO R M 142
7.3 Measures from Best Tests by U F O R M 145
7.4 Individualized Testing 151
7.5 Status Tailoring 153
7.6 Performance Tailoring , 156
7.7 Self-Tailoring * 161
7.8 Person Fit and Quality Control 165
7.9 Diagnosing M isfit 170
7.10 Correcting a Measure 181

8 CHOOSING A SC ALE 191


8.1 Introduction 191
8.2 Formulas fo r Making N ew Scales 192
8.3 The Least Measurable Difference 192
8.4 Defining the Spacing Factor 195
8.5 Normative Scaling Units: N IT S 198
8.6 Substantive Scaling Units: SITS 199
8.7 Response Probability Scaling Units: CHIPS 201
8.8 Reporting Forms 205

APPEND ICES 211


Table A 212 & 213
Table B 214 & 215
Table C 216

REFEREN CES 217

IN D E X 220
1 THE M E A S U R E M E N T M ODEL

1.1 HOW T ES TS A R E U SED T O M E A S U R E

This b oo k is about h ow to make and use mental tests. In order to do this success­
fu lly w e must have a m ethod fo r turning observations o f test perform ance into measures
o f mental ability. Th e idea o f a measure requires an idea o f a variable on which the
measure is located. I f the variable is visualized as a line, then the measure can be pictured
as a p oin t on that line. This relationship between a measure and its variable is pictured in
Figure 1.1.1.

1 F IG U R E 1.1.1 I

A MEASURE ON A VA RIAB LE

The Measure — .................. |

The Variable

When we test a person, our purpose is to estimate their location on the line implied
b y the test. B efore w e can do this w e must construct a test that defines a line. We must
also have a w ay to turn the person’s test perform ance into a location on that line. This
b ook shows h ow to use test items to define lines and how to use responses to these items
to position persons on these lines.

In order fo r a test to define a variable o f mental ability, the items out o f which the
test is made must share a line o f inquiry. This com m on line and its direction towards
increasing ability can be pictured as an arrow with high ability to the right and lo w ability
to the left. Th e meaning o f this arrow is given by the test items which define it. I f we use
the symbols 6 1, 6 2 . . . 6 t . . . , to represent the d ifficu lty levels o f items, then each 5t
marks the location o f an item on the line. The 5 ’s are the calibrations o f the items along
the variable and these calibrated items are the operational definition o f what the variable
measures. Hard items which challenge the m ost able persons define the high, or right, end
o f the line. Easy items which even the least able persons can usually do successfully de­
fine the low , or left, end o f the line. Figure 1.1.2 shows a variable defined by fou r items
spread across its length.

1
2 BEST TEST DESIGN

FIG U R E 1.1.2
DEFINING A VARIABLE

Person Measure

Easiest Hardest
Item Item

Item Calibrations

A variable begins as a general idea o f what we want to measure. This general idea is
given substance by writing test items aimed at eliciting signs o f the intended variable in
the behavior o f the persons. These test items become the operational definition o f the
variable. The intuition o f the test builder and the careful' construction o f promising test
items, however, are not enough. We must also gather evidence that a variable is in fact
realized by the test items. We must give the items to suitable persons and analyze the
resulting response patterns to see i f the items fit together in such a way that responses
to them define a variable.

In order to locate a person on this variable we must test them with some o f the
items which define the variable and then determine whether their responses add up to a
position on the line. I f we use the symbol 0 to represent the ability level o f the person,
then j3 marks their location on the line.

The person measure 0 shown in Figure 1.1.2 locates this person above the three
easiest items and below the hardest one. Were this person to take a test made up o f these
four items, their most probable test score would be three and we would expect them
to get the three easiest items correct and the fourth, hardest item, incorrect. This obser­
vation is more important than it might seem because it is the basis o f all our methods
fo r estimating person measures from test scores. When we want to know where a person
is located on a variable, we obtain their responses to some o f the items which define the
variable. The only reasonable place to estimate their location from these data is in the
region where their responses shift from mostly correct on easier items to mostly incorrect
on harder ones.

Before we can estimate a person’s measure from their score, however, we must
examine their pattern o f responses. We must see if their pattern is consistent with how we
expect their items to elicit responses. When the items with which a person is tested have
been calibrated along a variable from easy to hard, then we expect the person’s response
pattern to be more or less consistent with the difficu lty order o f these items along the
THE M EASUR EM ENT M ODEL 3

variable. We exp ect the person to succeed on items that ought to be easy fo r them and to
fail on items that ought to be hard fo r them.

Figure 1.1.3 shows tw o response patterns to the same ten item test. The ten items
are located along the variable at their levels o f d ifficu lty. Each pattern o f responses is
recorded above the line o f the variable. The l ’s represent correct answers. The 0 ’s repre­
sent incorrect answers. Both patterns produce a score o f six.

In Pattern A the six easiest items are correct and the fou r hardest ones incorrect. It
seems inconceivable to locate this person anywhere except in the region above 56 , the
m ost d ifficu lt item they get correct, but b elow 5 7, the least o f the even m ore d ifficu lt
items they get incorrect.
4 BEST TEST DESIGN

Pattern B, however, is very difficult to reconcile with the implications o f a score


o f six. This person gets the six hardest items correct and the four easiest ones incorrect!
I f we try to locate this person above 5 10, the hardest items they get correct, we have
to explain how they got the four easiest items incorrect. Could anyone be that careless?
If, on the other hand, we try to locate them below 5 ^, the easiest item they get incorrect,
then how do we explain their getting the six hardest items incorrect? Every other location
along the variable, such as between 6 6 and 6 7 fo r a score o f six, is equally unsatisfactory
as a “ measure” fo r the person who produced Pattern B. This pattern o f responses is not
consistent with any location on the variable defined by these items. We are forced to
conclude that something is wrong. Either the items used are miscalibrated or this person
did n ot take them in the way we intended. In any case, no reasonable measure can be
derived from Pattern B.

The Pattern B example is an important one because it shows us that even when we
have constructed items that can define a valid variable we still have also to validate every
person’s response pattern before proceeding to use their score as a basis fo r estimating
their measure. When item calibrations have been validated by enough suitable persons,
then most o f the response patterns we encounter among suitable persons will approxi­
mate Pattern A. However, the possibility o f occurrences verging on Pattern B forces us
to examine and validate routinely the response pattern o f every person tested before we
can presume to estimate a measure from their test score.

Four steps must be taken to use a test to measure a person. First, we must work
out a clear idea o f the variable we intend to make measures on. Second, we must con­
struct items which are believable realizations o f this idea and which can elicit signs o f
it in the behavior o f the persons we want to measure. Third, we must demonstrate that
these items when taken by suitable persons can lead to results that are consistent with
our intentions. Finally, before we can use any person’s score as a basis fo r their measure,
we must determine whether or not their particular pattern o f responses is, in fact, con­
sistent with our expectations.

1.2 HOW SCORES ARE USED

A test score is intended to locate a person on the variable defined by the test items
taken. Nearly everyone who uses test scores supposes that the person’s location on the
variable is satisfactorily determined either by the score itself, or by some linear function
o f the score such as a percent correct or a norm-based scale value. It is taken fo r granted
that the score, or its scale equivalent, tells us something about the person tested that goes
beyond the moment or materials o f the testing. It is also taken fo r granted that scores are
suitable fo r use in the arithmetic necessary to study growth and compare groups. But do
scores actually have the properties necessary to make it reasonable to use them in these
ways?

In order fo r a particular score to have meaning it must come from a response pattern
which is consistent with items that define a variable. But even the demonstration o f item
validity and response validity does not guarantee that the score will be useful. In order to
generalize about the person beyond their score, in order to discover what their score
implies, we must also take into account and adjust fo r the particulars o f the test items
used. H ow, then, does a person’s test score depend on the characteristics o f the items in
the test they take?
THE M EASUREM ENT MODEL 5

f F IG U R E 1.2.1 I

HOW SCORES DEPEND ON THE LEVEL AND SPREAD


OF TEST ITEM D IFFIC U LTIE S

Person
Very P
Easy
Test 6, 68
Expected
Score
8

Person
V ery ,3
Hard 1 ..................1
1■ ■ ■
Test
S.
Expected "1 °8
Score
0

Person
Narrow
Hard I ..................1 -.
Test Si ' ' Sg
Expected
Score
1

Test
Center

Narrow
Easy
Test

Wide
Easy
Test

Test
Center
6 BEST TEST DESIGN

Figure 1.2.1 shows what can happen when one person at a particular ability level
takes five different tests all o f which measure on the same variable but which d iffer in
the level and spread o f their item difficulties from easy to hard and narrow to wide. The
difficulties o f the eight items in each test are marked on the line o f the variable. In order
to see each test separately we have redrawn the line o f the variable five times, once for
each test.

The ability o f the person on the measure is also marked on each line so that we can
see how this person stands with respect to each test. While each test has a different posi­
tion on the variable depending on the difficulties o f its items, this person’s position, o f
course, is the same on each line. Figure 1.2.1 also shows the scores we would expect this
person most often to get on these five tests.

The first, V ery Easy Test, has items so easy fo r this person that we expect a test
score o f eight. The second, V ery Hard Test, has such hard items that we expect a score
o f zero. The third, Narrow Hard Test, has seven o f its items above the person’s ability
and one below. In this situation the score we would expect most often to see would be a
one. The fourth, Narrow Easy Test, has seven o f its items below the person’s ability and
so we expect a score o f seven. Finally the fifth , Wide Easy Test, has five items which
should be easy fo r them. Even though this test is centered at the same position on the
variable as the Narrow Easy Test just above it in Figure 1.2.1 and so has the same average
difficu lty level, nevertheless, because o f its greater width in item difficulty, we expect
only a score o f five.

For one person we have five expected scores: zero, one, five, seven and eight!
Although we know the person’s ability does n ot change, the five different scores, as they
stand, suggest five different abilities. Test scores obviously depend as much on the item
characteristics o f the test as on the ability o f the person taking the test.

I f the meaning o f a test score depends on the characteristics o f the test items, how ­
ever, then before we can determine a person’s ability from their test score we must
“ adjust” their score fo r the effects o f the particular test items from which that particu­
lar score comes. This adjustment must be able to turn test-bound scores into measures
o f person ability which are test-free.

Unfortunately, with test scores like zero, in which there is no instance o f success,
and the eight o f our example, in which there is no instance o f failure, there is no satis­
factory way to settle on a finite measure fo r the person. A ll we can do in those situations
is to observe that the person who scored all incorrect or all correct is substantially below
or above the operating level o f the test they have taken. I f we wish to estimate a finite
measure fo r such a person, then we will have to find a test fo r them which is more appro­
priate to their level o f ability.

We might be tempted to interpret perfect scores as “ complete mastery.” But unless


the test in question actually contained the most difficult items that could ever be written
fo r this variable there would always be the possibility o f other items which were even
more difficult. These more difficult items might produce incorrect answers, even with
our perfectly scoring person, revealing that mastery was not complete after all. When a
test is extremely easy, o f course, everyone recognizes that even a perfect score is quite
consistent with intermediate ability.
THE M EASUREM ENT MODEL 7

Th e dependence o f test scores on item d ifficu lty is a problem with which most
test users are familiar. A lm ost everyone realizes that fifty percent correct on an easy test
does n ot mean as much as fift y percent correct on a hard test. Some test users even
realize that seventy-five percent correct on a narrow test does n o t im ply as much ability
as seventy-five percent correct on a w ide test. But there is another problem in the use o f
test scores which is often overlooked.

It is com m on practice to com pute differences in test scores to measure growth, to


com bine test scores by addition and subtraction in order to compare groups and to add
and subtract squares and cross-products o f test scores in order to do regression analysis.
But when these simple arithmetic operations are applied to test scores the results are
always slightly distorted and can be substantially misleading. Although test scores usually
estimate the order o f persons’ abilities rather well, they never estimate the spacing sat­
isfactorily. T est scores are n o t linear in the measures they im ply and fo r which they
are used.

In the statistical use o f test scores, flo o r and ceiling effects are occasionally recog­
nized. But they are almost never adjusted for. These boundary effects cause any fixed
differences o f score points to vary in meaning over the score range o f the test. The dis­
tance on the variable a particular difference in score points implies is n o t the same from
one end o f the test to the other. A difference o f five score points, fo r example, implies
a larger change in ability at the ends o f a test than in the middle.

Figure 1.2.2 illustrates this problem with test scores. We show tw o persons with
measures, 0A and pB, w ho are a fixed distance apart on the same variable. Both persons
are administered five d ifferen t tests all measuring on this variable. The persons’ locations
and hence their measurable differen ce on the variable remain the same from test to test,
but their m ost probable scores vary w idely. This is because the five tests d iffer in their
item d ifficu lty level, spread and spacing. L e t’s see h ow the resulting expected scores
reflect the fix e d difference between these tw o persons.

T est I is com posed o f eight items all o f which fall well between Person A and Person
B. We expect Person A to get none o f these items correct fo r a score o f zero while we
expect Person B to get all eight items correct fo r a score o f eight. On this test their abili­
ties w ill usually appear to be eight score points apart. That is as far apart in ability as it is
possible to be on this test.

Test II is com posed o f eight items all o f which are w ell b elow both persons. We ex­
p ect both persons to get scores o f eight because this test is to o easy fo r both o f them.
N o w their expected score difference in test scores is zero and their abilities w ill usually
appear to be the same!

Test I I I is com posed o f eight very hard items. N o w we expect both persons to get
scores o f zero because this test is to o hard fo r them. Once again their expected score d if­
ference is zero and their abilities w ill usually appear to be the same.

Test I was successful in separating Persons A and B. Tests II and III failed because
they were to o far o f f target. Perhaps it is only necessary to center a test properly in order
to observe the difference between tw o persons.
8 BEST TEST DESIGN

__________ | F IG U R E 1.2.2 |__________


THE NO NLINEARITY OF SCORES

0A Pb

Test I

Pa Pb

Test II Ii .................................
. . . . . . i |
51 58
Expected Expected
Score Score Expected Score difference:
8 8 8 -8 = 0

Pa

Test III ■ ........................................


1 ........................... ...... 1
«1 ^10
Expected Expected
Score Score Expected Score difference:
0 0 0 -0 = 0

Test IV

Test V
THE M EASUREM ENT M O DEL 9

•Test IV is centered between Person A and Person B but its items are so spread out
that there is a w ide gap in its m iddle into which Person A and Person B both fall. The
result is that both persons can be expected to achieve scores o f fou r because fou r items
are to o easy and fou r items are to o hard fo r both o f them. Even fo r this test which is
m ore or less centered on their positions, their expected score difference is zero and their
abilities w ill still usually appear to be the same.

T est V , at last, is both wide and fairly w ell centered on Persons A and B. It con­
tains tw o items which fall between their positions and therefore separate them. We expect
Person A to get the fou r easiest items correct fo r a most probable score o f four. As fo r
Person B, however, w e exp ect them n o t on ly to get the same fou r items correct but also
the n ext tw o harder ones because these tw o items are also b elow Person B ’s ability level.
Thus on Test V the expected difference in scores between Person’s A and B becomes
tw o. On this test their abilities w ill usually appear to be somewhat, but n ot extrem ely,
different.

What can w e infer about the differences in ability between Persons A and B from
scores like these? Persons A and B w ill tend to appear equally able on Tests II, III, and
IV , som ewhat d ifferen t on. Test V and as d ifferen t as possible on Test I. I f differences
between the test scores o f the same tw o persons can be made to vary so w idely merely
b y changing the difficu lties o f the items in the test, then h ow can we use differences in
test scores to study ability differences on a variable?

Th e answer is, w e can’t. N o t as they stand. In order to use test scores, which are
n ot linear in the variable they im ply, to analyze differences w e must find a way to trans­
form the test scores in to measures which approxim ate linearity.

Test scores always contain a potentially misleading distortion. I f we intend to use


test results to study growth and to compare groups, then we must use a m ethod fo r
making measures from test scores which marks locations along the variable in an equal
interval or linear way.

In this section w e have illustrated tw o serious problems with test scores. The first
illustration shows h o w test scores are test-bound and how we have to adjust them fo r the
characteristics o f their test items b efore w e can use the scores as a basis fo r measurement.
The second illustration shows h ow test scores d o n ot mark locations on their variable in a
linear w ay and h ow w e need to transform test scores into measures that are linear before
w e can use them to study growth or to compare groups.

1.3 W H A T HAPPENS W H EN A PERSON T A K E S A N IT E M

The discussions in Sections 1.1 and 1.2 establish our need fo r 1) valid items which
can be demonstrated to define a variable, 2) valid response patterns which can be used to
locate persons on this variable, 3) test-free measures that can be used to characterize per­
sons in a general way and 4 ) linear measures that can be used to study growth and com ­
pare groups. N o w we must build a m ethod that comes to grips with these requirements.

T h e responses o f individual persons to individual items are the raw data with which
w e begin. Th e m ethod we develop must take these data and make from them item cali­
brations and person measures with the properties we require. Figure 1.3.1 shows a very
10 BEST TEST DESIGN

simple data matrix containing the responses o f eight persons to a five item test. The five
items are named at the top o f the matrix. The eight persons are named at the left. The
response o f each person to each item is indicated by “ 1” fo r a correct response and “ 0”
fo r an incorrect response. N otice that the responses in Figure 1.3.1 have been summed
across the items and entered on the right side o f the matrix as person scores and down
the persons and entered at the bottom o f the matrix as item scores.

J F IG U R E 1.3.1 I

A DATA MATRIX
OF OBSERVED RESPONSES

Item
Person Name Person
Name 1 2 3 4 5 Score

a 1 0 0 0 0 1
b 0 1 0 0 0 1
c 1 1 0 0 0 2
d 1 0 1 0 0 2
e 1 1 1 0 0 3
f 1 1 0 1 0 3

9 1 1 1 1 0 4
h 1 1 1 0 1 4

7 6 4 2 1 20
Item
Score

Figure 1.3.1 shows what the basic data look like. But before we can put these data
to work we must answer a fundamental question. Where do we think these data come
from? What are these item and person scores supposed to tell us about items and per­
sons? H ow do we think these patterns o f l ’s and 0’s are produced? In order to figure out
how to use these data we must set up a reasonable model fo r what we suppose happens
when a person attempts to answer an item.

We would like a person v ’s ability /?„, that is their location on the variable, to govern
how far along the variable we can expect them to produce correct responses to items.
Indeed that is the only situation in which we can use item difficulties and a person’s
responses to them as the basis fo r measuring the person.

O f course we can think o f other factors which might affect a person’s responses. I f
items are multiple-choice, some guessing is bound to occur and persons d iffer in how
much guessing they are willing to engage in. The possibilities o f disturbing influences
which interfere with the clear expression and hence the unambiguous observation o f
ability Eire endless. But, if it is really the person’s ability that we hope to measure, then
THE M EASUREM ENT M ODEL 11

it w ould be unreasonable n o t to d o our best to arrange things so that it is the person’s


ability which dominates their test behavior. Indeed, isn’t that what good test adminis­
tration practices are for, namely, to control and m inim ize the intrusion o f interfering
influences.

We w ould also like item t’s d ifficu lty 5 t, that is its location on the variable, to
determ ine h ow far along the variable we can expect correct responses to that item to
occur. As with persons, w e can think up item characteristics, such as discrimination and
vulnerability to guessing, which m ight m o d ify persons’ responses to them. Some psycho­
metricians attem pt to estimate these additional item characteristics even though there
are good reasons to expect that all such attempts must, in principle, fail. But, again, it
hardly seems reasonable n ot to d o our best to arrange things so that it is an item ’s d if­
ficu lty which dominates h ow persons o f various abilities respond to that item. In any
case, the fact is that whenever we use unweighted scores as our test results w e are assuming
that, fo r all practical purposes, it is item difficulties, and person abilities, that dominate
person responses.

______________ j F IG U R E 1.3.2 |______________________

THE ESSENTIAL CO NDITIO NS CAUSING A RESPONSE

Person Observed
A bility Response

When the response is "correct",


x = 1
When the response is "incorrect",
x =0
12 BEST TEST DESIGN

These considerations lead us to set up a response model that is the simplest repre­
sentation possible. Figure 1.3.2 diagrams person v with ability acting on item i with
difficu lty 81 to produce the response x „ t. These are the essential elements we will take
into account when we try to explain the data in Figure 1.3.1. Figure 1.3.2 proposes that
the response x vl which occurs when person v takes item i can be thought o f as governed
by the person’s ability 0V and the item ’s difficulty 5( and nothing else.

Our next step is to decide how we want person ability 0V and item difficulty 51to
interact in order to produce x „ t. What is a reasonable and useful way to set up a mathe­
matical relation between 0V and 51? Since we require that 0V and 61 represent locations
along one common variable which they share, it is their difference (0„ - 5t) which is the
most convenient and natural formulation o f their relation.

Identifying the difference (0„ - 5t), however, does not finish our work because we
must also decide how we want this difference to govern the value o f the response x vl.
Even when a person is more able than an item is difficult, so that their 0V is greater
than the item ’s 5(, it will occasionally happen that this person nevertheless fails to give
a correct answer to that relatively easy item so that the resulting value o f x vi is “ 0 ” . It
will also happen occasionally that a person o f moderate ability nevertheless succeeds on a
very difficult item. Obviously it is going to be awkward to force a deterministic relation­
ship onto the way - 5t) governs the value o f response x vi. A better way to deal with
this problem is to acknowledge that the way the difference (0V - 6t) influences the re­
sponse x „ t can only be probabilistic and to set up our response model accordingly.

Figure 1.3.3 shows how it would be most reasonable to have the difference (0„ - 8t)
affect the probability o f a correct response. When is larger than 5(, so that the ability
level o f person v is greater than the difficu lty level o f item i and their difference (/J„ - 5t)
is greater than zero, then we want the probability o f a correct answer to be greater than
one half. When, on the other hand, the ability level o f person v is less than the difficulty
level o f item i, so that their difference (0„ —61) is less than zero, then we want the proba­
bility o f a correct answer to be less than one half. Finally, when the levels o f person
ability and item difficulty are the same so that their difference (0„ - 51) is zero, then
the only probability that seems reasonable to assign to a correct (o r to an incorrect)
answer is exactly one half.

The curve in Figure 1.3.4 summarizes the implications o f Figure 1.3.3 fo r all reason­
able relationships between probabilities o f correct responses and differences between
person ability and item difficulty. This curve specifies the conditions our response model
must fulfill. The differences (fiv - 5t) could arise in tw o ways. They could arise from a
variety o f person abilities reacting to a single item or they could arise from a variety o f
item difficulties testing the ability o f one person. When the curve is drawn with ability
0 as its variable so that it describes an item, it is called an item characteristic curve (IC C )
because it shows the way the item elicits responses from persons o f every ability. When
the curve is drawn with difficu lty 5 as its variable so that it describes how a person
responds to a variety o f items, we can call it a person characteristic curve (PCC).
THE M EASUREM ENT M ODEL 13

[ F IG U R E 1.3.3 |

HOW DIFFERENCES BETWEEN


PERSON A B IL IT Y AND ITEM D IF F IC U L T Y OUGHT TO
AFFECT THE PROBABILITY OF A CORRECT RESPONSE

1. When
0 „> 5 t

f (0 „ - 5 t) > O
5,

2. When
P„<6 t

I
(Pit ~ 8 t ) < 0 5t
and P|xj,t = lJ < ’/2

Pv

3. When
P it =
\
t
5t

(/3„- 6t) = 0
and P {x w = 1|= %

T h e curve in Figure 1.3.4 is a picture o f the response m odel we require in order to


solve the problem o f h ow the parameters, Pv and 51 which we want to estimate, depend
on the data we can observe. T o measure a person, we must estimate pv and to cali­
brate an item we must estimate 6t. In order to estimate either o f these parameters from
the observed responses o f persons to items we must construct a mathematical formula­
tion which is true to the relationship drawn in Figure 1.3.4 and which relates pv, 5 t and
x vl in a useful way. This form ulation must also be able to show us how to use data o f the
kind given in Figure 1.3.1 to make estimates o f person ability which are test-free and
estimates o f item d ifficu lty which are sample-free.
14

I FIGURE 1.3.4 |___

THE RESPONSE CURVE


BEST TEST DESIGN
THE M EASUREM ENT MODEL 15

1.4 T H E RASCH M O D E L

In order to construct a workable mathematical form fo r the curve in Figure 1.3.4


w e begin by com bining the parameters, fo r person ability and 6t fo r item d ifficu lty
through their difference (Pv - 6 ( ). We want this difference to govern the probability o f
what is supposed to happen when person v uses their ability p v against the d ifficu lty 5 1
o f item i. But the d ifferen ce (0„ - 6 t) can vary from minus in fin ity to plus in fin ity while
the probability o f a successful response must remain between zero and one. T o deal with
this w e apply the d ifferen ce (0„ - 6 t) as an expon en t o f the natural constant e = 2.71828
. . . and w rite the result as

e(Pv~ 8t) = exp

This exponential expression varies between zero and plus in fin ity and we can bring it
into the interval between zero and one by form ing the ratio

exp ( P „ - 5t)/[1 + exp (0 „- 5t>].

This form ulation has a shape which follo w s the ogive in Figure 1.3.4 quite well. It can be
used to specify the probability o f a successful response as

P lx ^ = I|j3„, 5 J = exp (Pv - 6 t) / [ 1 + exp (Pv - 5 t)] [1.4.11

which is the Rasch m odel.

A n y mathematical form which describes an ogive o f the shape in Figure 1.3.4 could
provide a solution to the linearity problem by transforming scores which are restricted
between 0 and 100 percent into “ measures” which run from minus in fin ity to plus
infinity.

A n y mathematical form which relates the probability o f x „ t to the difference be­


tween pv and 5 1 and which has estimable parameters could allow us to study item and
response validity. A ll w e have to do is to specify a workable m odel fo r h ow (j3„ - 5t)
governs the probability o f x „ t, use this m odel to estimate pv and 5 1from some data and
then exam ine the w ay these data fit with predictions calculated from the model.

A n y ogive and any form ulation, however, w ill n ot do. In fact, on ly the formulation
o f Equation 1.4.1, the Rasch m odel, allows us to estimate pv and 5 t independently o f one
another in such a w ay that the estimates pv are freed from the effects o f the 6 ( and the
estimates dt are freed from the effects o f the P y ’s.

T h e logistic function in Equation 1.4.1 provides a simple, useful response model


that makes both linearity o f scale and generality o f measure possible. Although biom etri­
cians have used the logistic function since 1920, it was the Danish mathematician Georg
Rasch (1 9 6 0 ) w h o first appreciated its psychom etric significance. Rasch calls the special
characteristic o f the simple logistic function which makes generality in measurement
possible “ specific ob jectivity.” He and others have shown that there is no alternative
mathematical form ulation fo r the ogive in Figure 1.3.4 that allows estimation o f the
person measures P v and the item calibrations 6t independently o f one another (Rasch,
1961, 1967; Andersen, 1973, 1977; Bam dorff-Nielsen, 1978). When the estimators fo r
Pv and 51 are derived by m axim izing a conditional liklih ood they are unbiased, consistent,
16 BEST TEST DESIGN

efficient, and sufficient (Andersen, 1970, 1971, 1972a, 1973, 1977; Haberman, 1977).
Simple approximations fo r these conditional maximum likelihood estimators which
are accurate enough fo r almost all practical purposes are described in Wright and Pan-
chapakesan (1969), Wright and Douglas (1975a, 1975b, 1977a, 1977b) and Wright
and Mead (1976). These procedures have been useful in a wide variety o f applications
(Connolly, Nachtman and Pritchett, 1971; W oodcock, 1974; W illm ott and Fowles,
1974; Rentz and Bashaw, 1975, 1977; Andrich, 1975; Mead, 1975; Wright and Mead,
1977; Cornish and Wines, 1977; Draba, 1978; Elliott, Murray and Pearson, 1977.

_______________________ | TAB LE 1.4.1 | _______________________

PERSON A B IL ITY AND ITEM D IFFIC U LTY IN LOGITS


AND THE RASCH PROBABILITY OF A RIGHT ANSWER

Person Item Right Answer Information


A bility Difficulty Difference Odds Probability in a Response

Pv *» exp(/3j, - 5 t) nvi

5 0 5 148. .99 .01


4 0 4 54.6 .98 .02
3 0 3 20.1 .95 .05
2 0 2 7.39 .88 .11
1 0 1 2.72 .73 .20
*

0 0 0 1.00 .50 .25

0 1 - 1 0.368 .27 .20


0 2 - 2 0.135 .12 .11
0 3 - 3 0 .050 .05 .05
0 4 - 4 0 .018 .02 .02
0 5 - 5 0.007 .01 .01

v VL = exp $ v - 5 t) / [ 1 + exp (0„ - 5 t)]

K i = nv i^ ~ nw)

We can see in Equation 1.4.1 that when person v is smarter than item i is difficult,
then j3„ is more than 61, their difference is positive and the probability o f success on item
i is greater than one half. The more the person’s ability surpasses the item ’s difficulty,
the greater this positive difference and the nearer the probability o f success comes to
one. But when the item is to o hard for the person, then is less than 6t, their difference
is negative and the person’s probability o f success is less than one half. The more the
item overwhelms the person, the greater this negative difference becomes and the nearer
the probability o f success comes to zero.
THE M EASUREM ENT M ODEL 17

The mathematical units fo r pv and 6t defined by this m odel are called “ logits.” A
person’s ability in logits is their natural log odds fo r succeeding on items o f the kind
chosen to define the “ zero ” p oin t on the scale. And an item ’s d ifficu lty in logits is its
natural lo g odds fo r eliciting failure from persons with “ zero ” ability.

Table 1.4.1 gives examples o f various person abilities and item difficulties in logits,
their differences (/3„ - 5 1) and the success probabilities which result. The first six rows
illustrate various person abilities and their success probabilities when provoked by items
o f zero d ifficu lty. The last six rows give examples o f various item difficulties and the
probabilities o f success on them by persons with zero ability.

The origin and scale o f the logits used in Table 1.4.1 are arbitrary. We can add any
constant to all abilities and all difficulties w ith ou t changing the difference (0„ - 6( ). This
means that we can place the zero p oin t on the scale so that negative difficulties and abili­
ties d o n ot occur. We can also introduce any scaling factor w e find convenient including
one large enough to eliminate any need fo r decimal fractions. Chapter 8 investigates these
possibilities in detail.

Th e last column o f Table 1.4.1 gives the relative inform ation I„ t = -nVL( 1 - -nv l)avail
able in a response observed at each (j3„ - 5 t). When item d ifficu lty 5t is within a logit o f
person ability j3„, the inform ation about either 6 L or in one observation is greater than
.20. But when item d ifficu lty is m ore than tw o logits o f f target, the inform ation is less
and .11 and fo r \PV - 6t | > 3 less than .05. The implications fo r efficien t calibration
sampling and best test design are that responses in the \fiv - 5 1| < 1 region are worth
m ore than tw ice as much fo r calibrating items or measuring persons as those outside o f
\Pv - 5 1|> 2 and m ore than fou r times as much as those outside o f |j3„ — 51|> 3.

1.5 U S IN G T H E RASCH M O D E L FO R C A L IB R A T IN G A N D M E A S U R IN G

We have established the need fo r an exp licit approach to measurement and shown
h ow measurement problems can be addressed with a m odel fo r what happens when a
person takes an item. N o w w e are ready to w ork through the mathematics o f this m odel
in order to find ou t h ow w e can use the m odel to calibrate items and measure persons.
The m odel specifies the probability o f person v with ability j3„ giving response x „ t to item
i with d ifficu lty 51 as

P {xw |j3„,5t} = e x p [x „t(/3„- 8 t) ] / [ 1 + e x p ( 0 „ - 5 t)] [1.5.1]

The response x vl takes on ly tw o values,

x „ ( = 0 when the response is incorrect and


x „ t = 1 when the response is correct.

When w e insert each o f these values o f Xj,t into Equation 1.5.1 we find that it breaks
dow n into the com plem entary expressions

P {x l,l = 1 |0 „ ,5 t} = exp ((J „ - 6 t)/[1 + exp ( 0 „ - 5 t)] [1.5.2]

fo r a correct response and

P {x „t = 0 1 5 ( } = 1/[1 + exp (fl„ - 5 t)] [1.5.3]


18 BEST TEST DESIGN

fo r an incorrect response. These tw o expressions add up to one because together they


cover everything that can happen to x „ t.

When a set o f L items is administered to a sample o f N persons, the result is an N by


L collection o f responses ( ( x „ t)) fo r which v = 1, N and i = 1, L and the double parentheses
(( )) remind us that a whole N by L table o f x „ ( ’s is implied. These data can be displayed
in a matrix like the one shown in Figure 1.5.1. The marginal sums o f the rows and col-
ums o f this data matrix are the person scores
L
v = 1,N

and the item scores

t = 1 .L

J F IG U R E 1.5.1

DATA M ATR IX OF OBSERVED RESPONSES

Items Person
Scores
L

Persons

N
Item
Scores 2 xvl = s,
THE M EASUREM ENT M O DEL 19

What happens when we analyze these data as though they were governed by the
m odel o f Equation 1.5.1? A ccordin g to that m odel the only systematic influences on the
production o f the x ^ ’s are the N person abilities (Pv ) and the L item difficulties (6 (). As
a result, apart from these parameters, the x ^ ’s are m odeled to be quite independent o f
one another. This means that the probability o f the whole data matrix ( ( x „ t)), given the
m odel and its parameters (0 „) and (5 1), can be expressed as the product o f the probabili­
ties o f each separate x vl given by Equation 1.5.1 continued over all v = 1, N and all
i = 1, L.

This continued product is

i ± ;|e x p [x „ ( ( / 3 „ - 5 ,)] )
P{ <( *„ ) >l ( 0„ us, l ) . n | . [1.5.4]

N L
When w e m ove the continued product operators n and n in the numerator o f
v i
Equation 1.5.4 into the exponential expression

exp [xw 6t>],

they becom e the summation operators


N L
2 and 2 so that

N L f"N L "1
n n exp [xvl(Pv ~ 6t)] = exp 2 2 x „ t (0 „ - 5 t) .
v i L v i J

Then, since
N L

and
N L L
2 2 Xm St = 2 st 5t,
V I I

Equation 1.5.4 becomes

exp Oh* [1.5.5]


n n [1 + (exp (^ - )J
v i

Equation 1.5.5 is im portant because it shows that in order to estimate the para­
meters (/3j,) and (5 t), w e need only the marginal sums o f the data matrix, (r„) and (^ ).
This is because that is the only w ay the data ( ( x „ t)) appear in Equation 1.5.5. Thus the
person scores (r,,) and item scores (st) contain all the m odelled inform ation about person
measures and item calibrations.

Finally, the numerator o f Equation 1.5.5 can be factored into tw o parts so that the
m odel probability o f the data m atrix becomes
20 BEST TEST DESIGN

N L
[exp ( 2 r„ 0„f) [exp (- 2 st 5t)]
P {((x w ))|(/3„),(6t)} = ------------------------------------ '---------- [1.5.6]
I I I I [1 + exp ( 0 „ - 8,)]
i

Equation 1.5.6 is important because it shows that the person and item parameters
can be estimated independently o f one another. The separation o f

( 2 r„ pj
v

and
L
( 2 st 5t)
i

in Equation 1.5.6 makes it possible to condition either set o f parameters out o f Equation
1.5.6 when estimating the other set. This means, in the language o f statistics, that the
scores (r „ ) and (st) are sufficient fo r estimating the person measures and the item cali­
brations.

Because o f this we can use the person scores (r „ ) to remove the person parameters
(P „) from Equation 1.5.6 when calibrating items. This frees the item calibrations from the
modelled characteristics o f the persons and in this way produces sample-free item cali­
brations. As fo r measuring persons, we could use the item scores (s( ) to remove the item
parameters (8 t) from Equation 1.5.6. When we come to person measurement, however,
we will find it more convenient to work directly from the estimated item calibrations
(dt).

There are several ways that Equation 1.5.6 can be used to estimate values fo r pv and
5t. The ideal way is to use the sufficient statistics fo r persons (r,,) to condition person
parameters (0 „) out o f the equation. This leaves a conditional liklihood involving only the
item parameters (6 t) and they can be estimated from this conditional liklihood (Fischer
and Scheiblechner, 1970; Andersen, 1972a, 1972b; Wright and Douglas, 1975b, 1977b;
Allerup and Sorber, 1977; Gustafsson, 1977).

But this ideal method is impractical and unnecessary. Computing times are excessive.
R ound-off errors lim it application to tests o f fifty items at most. And, in any case, results
are numerically equivalent to those o f quicker and more robust methods. A convenient
and practical alternative is to use Equation 1.5.6 as it stands. T o learn more about this
unconditional estimation o f item parameters see Wright and Panchapakesan (1969),
Wright and Douglas (1975b, 1977a, 1977b) and Chapter 3, Section 3.4. o f this book.

Even this unconditional method, however, is often unnecessarily detailed and costly
fo r practical work. I f the persons we use to calibrate items are n ot to o unsymmetrically
distributed in ability and n ot to o far o ff target so that the impact o f their ability distribu­
tion can be more or less summarized by its mean and variance, then we can use a very
simple and workable method fo r estimating item difficulties. This method, called PRO X ,
was first suggested by Leslie Cohen in 1973 (see Wright and Douglas, 1977a; Wright,
1977).
THE M EASUREM ENT M ODEL 21

1.6 A S IM P L E U S E F U L E S T IM A T IO N P R O C E D U R E

Three methods o f parameter estimation w ill be used in this book. The general un­
conditional m ethod called U C O N requires a com puter and a com puter program such as
B IC A L (W right and Mead, 1976). U C O N is discussed and illustrated in Chapter 3. A
second m ethod called U F O R M , which can be done by hand with the help o f the simple
tables given in A ppen dix C, is discussed and applied in Chapter 7. The third method,
P R O X , is com pletely manageable b y hand. In addition the sim plicity o f P R O X helps us
to see h o w the Rasch m odel works to solve measurement problems. The derivations o f
the U F O R M and P R O X equations are given in W right and Douglas (1975a, 1975b).

P R O X assumes that person abilities (0 „) are m ore o r less norm ally distributed with
mean M and standard deviation a and that item difficulties (5 t) are also m ore or less
norm ally distributed with average d ifficu lty H and d ifficu lty standard deviation co.

If

~ N (M , a )

and

5t ~ N (H , co2 ),

then fo r any person v with person score r„ on a test o f L items it follo w s that

bj, = H + Xfin [rj,/(L - r„)] [1 .6 .1 ]

and fo r any item i with item scores st in a sample o f N persons it follo w s that

dt = M + Y8n [ (N - st)/st] [1.6.2]

T h e coefficien ts X and Y are expansion factors which respond in the c l ,e o f X to


the d ifficu lty dispersion o f items and in the case o f Y to the ability dispersion o f persons.
In particular

X = ( 1 + w 2/2 .8 9 )1/2 [1 .6 .3 ]

and

Y = (1 + o 2/2.89)Vx [1.6.4]

Th e value 2.89 = 1.72 comes from the scaling factor 1.7 which brings the logistic
ogive into approxim ate coincidence with the normal ogive. This is because the logistic
ogive fo r values o f 1.7z is never m ore than one percent d ifferen t from the normal ogive
fo r values o f z.

T h e estimates b„ and dt have standard errors

SE(b„) = X [ L / r „ ( L - r„)] Vx [1-6.5]

SE(dt) = Y [N /st( N - st) ] 1/a [ 1.6 .6 ]


22 BEST TEST DESIGN

This estimation method can be applied directly to observed item scores (st) by calcu­
lating the sample score logit o f item t as

xt = 8n [(N - st)/s ( ] [1.6.71

and the item score logit o f person v as

y„ = Cn [ r „ / ( L - r„)] [1.6.8]

The expansion factors X and Y are then estimated by the expressions

X = [{1 + U /2 .8 9 )/(1 - U V /8 .3 5 )]14 , [1.6.9]

fo r the person logit expansion factor and

Y = [(1 + V /2 .8 9 )/(1 - U V / 8 . 3 5 ) ] * , [1.6.10]

fo r the item logit expansion factor.

In these expressions 2.89 = 1.72 and 8.35 = 2.892 = 1.74 and

U = ( 2 x ,2 - L x .2 ) / ( L - 1), [1.6.11]
I 1

the item logit variance and


N
V = ( 2 y „ 2 - N y.2 ) / ( N - 1), [1.6.12]
0
the person logit variance.

T o com plete this estimation, we set the test center at zero so that H = 0. Then

dt = M + Y x t = Y (x t - x.) [1.6.13]

fo r each item difficu lty, and

b„ = H + Xy„ = X y„ [1.6.14]

fo r each person ability.

Standard errors are

SE(dt) = Y [N /s t{N - st) ] 14 “ 2 .5 /N 54 [1.6.15]

and

SE(bv) = X t L / r ^ L - rv) ] ' A = 2 .5 /L 54 [1.6.16]

Finally the estimated person sample mean and standard deviation become

M « - Yx . [1.6.17]

1.7{Y 2 - 1)^ [1.6.18]


THE M EASUREM ENT MODEL 23

Once we have estimated b„ and dt we can use them to obtain the difference be­
tw een what the m odel predicts and the data we have actually observed. These residuals
from the m odel are calculated by estimating the m odel expectations at each x „ t from by
and dt and subtracting this expectation from the x „ t which was observed. The m odel
expectation fo r x vi is

^ ( XW } = 17VI

with m odel variance

V {*v i } = v vi ^ ~ 71vJ

where

v vl = exp (Pv - 5 t)/[1 + e x p (0 y - 6 t)] .

A standardized residual w ould be

(X w - i r w ) / t i r w ( 1 - w ^ ) ] 54 . [1.6.191

I f the data fit the m odel this standardized residual ought to be distributed m ore or less
norm ally with mean zero and variance one.

I f we estimate v Vl from p vl where

pvi = exp (bv - d ()/[1 + exp (by - dt)] [1 .6 .2 0 ]

then w e can use the error distributions

Zyt ~ N (0,1) and

7 2
z lH Xi

as guidelines fo r evaluating the exten t to which any particular set o f data can be managed
b y our measurement model.'

We can calculate the sum o f their squared residuals zvl2 fo r each person. According
to the m odel this sum o f squared normal deviates should approxim ate a chi-square distri­
bution with about ( L - 1) degrees o f freedom . This gives us a chi-square statistic

[ 1.6.211

with degrees o f freedom

fy = (L - 1)(N - 1)/N [1.6.22]

and a mean square statistic

vy = C l / fy ~ F y °o [1.6.23]
24 BEST TEST DESIGN

which approximates an F-distribution when the person’s responses fit the model.

The sum o f squared residuals fo r each item can be used in the same way to evaluate
item fit. For items

f Zw = Ct2 ~ X f2i [1.6.24]

with

ft = (N — 1 )(L — 1 )/L [1.6.25]

and

vi = c i2 / f i ~ Fft < °°. [1.6.26]

Finally, since x „ ( can only equal one or zero, we can use the definition o f p „t given
in Equation 1.6.20 to calculate z„,Vl and z „ t2
2 directly as

zw = (2xw - 1) exp [(2 x w - 1)(dt - b„)/2] [1.6.27]

and

z^t = e x p [(2 x „ ,- 1)(dt -b ^ ,)] [1.6.28]

This relation can also be worked backwards. I f we already have a z „ t2 and wish to
calculate the probability o f the observed responsex „ t to which it refers in order to decide
whether or not that response is to o improbable to believe, then we can use

P {x w I b„,dt }= 1/(1 + z„t2 ). [1.6.29]

In contrast with the pul o f Equation 1.6.20 which is the estimated probability o f a
correct answer, the probability o f Equation 1.6.29 applies to x „ t whatever value it takes,
whether x „ t = 1 fo r a correct answer or x „ t = 0 fo r an incorrect one.

1.7 HOW T R A D IT IO N A L TES T S TA T IS T IC S APPEAR IN RASCH M E A S U R E M E N T

Sections 1.1, 1.2 and 1.3 discuss the purpose o f tests, the use o f test scores and the
problems o f generality and linearity in making measures. Sections 1.4, 1.5 and 1.6 des­
cribe a simple and practical solution to these measurement problems. Because the math­
ematics are new it might seem that using the Rasch model will take us far away from the
traditional item statistics with which we are familiar. This is n ot so.

Applying the Rasch model in test development gives us new versions o f the old
statistics. These new statistics contain all o f the old familiar information, but in a form
which solves most o f the measurement problems that have always beset traditional
test construction. T o show this we will examine the three most common traditional item
and person statistics and see how closely they relate to their corresponding Rasch mea­
surement statistics.
THE M EASUREM ENT M ODEL 25

The Item P-Value

T h e m ost fam iliar traditional item statistic is the item “ p-value.” This is the propor­
tion o f persons in a specified sample w ho get that item correct. The P R O X estimation
equation (1 .6 .2 ) gives us a convenient way to formulate the relationship between the
traditional item p-value and Rasch item d ifficu lty. I f the p-value fo r item i is expressed as

Pt = s(/N

in which st is the number o f persons in the sample o f N persons who answered item i
correctly, then the P R O X estimated Rasch item d ifficu lty is

dt = M + (1 + o 2/2 .8 9 )’/’ Cn [(1 - Pl)/p t] [1.7.1]

Equation 1.7.1 shows that the Rasch item d ifficu lty d t is in a one-to-one relation
with the item p-value represented by p(. It also shows that this one-to-one relation is curvi­
linear and involves the ability mean M and variance a 2 o f the calibrating sample.

What the Rasch m odel does is to use the lo git function

Cn [(1 - P t )/P tl

to transform the item p-value which is n o t linear in the im plied variable into a new value
which is. This new lo git value expresses the item d ifficu lty on an equal interval scale
and makes the subsequent correction o f the item ’s p-value fo r the ability mean M and
variance a 2 o f the calibrating sample easy to accomplish.

This correction is made by scaling the lo g it to rem ove the effects o f sample variance
a 2 and translating this scaled lo git to rem ove the effects o f sample mean M. The resulting
Rasch item difficu lties are n ot only on an equal interval scale but they are also freed
o f the observed ability mean and variance o f the calibrating sample. Just as the item
p-value P( has a binomial standard error o f

SE(pt) = [pt(1 - p [)/N ] 'A [1.7.2]

so the P R O X item d ifficu lty dt has its ow n closely related standard error o f

SE(d() = (1 + ct2/2 .8 9 )'/2 [1 /N p t(1 - pt) ] * . [1.7.3]

But there are tw o im portant differences between Equations 1.7.2 and 1.7.3. Unlike the
p-value standard error in Equation 1.7.2, the Rasch standard error in Equation 1.7.3 is
corrected fo r the ability variance a2 o f the calibrating sample. The second difference
between these tw o form ulations is m ore subtle, but even m ore important.

The traditional item p-value standard errors in Equation 1.7.2 are maximum in the
m iddle at p-values near one-half and zero at the extremes at p-values o f zero or one. This
makesit appear that w e know the m ost about an item, that is have the smallest standard
error fo r its p-value when, in fact, we actually know the least. This is because the item
p-value is focused on the calibrating sample as well as on the item. As the sample goes
o f f target fo r the item, the item p-value nears zero or one and its standard error nears
zero. This assures us that the item p-value fo r this particular sample is extrem e but it
26 BEST TEST DESIGN

tells us nothing else about the item. Thus even though our knowledge o f the item ’s p-
value is increasing our information concerning the actual difficu lty o f the item is de­
creasing. When item p-values are zero or one, the calibrating sample which was intended
to tell us how that item works is shown to be to o able or to o unable to interact with the
item. We know exactly in which direction to look fo r the item difficulty, but we have
no information as to where in that direction it might be.

In contrast, the Rasch standard error fo r dt varies in a more reasonable manner. The
expression pt (1 - pt) which goes to zero as pt goes to zero or one, appears in the denomi­
nator o f Equation 1.7.3 instead o f in the numerator, as it does in Equation 1.7.2. There­
fore, the Rasch standard error is smallest at pt = .5, where the sample is centered on the
item and thus gives us the most information about how that item functions. A t the
extremes, however, where we have the least information, the Rasch standard error goes to
infinity reminding us that we have learned almost nothing about that item from this
sample.

The Item Point-Biserial

The second most widely used traditional item statistic is the point biserial correlation
between the sampled persons’ dichotomous responses to an item and their total test
scores. The item point-biserial has tw o characteristics which interfere with its usefulness
as an index o f how well an item fits with the set o f items in which it appears. First, there
is no clear basis fo r determining what magnitude item point-biserial establishes item
acceptability. Rejecting the statistical hypothesis that an item point-biserial is zero does
not produce a satisfactory statistical criterion fo r validating an item. The second inter­
fering characteristic is that the magnitude o f the point-biserial is substantially influenced
by the score distribution o f the calibrating sample. A given item ’s point-biserial is'largest
when the persons in the sample are spread out in scores and centered on that item. Con­
versely as the variance in person scores decreases or the sample level moves away from the
item level, so that the p-value approaches zero or one, the point-biserial decreases to zero
regardless o f the quality o f the item.

The Rasch statistic that corresponds in meaning to the item point-biserial is the
item ’s mean square residual given in Equation 1.6.26. This mean square residual is not
only sensitive to items which fail to correlate with the test score, but also to item point-
biserials which are unexpectedly large. This happens, fo r example, when an additional
and unmodelled variable produces a local interaction between a unique feature o f the
item in question and a corresponding idiosyncrasy among some members o f the cali­
brating sample.

In contrast with the point-biserial, the Rasch item mean square residual has a useful
statistical reference distribution. The reference value fo r testing the statistical hypothesis
that an item belongs in the test is a mean square o f one with a standard error o f (2/f)54
for f degrees o f freedom. Thus the extent to which an observed mean square exceeds the
expected value o f one can be tested fo r its statistical significance at whatever significance
level is considered useful.

The Rasch item mean square is also very nearly indifferent to the ability distribution
o f the calibrating sample. This provides a test o f item fit which is focused on just those
sample and item characteristics which remain when the modelled values fo r item d iffi­
culty and person abilities are removed.
THE M EASUR EM ENT M ODEL 27

The Person Test Score

Th e m ost fam iliar traditional person statistic is test score, the number o f correct
answers the person earns on the test taken. Once again we can use the P R O X estimation
procedure to show the connection between the traditional test score and Rasch person
ability. Using the P R O X estimation equation (1 .6 .1 ) we have

b„ = H + (1 + w 2/2 . 8 9 ) * in [r„ /(L - r„)] [1.7.4]

with a standard error o f

SE(bj,) = (1 + w 2/2 . 8 9 ) * [L/r„{1 - r „ ) ] 14 [1.7.5]

in which

rv = the test score o f person v,


L = the number o f items in the test,
H = the average d ifficu lty level o f the test and
the variance in difficulties o f the test items.

As with the item p-values we see the lo git function transforming the person scores
which are n o t linear in the variable they im ply into an approxim ately linear metric. We
also see this lo g it being scaled fo r test width, which is represented in Equation 1.7.4 by
the item d ifficu lty variance oj 2, and then being shifted to adjust fo r test d ifficu lty level
H so that the resulting estimated person ability is freed from the local effects o f the test
and becom es a test-free measure.

T h e standard error o f this measure is minimum at scores near 50 percent correct,


where w e have the m ost inform ation about the person, and goes to in fin ity at scores o f
zero and 100 percent, where w e have the least inform ation about the person.

While traditional test practices almost always emphasize the analysis o f item validity,
hardly any attention is ever given to the validity o f the pattern o f responses leading to a
person score. As far as w e know no one calculates a person point-biserial coefficien t in
order to determ ine the relationship between the responses that person gives to each item
and the supposedly relevant item p-values. This would be a reasonable way to apply the
traditional point-biserial correlation co e ffic ien t to the supervision o f person score validity.

The Rasch approach to person score validity is outlined in Equations 1.6.19 through
1.6.23 and discussed and illustrated at length in Chapters 4 and 7.

There are other connections that can be made between traditional test statistics
and Rasch statistics. We could review here the various ways that traditional test reliability
and validity, norm referencing, criterion referencing, form equating and mastery testing
are handled in Rasch measurement. But each o f these topics deserves a thorough dis­
cussion and that, in fact, is the purpose o f the chapters which follo w . Our next step now
is to see h ow the P R O X estimation procedure works to solve a simple problem in test
construction.
2 ITEM CALIBRATION BY H A ND

2.1 IN T R O D U C T IO N

This chapter describes and illustrates in detail an extrem ely simple procedure for the
Rasch calibration o f test items. The procedure, called P R O X , approximates the results
obtained by more elaborate and hence more accurate procedures extrem ely well. It
achieves the basic aims o f Rasch item analysis, namely linearization o f the latent scale and
adjustment for the local effects o f sample ability distribution. Th e assumption which
makes P R O X simple is that the effects on item calibration o f sample ability distribution
can be adequately accounted fo r by just a mean and standard deviation. This assumption
makes P R O X so simple that it can easily be applied by hand.

In practice, it will often be convenient to let item calibration be done by computer.


However, P R O X provides an opportunity to illustrate Rasch item analysis in minute
detail, thereby exposing to com plete comprehension the process involved, and, where
computing facilities are remote or it is urgent to check computer output fo r plausibility,
then P R O X provides a method fo r calibrating items which requires nothing m ore than the
observed distributions o f item score and person score, a hand calculator (o r adding
machine) and paper and pencil.

The data fo r illustrating P R O X com e from the administration o f the 18-item. K nox
Cube Test, a subtest o f the Arthur Point Scale (Arthur, 1947) to 35 students in Grades 2
to 7. Our analysis o f these data shows how Rasch item analysis can be useful for
managing not only the construction o f national item banks but also the smallest
imaginable measurement problem, i.e., one short test given to one room ful o f examinees.

Using student correct/incorrect responses to each item o f the test, w e work out in
detail each step o f the procedure fo r P R O X item analysis. Then, Chapter 3 reviews
comparable computer analyses o f the same data by both the P R O X procedure and the
more accurate UCON procedure used in most com puter programs for Rasch item analysis.
These detailed steps o ffe r a systematic illustration o f the item analysis procedure with
which to compare and by which to understand com puter outputs. They also demonstrate
the ease o f hand computations using P R O X (P R O X is derived and described at length in
Wright and Douglas, 1975b, 1976, 1977a; Cohen, 1976, and Wright, 1977). Finally, they
illustrate the empirical development o f a latent trait or variable. Each step moves from
the observed data toward the inferred variable, from the confines o f the observed
test-bound scores to the reaches o f the inferred test-free measurements.

2.2. THE K N O X CUBE TEST

While the Arthur Point Scale covers a variety o f mental tasks, the K n ox Cube Test
implies a single latent trait. Success on this subtest requires the application o f visual
attention and short-term memory to a simple sequencing task. It appears to be free from
school-related tasks and hence to be an indicator o f nonverbal intellectual capacity.

28
ITEM C A LIB R A T IO N BY H AN D 29

T h e K n o x Cube Test uses five one-inch cubes. Four o f the cubes are fixed tw o
inches apart on a board, and the fifth cube is used to tap a series on the other four. The
fou r attached cubes w ill be referred to, from le ft to right, as “ 1,” “ 2,” “ 3 ” and “ 4 ” to
avoid confusion when specifying any particular series to be tapped. In the original version
o f the test used fo r this exam ple, there are 18 such series going from the two-step sequences
(1-4) and (2-3) to the seven-step sequence (4-1-3-4-2-1-4). Usually, a subject is administered
this test tw ice with another subtest from the battery intervening. However, we need use
only the first administration fo r our analysis.

T h e 18 series are given in Figure 2.2.1. These are the 18 “ item s” o f the test. N ote
that Items 1 and 2 require a two-step sequence; Items 3 through 6, a three-step sequence;
Item s 7 through 10, a four-step sequence; Items 11 through 13, a five-step sequence,
Items 14 through 17, a six-step sequence; and Item 18, a seven-step sequence.

F IG U R E 2.2.1

ITEM NAME AN D TAPPING ORDER FOR THE


KNOX CUBE TEST

ITEM NAME TAPPING ORDER

1 1 4
2 2 3
3 1 2 4
4 1 3 4
5 2 1 4
6 3 4 1
7 1 4 3 2
8 1 4 2 3
9 1 3 2 4
10 2 4 3 1
11 1 3 1 2 4
12 1 3 2 4 3
13 1 4 3 2 4
14 1 4 2 3 4 1
15 1 3 2 4 1 3
16 1 4 2 3 1 4
17 1 4 3 1 2 4
18 4 1 3 4 2 1

2.3 T H E D A T A FOR IT E M A N A L Y S IS

The responses o f 35 students to a single administration o f the 18 item K nox Cube


Test are given in Table 2.3.1. These responses are arranged in a person-by-item data
m atrix. A correct response by a student to an item is recorded as a 1, and an incorrect
response as a 0. Th e items have been listed across the top in the order o f administration.
30 BEST TEST DESIGN

Student scores, the number o f correct responses achieved by each student, are given
at the end o f each row in the last column on the right. Item scores, the total number o f
correct responses to each item, are given at the bottom o f each column.

Inspection o f Table 2.3.1 shows that the order o f administration is very close to the
order o f difficulty. Items 1, 2 and 3 are answered correctly by all students. A second,
slightly greater, level o f d ifficu lty is observed in Items 4 through 9. Then Items 10 and 11
show a sharp increase in difficulty. Items 12 through 17 are answered correctly by only a
few students, and no student succeeds on Item 18. Only 12 students score successfully at
least once on Items 12 through 17, and only five o f these students do one or more o f the
six-tap items successfully.

2.4 C A L IB R A T IN G ITEM S A N D M E A SU R IN G PERSONS

The general plan for accomplishing item analysis begins with editing the data in
Table 2.3.1 to remove persons and items fo r which no definite estimates o f ability or
d ifficu lty can be made, i.e., those with all correct or all incorrect responses. This means
that Person 35 and Items 1, 2, 3 and 18 must be set aside, leaving 34 persons and 14
items for analysis. Then the remaining inform ation about persons and items is
summarized into a distribution o f person scores and a distribution o f item scores.

N ex t these score distributions are rendered as proportions o f their maximum possible


value and their frequency o f occurrence is recorded. The proportions are then converted
to log odds, or logits, by taking fo r items the natural log o f the proportion incorrect
divided by the proportion correct and fo r persons the natural log o f the proportion o f
successes divided by the proportion o f failures. This converts proportions, which are
bounded by 0 and 1, to a new scale which extends from - °° to + 00 and is lineqr in the
underlying variable.

For “ item d ifficu lty ” this variable increases with the proportion o f incorrect
responses. For “ person ability” it increases with the proportion o f correct responses. The
mean and variance fo r each distribution o f logits are then computed, and the mean item
logit is lused to center the item logits at zero. This choice o f origin fo r the new scale is
inevitably arbitrary but must be made. Basing it on items rather than on persons and
placing it in the center o f the current items is natural and convenient.

The item logit and person logit variances are used to calculate tw o expansion factors,
one fo r items and one fo r persons. These factors are used to calculate the final
sample-free item difficulties and test-free person abilities. They are needed because the
apparent relative difficulties o f the items depend upon how dispersed in ability the
sample o f persons is. The more dispersed the persons, the more similar in difficu lty will
items appear. This is also true for apparent ability. The more dispersed the test in item
difficulty, the more similar in ability will persons appear. These effects o f sample spread
and test width must be removed from the estimates o f item difficu lty and person ability,
if these estimates are to be made sample-free and test-free.

Finally, the standard errors o f these estimates are calculated. The standard errors
are needed to assess the precision o f the estimates. T h ey depend on the same expansion
factors plus the extent to which the item d ifficu lty is centered among the person abilities
and the person ability is centered among the item difficulties. The more that items or
persons are centered on target, the more precise are their estimates and hence the smaller
their standard errors.
ITEM C A LIB R A T IO N BY H A N D 31

Z ui
° cc |> . 0 0 «0 0 0 ^ 0 0 » - 0 0 0 0 0 « - ( w> 0 0 > ,- 0> » - C N ( M C N ^ i f ) O N 0 0 0 1 0 * - (
a
£
8
8

o o o o o o o o o o o o o o o o o o o c o o o o o o o o o o o o o o

00000000000000000000000^-0000000000

00000000000000000000000^-0000000000
TEST

000000^-000000000000000000000000000
CUBE

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 * - 0 0 0 0 0 » - 0 0 0 0 r -
KNOX

0 0 0 0 0 0 « - 0 0 0 0 0 0 0 « - 0 0 » - 0 ’- » - 0 » - » - 0 0 0 0 0 0 0 0 0 0
ON THE

00«-000*-00000*-0»-000000«-00000000000'
18 ITEMS

o o o o o o —o o » - o o « - « - « - « - o o o o « - « - o * - o o o « - « - o o » - - o o O CM

UI
2
<
TO

z
2
OF 35 PERSONS

LU
h-

O r-
RESPONSES

o o
O R IG IN A L

O CM

O “ 2 cc
co 5
cc < t 0
C0
J
32 BEST TEST DESIGN

Step 1. Organizing the Data Matrix

The data matrix in Table 2.3.1 has been rearranged in Table 2.4.1 so that person
scores are ordered from low to high with their respective proportions given in the
right-most column, and item scores are ordered from high to low with their proportions
given in the bottom row.

Step 2. Editing the Data Matrix

The data matrix o f person-by-item responses in Table 2.4.1 has also been edited by
removing all items that were answered correctly by everyone or no one, and by removing
all persons who had perfect scores or who had not answered any items correctly.

The boundary lines drawn in Table 2.3.1 show the items and persons removed by
the editing process. Items 1, 2 and 3 were removed because they were answered correctly
by everyone. Rem oving these three items then brought about the removal o f Person 35
because this person had only these three items correct and hence none correct after these
items were removed. Item 18 was removed because no person answered this item
correctly.

Editing a data matrix may require several such cycles because removing items can
necessitate removing persons and vice versa. F or example, had there been a person who
had succeeded on all but Item 18, then removal o f Item 18 would have le ft this person
with a perfect score on the remaining items and so that person would also have had to be
removed.

Why were some items and some persons removed? When no one in a sample o f
persons gets an item correct, that shows that the item is to o d ifficu lt fo r this sample o f
persons. However, no further inform ation is available as to just how much to o d ifficu lt it
actually is. When everyone gets an item correct, that shows that the item is to o easy for
these persons, but again, no further inform ation is available as to exactly how much too
easy the item actually is. T o make a definite estimate for a very easy item we must find at
least one measurable person who gets it incorrect, and fo r a very hard item, at least one
measurable person who gets it correct. That is, we must “ bracket” the item between
persons at least one o f whom is more and at least one o f whom is less able than the item
is difficult. O f course, only one person below a very easy item or above a very hard one
does not give a very precise estimate o f that item ’s difficulty.

Thus, w e have insufficient data in our example to evaluate the extreme Items 1, 2, 3
and 18. We know that Items 1, 2 and 3 appear very easy and that Item 18 appears to be
very hard fo r these persons, but we d o not have enough inform ation to specify definite
estimates o f the difficulties o f these four items.

As for extreme persons, do persons with a zero score know nothing? A re scores o f
100% indicative o f persons who “ know it all” or have they only answered easy questions?
T o make a definite estimate fo r a person, we must bracket that person between items that
are both easier and harder than the person is able.

The boundary scores o f zero and 100%, whether for items or for persons, represent
incomplete information. They tell us in which direction to look for an estimate o f the
person’s ability or the item ’s difficulty, but they do not tell us how far to go in that
direction. For sufficient information to make a definite estimate o f where the person or
ITEM C A LIB R A T IO N BY H A N D 33
34 BEST TEST DESIGN

the item is on the latent variable, w e must find some items to o easy and some items too
hard for these persons, and some persons to o smart and others too dumb for these items,
so that each item and person is bracketed by observations. Then we can make an estimate
o f where they are on the variable.

Step 3. Obtaining Initial Item Calibrations

From the edited data matrix in Table 2.4.1, we build a grouped distribution o f the
10 different item scores and their logits incorrect, and compute the mean and variance o f
the distribution o f these item logits over the test o f 14 items. This is done in Table 2.4.2.

EXPLANATION OF TABLE 2.4.2 NOTATION AND FORMULAE

Column 1 o f Table 2.4.2 gives the item names


collected into each item score group.

Column 2 gives the item score which characterizes i = 1, G


each item score group. Since there are
10 different item scores in this example,
G = 10 and the item score group index i
goes from 1 to 10.

Column 3 gives the frequency o f items at each f:


score. The sum o f these frequencies
over the G = 10 item score group comes
to the L = 14 items being calibrated. L = Z f:

Column 4 converts the item scores into propor­ Pi = Sj/N


tions correct among the sample o f N =
34 persons.

Column 5 is the conversion o f proportion correct 1 - Pi = (N - Sj)/N


p: into the proportion incorrect 1 - Pj

Column 6 is the conversion o f this proportion into Xj = i n [ ( 1 - P j)/P j]


logits incorrect. Each item score group
logit is the natural log o f its proportion
incorrect divided by its proportion cor­
rect.

This conversion is facilitated by the


values o f the logits £ n [p / (l-p )] given in
Table 2.4.3.

Column 7 is the product o f item frequency and fi xj


logit incorrect.

Column 8 is the product o f item frequency and


logit incorrect squared.
ITEM C A LIB R A T IO N BY H A N D

*These values come from Table 2.4.3 w here£n[.06/.94] = — 2.75. Were these calculations made with Sj and N as in Cn[(N-Sj)/Sj]
35

then £n[2/32] = — 2.77. The difference between — 2.75 and — 2.77 is due to the rounding in Sj/N = 32/34 = 0.941176 . . . ~ 0.94.
36 BEST TEST DESIGN


in t- r* cn 05 in O) (o co o
CM O O) O) O
co ^^ in
O) in ^ CO CO 05 O
cm
o «- cm cm n n in in cn oo q o) o cm N q t- ^ co q be
o «- r - r-’ CM CM cvi cvi c\i cn cvi CO CO CO ^ 0
tp >»

c
o
co
L
05
z a
o 05
JS
h-
CC 05
O co r*^ oo 05 o «— cn co lo CO N CO 05 O r- CM CO in co co 05
Q.
r^. r-» oo co oo co co co CO CO CO CO 05 05 05 05 05 05 05 05 05 05 JS

o
oc 3
JS
Eh

co cn co o co CN CD r - LT5 05 CO CO CN CO o 10 o 05 in a
o O O r; r - CM CM CM CO CO ^ 10 Ifl CO CO 00 00 05 05 05 O 2
o o o o o o 0 6 0 6 0 0 0 0 0 0 0 0 0 0 0 6 6 0 ^ 05
cd
09
£2
05
JS


£2
P R O P O R T IO N S

-a
05
z >> !2
o o -Q
]>
a
o H •a 'O
CC cn co ^ in CO N 00 O) O *— cn co in
05
o CO CO 05 O 1- cn co <3* in T3
a. in in in in in 10 in lo 10 co co co co co co co co co co r>. i"*

o
oc
CL

a
c o
o o
FROM

a
a a
o

c m 05 ^ o in o in «- co cn CO 00 05 LO T- CO CN00 ^ O CO CN 00 ^ O
O 05 05 05 CO cq cq
LOGITS

CO in in ^ ^ CO COCN CN CN o o o
0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 6 0 6 6 0
I I I I I I I I 1 1 l 1 1 I II I I 1 1 1 I 1

05
£) 05

2 JO
z 3
g c 3
C
H 05
OC JS 05

o CO S 00 05 O 1— cn co ^ in CO CO 05 o «- cn co in co rs oo 05 o -C
-4
1
-> .
a. CN CN CN CN CO CO CO CO CO CO CO CO CO CO ^ ^ ^ ^ 'T in
05 CO ■
O 2 05
OC
o
(5
2 I

05 C5 *
£2 05 ^
fl
la ex

.2 1 a n

0 ^ * Z
O 05 CO CO <3*
(5 00 ^ r- O) in 05^ o 05 05 .O CN CO CO 05 CN in 05 cn r** t— in o d, u
0—
o
a is q ^ n cm o q 05 co Is* co in in ^ co CO CN CN r— r—
o CO CO CO CN CN CN CN CN CN ,P*©< a 1
o ' *

I I I I I I I II I 1 l I I l l l l l l I I II I 5 Z
1 «
3 I
« t£
05

0 I
z
g S i Sz
c ^ s ~
H 0J 2 1
OC r- cn co ^ in
09
o
CL 0 0 0 0 0
CO P*^ CO 05 o
o o o o «-
cn co ^ in CO C** 00 05 O r- cn co ^ in
CN CN CN CN CN
L 77
05 O
a »—•
3z
o u c
oc goc
.2 r° oc
b C
ITEM C A LIB R A T IO N BY H A N D 37

CD
ID

II
CO

c
<D
SI

? 0
-j o
? «
c

c 00

"O
3 9
_i °
c ^
<D *“
£ II
« -J

o> .E
ID _
O)
CN C

' 1
*

11 8
• u
CO <D
O) _c

o
c 3
*D
CO
ID
CN
CO I
~a
CN c
CO
O)
ID
CN
I
C
0)
<D
£

CD
(J
c

_c
H
38 BEST TEST DESIGN
G
The mean and variance for the item x. = Z f : X j/L
i
logits in Column 6 are then computed
from the values in Columns 7 and 8 and G
given beneath these columns. U = ( I f j X j 2 - Lx.2 ) / (L-1)
i

Column 9 gives the values o f Column 6 centered


by subtracting their mean. These are the
initial item calibrations ready to be
corrected for the effect o f sample
spread.

In this example, the mean and variance have been computed from the values in
Columns 7 and 8. Hand calibration can be facilitated even further by a short-cut
expression fo r a standard deviation proposed by Mason and Odeh (1 9 6 8 ). T o do this,
sum the item logits in Column 9 (or Column 6 ) for the top and bottom sixth o f the items
ordered by difficu lty, and take the square o f twice the difference o f these sums divided
by one less than the number o f items.

For the data in Table 2.4.2:

a. One-sixth o f ihe items is 14/6 = 7/3, or 2.33 items at each end.

b. The item logits incorrect for the top three items in Column 9 are 3.29,
which times 2.33 is 3.29 x 2.33 = 7.67.

c. The item logit incorrect for the bottom item is -2.94 and for the next tw o
items is -2.50. So, taking the lowest item, -2.94, plus 7/3 - 3/3 = 4/3 o f the
next tw o items, gives (-2.94) + (-2.50 x 1.33) = - 6.27.

d. The difference between 7.67 and -6.27 is 13.94.

e. Tw ice this amount divided by the number o f items minus one and squared
becomes the variance estimate [2(13.94)/13]2 = 4.6.

This short-cut value o f 4.6 is somewhat smaller than 5.7 but the number o f items is
small and the distribution is flatter than the normal distribution assumed by the
short-cut.

Completion o f the steps in Table 2.4.2 provides initial values fo r item difficulties in
preparation for the adjustment which will compensate fo r the effect o f sample spread.

Step 4. Obtaining Initial Person Measures

In Table 2.4.4, we take identical steps with a grouped distribution o f person scores
in order to obtain the distribution o f person score logits and hence initial values for the
abilities that go with each possible score on the test.

EXPLANATION OF TABLE 2.4.4 NOTATION AND FORMULAE

Column 1 o f Table 2.4.4 gives each possible r = 1, L - 1


person score from 1 to 13.
ITEM C A LIB R A T IO N BY H AN D 39

EXPLANA TIO N OF TABLE 2.4.4 N O TATIO N AND FORMULAE

Colum n 2 gives the frequency o f persons observed nr


at each score. T h e total number o f L - 1
persons N = 34 equals the sum o f these N = 2 nr

frequencies from r = l t o r = 13. r

Colum n 3 is the p roportion o f each score on a test p r = r/L


o f L = 14 items.

Colum n 4 is the logit correct fo r that proportion V r = * n [ p r/(1 - p r ) ]


using Table 2.4.3.

Colum n 5 is the product o f person frequency and nr y r


logit correct.

Colum n 6 is the product o f the person frequency


and logit correct squared.

Colum n 7 repeats the values o f Column 4 because,


as far as this test is concerned, the score
logits are already centered by the cen­
tering o f the item logits. These are the
initial person measures prior to correc­
tion fo r test width.

T h e mean and variance fo r the distribution o f score logits over persons are given at
the base o f Table 2.4.4, as is the short-cut estimate o f the variance.

N o te that because w e are interested not on ly in the scores observed in this sample
but also in the measurements im plied by any possible score which might be observed on
this test o f 14 items, unobserved scores o f 1, 12 and 13 have been added to Table 2.4.4,
together w ith the initial measures fo r these scores. T h e measurement m odel specifies what
measures are equivalent to these scores even when no persons in the sample actually earn
them.

T o summarize the procedure thus far (n o w letting each item defin e its ow n item
score group fo r notational sim plicity, so that the item index i now runs from 1 to 14
items instead o f from 1 to 10 item score groups):

F o r a test o f L* items given to N* persons, w e delete all items no one gets


correct and no one gets incorrect, and all persons with none correct and none
incorrect until no such items or persons remain.

Lettin g Sj be the number o f persons w ho go t item i correct for i = 1 through L,


and n_ be the number o f persons w h o g o t r items correct, fo r r = 1 through L - l,
we find the mean and variance over items o f the log odds incorrect answers
(o r item logits incorrect) in the sample to each o f the L items and the mean and
variance over persons o f the log odds correct answers (o r score logits correct)
on the test by each o f the N persons.
40 BEST TEST DESIGN

Thus we obtain, fo r each item i, its logit Xj = 8n[(N - Sjl/Sj]


incorrect answers among the sample o f N
persons,
x. = 2 x j / L
i
and the mean and variance over L items o f L
these item logits. U = 2 (x j - x.) 2 / ( L - 1)
i

And we obtain for each score r its logit y r =£n [ r / ( L - r ) ]


correct answers on the test o f L items,
L -1
y. = Z n r y r /N
and the mean and variance over N persons o f r

their score logits. L -1


v = 2 nr ( v r - y ) 2/ (n -d
r

N ow w e are ready to adjust the initial calibrations and measures in Tables 2.4.2 and
2.4.4 fo r the local effects o f the person ability distribution o f the sample and the item
d ifficu lty distribution o f the test.

Step 5. Calculating the Expansion Factors

We compute expansion factors fo r the initial estimates o f item calibrations and


person measures in order to correct the item calibrations fo r sample spread and the
person measures for test width. From Tables 2.4.2 and 2.4.4 w e have U = 5.72 and V =
0.46 (or the short-cut values U' = 4.6 and V ' = 0.5).

1 + U /2 .8 9
a. The person ability expansion factor X =
1 - U V /8 .3 5
due to test width is
lA
1 + 5 .7 2 /2 .8 9
1 - (5.72) (0.461/8.35

lA
2.98
= 2.09
0.68

lA
1 + U 7 2 .9
or short-cut value X' =
1 - U 'V '/ 8 .4

lA
j + 4 .6 /2 .9
1 -(4 .6 ) (0 .5 )/8 .4
lA
2.59
= 1.9
0.73

1F o r explanation see Chapter One, Section 1.6.


ITEM C A LIB R A T IO N BY H A N D 41
42 BEST TES T DESIG N
*/2
1 + V /2 .8 9
b. The item difficulty expansion factor Y = _1- U V /8 .3 5
due to sample spread is */2
1 + 0 .4 6 /2 .8 9
1 —(5.72) (0 .4 6 )/8 .3 5

>/2
1.16
= 1.31
0.68

*/2
1 + V 7 2 .9
or short-cut value Y’ =
1 - U 'V 7 8 .4

%
1 + 0 .5 /2 .9
1 - (4.6) (0.5) /8.4_

Vi
1.17
= 1.3
0.73

Step 6. Correcting Item Calibrations for the Effect of Sample Spread

In Table 2.4.5 we obtain the final corrected item calibrations and their standard
errors from the sample spread expansion factor Y .

EXPLANATION OF TABLE 2.4.5 NOTATION AND FORMULAE

Column 1 gives the item name.

Column 2 repeats the initial item calibrations from d ° = x j - x. i = 1 ,G


Column 9 o f Table. 2.4.2. (Recall that
when items are grouped by item score,
then i runs from 1 to G the number o f
item score groups instead o f from 1 to
L indexing the individual items.)

Column 3 is the item difficulty expansion factor Y


Y = 1.31 due to sample spread.

Column 4 is the corrected item calibrations ob­ di = Ydi


tained by multiplying each initial value
= Y ( x r x.)
in Column 2 by the expansion factor o f
1.31.

Column 5 reminds us o f the number o f persons


who got the items in each item score
group correct.
ITEM C A LIB R A T IO N BY HAND 43

T A B L E 2 .4 .6

FINAL ESTIMATES OF PERSON MEASURES


FOR ALL POSSIBLE SCORES ON THE 14 ITEM TEST

1 2 3 4 5

POSSIBLE TEST W ID TH MEASURE


TEST IN IT IA L EXPANSION CORRECTED STANDARD
SCORE MEASURE FACTOR MEASURE ERROR

r X b = Xb° SE (b r )
*>r r r

1 - 2 .59 2.09 - 5.41 2.17

2 - 1.82 2.09 - 3.80 1.60

3 - 1.32 2.09 - 2.76 1.36

4 -0 .9 0 2.09 - 1.88 1.24

5 -0 .5 8 2.09 - 1.21 1.17

6 -0 .2 8 2.09 -0 .5 9 1.13

7 0 .0 0 2.09 0 .00 1.12

8 0 .2 8 2.09 0 .59 1.13

9 0 .58 2.09 1.21 1 .17

10 0 .9 0 2.09 1.88 1.24

11 1.32 2.09 2.76 1.36

12 1.82 2.09 3.80 1.60

13 2.59 2.09 5.41 2.17

L =14

%
SE (b r) = X [ L /r ( L - r ) ]
44 BEST T E S T D ESIG N

E X P L A N A T IO N O F TA B LE 2.4.4 NOTATION AND FORMULAE

Column 6 is the standard error o f the corrected SE (dj) = V [ N / Si(N - Sj> ] 54


item calibrations.

Step 7. Correcting Person Measures for the Effect of Test Width

In Table 2.4.6 we obtain the final corrected person measures and their standard
errors from the test width expansion factor X.

EXPLANATION OF TABLE 2.4.6 NOTATION AND FORMULAE

Column 1 gives all possible scores because we want r = 1, L -1


to have measures available for every
possible test score from 1 to L — 1,
whatever scores were actually observed.

Column 2 repeats the initial person measures from b? = y r


Column 7 o f Table 2.4.4.

Column 3 is the person ability expansion factor


X = 2.09 due to test width.

Column 4 is the corrected person measures ob­ b r = X b °= X y r


tained by multiplying each initial value
in Column 2 by the expansion factor o f
2.09.

Column 5 is the standard error o f the corrected SE (b r) = X [ L / r ( L - r) ]


person measures.

2.5 DISCUSSION

The P R O X item analysis procedure has been carefully described not only because it
accomplishes item calibration and hence person measurement but also because it
embodies in a logical and straightforward manner the simplest possible analysis o f the
interaction between items and persons. The decisive idea on which this analysis is based is
that the probability o f success is dominated by the person’s ability and the item ’s d iffi­
culty. A more able person is supposed always to have a greater chance o f success on any
item than is a less able person. A n y particular person is supposed always to have a better
chance o f success on an easy item than on a difficu lty one. T o the extent this is the case
the probability o f any person’s success on any item can be specified as the consequence
o f the difference between the person’s position on a single variable and the item ’s
position on that same variable. That is the Rasch model for item analysis and test
construction and, indeed, the fundamental model implicit in the item analysis o f all those
who work with unweighted scores (Andersen, 1977).
IT E M C A L IB R A T IO N BY H A N D 45

A ll the inform ation observed about a person’s position on the variable, e.g., his
ability, is assumed to be expressed in his responses to the set o f items he takes as
summarized in the unweighted count o f the number o f items he gets correct. For item
d ifficu lty, the inform ation observed is assumed to be com pletely contained in the
unweighted count o f persons in the sample w h o responded correctly to that item.

O f course, this m odeling o f the interaction between person and item is an


idealization and can on ly approxim ate whatever actually happens. A ll ideas, however, are,
in the end, o n ly approxim ations o f, or abstractions from , experience. Their value can
o n ly be judged in terms o f their usefulness, that is, their demonstrable relevance to the
situation under study, and their sim plicity. This chapter has illustrated the sim plicity and
potential convenience o f Rasch item analysis. Its u tility is testified to by hundreds o f
applications. Our n ext task is to show how the same data just analyzed by hand com e out
when the P R O X procedure and its m ore elaborate and accurate parent procedure, U CON,
are applied to them by com puter.
3 ITEM CALIBRATION BY COMPUTER

3.1 IN T R O D U C T IO N

In this chapter we display and describe computer output fo r Rasch item calibration
using the estimation procedures P R O X and U C ON (Wright and Panchapakesan, 1969;
Wright and Douglas, 1975b, 1977a, 1977b). The K nox Cube Test data analyzed by hand
in Chapter 2 are used for illustration. The estimation procedures are performed by the
computer program B IC A L (Wright and Mead, 1976). The accuracy and utility o f the
hand calibration described in Chapter 2 are evaluated by comparing the “ hand” estimates
with those produced by a computer analysis o f the same data. Knowing the steps by
which these procedures can be applied by hand should facilitate understanding and using
the computer output.

We will move through the computer output step by step in order to bring out its
organization and use. B IC A L produces the output given in Tables 3.2.1 to 3.2.9. A
comparison o f the item and person statistics from P R O X by hand with those from P R O X
by computer is given in Tables 3.3.1 through 3.3.5.

3.2 B IC A L O U T P U T FOR A PRO X A N A L Y S IS O F TH E K N O X CUBE TES T D A T A

The first page o f the output, in Table 3.2.1, recaps the control specifications neces­
sary to apply the 1976 version o f B IC A L to this calibration jo b (fo r details consult the
manual that goes with your B IC A L program). A t the top we begin with the jo b title, the
various control parameters, the record (o r card) columns read, the test scoring key and a
copy o f the first person record read. Finally, the output reports that 18 items and 34 per­
sons went into this analysis.

__________________| T A B L E 3.2.1 |___________________

-------------------------- PROGRAM CONTROL SPECIFICATIONS -----------------------

K N O X CUBE TEST

C O N TR O L PAR AM ETERS
N IT E M NGROP Ml NSC MAXSC LREC KCAB SCORE

18 10 1 17 21 1 0

C O LUM NS SELECTED

1 2 3 4
1* ......................

111111111111111111
K EY
111111111111111111
F IR S T SUBJECT
001111111100000000000

NUM BER OF ITEM S 18


NUM BER OF SUBJT 34

46
ITE M C A L IB R A T IO N BY COMPUTER 47

T he con trol parameters fo r this jo b were:

1. Num ber o f items (N IT E M ): 18 There were originally 18 items in the KCT.


2. Smallest subgroup size (N G R O P ): 10 Subgroups o f at least 10 persons are to be
form ed fo r analyzing item fit.
3. Minim um score (M IN S C ): 1 The minimum score to be used is 1.
4. M axim um score (M A X S C ): 17 The maximum score to be used is 17.
5. R ecord length to be read (L R E C ): 21 The data comes from the first 21 positions
o f each person record. [T h e column select
card (listed in Table 3.2.1 under “ C O LU M N
S E LE C T E D ” ) specifies the 18 columns
that contain these test responses.]
6. Calibration procedure (K C A B ): The calibration procedure to be used is
PROX.
[T h e selection code is: 1 = P R O X , 2 =
UCON]
7. Scoring op tion (S C O R E ): The data are already scored.
[T h e full control code is: 0 = data to be
scored dichotom ously according to key
supplied; 1 = data are successive integers;
2 = score data “ correct” , i f response value
equal to or less than key supplied, else
“ incorrect” ; 3 = score data “ correct” , if
response value equal to or greater than key
supplied, else “ incorrect” .]

Table 3.2.2 gives each item ’s response frequencies fo r each response value. This table
can accom m odate up to five response values as specified by the user. A n “ unknown”
value colum n records the count o f all other values encountered. The final column is fo r
the key. T h e key marks the value specified as correct when the data is still to be scored.
As Table 3.2.2 shows the K C T data was entered in scored form . The appropriate key,
therefore, is the vector o f ‘ l ’s shown in Tables 3.2.1 and 3.2.2. Each item is identified
on the le ft b y its sequence number in the original order o f test items as read into B IC A L.
A four-character item name can also be used to id en tify test items. F o r the K C T w e have
named the items b y the number o f taps required.

Table 3.2.2 enables us to exam ine the observed responses fo r obvious disturbances
to our test plan and w ill often suggest possible explanations fo r gross misfits. The distri­
bution o f responses over m ultiple-choice distractors, fo r example, can reveal the undue
influence o f particular distractors. The effects o f insufficient tim e show up in the piling
up o f responses in the U N K N column tow ard the end o f the test. The effects o f wide­
spread inexperience in test taking show up in the pile-up o f U N K N responses in the first
one or tw o items o f the test.

We see again in Table 3.2.2 what w e already learned from Table 2.3.1, namely that
the first three items are answered correctly by all 34 persons, that Item 18 was n ot an­
swered correctly b y anyone and that there is a rapid shift from largely correct responses
to largely incorrect responses between Items 9 and 11. Since IT E M N A M E gives the
number o f taps in the series, w e see that this shift occurs when the task moves from a
series o f fo u r taps up to five taps.
48 BEST TEST DESIGN

TABLE 3.2.2

RESPONSE FREQUENCIES FOR EACH RESPONSE ALTER NATIVE

A L T E R N A T IV E RESPONSE FR EQ U ENC IES

SEQ ITEM 0 1 U NK N KEY


NUM NAM E

1 2 0 34 0 0 0 0 1
2 2 0 34 0 0 0 0 1
3 3 0 34 0 0 0 0 1
4 3 2 32 0 0 0 0 1
5 3 3 31 0 0 0 0 1
6 3 4 30 0 0 0 0 1
7 4 3 31 0 0 0 0 1
8 4 7 27 0 0 0 0 1
9 4 4 30 0 0 0 0 1
10 4 10 24 0 0 0 0 1
11 5 22 12 0 0 0 0 1
12 5 28 6 0 0 0 0 1
13 5 27 7 0 0 0 0 1
14 6 31 3 0 0 0 0 1
15 6 33 1 0 0 0 0 1
16 6 33 1 0 0 0 0 1
17 6 33 1 0 0 0 0 1
18 6 34 0 0 0 0 0 1

Table 3.2.3 reports the editing process. It summarizes the work o f the editing rou­
tine which successively removes person records with zero or perfect scores and items
correctly answered by all persons or n ot answered correctly by any persons, until all such
persons or items are detected and set aside. The editing process determines the final
matrix o f item-by-person responses that is analyzed.

Table 3.2.3 shows that initially there were no persons with perfect o r zero scores,
and that 18 items entered the run, with no person scoring below 1 o r above 17, leaving
34 persons fo r calibration (The 35th person appearing in Table 2.3.1 had already been
removed from the data deck by hand.). Items 1, 2 and 3 are then removed by the editing
process because they were answered correctly by all subjects and Item 18 is removed
because no one answered it correctly. A fte r this editing the calibration sample still con­
sists o f 34 subjects, but now only the 14 items which can be calibrated remain, with the
same minimum score o f 1 and a new maximum score o f 13.

Table 3.2.4 shows the distribution o f persons over the K C T scores. The histogram is
scaled according to the scale factor printed below the graph. The distribution o f person
scores gives a picture o f how this sample responded to these items. It shows how well the
items were targeted on the persons and how relevant the persons selected were fo r this
calibration. For the best calibration, persons should be more or less evenly distributed
over a range o f scores, around and above the center o f the test. In our sample we see a
symmetrical distribution around a modal score o f 7.
IT E M C A L IB R A T IO N B Y COMPUTER 49

T A B L E 3.2 .3

TH E E D ITIN G PROCESS

K N O X CUBE T E S T

N U M B E R O F Z E R O SCORES 0
N U M B E R O F P E R F E C T SCORES 0

N U M B E R O F IT E M S S E L E C T E D 18
N U M B E R O F IT E M S N A M E D 18

SUBJECTS B ELO W 1 0
SUBJECTS A B O V E 17 0
SUBJECTS R E M A IN IN G 34

T O T A L SUBJECTS 34

REJE C TE D IT E M S

IT E M IT E M ANSW ERED
NUMBER NAME CORRECTLY

1 2 34 H IG H SCORE
2 2 34 H IG H SCORE
3 3 34 H IG H SCORE
18 6 0 LOW SCORE

SUBJECTS D E L E T E D = 0
SUBJECTS R E M A IN IN G = 34

IT E M S D E L E T E D = 4
IT E M S R E M A IN IN G = 14

M IN IM U M SCORE = 1
M A X IM U M SCORE = 13

T A B L E 3.2 .4

SAMPLE PERSON A B IL IT Y DISTRIBUTIO N

SCORE D IS T R IB U T IO N O F A B IL IT Y

COUNT P R O P O R T IO N 2

1 0 0.0
2 1 0 .03 X
3 2 0 .06 XX
4 2 0 .06 XX
5 2 0 .06 XX
6 3 0 .09 XXX
7 12 0 .36 xxxxxxxxxxxx
8 5 0 .15 xxxxx
9 4 0 .12 xxxx
10 1 0 .03 X
11 2 0 .06 XX
12 0 0.0
13 0 0.0
14 0 0.0

EACH X = 2 .94 P ER C EN T
50 BEST TEST DESIGN

Table 3.2.5 shows the distribution o f item easiness. The scale is given below the
graph. Items 4 through 9 are seen to be fairly easy, with Item 8 the most difficu lt among
them. Item 10 is slightly more difficult. Item 11 is much more difficult, follow ed by
Items 12 and 13, more difficult still. Finally, items 15, 16 and 17 are so difficu lt that
only one person answered these items correctly.________
TA B LE 3.2.5___ _________________ _

TEST ITEM EASINESS DISTRIBUTION

ITEM D IS T R IB U T IO N OF EASINESS

CO U NT PROPORTION 2 4 6 8 10

4 32 0.94 x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
5 31 0.91 x x x x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
6 30 0.88 x x x x x x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxx
7 31 0.91 x x x x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
8 27 0.79 x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxxxx
9 30 0.88 x x x x x x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxx
10 24 0.71 x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxx
11 12 0.35 xxxxxxxxxxxxxxxxxx
12 6 0.18 xxxxxxxxx
13 7 0.21 xxxxxxxxxx
14 3 0.09 xxxx
15 1 0.03 X
16 1 0.03 X
17 1 0.03 X

EACH X = 2.00 PERCENT

Table 3.2.6 gives the estimation information. A t the top are the P R O X d ifficu lty
and ability expansion factors. N otice that these values are identical to those w e obtained
by hand in Chapter 2. Within the table, the first fou r columns give the item sequence
number, item name, item d ifficu lty and standard error.

___________________ TA B LE 3.2.6

CALIBRATION BY PROX

D IF F IC U L T Y EXPANSIO N FACTOR 1.31


A B IL IT Y EXPANSIO N FACTOR 2.10

SEQUENCE ITE M ITEM S TA N D A R D


NUMBER NAM E D IF F IC U L T Y ERROR

4 3 -3 .8 6 5 0.833
5 3 -3 .2 9 4 0.691
6 3 -2 .8 7 6 0.608
7 4 -3 .2 9 4 0.691
8 4 -2 .0 0 7 0.485
9 4 -2 .8 7 6 0.608
10 4 -1 .3 8 8 0.430
11 5 0.547 0.410
12 5 1.767 0.514
13 5 1.518 0.485
14 6 2.805 0.691
15 6 4.321 1.160
16 6 4.321 1.160
17 6 4.321 1.160
IT E M C A L IB R A T IO N BY COMPUTER 51

Table 3.2.7 gives the lo git ability measure and its standard error fo r each score on
the K C T and the number o f persons in the sample obtaining each score. F or each raw
score we can see the sample frequency at that score and the ability and standard error
im plied by that score. Th e sample ability mean and standard deviation are given at the
b otto m o f the table.

T A B L E 3.2 .7

M EASUREM ENT BY PROX

C O M P L E T E SCORE E Q U IV A L E N C E T A B L E

RAW PERSON STANDARD


SCORE COUNT A B IL IT Y ERROR

13 0 5 .40 1.51
12 0 3 .77 1.11
11 2 2.73 0 .94
10 1 1.93 0 .86
9 4 1.24 0.81
8 5 0.61 0 .78
7 12 0 .0 0 0 .78
6 3 -0 .6 1 0 .78
5 2 - 1 .2 4 0.81
4 2 - 1 .9 3 0 .86
3 2 - 2 .7 3 0 .94
2 1 - 3 .7 7 1.11
1 0 - 5 .4 0 1.51

M E A N A B IL IT Y = - 0 .0 6
SD O F A B IL IT Y = 1.14

Table 3.2.8 provides item characteristic curves and fit statistics. The tests o f fit
include a division o f the calibration sample in to ability subgroups by score level. Three
groups have been made ou t o f the K C T sample, the 10 persons with scores from 1 to 6,
the 12 persons at score 7 and the 12 persons with scores from 8 to 13. Control over
group size and hence over the number o f groups used is asserted through the control
parameter N G R O P . A n evaluation o f item d ifficu lty invariance over these ability groups is
made b y com paring fo r each item its d ifficu lty estimates over the differen t groups. The
tests o f fit are thus sample-dependent. H owever, i f the d ifficu lty estimates they use pass
these tests, then those estimates are sample-free as far as that sample is concerned. O f
course, successful item fit in one sample does n ot guarantee fit in another. However, as
the ability groups within a given sample are arranged by scores, w e do obtain inform ation
about the stability o f item difficu lties over various abilities and therefore can see whether
our items are displaying sufficient invariance over these particular ability groups to
qu alify the items fo r use as instruments o f objective measurement.
52 BE ST T E S T D E S IG N

i- oc o CN o CO 05 ,_ CN 00 i O CO CO CO |
Z ID CN CD p LO lo LO CN p p co 1
d d d d d d d d o d ' d d d d 1 DC
CL CQ 1 <r
LL O
1 LL. DC
1 o DC
O X o LO r>» oo\ 05 o o 05 r— CO CO co 1 LU
a
C/J Q o p 00 00 p p n i O
QZ o7 d d

" i
1

< 05 00 CO CN rs ■

34
I- CD 81 CN 00 p p 05 CN
o d d o
y d d d d d d d d d d |
1
d

1
1
I d CO
CO
00 CD 05 \ CD LO
05
LO CN
00
LO
CO
00
p
CO
CN
00 00
21 CO CN
CURVES AND ANALYSIS OF FIT

LD CN ) °°. p CO
d d d CO* 1 ' d ' d d CN o d d d 1 d
co a 1
I
1
| d LO CO o o CD CD 00 IS 1 LO
i—o LO CD s CN 00 p r** CN 05 T~ CO CN
o o
£
5 “
o
d d
0 d d d d d d d d d j

1
d

1
a.
CN CN 00 CD CN CN CN 1 CN
Q ^ p o o p o O o o p o o o q i
a DO
CCo
DC d d d d d d d d d d d d d o !
u
13
in Q
X DCLU
o CL
CN CN C
O s-
cc p o o
a.
>
is Z
CMoDC
a
O
O
O
O
I
d d
i
p
d d

CQ LU
O S
o
ITEM CHARACTERISTIC

CL
DC LD LO CN o o o o
C/5 O
O O O O o o o
«“ CC O O d d
ii
I I i i

CL
0 2 o o o o o o o r>
. co oo oo oo oo CO CD
DO
CO O o o o o o o CD CO LO o o o
C DC d d d d I
00 2
C/5 a
DC
CL
a 3 o o CN CO CN o LO co oo o
z o
CN CC
o o
i-
P
d
00
d
05
d
o r^
d
co
d
o
d
o o o o I o
d
d d d
<o O
o
Q.
H ^ o o
o o o o o o
05 O
r- cc 00
d d d d d d d
p
d
o

LO LO LO CD CD CD CD UJ
O
Z
< 00
CC
<
a : LU
00 0
5 o CN co [ LO CD CC

17 o
o
C/5
IT E M C A L IB R A T IO N BY COMPUTER 53

In the ‘Item Characteristic Curve” panel o f Table 3.2.8 we have the proportion o f
correct answers given by each ability group to each item. The score range and mean
ability fo r each group are given at the b ottom o f each column. We expect these ICCs to
increase as w e m ove from le ft to right, from less able to more able score groups, and fo r
the m ost part we see that in Table 3.2.8 they do. However, Item 7 does show a rather
implausible pattern. A greater proportion o f persons get it correct in the low est score
group than in the m iddle one!

In the m iddle panel o f Table 3.2.8 w e have the differences in ICC proportions fo r
each ability group between those observed and those predicted by the Rasch measurement
m odel. Here we can see where the largest proportional departures occur and in which
direction they go. Again, Item 7 is out o f line with the other items, especially fo r the
low est ability group.

In the “ Analysis o f F it” panel o f Table 3.2.8 w e have a series o f fit mean squares.
These fit statistics are mean square standardized residuals fo r item-by-person responses
averaged over persons, and partitioned into tw o components, one between ability groups
and the other within ability groups. These mean squares increase in magnitude away from
a reference value o f 1 as the observed ICC departs from the expected ICC, i.e., when to o
many high-ability persons fail an easy item o r to o many low -ability persons succeed on a
d ifficu lt one. T h e statistical significance o f large values can be judged by comparing the
observed mean squares with their expected value o f 1 in terms o f the expected standard
errors given at the b ottom o f the table.

The “ to ta l” mean square evaluates the general agreement between the variable
defined b y the item and the variable defined b y all other items over the whole sample.
O n ly Item 7 is significantly ou t o f line, with an observed mean square o f 1.73, more than
three times its expected standard error o f 0.24 above its expected value o f 1.

T h e “ between-group” mean square evaluates the agreement between the observed


item characteristic curve and the best fittin g Rasch m odel curve over the ability sub­
groups. Again, Item 7 is ou t o f line with a mean square o f 3.89, m ore than three times
its standard error o f 0.82 above 1.

Th e “ within-group” mean square summarizes the degree o f m isfit remaining within


ability groups after the “ between-group” m isfit has been rem oved from the “ to ta l” . Here,
Item 7 shows a m isfit o f 1.52 against an expected value o f 1 and a standard error o f 0.25.

Th e discrimination index shown in the n ext to last column o f Table 3.2.8 describes
the linear trend o f departures from the m odel across ability groups expressed around a
m odel value o f 1. When this index is near 1, then the observed and expected ICCs are
close together over the reference points defined by the ability grouping.

When the index is substantially less than 1, then the observed ICC is flatter them
expected and the particular item is failing to differentiate among abilities as well as the
other items do. This condition, o f course, tends to go with a low er p oin t biserial cor­
relation between item response and total test score. However, the discrimination index i<
less influenced in its magnitude than the p oin t biserial by h ow central the item is to the
sample or h o w dispersed in ability the sample is.

When the index is substantially greater than 1, then the item gives the appearance o f
differentiating abilities m ore distinctly than the average items in the test. The cause o f
this unusual “ discrim ination” must then be investigated. It is almost always found to be
caused b y a local interaction between a secondary characteristic o f the item and a
54 BEST TEST DESIGN

secondary characteristic o f the sample, a sample-dependent condition which, upon identi­


fication, is generally judged to be to o idiosyncratic to be useful in a general measurement
system.
The fit statistics in Table 3.2.8 show that Item 7 misfits both “ between” and “ within”
ability groups and that its item characteristic curve is on the flat side. N o other item
shows a significant misfit. Item 14 does show a low point biserial but its fit statistics are
not out o f line and, like Items 15, 16, and 17, its low point biserial is due primarily to
its difficulty fo r these persons. N otice how the magnitude o f the biserial correlation varies
widely with the level o f the ICC quite independently o f how well the items fit!
This leaves us with the misfit observed fo r Item 7. What shall we conclude about this
misfit? Is it due to a general flaw in Item 7 or is it due to an interaction between Item 7
and a few aberrant person response patterns? It could be that Item 7 functions satisfac­
torily with most persons and that the m isfit observed here can be traced to the irregular
responses o f just a few persons. Should that be the case we might decide to retain Item 7
in the test and to question, instead, the plausibility o f the response patterns o f these few
unusual persons. We will discuss item and person fit in detail in Chapter 4.

TA B LE 3.2.9

ITEM CALIBRATION SUMMARY


BY PROX

S E R IA L O R D ER D IF F IC U L T Y O R D E R

SEQ ITEM ITEM DISC F IT SEQ ITE M ITEM DISC F IT


NUM NAME D IF F IN D X MN SQ NUM NAM E D IF F IN D X MN SQ

4 3 -3 .8 7 1.30 0.49 4 3 -3 .8 7 1.30 0.49


5 3 -3 .2 9 1.35 0.64 5 3 -3 .2 9 1.35 0.64
6 3 -2 .8 8 1.07 0.90 7 4 -3 .2 9 0.48 1.73
i 4 -3 .2 9 0.48 1.73 6 3 -2 .8 8 1.07 0.90
8 4 -2 .0 1 1.49 0.48 9 4 -2 .8 8 1.44 0.27
9 4 -2 .8 8 1.44 0.27 8 4 -2 .0 1 1.49 0.48
10 4 -1 .3 9 1.40 0.83 10 4 -1 .3 9 1.40 0.83
11 5 0.55 1.04 0.71 11 5 0.55 1.04 0.71
12 5 1.77 0.80 0.72 13 5 1.52 1.49 0.44
13 5 1.52 1.49 0.44 12 5 1.77 0.80 0.72
14 6 2.80 0.81 0.91 14 6 2.80 0.81 0.91
15 6 4.32 1.33 0.17 16 6 4.32 1.33 0.17
16 6 4.32 1.33 0.17 17 6 4.32 1.33 0.17
17 6 4.32 1.33 0.17 15 6 4 .32 1.33 0.17

MEAN 0.00 1.19 0.62


S.D. 3.15 0.31 0.42
F IT O R D ER
SEQ ITEM ITEM DISC F IT PO INT
NUM NAME D IF F IN D X MN SQ BISER
16 6 4.32 1.33 0.17 0.33
17 6 4.32 1.33 0.17 0.33
15 6 4.32 1.33 0.17 0.33
9 4 -2 .8 8 1.44 0.27 0.61
13 5 1.52 1.49 0.44 0.58
8 4 -2 .0 1 1.49 0.48 0.69
4 3 -3 .8 7 1.30 0.49 0.40
5 3 -3 .2 9 1.35 0.64 0.42
11 5 0.55 1.04 0.71 0.54
12 5 1.77 0.80 0.72 0.42
10 4 -1 .3 9 1.40 0.83 0.54
6 3 -2 .8 8 1.07 0.90 0.40
14 6 2.80 0.81 0.91 0.20
7 4 -3 .2 9 0.48 1.73 0.23
IT E M C A L IB R A T IO N BY COMPUTER 55

Table 3.2.9 summarizes the item calibration inform ation in three useful arrangements.
We have there fo r each item its name, d ifficu lty, discrimination index and total fit mean
square listed first by serial order, second b y d ifficu lty order, and third by fit order. While
in the K C T exam ple w e have on ly a fe w items to deal with, on longer tests the convenient
reordering o f these items b y d ifficu lty and by fit helps us to find m isfitting items and to
grasp the pattern o f m isfit, i f there is one. In our example we see again that the item with
the greatest m isfit, Item 7, is identified fo r us at the b ottom o f the third panel o f Table 3.2.9.

3 .3 C O M P A R IN G P R O X B Y H A N D W IT H P R O X B Y C O M P U TE R

N o w w e can com pare the P R O X estimation results fo r item difficulties and person
measures obtained by hand with those produced by computer. The data on P R O X by
hand fo r item difficu lties and person measures comes from Tables 2.4.5 and 2.4.6 in
Chapter 2. T h e data on P R O X by com puter com e from Tables 3.2.6 and 3.2.7. These
data have been com piled in to Tables 3.3.1 and 3.3.2.

In Table 3.3.1, each item is listed with its calibration by hand and by computer. The
standard error fo r each item as com puted by hand and by com puter is also given. The
results fro m P R O X b y hand and P R O X by com puter are virtually the same.

_________________ T A B L E 3.3.1 ________________

A COMPARISON OF ITEM CALIBRATIONS


AND TH E IR STANDARD ERRORS FOR
PROX BY HAND AND BY COMPUTER

C A L IB R A T IO N STA N D A R D ERROR

Item Hand Computer Difference Hand Computer D iff ere

4 -3 .9 -3 .9 0.0 1.0 0.8 0.2


5 -3 .3 -3 .3 0.0 0.8 0.7 0.1
6 -2 .9 -2 .9 0.0 0.7 0.6 0.1
7 -3 .3 -3 .3 0.0 0.8 0.7 0.1
8 -2 .0 -2 .0 0.0 0.6 0.5 0.1
9 -2 .9 -2 .9 0.0 0.7 0.6 0.1
10 -1 .4 -1 .4 0.0 0.5 0.4 0.1
11 +0.6 +0.5 0.1 0.5 0.4 0.1
12 +1.7 +1.8 -0 .1 0.6 0.5 0.1
13 +1.5 +1.5 0.0 0.6 0.5 0.1
14 +2.8 +2.8 0.0 0.8 0.7 0.1
15 +4.3 +4.3 0.0 1.3 1.2 0.1
16 +4.3 +4.3 0.0 1.3 1.2 0.1
17 +4.3 +4.3 0.0 1.3 1.2 0.1

MEAN 0 .00 0 .0 0 0 .82 0.71

SD 3 .13 3 .15 0 .29 0 .29

The ability measures and their standard errors are given in Table 3.3.2. Again, the
differences between the tw o methods are minimal. Only the standard errors o f measure­
m ent show a difference o f any magnitude. This difference is due to the use o f a more
accurate but also more laborious form ula in P R O X by computer. Thus, with the mild
exception o f the standard errors o f measurement, the very simple P R O X by hand and
P R O X b y com puter produce virtually the same results.
56 BEST TEST DESIGN

TABLE 3.3.2 _____________


A COMPARISON OF PERSON MEASURES
AND THEIR STANDARD ERRORS FOR PROX
BY HAND AND BY COMPUTER

MEASURE S T A N D A R D ERROR

Score Hand Computer Difference Hand Computer Difference

1 -5 .4 -5 .4 0.0 2.2 1.5 0.7


2 -3 .8 -3 .8 0.0 1.6 1.1 0.5
3 -2 .8 -2 .7 -0 .1 1.4 0.9 0.5
4 -1 .9 -1 .9 0.0 1.2 0.9 0.3
5 -1 .2 -1 .2 0.0 1.2 0.8 0.4
6 -0 .6 -0 .6 0.0 1.1 0.8 0.3
7 0.0 0.0 0.0 1.1 0.8 0.3
8 +0.6 +0.6 0.0 1.1 0.8 0.3
9 +1.2 +1.2 0.0 1.2 0.8 0.4
10 +1.9 +1.9 0.0 1.2 0.9 0.3
11 +2.8 +2.7 0.1 1.4 0.9 0.5
12 +3.8 +3.8 0.0 1.6 1.1 0.5
13 +5.4 +5.4 0.0 2.2 1.5 0.7

MEAN 0.00 0.00 1.42 0.98

SD 3.08 3.06 0.39 0.25

3.4 A N A L Y Z IN G KCT W IT H T H E UCON PRO CEDURE

N o w that we have seen how P R O X by hand compares with P R O X by computer, we


can turn to a slightly more elaborate and also more accurate procedure which is not
suitable fo r hand work but is convenient and economical to apply by computer. This is
the UCON procedure, developed by Wright and Panchapakesan in 1966 (1969) and
further reviewed and tested by Wright and Douglas in 1974 (1975b, 1977a) and Wright
and Mead in 1975 (1975, 1976). The computer output from U C O N is similar in form to
that from PR O X . The UCON analysis o f the K C T data is shown in Tables 3.4.1 through
3.4.4. Only those tables which contain results different from the P R O X analysis are
presented.

Table 3.4.1 gives the test items with their UCON calibrations and standard errors.
UCON uses P R O X item difficulties as its point o f departure and these are given in the far
right column o f the table. Table 3.4.2 gives the U C ON ability measure associated with
each score and the standard error fo r each measure. The larger standard errors at scores 7
and 8, fo r abilities between ± 2 logits are caused by the bimodal distribution o f item
difficulties shown in Table 3.4.1. Six o f the 14 items have difficulties below -3 .2 logits,
while another six have difficulties greater than +1.8 logits. This leaves only tw o items to
function in the 5 logit range between -3 .2 and +1.8 and the standard errors o f measure­
ment in that region are accordingly higher.
ITEM C A LIB R A T IO N B Y COMPUTER 57

T A B L E 3.4.1 _____

CA LIBRA TIO N BY UCON

D IF F IC U L T Y E X P A N S IO N F A C TO R 1.31

A B IL IT Y E X P A N S IO N F A C T O R 2 .10

N U M B E R O F IT E R A T IO N S = 7

S EQ U E N C E IT E M IT E M STANDARD LA S T D IF F PRO X
NUMBER NAME D IF F IC U L T Y ERR O R CHANGE D IF F

4 3 -4 .1 8 6 0.8 1 6 -0 .0 2 5 -3 .8 6 5
5 3 -3 .6 4 8 0.7 0 9 -0 .0 2 3 -3 .2 9 4
6 3 -3 .2 2 0 0.6 4 7 -0 .0 2 1 -2 .8 7 6
7 4 -3 .6 4 8 0.7 0 9 -0 .0 2 3 -3 .2 9 4
8 4 -2 .2 4 1 0.5 4 7 -0 .0 1 5 -2 .0 0 7
9 4 -3 .2 2 0 0.6 4 7 -0 .0 2 1 -2 .8 7 6
10 4 -1 .4 9 8 0.4 8 9 -0 .0 0 9 -1 .3 8 8
11 5 0 .7 6 0 0 .4 5 6 0 .0 0 6 0.5 4 7
12 5 2.1 3 5 0.5 5 6 0 .0 1 5 1.767
13 5 1.861 0.5 2 9 0.014 1.518
14 6 3.2 1 4 0.7 0 5 0.0 2 2 2.805
15 6 4 .5 6 4 1.076 0.0 2 7 4.321
16 6 4 .5 6 4 1.076 0.027 4.321
17 6 4 .5 6 4 1.076 0.0 2 7 4.321

R O O T M E A N S Q U A R E = 0.0 2 2

T A B L E 3.4.2

MEASUREMENT BY UCON

C O M P L E T E SCORE E Q U IV A L E N C E T A B L E

RAW PERSON STANDARD


SCORE COUNT A B IL IT Y ERR O R

13 0 5.09 1.14
12 0 4.11 0.95
11 2 3.31 0.92
10 1 2.53 0.93
9 4 1.71 0 .96
8 5 0.81 1.03
7 12 - 0 .2 2 1.07
6 3 - 1 .1 9 0.97
5 2 - 1 .9 6 0.86
4 2 -2 .6 1 0.81
3 2 -3 .2 1 0.81
2 1 - 3 .8 6 0.88
1 0 - 4 .7 3 1.10

M E A N A B IL IT Y = - 0 .1 6

SD O F A B IL IT Y = 1.45
58 BEST TEST DESIGN

*dasia
INIOd
o CN o CO O) CM 00 o CO CO CO 1
CN CO 5 LO LO LO CN CO CO CO 1
o' o o o o o o o o o o o o o 1 5
Q oc
CC
O

"
u.
cc
LL oc
o X|
CO Q
CO
CD
o
CO
o
CN
CM
CN
o CO
O)
CN
00 CO
CN
00
00
o
00
o
00 1
o j
o LU
o o' r- t-’ *- o o *- o i- ,
u a
LUH
Q CO

/ 00 \ LO o 00 r— CO CO CO CO mi-
co LO q i q 1 CN r^* r^. q co *— «— t— CO CM
o o It -’/ o o o o o o t—’ o o o

Z o-
S 3 o r>. CO ' oo\ IO o CO CO CD CN i CO CN
CN CD o o o 1 00
CURVES AND ANALYSIS OF FIT

F O CO CM col *fr
LU CC o o o i*■ / o o o o o o o o o 1 o
COO

CD
CO
LO
LO o f°A LO o
CN
CM
00 0
O
0 o>
o> o
CO
o>
CO
<3* ^ 1 t-
CO
LO
CN
o o O - o o O o o o * - o o o J o
> o

a. o t— t— t— r* 1
O 3 CO LO CO CO CN
o o o o o o o o o r— o o o o . t—
O CC O o o o o o o o o o d o o o o ,
a co cc
a
LU Q
Z CC UJ
o
o i« 03
CL
CM
o
CO /
T
CO
“j o
LO
o
CO
o
CO
o
0
0 LO
o j; o o q q
o
D Eg z o
CM CC o
o 1
o I ol o o o o' o o’ o' o* o' d
> 2 *
LU _
u
DO
os
o
ITEM CHARACTERISTIC

Ol
cc CO T—j ^ 1 o CD 3 CO CN CN r- o o o o
u. i- 2 o o o t—| o o o o o O o o o o
10 O o o o 1i o'/ o o’ o o o' o o o o' o
* - cc II
o

Q.
Q3 o o o o o o o CO 00 1^ 0
0 0
0 0
0 CO
OC O o o o o o o o CO CO LO t— 0
0 o o 1 CO
t— T— T— 1- o o o o o o O I
CO QC «- «- 0 «-
0
CO o
cc
UJ
I- UJ CL
QD o o CM CO CN o LO CO Is* 00 CN
< CC z o o o CD 00 CD o CO *T o o o o CM
o'
CN QC «— O o o t—
’ o o o o o o o I
<« o
o
Q.
o o <o °
c/)o 00 o o o o o o
«- OC o’ o' o o' o o' o*
o

co co co LO LO IO CO CO CO CO LU >
O I-
z
< CQ
cc
LU
<
00 O) O «- CM CO CC z
o <
O LU
CO
ITEM C A L IB R A T IO N BY COMPUTER 59

Table 3.4.3 gives the observed Item Characteristic Curve shown in Table 3.2.8, the
departures o f this ICC from the m odel ICC as expected by U C O N estimates and the fit
mean squares resulting from the U C O N analysis. Table 3.4.4 summarizes the U C ON
calibration in the same form as Table 3.2.9 summarizes the P R O X calibration.

T A B L E 3 .4 .4

ITEM CA LIB R A TIO N SUMMARY


BYUCON

S E R IA L O R D E R D IF F IC U L T Y O R D E R

SEQ IT E M IT E M DISC F IT SEQ IT E M IT E M DISC F IT


NUM NAME D IF F IN D X M N SQ NUM NAME D IF F IN D X M N SQ

4 3 -4 .1 9 1.13 0.37 4 3 - 4 .1 9 1.13 0.37


5 3 - 3 .6 5 1.17 0.53 5 3 - 3 .6 5 1.17 0.53
6 3 - 3 .2 2 0.97 0 .9 0 7 4 - 3 .6 5 0 .6 0 1.98
7 4 - 3 .6 5 0 .6 0 1.98 6 3 - 3 .2 2 0 .97 0 .90
8 4 - 2 .2 4 1.20 0.44 9 4 - 3 .2 2 1.22 0.23
9 4 - 3 .2 2 1.22 0 .23 8 4 -2 .2 4 1.20 0.44
10 4 -1 .5 0 1.10 0.79 10 4 - 1 .5 0 1.10 0 .79
11 5 0 .7 6 0 .9 6 0.77 11 5 0 .7 6 0 .96 0.77
12 5 2 .14 0 .8 2 0.97 13 5 1.86 1.34 0 .40
13 5 1.86 1.34 0 .40 12 5 2.14 0 .82 0.97
14 6 3.21 0 .8 2 1.33 14 6 3.21 0 .82 1.33
15 6 4 .5 6 1.08 0 .13 16 6 4 .5 6 1.08 0 .13
16 6 4 .5 6 1.08 0 .13 17 6 4 .5 6 1.08 0.13
17 6 4 .5 6 1.08 0 .13 15 6 4 .5 6 1.08 0 .13

MEAN 0 .0 0 1.04 0 .65


S.D. 3 .44 0 .1 9 0.53

F IT O R D E R
SEQ IT E M IT E M DISC F IT P O IN T
NUM NAME D IF F IN D X M N SQ BISER

16 6 4 .5 6 1.08 0.13 0.33


17 6 4 .5 6 1.08 0.13 0.33
15 6 4 .5 6 1.08 0.13 0.33
9 4 -3 .2 2 1.22 0.23 0.61
4 3 -4 .1 9 1.13 0.37 0.40
13 5 1.86 1.34 0.40 0.58
8 4 -2 .2 4 1.20 0.44 0.69
5 3 - 3 .6 5 1.17 0.53 0.42
11 5 0 .76 0 .96 0.77 0.54
10 4 -1 .5 0 1.10 0.79 0.54
6 3 3.22 0.97 0.90 0.40
12 5 2.14 0 .82 0.97 0.42
14 6 3.21 0 .82 1.33 0.20
7 4 3 .65 0 .60 1.98 0.23
60 BEST TEST DESIGN

In Chapter 2 we demonstrated the feasibility o f hand calibration and showed the


computation in detail. This was done to provide a basis fo r understanding the computer
programs which accomplish the same task and their resulting outputs. The comparison
o f P R O X by hand to P R O X by computer demonstrates their comparability. In UCON
we have a program which provides greater accuracy. Our next step is to compare UCON
to PRO X.

3.5 C O M PA R IN G UCON TO PRO X W IT H TH E KCT D A T A

Table 3.5.1 gives the calibrations and standard errors fo r the K C T data produced
by the UCON and P R O X methods. The calibration differences between UCON and P R O X
run about ± .3 logits. The difference between their standard errors is at most ± .1 logits.

_______________ TA B LE 3.5.1 _______________

A COMPARISON OF ITEM CALIBRATIONS


AND STANDARD ERRORS FOR
UCON AND PROX BY COMPUTER

C A L IB R A T IO N S T A N D A R D ERROR
Item UCON PROX Difference UCON PRO X Difference
4 -4 .2 -3 .9 -0 .3 0.8 0.8 0.0
5 -3 .6 -3 .3 -0 .3 0.7 0.7 0.0
6 -3 .2 -2 .9 -0 .3 0.6 0.6 0.0
7 -3 .6 - -3 .3 -0 .3 0.7 0.7 0.0
8 -2 .2 -2 .0 -0 .2 0.6 0.5 0.1
9 -3 .2 -2 .9 -0 .3 0.6 0.6 0.0
10 -1 .5 -1 .4 -0 .1 0.5 0.4 0.1
11 +0.8 +0.5 0.3 0.5 0.4 0.1
12 +2.1 +1.8 0.3 0.6 0.5 0.1
13 +1.9 +1.5 0.4 0.5 0.5 0.0
14 +3.2 +2.8 0.4 0.7 0.7 0.0
15 +4.6 +4.3 0.3 1.1 1.2 -0 .1
16 +4.6 +4.3 0.3 1.1 1.2 -0 .1
17 +4.6 +4.3 0.3 1.1 1.2 -0 .1

M EAN 0.00 0.00 0.74 0.71

SD 3.44 3.15 0.22 0.29

Table 3.5.2 gives the person measures and their standard errors fo r UCON and
PR O X . There the differences between UCON and P R O X methods run as much as ± .7
logits fo r the measures.

We see that using the more accurate UCON procedure which takes into account the
particular distributions o f item difficulties and person abilities does make a tangible d if­
ference fo r the K C T data. As we have seen these K C T items have a distinctly bimodal
distribution not well handled by the P R O X procedure. Although, these differences
between P R O X and UCON are never as much as a standard error, and hence could n ot be
ITE M C A L IB R A T IO N BY COMPUTER 61

considered statistically significant, nevertheless, they might trouble some practitioners.


T heir cause, however, is the brevity o f this K C T example and the bim odality o f its item
difficulties. F o r larger data sets and fo r more uniform item d ifficu lty distributions, the
results o f P R O X and U C O N are virtually indistinguishable.

T A B L E 3.5.2

A COMPARISON OF PERSON MEASURES


AND STANDARD ERRORS FOR
UCON AND PROX BY COMPUTER

MEASURE S TA N D A R D ERROR
Score UCO N PROX Difference UCON PRO X Difference
1 -4 .7 -5 .4 0.7 1.1 1.5 - 0 .4
2 -3 .9 -3 .8 -0 .1 0.9 1.1 - 0 .2
3 -3 .2 -2 .7 - 0 .5 0.8 0.9 -0 .1
4 - 2 .6 -1 .9 -0 .7 0.8 0.9 -0 .1
5 -2 .0 - 1 .2 -0 .8 0.9 0.8 0.1
6 -1 .2 -0 .6 -0 .6 1.0 0.8 0.2
7 - 0 .2 0.0 -0 .2 1.0 0.8 0.2
8 0.8 0.6 0.2 1.0 0.8 0.2
9 1.7 1.2 0.5 1.0 0.8 0.2
10 2.5 1.9 0 .6 0.9 0.9 0.0
11 3.3 2.7 0.6 0.9 0.9 0.0
12 4.1 3.8 0.3 1.0 1.1 -0 .1
13 5.1 5.4 -0 .3 1.1 1.5 -0 .4

MEAN 0.0 0.0 1.0 1.0

SD 3.2 3.1 0.1 0.3

3 .6 A C O M P U T IN G A L G O R IT H M FO R P R O X

Here is a concise im plem entation o f the P R O X procedure, suitable fo r com puter


programming:

1. E dit the binary data matrix o f person-by-item responses such that no


person has a zero o r a p erfect score and no item has a zero o r a perfect
score. This editing may go beyond a single stage when the removal o f an
item necessitates the removal o f some persons, and vice versa. The final
outcom e is a vector o f item scores (Sj) where i goes from 1 to L and a
vector o f person score frequencies (n r) where r goes from 1 to L - l .

L et: Xj = Cn [ ( N —Sj) /sj] [ 3 . 6 .1 ]

x. = 2 x ./L [ 3 .6 .2 ]

y r = Cn [ r / ( L - r ) ] [ 3 .6 .3 ]
62 b e s t t e s t d e s ig n

L -1
y. = 2 nr y r / N [3.6.41

D = 2 ( Xj _ x . ) 2 / 2 .8 9 ( L - 1) [3.6.5]

B = z ’ n , ( y r - y . ) 2 / 2 .8 9 ( N - 1) [3.6.6]
r

G = BD [3.6.7]

3. Calculate the expansion factors:

X = [(1 + D ) / (1 - G )] 1/4 [3.6.8]

Y = [(1 + B) / (1 - G )] 54 . [3.6.9]

4. Estimate the item difficulties as:

d, = Y <Xj —x . ) f f o r i = 1, L [3.6.10]

5. With standard errors o f:

S E( d j ) = Y ( N / S j ( N - Sj ) ] 14. [3.6.11]

6. The ability estimates fo r this set o f itemsare given by:

br = X y r, for r = 1, L -1 [3.6.12]

7. With standard errors o f:

SE (b r ) = X [ L / r ( L —r ) ] 1/4 . [3.6.13]

3.7 T H E U N C O N D IT IO N A L PRO CEDURE UCON

The Rasch model fo r binary observations defines the probability o f a response x„j to
item i by person v as

p { x i>i l0 y .5 i } = e x P [ x i>i(0j; —5 1)] / [ 1 + e x PIP? —5 j) 1 [3.7.1]

[ 1 i f correct
where xt»i = \ 0 otherwise,

Pv = ability parameter o f person v,

5! = difficu lty parameter o f item i.

The likelihood A o f the data matrix ((x „ , )) is the continued product o f Equation
[3.7.1] over all values o f v and i, where L is the number o f items and N is the number o f
persons with test scores between 0 and L, since scores o f 0 and L lead to infinite ability
estimates.
A = exp [ Z 2 x„., ( 0 „ - 6 , ) ] / 8 h [1 + exp ( 0 ^ - 5 . ) ] [3.7.2]
v i v I
ITE M C A L IB R A T IO N BY COMPUTER 63

Upon taking logarithms and letting

L
2 x^j = rv be the score o f person v
i

N
and 2 x,,j = Sj be the score o f item i,

the lo g likelihood X becomes


N L N L
X = 2n A = 2 r „ 0 „ - 2 8,5, - 2 2 2n [1 + exp (0V -6,)1 . (3.7.3]

The reduction o f the data matrix ( ( x ^ ) ) to its margins (r „ ) and (s,) and the separa­
tion o f rvPv and S|8 , in Equation 3.7.3 establish the sufficiency o f r„ fo r estimating $v
and o f S| fo r estimating 5 j, as w ell as the objectivity o f these estimates.

It is im portant to recognize, o f course, that although rv and s, lead to sufficient


estimates o f and 5, they themselves are n ot satisfactory as estimates. Person score r„
is n ot free from the particular item difficu lties encountered in the test. N o r is item score
S| free from the ability distribution o f the persons w ho happen to be taking the item. T o
achieve independence from these local factors requires adjusting the observed r„ and Sj
fo r the related item d ifficu lty and person ability distributions to produce the test-free
person measures and sample-free item calibrations desired.

L
With the side condition 2 8( = 0 to restrain the indeterminacy o f origin in the
i
response parameters, the first and second partial derivatives o f X with respect to and
6 j becom e
9X l
10“ “ r„ - 2 p • 1,N [3.7.4]

a2x L
i J j = - 2 ^ ,(1 ~nvi) [3.7.5]

3X n
and 35 . = -Sj + 2 i= 1,L [3.7.6]

3 2X N
2 7r„i(1 ~ n vi) [3.7.7]

where irvi = exp(j3v - 5 ,)/[ 1 + exp (j3v -8 j)]

These are the equations necessary fo r unconditional maximum likelihood estima­


tion. The solutions fo r item d ifficu lty estimates in Equations 3.7.6 and 3.7.7 depend on
the presence o f values fo r the person ability estimates. Because unweighted test scores are
the sufficient statistics fo r estimating abilities, all persons with identical scores obtain
identical ability estimates. Hence, we may group persons b y their score, letting

b r be the ability estimate fo r any person with score r,


d, be the d ifficu lty estimate o f item i,
nr be the number o f persons with score r,
64 BEST TEST DESIGN

and write the estimated probability that a person with a score r will succeed on item i
as

p ri = exp (b r—d j ) / [ 1 + exp (b r - d s)] . [ 3 .7 .8 ]

N L -1
Then 2 nVi 2 nrpr, , as far as estimates are concerned.
V r

A convenient algorithm fo r computing estimates (d ,) is:

1. Define an initial set o f ( b r) as

b r (0) = 8 n ^ T ' T _j r = 1 ,L - 1 [ 3 .7 .9 ]

2. Define an initial set o f (d ,), centered at d.=0, as

d, (0) = fir /L i = 1 ,L [ 3 .7 .1 0 ]

where br<0) is the maximum likelihood estimate o f fo r a test o f L equivalent


items centered at zero and d ,101 is the similarly centered maximum likelihood
estimate o f 5, fo r a sample o f N equal-ability persons.

3. A p p ly N ew ton ’s method to Equation 3.7.6 to improve each dj according t o ,

- S, * ' Li \ Prtli)
d . Ii T 1 1 = d , ' i » - i = 1 ,L [ 3 .7 .1 1 ]
v (j) n OK
- 2 n rp rl (1 ~ P ri )
r

until convergence at |dj(i+ 11 - dj<i} |< .01

where Pri*'* = exP ( b r—d 8*** )/[ 1 + e x p ( br—d s<j>)] [ 3 .7 .1 2 ]

and the current set o f ( br) are given by the previous cycle.

4. Recenter the set o f (dj) at d. = 0.

5. Using this improved set o f (dj), apply N ew ton ’smethod to Equation 3.7.4 to
improve each br according to
r - 2I K
p ri
( [" ’

br<m + 1) = br<m) l r = 1, L — 1 [ 3 .7 .1 3 ]
- 2 p'T* (1 + p [r’ )
i

until convergence at |br(m + 11 - br(m 11< .01


IT E M C A L IB R A T IO N BY COMPUTER 65

where p r , <m 1 = e x p ( b r(m ’ - d,)/! 1 + e x p ( br(m 1 - d, ) ] [ 3 . 7 .1 4 ]

and the current set o f ( d j ) are given by the previous cycle.

6. R epeat steps (3 ) through (5 ) until successive estimates o f the whole set o f (dj)
becom e stable at

2 ( d j<k + 1 ’ - d j < k )) 2 / L < .0 0 0 1 , [ 3 . 7 .1 5 ]

which usually takes three or fou r cycles.

7, Use the reciprocals o f the negative square roots defined in Equation 3.7.7 as asymp­
totic estimates o f the standard errors o f d ifficu lty estimates,

SE (d ,) = [ 2 1n r p r i (1 - p r i ) ] _V4 . i = 1, L [ 3 . 7 .1 6 ]
r

Andersen (1 9 7 3 ) has shown that the presence o f the ability parameters (/3„) in the
likelihood equation o f this unconditional approach leads to biased estimates o f item d if­
ficulties (5j). Simulations undertaken to test U C O N in 1966 indicated that m ultiplying
the centered item d ifficu lty estimates by the coefficien t [ ( L - 1)/L] compensates fo r
m ost o f this bias. ( F o r a discussion and evaluation o f the unbiasing coefficien t [ ( L - 1)/L]
see W right and Douglas, 1975b or 1977a).
4 THE ANALYSIS OF FIT

4.1 IN T R O D U C T IO N

Procedures fo r item calibration by hand were given in Chapter 2, and calibration


output from computer programs was discussed in Chapter 3. However, these calibration
procedures are only part o f a complete analysis o f a sample o f data. The Rasch model
makes certain plausible assumptions about what happens when a person takes an item,
and a complete analysis must include an evaluation o f how well the data fit these assump­
tions. When, fo r example, a person answers all the hard items o f a test correctly but then
misses several easy items, we are surprised by the resulting implausible pattern o f re­
sponses. While we could examine individual records by eye fo r their implausibility, in
practice we want to put such evaluations on a systematic and manageable basis. We want
to be able to be specific and objective in our reactions to implausible observations.

Even i f the measurement model tends to fit a particular application, we cannot


predict in advance how well new items (o r even old ones) will continue to work in every
situation in which they might be applied, nor can w e know in advance how all persons
w ill always respond. Therefore, if w e are serious in our attempts to measure, we must
examine every application to see how w ell each set o f responses corresponds to our model
expectations. We must evaluate not only the plausibility o f the sample o f persons’
responses, but also the plausibility o f each person’s responses to the set o f itdms in his
test. T o do this we must examine the response o f each person to each item to determine
whether it is consistent with the general pattern o f responses observed.

4.2. T H E K CT RESPONSE M A T R IX

We begin the study o f fit analysis by returning to the item-by-person data matrix o f
the K C T given in Table 2.4.1. In this table we have the edited and ordered responses o f
34 persons to 14 K C T items. The editing process removed items answered correctly by
everyone or no one, and persons answering correctly all or none o f the items. The
remaining persons and items have been arranged in order o f increasing item and person
score.

This item-by-person matrix o f l ’s and 0 ’s is the com plete record o f usable person
responses to the items o f the test. By inspection we see that the increasing d ifficu lty o f
the K C T items has divided the matrix roughly into tw o triangles: a lower left triangle
dominated by correct responses signified by l ’s and an upper right triangle dominated by
incorrect responses signified by 0 ’s.

This is the pattern w e expect. As items get harder, going from left to right in Table
2.4.1, any particular person’s string o f successes should gradually peter out and end in a
string o f failures on the items much to o hard fo r that person. Similarly, when we examine
the pattern o f responses for any item by proceeding from the bottom o f Table 2.4.1 up

66
TH E A N A LY S IS OF F IT 67

that item ’s colum n over persons o f decreasing ability, w e expect the string o f successes at
the b otto m to peter ou t into failures as the persons becom e to o low in ability to succeed
on this item.

From our calibration o f the K C T items we have estimates o f the item difficulties
( dj) and o f the abilities ( b r) which go with the possible scores (r) on this test. In Table
4.2.1 w e show the m atrix o f responses from Table 2.4.1 to which we have added, from
our calibration, the item difficu lties (d;) across the bottom and the abilities ( b r ) asso­
ciated w ith each score dow n the right column. The item difficulties and score abilities in
Table 4.2.1 are those estimated with P R O X by hand from Chapter 2.

N o tic e h ow Table 4.2.1 is arranged into six sections in order to bring out the pattern
o f responses. T h e 14 items are partitioned into the 7 easier and the 7 harder. The 34
persons are partitioned into the 10 scoring b elow seven, the 12 scoring exactly seven and
the 12 scoring above seven. In the lo w er le ft section there are on ly l ’s. Every higher
ability person go t every easier item correct. In the upper right section there are on ly 0 ’s.
Every lo w er ability person g o t every harder item incorrect. But in the other fou r sections
there is a pattern o f l ’s and 0 ’s that must be analyzed.

When w e exam ine the pattern o f responses in these data fo r unexpected “ corrects”
and “ incorrects,” w e find that Table 4.2.1 shows several exceptions to a pattern o f all l ’s
fo llo w e d b y all 0 ’s. O f course, w e d o n ot exp ect every single person to fail fo r the first
tim e at a particular p oin t and then always to continue to d o so on all harder items. We
expect to find a run o f successes and failures leading finally to a run o f failures as the
items finally becom e to o d ifficu lt. H ow ever, some o f the exceptions in Table 4.2.1 seem
to exceed even this expectation. T o facilitate their examination w e have circled those
responses which seem m ost unexpected given the overall pattern.

T h e locations o f these apparently surprising responses lead us to examine more


closely som e o f the person records in Tables 4.2.1.

F o r Person 2, the pattern o f responses is almost to o reasonable: all l ’s


fo llo w e d b y all 0 ’s.

F o r Person 29, in contrast, the pattern is quite puzzling; it shows both


failures on easy items and success on a hard one.

Th e expected pattern is the one w e see in the records o f Persons 12 o r 23. Here each
record shows a string o f l ’s w ith a few adjacent and alternating l ’s and 0 ’s, follow ed by a
string o f 0 ’s.

Turning to the fou r most questionable records, w e see:

Person 11, failed Item 4 but passed Items 5 through 9 before failing
all the remaining items.

Person 17, passed Item 4, missed Item 5, passed Items 6 through 10


and then failed the remaining items.

Person 13, passed Items 4 and 5, missed Items 6 and 7, passed Items 8
through 12 and then missed the remaining ones.
68 BEST TEST DESIGN

blf.
a. -
o
um d .o-

z
o

id in in in in i

bi
tr w 00 CO CO CO CO (
O
o
</>

oooooooooo oooooooooooo ooooooooooo-

oooooooooo oooooooooooo ooooooooooo—


RESPONSE PATTERNS OF 34 PERSONS TO 14 ITEMS FROM TABLE 2.4.1

oooooooooo oooooooooooo OO OO O O O O O O —O

o
cc oooooooooo ooooooooo
o o o o o o o o ——o o o
<
z

oooooooooo o o o o o o —o ———o

oooooooooo oooooooooooo O — O O — — O — O — — —

oooooooooo o o o o o o ——o ——o —o ——o ——oo<

o o —o o o ——o o

oo ooo © -
- •

o —o o o «

>
</> © "©■"' - © ...........
<

Tables 2.4.5 and 2.4.6


_0_ - -0 -
©
© 0_, O) <*»
Item difficulties (d () and score abilities (b r ) from

© ©---

h
a<
bi Z
a.

5 s£
o a>it
E £
—ja. ~ a
TH E A N A LY S IS OF F IT 69

Person 29, passed Items 4 through 6 , missed Items 7 and 8 , passed


Items 9 through 11, missed Items 12 and 13 and then
passed Item 14 b efore missing all the remaining items.

There are a fe w oth er records that might also be examined such as Persons 3 and 12,
but as w e “ eyeb all” this small matrix, w e can see that the other records are less
exceptional.

N o w that w e have found some instances o f possibly irregular responses, w e want a


systematic w ay to judge the degree o f unexpectedness seen in these response patterns.

T h e Rasch m odel bases calibration and measurement on tw o expectations: (1 ) that


a m ore able person should always have a greater probability o f success on any item than a
less able person, and ( 2 ) that any person should always be m ore likely to d o better on an
easier item than on a harder one. When an observed pattern o f responses shows significant
deviations from these expectations, w e can use the particulars o f the m odel and the
person and item estimates to calculate an index o f how unexpected any particular person
or item record is.

4 .3 T H E A N A L Y S IS O F F IT B Y H A N D

T h e first step in our analysis o f response plausibility or fit is to observe the


d ifferen ce (b „ - d j) between the estimates o f ability bj, and d ifficu lty dj fo r each person
and item. When this d ifferen ce is positive, it means that the item should be easy fo r the
person. T h e m ore positive the differen ce, the easier the item and hence the greater our
expectation that the person w ill succeed. Similarly, as the difference between person
ability and item d ifficu lty becom es m ore and m ore negative, the item should be m ore and
m ore d ifficu lt fo r that person, and our expectation o f his failure increases.

In order to focus our application o f these ideas, w e have taken from Table 4.2.1 the
responses o f the six persons w ith the m ost implausible patterns to the seven items on
w hich their implausible responses occur. These selected responses com prise Table 4.3.1.
W ith this table w e can m ore easily study the outstanding unexpected “ correct” or
“ in correct” responses.

T o begin w ith, w e can tabulate the number o f unexpected responses fo r each person
and item in Table 4.3.1 to arrive at a simple count with which to describe what is
occurring. W e see that Persons 13 and 29 make the worst showing with three unexpected
responses each. H ow ever, this simple count does n ot tell us how to weigh and hence how
to judge the degree o f unexpectedness in these responses.

O ne w ay statisticians think about the outcomes o f probabilistic events like dice-


rolling, coin-tossing and getting an item correct on a test is to define the expected
value o f the variable realized in any response x Ul, say o f person v to item i, as the proba­
b ility itv | o f that response occurring. This is useful because, i f we were to obtain re­
sponse x vi a great many times and its genesis were m ore or less governed by the prob­
ability TTyj, then we would exp ect success to occur about nv-, o f the time, just as we
exp ect “ 6 ” to com e up about one-sixth o f the tim e when we roll dice and “ heads” to
com e up about one-half o f the tim e when we toss coins.
70 BEST TEST DESIGN

TA B LE 4.3.1

SELECTED PERSON-TO-ITEM RESPONSES (*„,) WITH


UNEXPECTED RESPONSES CIRCLED

ITEM NUMBER OF
UNEXPECTED rcnaun
PERSON 4 5 7 6 8 12 14 RESPONSES A B IL IT Y *

11 1 1 1 1 0 0 1 -1.2

12 1 1 1 ® ® 0 0 2 -1.2

17 1 1 1 1 0 0 1 -0.6

3 1 1 1 1 1 © 0 1 0.0

13 1 1
® © 1 © 0 3 0.0

29 1 1
® 1 ® 0 © 3 0.0

Number of
Unexpected
Responses 1 1 2 2 2 2 1 11

Item
*
Difficulty -3 .9 - 3.3 - 3.3 - 2 . 9 - 2.0 1.7 2.8

" 1 " expected ”0" expected


"0" unexpected "1" unexpected
*From Tables 2.4.5 and 2.4.6

Our model estimates the probability o f instances o f response x vi as

p^i = exp (bj,- dj)/[1 + exp ( b„ - d,)]

where bj, = the estimated ability measure o f person v

and d| = the estimated difficulty calibration o f item i.

Thus we can use pvi as an estimate o f the expected value o f instances o f x„j.

The same theory tells us that the expected variance o f instances o f x vi is n V l ( l - v v i )


which we can estimate with pvi( l - p „ j). The result is an estimated standard residual zui
from any x v, o f

Zj;| — (Xj/j — Pj;j)/[Pj;i(1 — PyiU ^ • [4.3.1]

T o estimate this standard residual zvi, we subtract from the observed x vi its esti­
mated expected value p„j and standardize this residual difference by the divisor

[p„i (1 ~ P ^ ) ] 54
TH E A N A LY S IS OF F IT 71

which is the estimated binom ial standard deviation o f such observations. T o the extent
that our data approxim ate the m odel, w e expect this estimated residual zvi to be distribu­
ted m ore or less norm ally with a mean o f about 0 and a variance o f about 1.

Thus, as a rough but useful criterion fo r the fit o f the data to the m odel, we can
exam ine the exten t to which these standard residuals approxim ate a normal distribu­
tion, i.e.

‘■i' i N(0, 1)

or their squares approxim ate a one degree o f freedom chi-square distribution, i.e.
2
V\ 'Xl

T h e reference values o f 0 fo r the mean and 1 fo r the standard deviation and the
reference distributions o f N (0 , 1 ) and x i 2 help us to see i f the estimated standard resid­
uals deviate significantly from their m odel expectations. This examination o f residuals
w ill suggest whether w e can proceed to use these items to make measurements, or whether
we must d o further w o rk on the items and the testing situation to bring them into line
with reasonable expectations. It w ill also indicate when particular persons have failed to
respond to the test in a plausible manner.

When a particular squared residual zv 2 becomes very large, we w onder if something


unexpected happened when person v to o k item i. O f course, a single unexpected response
is less indicative o f trouble than a string o f unexpectedly large values o f zv 2 . Then the
accumulated im pact o f these values taken over items fo r a person or over persons fo r an
item is bound to produce concern fo r the plausibility o f the person’s measure or o f the
item ’s calibration and hence to put in to dou bt the meaning o f that person’s measurement
o r o f that ite m ’s calibration.

Since x „j takes on ly the tw o values o f “ 0 ” and “ 1 ” we can express these standard


residuals in terms o f the estimates b„ and d{.

F rom Equation 4.3.1 w e have

zx = (x - p )/[p (1 - p )l 54 .

So when X = 0 then z0 = (-p )/[p (1 - p ) ] y/i- — [p/(1 -

and X = 1 then Zf = (1 - p )/[p (1 - p ) ] 1/4 = + [(1 -

N o w since p = exp (b - d )/[1 + exp(b - d ) ]

then p/(1 - p) = exp (b - d)

and (1 - p )/p = exp (d - b) .

So zo = - exp [(b - d )/2 ] z0 2 = e x p ( b - d )

and Z1 = + exp [(d - b )/2 ] z 12 = e x p ( d - b )

= (2 x -1 ) e x p [(2 x -1 )(d -b )/2 ] [4.3.21


or in general z
[4.3.3]
z2 = exp t(2x-1 )(d -b )]
72 BEST TEST DESIGN

Thus, exp (b - d ) indicates the unexpectedness o f an incorrect response to a relatively


easy item, while exp (d - b ) indicates the unexpectedness o f a correct response to a rela­
tively hard item. The values o f z02 = exp (b - d ) and z , 2 = exp (d - b) can be ascertained
fo r each x vi o f 0 or 1 and then accumulated over items to evaluate the plausibility o f any
person measure, or over persons to evaluate the plausibility o f any item calibration.

T o evaluate the unexpected responses in Table 4.3.1 w e replace each instance o f an


unexpected response by the difference between the ability measure fo r that person and
the d ifficu lty calibration fo r that item. For Person 11 on Item 4 the unexpected incorrect
response associated with a person ability b,, o f -1 .2 and an item d ifficu lty dj o f - 3. 9 leads
to a difference (b,, - d j) o f (-1 .2 ) - (- 3 .9 ) = +2.7.

This difference o f 2.7 fo r Person 11 on Item 4 is placed at the location o f that


unexpected response in the matrix in Table 4.3.2 where w e have also computed the
differences fo r each instance o f an unexpected response given in Table 4.3.1.

TA B LE 4.3.2

A B IL ITY - D IFFIC U LTY DIFFERENCES (b^-dj)


FOR UNEXPECTED RESPONSES

ITEM
PERSON
PERSON 4 5 7 6 8 12 14 AB IL ITY

11 2.7 - 1.2

12 1.7 0.8 - 1.2

17 2.7 -0 .6

3 1.7 0 .0

13 3.3 2.9 1.7 0 .0

29 3.3 2 .0 2.8 0 .0

Item
Difficulty -3 .9 -3 .3 -3 .3 -2 .9 -2 .0 1.7 2.8

Since Since
" 1 " expected " 0 " expected
" 0 " unexpected " 1 " unexpected
entry is (b - d) entry is (d - b)

Unexpected incorrect answers have been recorded as (b - d ), but unexpected correct


answers have been recorded as (d - b). This is because when the response is incorrect,
i.e., x = 0, then the index o f unexpectedness is exp (b - d ), but when the response is
correct, i.e., x = 1, then the index is exp (d - b).

The earmark o f unexpectedness in Table 4.3.2 is a positive difference, whether from


(b - d ) or (d - b). Corresponding values fo r z2 can be looked up in Table 4.3.3 which
gives either values o f zQ2 = exp (b - d ) fo r unexpected incorrect answers or values o f
Zf 2 = exp (d - b ) fo r unexpected correct answers. The entry Cx in Column 1 o f Table
4.3.3 is either C0 = (b - d ) when x = 0 and the response is incorrect or C^ = (d - b ) when
x = 1 and the response is correct.
TH E A N A LY S IS OF F IT 73

T A B LE 4.3.3

M IS F IT S T A T IS T IC S

1
DIFFERENCE
BETWEEN RELATIVE NUMBER OF ITEMS
PERSON A B ILITY SQUARED IMPROBABILITY EFFICIENCY NEEDED TO
AND STANDARDIZED OF THE OF THE M AINTAIN
ITEM D IFFICULTY RESIDUAL RESPONSE OBSERVATION EQUAL PRECISION

1000/1

-0 .6 .0 .4 1 .50 100 10

0 .5 ,0 .9 2 .33 90 11

1.0,1.2 3 .25 75 13
1.3,1.5 4 .20 65 15

1.6,1.7 5 .17 55 18

1.8,1.8 6 .14 50 20

1.9,2.0 7 .12 45 22

2.1 8 .11 40 25

2.2 9 .10 36 28

2.3 10 .09 33 30

2.4 11 .08 31 32

2.5 12 .08 28 36

2.6 13 .07 25 40

2.7 15 .06 23 43

2.8 16 .06 21 48

2.9 18 .05 20 50

3.0 20 .05 18 55

3.1 22 .04 16 61

3.2 25 .04 15 66

3.3 27 .04 14 73

3.4 30 .03 12 83

3.5 33 .03 11 91

3.6 37 .03 10 100

3.7 40 .02 9 106

3.8 45 .02 9 117

3.9 49 .02 8 129

4.0 55 .02 7 142

4.1 60 .02 6 156

4.2 67 .02 6 172

4.3 74 .01 5 189

4.4 81 .01 5 209

4.5 90 .01 4 230

4 .6 99 .01 4 254

• F o r incorrect responses when x = 0 then CQ = (b - d). For correct responses when x = 1 then C j - (d b).
74 BEST TEST DESIGN

_________ TABLE4.3.4 _______


FIT MEAN SQUARES (z 2 )
FOR UNEXPECTED RESPONSES

ITEM PERSON
• M ISFIT
PERSON 4 5 7 6 8 • 12 14 TOTAL
~~~~~ ~ “ “““ ■ ~~ —

11 15 ■ 15

12 6 2 ! 8

17 15 i■ 15

3 | 6 6

13 27 18 * 6 51

29 27 7 !■ 17 51

■■■■■'■ ■ -------

Item !■
Misfit 15 15 54 24 9 I 12 17 146
Total ■

“ 1" expected ! " 0 " expected


" 0 " unexpected : " 1 " unexpected

We can locate the difference +2.7 fo r the (b - d) o f Person 11 on Item 4 in the first
column o f Table 4.3.3 and read the corresponding z2 in Column 2 as 15. This value and
all o f the other values fo r the differences in Table 4.3.2 have been recorded in Table
4.3.4, which now contains all the z2 for every instance o f unexpectedness that we have
observed fo r the six persons and seven items. In the margins o f Table 4.3.4 are the sums
o f these z2 fo r each person and item. These sums indicate how unexpected the person or
item pattern o f responses is.

In Column 3 o f Table 4.3.3 we show p = 1/(1 + z2 ), the improbability o f the


observed response. This provides a significance level fo r the null hypothesis o f fit fo r any
particular response. With our example o f a (b-d) o f 2.7 we have a significance level o f .06
against the null hypothesis that the response o f Person 11 to Item 4 is according to the
model. The z2 themselves, are approximately X2 distributed with almost 1 degree o f
freedom each. When they are accumulated over items fo r a person or over persons fo r an
item , the resulting sums are approximately x 2 distributed with ( L-1) degrees o f freedom
for a person and (N-1) degrees o f freedom fo r an item.

In Column 4 o f Table 4.3.3 we show I = 400p (1-p), an index o f the relative


efficiency with which an observation at that (b-d) provides information about the person
and item interaction. This index is scaled by the factor 400 so that it will give the amount
TH E A N A LY S IS OF F IT 75

o f inform ation provided by the observation as a percentage o f the maximum inform ation
that one observation at (b - d ) = 0, i.e., right on target, could provide. The percent infor­
m ation in an observation can be used to judge the value o f any particular item fo r mea­
suring a person. This can be done b y considering how much inform ation would be lost by
rem oving that item from the test. Thus, the I o f 23% fo r Person 11 on Item 4 gives us an
indication o f h ow much w e gain by including Item 4 in the measurement o f Person 11 or
o f h ow much we w ould lose were w e to rem ove Item 4.

The w ay the idea o f inform ation or efficien cy enters into judging the value o f an
observation is through its bearing on the precision o f measurement. Measurement
precision depends on the number o f items in the record and on the relevance o f each item
to the particular person. W e can sim plify the evaluation o f each item ’s contribution to
our know ledge o f the person by calculating what percent o f a best possible item the item
in question contributes. That is what the values o f I in Column 4 provide.

When the item and person are close to one another, i.e., on target, then the item
contributes m ore to the measure o f the person than when the item and person are far
apart. T h e greater the d ifferen ce between item and person, the greater the number o f
items needed to obtain a measure o f comparable precision and, as a result, the less
effic ie n t each item.

F o r exam ple, it requires five 20% items to provide as much inform ation about a
person as could be provided by one 100% item. Thus, when (b-d) is about 3.0, it takes
fo u r to five times as m any items to provide as much inform ation as could be had from
items that fell w ithin one lo git o f the person, i.e., in the |b-d| <1 region.

In general, the test length necessary to maintain a specified level o f measurement


precision is inversely proportional to the relative efficien cy o f the items used. The
number L o f less e fficie n t items necessary to match the precision o f 10 right-on-target
items is given in the last colum n o f Table 4.3.3.

T o facilitate the use o f Table 4.3.3, it has been arranged in fou r sections:

R igh t Item efficien cy is 45% or better, in the |b-d|<i


|b -d |< 2 region, 79% or better. M isfit is d ifficu lt to
on target detect.

Slightly E fficien cy is poor, less than 45%. M isfit becomes


2 < |b -d | < 3 detectable when unexpected responses accumu­
o f f target late.

Rather E fficien cy is very poor, less than 18%. Even


3 < |b -d |< 4 single unexpected responses can indicate signifi­
o f f target cant response irregularities.

E xtrem ely E fficien cy is virtually nil, less than 7%. U nex­


4 < | b -d | pected responses are always unacceptable.
o f f target
76 BEST TEST DESIGN

4.4 M IS F IT T IN G PERSON RECORDS

Upon examining the rows o f Table 4.3.4 fo r high z2 values in person records, we
find that the highest accumulated values are fo r Persons 13 and 29. These are the tw o
persons whose test behavior is most questionable, and so w e will examine their records in
more detail.

co £2
TH E A N A LY S IS OF F IT 77

Table 4.4.1 displays the response vectors fo r Persons 13 and 29 over all 14 items.
F o r each person w e show their responses o f 0 or 1, the concom itant (b-d) o r ( d - b )
differences, depending upon w hether the response is 0 fo r incorrect or 1 fo r correct, and
the consequent value o f z2 . T h e sums o f the ro w o f z2 fo r Person 13 and Person 29 are,
coincidentally, 53. A ccordin g to the m odel, these accumulated z2 ’s ought to fo llo w a
chi-square distribution w ith 1 degree o f freedom fo r each z2 minus the degree o f
freed om necessary to estimate the person measure b .

Further, any sum o f z2 ’s, when divided by its degrees o f freedom , should fo llo w a
mean square o r v = Z z2 /f distribution which can conveniently be evaluated as the t —
statistic:

v = Z z 2/f and f = L- 1 [4.4.1]

t = [fin(v) + v - 1] [f/8 ] V4~ N ( 0 , 1 ) [4.4.2]

which has approxim ately a unit normal distribution.

F o r Person 13 w e have
14

V13 = Z z21 3 i/ ( 1 4 - 11 = 5 3 /1 3 = 4.1


i

fo r which

%
t 13 = U n ( v13) + v13- 1] [ 13/s ] = [1 .4 + 4.1 - 1 ] [1.3] = 5 .8 ,

which is a rather im probable value fo r t, if this person’s performance fits the model.

F o r Person 29 w e observe the same results and the same t - statistic. W ith such
significant m isfit it w ou ld seem reasonable to diagnose these tw o records as unsuitable
data sources either fo r the measurement o f these tw o persons or fo r the calibration o f
these items.

4 .5 M IS F IT T IN G IT E M R EC O R D S

W e can also see in Table 4.3.4 that Items 7 and 6 show the greatest m isfit among
items, especially Item 7 w ith an accumulated z2 o f 54. In Table 4.5.1 w e analyze the
com plete data vectors o f these tw o items, showing fo r each person’s response o f 0 or 1
the associated (b-d) or ( d- b) w ith their respective z2 .

F o r Item 7
34
v , = Z z2 /(34-1 1 = 5 7 /3 3 = 1.7
7 V, f
V

fo r w hich
^ y2
t ? = [£n(v7 ) + v 7 - 1 ] [ 33/8 ] = [0.5 + 1 . 7 - 1] [2.0] = 2 .4 ,

which is also a som ewhat im probable value fo r t, if this item fits the model.
78 BEST TEST DESIGN

I TA B LE 4.5.1 I

C O M P L E T E F I T A N A L Y S IS F O R
IT E M 7 A N D 6

IT E M 7 IT E M 6
(d = - 3.3) (d = - 2.9)

PERSON A B ILITY RESPONSE X= 0 x= 1 RESPONSE x = 0 x = 1


b x (b - d) (d -b ) Z2 X (b -d ) (d -b ) z2

25 -3 .8 0 -0 .5 1 1 0.9 3

4 -2 .8 1 -0 .5 1 0 + 0.1 1

33 - 2.8 1 -0 .5 1 0 + 0.1 1

1 - 1.9 1 - 1.4 0 1 - 1.0 0

27 -1 .9 1 - 1.4 0 1 - 1.0 0

11 - 1.2 1 -2 .1 0 1 - 1.7 0

12 - 1.2 1 -2 .1 0 + 1.7 6

17 -0 .6 1 -2 .7 0 1 -2 .3 0

19 -0 .6 1 -2 .7 0 1 -2 .3 0

30 -0 .6 1 -2 .7 0 1 -2 .3 0

2 0.0 1 -3 .3 0 1 -2 .9 0

3 0.0 1 -3 .3 0 1 -2 .9 0

5 0.0 1 -3 .3 0 1 -2 .9 0

6 0.0 1 -3 .3 0 1 -2 .9 0

8 0.0 1 -3 .3 0 1 -2 .9 0

9 0.0 1 -3 .3 0 1 -2 .9 0

0.0 0 + 3 .3 27 + 2.9 18
©
16 0.0 1 -3 .3 0 1 -2 .9 0

26 0.0 1 -3 .3 0 1 -2 .9 0

28 0.0 1 -3 .3 0 1 -2 .9 0
0.0 0 + 3 .3 27 1 -2 .9 0
0
31 0.0 1 -3 .3 0 1 -2 .9 0
10 + 0.6 1 -3 .9 0 1 -3 .5 0
18 + 0.6 1 -3 .9 0 1 -3 .5 0
14 + 0.6 1 -3 .9 0 1 -3 .5 0
32 + 0.6 1 -3 .9 0 1 -3 .5 0
20 + 0.6 1 -3 .9 0 1 -3 .5 0
21 + 1.2 1 -4 .5 0 1 -4 .1 0
22 + 1.2 1 -4 .5 0 1 -4 .1 0
23 + 1.2 1 -4 .5 0 1 -4 .1 0
34 + 1.2 1 -4 .5 0 1 -4 .1 0
15 + 1.9 1 - 5.2 0 1 -4 .8 0
7 + 2.8 1 -6 .1 0 1 -5 .7 0
24 + 2.8 1 -6 .1 0 1 -5 .7 0

SUM OF SQUARES 57 29
T H E A N A LY S IS OF F IT 79

F o r Item 6
34

v_= S z 2 / (34-1) = 29/33 = 0.9


° v V6
fo r which

Vi
t6 = U n<v6 ) + v 6 - 1] [ 33/a] = (-0.1+0.9-11(2.0] = 0.4,

obviously n o t a significant misfit.

W e find that the mean square fo r Item 7 is significant but that the mean square fo r
Item 6 is not. H ow ever, when w e exam ine Table 4.5.1 again, we see that it is the tw o
significantly m isfitting persons 13 and 29 w ho contribute m ost to the m isfit values for
these tw o items. N o w w e have the opportu n ity o f im proving the fit o f the data to the
m odel, either by rem oving Item 7 and observing what happens then or by removing
Persons 13 and 29.

4 .6 B R IE F S U M M A R Y O F T H E A N A L Y S IS O F F IT

F o r any response o f Person v to Item i

xv j = 0 i f “ in correct” and

x Vj = 1 i f “ correct.”

T h e standard mean square residual becomes

z„j = exp (b-d), fo r xpi = 0 , incorrect, and

Zyj = exp (d-b), fo r x ■ = 1 , correct.

T o evaluate th e overall fit o f person v, w e sum his vector o f standard square residuals
2
(z •) over the test o f i = 1 ,L items, and calculate his person m isfit statistic as

L
V „ = 2 z 2 /(L - 1) ~ f l - , , ° o [4.6.1]

w ith t „ = (Cn(vj,) + v „ - 1] [(L - 1)/8]54~ N(0,1) [4.6.2]

T o evaluate the fit o f Item i, w e sum the item ’s vector o f standard square residuals
( Zy j) over the sample o f v = 1 ,N persons, and calculate the item m isfit statistic as
80 BEST T E S T D ESIG N

with t, = [C n (V j) + v, - 1] [(N - 1 )/ 8 ]1/1 ~ N (0 , 1 ) [4 .6 .4 ]

4.7 COMPUTER A N A L Y S IS OF F IT

In the analysis o f fit done by hand we saw that certain person records and items had
residuals evaluated as significant. Having shown the procedures for the analysis o f fit by
hand we turn to computer analysis and return to our calibration o f the K C T with 18
items and 34 persons. In the calibration o f the K C T w e see from the fit mean square,
given in the left panel o f Table 4.7.1, that Item 7 produces the greatest misfit with a value
o f 1.98 not far from the 1.7 found in our hand computation. From our analysis o f person
misfit we know that Persons 13 and 29 greatly contributed to this misfit in Item 7.
Without this information at the tim e o f our calibration, however, w e might have
considered the possible deletion o f Item 7 because o f its high fit mean square. With this
much lack o f fit fo r Item 7 we might have chosen to recalibrate with Item 7 removed.
This has been done and the results are given in the middle panel o f Table 4.7.1. N o w we
see that Item 6 has acquired a misfit o f 2.73 even though previously when w e calibrated
all 14 items, Item 6 had a fit mean square o f only 0.90. This change in the status o f Item
6 is troublesome. We do not seem to be focusing in on a set o f suitable items.
Nevertheless w e go one step further and recalibrate once more, this tim e removing both
Item 7 and Item 6. The results are in the right panel o f Table 4.7.1. Alas, now we find
that Item 8 has become a misfit. These attempts to find a properly fitting set o f items
appear doomed.

J T A B LE 4.7.1 |_

ANALYSIS OF FIT
WITH UCON :
ITEM DELETIONS

ITEMS IN F IT ORD E R ITEM S IN F IT O RD E R ITEM S IN F IT ORD E R

MN

16 6 4.56 0.13

17 6 4.56 0.13 16 4.45 0.13

15 6 4.56 0.13 17 4.45 0.13 4.29 0.13

9 4 - 3.22 0.23 15 4.45 0.13 4.29 0.13

4 3 - 4.19 0.37 9 - 3.75 0.22 15 4.29 0.13

13 5 1.86 0.40 4 - 4.75 0.40 9 - 4.30 0.18

8 4 - 2.24 0.44 13 1.77 0.42 4 - 5.38 0.35

5 3 - 3.65 0.53 5 - 4.20 0.54 13 1.61 0.44

11 5 0.76 0.77 11 0.65 0.68 5 - 4.79 0.64

10 4 - 1.50 0.79 14 3.11 0.74 11 0.43 0.67

6 3 - 3.22 0.90 12 2.04 0.83 14 2.96 0.77

12 5 2.14 0.97 10 - 1.81 0.88 10 - 2.19 0.78

14 6 321 1.33 - 2.66 1.03 12


1.89 0.82

71 4 - 3.65 1 98 - 3.75 2.73


□ I -3 .1 1 1 29

A ll Persons and A ll Items Deleting Item 7 Deleting Items 7 and 6

L = 14 N = 34 L = 13 N = 34 L = 12 N = 34
T H E A N A LY S IS OF F IT 81

Suppose, instead, w e decide, subsequent to our first calibration o f the K C T items, to


evaluate person fit. T h e com pu ter analysis fo r person m isfit, shown in Table 4.7.2, also
identifies Person 13 and 29 as producing the highest fit statistics. So let us recalibrate all
14 o f the items but with these tw o persons rem oved. N o w , in Table 4.7.3, w e see that the
f it mean squares fo r all o f the items are small enough to satisfy us. R em oving the tw o
unsuitable person records has brought all o f the items into agreement.

I TABLE 4.7.2 I

ANALYSIS OF PERSON FIT


WITH UCON

UCON UCON
PERSON SCORE A B IL IT Y M ISFIT
r b V

25 2 -4 .4 0 .5
4 3 -3 .7 0 .4
33 3 -3 .7 0 .9
1 4 -3 .1 0 .3
27 4 -3 .1 0 .3
11 5 -2 .3 0 .8
12 5 - 2.3 0 .5
17 6 - 1.4 1.0
19 6 - 1.4 0 .2
30 6 - 1.4 0 .2
2 7 -0 .3 0.1
3 7 -0 .3 1.4
5 7 -0 .3 0.1
6 7 - 0 .3 0.1
8 7 - 0 .3 0.1
9 7 -0 .3 0.1

©
16
7
7
-0 .3
-0 .3
I 5 .7
0 .6
I (Hand P R O X = 4 .1 )

26 7 -0 .3 0.1
28 7 -0 .3 0 .6

@31
7
7
-0 .3
-0 .3
| 6 .6 |
0.1
(Hand P R O X = 4 .1 )

10 8 + 1.0 0 .2
14 8 + 1.0 0 .2
18 8 + 1.0 0.4
20 8 + 1.0 0 .4
32 8 + 1.0 0 .2
21 9 + 2 .0 0 .2
22 9 + 2 .0 0 .2
23 9 + 2 .0 0 .7
34 9 + 2 .0 0 .7
15 10 + 3 .0 0 .2
7 11 + 3 .9 0 .4
24 11 + 3 .9 0 .9

Mean 0.7

Standard Deviation 1.6


82 BEST TEST DESIGN

T A B L E 4 .7 .3

A N A L Y S IS O F F IT
W IT H U C O N :
P ER SO N D E L E T IO N S

SEQ ITEM ITEM FIT


NUM NAME DIFF MN

7 4 - 5 .7 0 0.1 0

16 6 5.27 0.13

17 6 5.27 0.1 3

15 6 5.27 0 .1 3

8 4 - 2.87 0.17

9 4 - 3.73 0.21

6 3 -4 .2 4 0.34

13 5 2.34 0 .3 8

4 3 -4 .8 4 0 .4 0

14 6 4.43 0.55

5 3 -4 .2 4 0.64

11 5 1.51 0.7 0
12 5 3.01 0.99

10 4 - 1.48 1.03

Deleting Persons 13 and 29

L = 14 N =32

It seems clear that it was the test records o f these tw o unpredictable persons which
caused Item 7 and then Item 6 to seem to misfit. Thus, we learn that successive deletions
o f items without analyzing person fit can lead us to believe that items are misfitting
when, in fact, it is the response records o f a few irregular persons which are causing the
trouble. While the very small sample size used in our example exaggerates the impact o f
the tw o irregular persons, even large samples do not com pletely obliterate the
contaminating influence o f irregular person records, and in a large sample such flawed
records may be harder to spot and so remain unknown unless explicit tests o f person fit
are routinely made.
5 C O N STR U C TIN G A VARIABLE

5.1 G E N E R A L IZ IN G T H E D E F IN IT IO N O F A V A R IA B L E

In Chapters 2, 3 and 4 w e have shown h ow to expose and evaluate the observed rela­
tionship between intended measuring instruments, the test items, and the objects they are
intended to measure, the persons. This prepares us fo r the present chapter which is con­
cerned with h o w to defin e a variable.

W ith a workable calibration procedure and a method fo r the evaluation o f fit, it


becom es practical to turn our attention to a far m ore im portant activity, namely a critical
exam ination o f the calibrated items to see what it is that they im ply about the possibility
o f a variable o f some useful generality. We want to find ou t whether our calibrated items
spread ou t in a w ay that shows a coherent and meaningful direction. I f they are n ot
spread ou t at all, then all w e have achieved is to define a point, perhaps on some variable,
perhaps not. But the variable itself, whatever it may be remains obscure.

Our intention n o w is to show h ow calibrated items can be used to define a variable


and h ow to find ou t whether the resulting operational definition o f the variable makes
sense. We w ill begin by exam ining the degree to which the spread o f item difficulties
substantially exceeds the standard error o f their estimates, that is, the degree to which the
data has given a direction to the variable. F o r exam ple, suppose we consider the estimates
o f tw o item d ifficu lties with their respective standard errors. In order fo r these tw o items
to defin e a line between them the difference between their estimates must be substan­
tially greater than the standard error o f this differen ce! Only i f the tw o estimates are well
separated b y several such standard errors w ill we begin to see a line between the tw o
items suggesting a direction fo r the variable which they define.

If, how ever, when w e com pare these tw o estimates by a standard error or tw o, they
overlap substantially, then w e cannot assume that the tw o values d iffe r and as a result no
direction fo r a variable has been defined. Instead the items define a point w ithout direction.

Figure 5.1.1 illustrates this. In Exam ple 1 w e have Items A and B separated from
each other by several standard errors. Even with tw o items we begin to see a direction to
the variable at least as defined b y these tw o items. In the second example, however, we
find the tw o items so close to each other that, considering their standard errors, they are
n o t separable. We have found a point. But no direction has been established and so no
variable has as y e t been implied.

As an exam ple o f variable defin ition , w e w ill continue our study o f the K C T data to
see h o w w ell the K C T items succeed in defining a variable and just what that variable
seems to be.

5 .2 D E F IN IN G T H E K C T V A R IA B L E

T h e items o f the K C T form a tapping series that grows in length by increasing the
number o f taps and grows in com p lexity by the distance between adjacent taps and the
number o f reverses in direction o f movement.

83
84 BEST TEST DESIGN

Figure 5.2.1 lists the 18 items comprising the original K C T. Each item is described
by its numerical name, tapping series and tapping order pattern.

Table 5.2.1 focuses on those 14 K C T items that were calibrated in Chapters 2 and 3.
Items 1, 2 and 3 are not included because they were to o easy fo r the 34 persons in that
sample and Item 18 is not included because it was to o hard. Table 5.2.1 gives the item
names, tapping series, item difficulties and their standard errors. The difficu lty range o f
these 14 items is from -4 .2 logits to +4.6 logits.

The item difficulties in Table 5.2.1 make it possible to be quantitatively explicit in


our definition o f the K C T variable by placing the 14 items at their calibrated positions
along the line o f the variable. This is done in Figure 5.2.2 As several items have either the
same difficulty or are so close in terms o f their standard errors that they can hardly be
differentiated, we have shown only the eight items that best mark out the extent o f the
K C T variable. The semicircles in Column 2 o f Figure 5.2.2 show an allowance o f one
standard error around each estimated difficulty. We can see that Items 4, 6, 8 and 10
define the easy end o f the variable. Then there is a rather wide undefined gap in the
middle. Finally, Items 11, 12, 14 and 16 define the hard end. The tapping patterns in
Column 3 show what movement along the variable means in terms o f the increasing
number o f taps and pattern com plexity. Column 4 gives the distribution along this K C T
variable o f the 34 persons who participated in the initial calibration.
CONSTRUCTING A V A R IA B L E 85
86 BEST TEST DESIGN
CO NSTRUCTING A V A R IA B L E 87

I T A B L E 5.2.1 I
C A LIB R A TIO N OF THE KCT VA R IA B LE
W ITH ITEMS IN ORDER OF D IF F IC U L T Y

IT E M T A P P IN G IT E M STA N DA RD
NAME S ER IE S C A L IB R A T IO N ERROR

CO
4 - 4 .2 0.8
I

I
5 2 -1 -4 -3 .6 0.7
CO

CN
7 -3 .6 0.7
I

I
I
CO

6 -3 .2 0.6
I

I I
00

CN

9 - 3 .2 0.6
I

ro I
CO

8 -2 .2 0.6
I

I
CN

CO

10 -1 .5 0.5
I

11 1 - 3 - 1 - 2 - 4 0.8 0.5
T—

00

CN

13 1.9 0.5
I

I
00

JN 00
00 -e.
M

12 2.1 0.6
I

I I
NO

14 1 3.2 0.7
I

I
I
CO

CM

15 3 4 .6 1.1
I

16 1 - 4 - 2 - 3 - 1 - 4 4.6 1.1
00
ON

NO

17 4 4 .6 1.1
I

I
I

Mean 0 .0

Standard Deviation 3.4

UCO N Calibration from Table 3.4.1

We see that m ost o f the persons in this sample fall in the center o f the test. But that
is just where w e have a large gap in test items. We have discovered something important
and useful to us, nam ely that our test instrument is weakest at the m ode o f our sample. It
becom es clear that, i f w e w ant to discriminate among the m ajority o f persons found in
the m iddle range o f the K C T , then w e must construct some additional middle range items
which w ill be m ore appropriate to m iddle range abilities.

5 .3 IN T E N S IF Y IN G A N D E X T E N D IN G T H E K CT V A R IA B L E

T o im prove measurement along the K C T variable, especially in the middle range,


further item developm ent is required. We need items to fill the gap in the original defini­
tion o f the variable and w e need easier and harder test items in order to extend the vari­
able’s range. H ow ever, since all o f our sample passed the three easiest items, extending the
K C T variable dow n to easier levels may prove difficu lt. We would have to locate some
much less able persons than were found am ong our original 34 in order to calibrate easier
items. On the other hand, on ly one hard item was failed b y all persons in our K C T sample.
It m ight be fru itfu l to try to add some items which are m ore d ifficu lt than Items 15, 16
and 17, under the assumption that with a sample o f m ore able persons we could obtain
useful calibrations o f these m ore d ifficu lt items and thus extend the K C T variable upward.
88 BEST TEST DESIGN
CO NSTRUCTING A V A R IA B L E 89

W ith these considerations in mind, further developm ent o f the K C T variable was
undertaken. A ll 18 items from the original K C T were retained, and ten new items were
added. T h e original K C T was from Form II o f the Arthur Point Scale. We examined Form
I and found three items n ot used in Form II (Arthur, 1943). T o these three items we
added seven m ore. Five items were designed to fill the middle range gap, fou r items were
designed to exten d the K C T variable upward and one o f the Form I items was expected
to fit near old Items 5, 6 and 7. The tapping series fo r these additional items and their
intended locations on the K C T variable are shown in Figures 5.3.1 and 5.3.2.

Figure 5.3.1 shows the one item from Form I and the five new items designed to fill
the gap between the old K C T Item s 10 and 11. The fou r items designed to extend the
K C T in the region o f Item 18 are shown in Figure 5.3.2. The result is a new test form ,
K C T B , which contains all 18 old items and, in addition, 10 new items. This new instru­
m ent o f 28 items was administered to a sample o f 101 persons and Items 2 through 25
were calibrated. Item 1 was still to o easy and Items 26, 27 and 28 were still to o hard to
be calibrated.

Colum n 6 in Table 5.3.1 gives these new K C T B calibrations. Th e rest o f Table 5.3.1
shows the relationship betw een the old K C T and the new K C T B calibrations. Column 1
names the 14 old K C T items. Column 2 shows their original calibrations from Table
3.4.4. N o tic e in Colum n 6 that we have n ow obtained calibrations on old K C T Items 2, 3
and 18, three o f the original items which remained uncalibrated in our first study with 34
persons.

Colum n 3 o f Table 5.3.1 applies the necessary adjustment to bring the old K C T cali­
brations in to line with their new calibrations on the new K CTB . This is done by shifting
the calibrations in Column 2 by the constant 0.4 which is the mean position o f the old
K C T items in the n ew K C T B calibrations. This causes Column 3 and Column 5 to have
the same mean o f 0.4.

In Table 5.3.1 w e see that the new K C T B Items 12 through 16 fall more or less
where expected, i f som ewhat on the easy side. K C T B Item 25 along with K C T Item 18
extend the reach o f the K C T variable 2 logits further upwards, but w e have found no one
w h o succeeds on K C T B Item s 26, 27 and 28.

Figure 5.3.3 compares the difficu lties o f those items which appeared in both the
K C T and K C T B calibrations. Each o f the 14 items is located in Figure 5.3.3 by its pair o f
d ifficu lty estimates. I f the items fit the measurement m odel, then w e expect these inde­
pendent estimates o f their difficu lties to be statistically equivalent.

Thus the exten t to which the 14 points fall along the identity line tests the invar­
iance o f these 14 items difficulties. As Figure 5.3.3 shows, the 14 points all lie well within
95% quality con trol lines. This is the pattern that the m odel says they must approximate
in order to be useful as instruments o f measurement.
90 BEST TEST DESIGN

Items 1 • 6 Items 1 - 6
Old KCT become New KCTB
Items 7 - 9 Items 8 - 1 0
CONSTRUCTING A V A R IA B L E 91

Old K C T Items 1 2 - 1 6 become New K CTB Items 1 8 - 2 2


92 BEST TEST DESIGN

J TABLE 5.3.1 {
CALIBRATION OF KCTB

1 2 3 4 5 6
Old KCT New KCTB

Item KCT Calibration Item KCTB Calibration


Name Unadjusted Adjusted * Name Old Items All Items

2 2 -6 .0
3 3 -5 .6
4 -4 .2 -3 .8 4 - 3 .8 -3 .8
5 -3 .6 -3 .2 5 - 2 .3 -2 .3
6 -3 .2 -2 .8 6 -2 .5 - 2 .5
7 -4 .0
7 -3 .6 -3 .2 8 -2 .3 -2 .3
8 -2 .2 -1 .8 9 -1 .8 -1 .8
9 -3 .2 -2 .8 10 - 1 .8 - 1 .8
10 -1 .5 -1 .1 11 - 0 .8 -0 .8
12 0.1
13 - 0 .6
14 -0 .3
15 - 1 .3
16 - 0 .5
11 0.8 1.2 17 2.2 2.2
12 2.1 2.5 18 1.6 1.6
13 1.9 2.3 19 2.2 2.2
14 3.2 3.6 20 3.1 3.1
15 4.6 5.0 21 3.6 -3 .6
16 4.6 5.0 22 3.6 3.6
17 4.6 5.0 23 4.7 4.7
18 24 6.5
25 6.0

Mean 0.0 0.4 0.4 0.0


Standard
Deviation 3.4 3.4 2.8 3.4

KCT: L = 14 N = 34 KCTB: L = 24 N = 101

'"The Chapter 3 calibrations o f the 14 old K C T items in Column 2 have been shifted along the variable
by 0.4 logits so that the mean o f these Chapter 3 calibrations equals their mean calibration in the new
K C T B calibrations. This new mean was calculated from Column 5.
CO NSTRUCTING A V A R IA B L E 93

J
F IG U R E 5.3 .3 1
PLOT OF ITEM CALIBRATIONS,
KCT VERSUS KCTB

95%

KCTB IT E M C A L IB R A T IO N

Items identified by Old K C T Names


94 BEST TEST DESIGN

F IG U R E 5.4.1

HOW TO FORM 68% AND 95%


Q U A LITY CONTROL LINES

95% IDENTITY
CALIBRATION i Boundary LINE

(d

5.4 C O N TR O L LIN ES FOR ID E N T IT Y PLOTS

Figure 5.3.3 contains a pair o f 95% quality control lines which help us see the extent
to which the 14 item points conform to our model expectation o f item difficulty invar­
iance. In plots which are used to evaluate the invariance o f item d ifficu lty and hence the
quality o f items, these 95% lines make it easy to see how satisfactorily the item points in
the p lot fo llo w the expected identity line.
CONSTRUCTING A V A R IA B L E 95

Figure 5.4.1 shows h ow such lines are drawn. Each p lo t compares a series o f paired
item calibrations. Each item has a d ifficu lty dj and a standard error Sj from each o f tw o
independent calibrations in which the item appeared. Thus fo r each item i we have
( d j j , Sjj) and ( d i2, si2 ). Since each pair o f calibrations applies to one item, we expect
the tw o difficu lties d ^ and di2, after a single translation necessary to establish an origin
com m on to both sets o f items, to estimate the same d ifficu lty 8 j. We also expect the
error o f these estimates to be estimated by Sjj and sj2.

This gives us a statistic fo r testing the exten t to which the tw o dj’s estimate the same
8 j, namely

tii2 = <dil ~ di2)/(Si? +Sjl)Va ~N(0,1) [5.4.1]

in which (sjf + si2 )1/s estimates the expected standard error o f the difference between
the tw o independent estimates d (1 and di2 o f the one parameter 8j. We can introduce
this test fo r the quality o f each item point into the p lo t by drawing quality control
boundaries at about tw o o f these standard errors away from the identity line on each
side.

Since the standard unit o f d ifferen ce error parallel to either axis o f the p lot is

the unit o f error perpendicular to the 45 degree identity line must be

[(s,? + sJ)/2)* .

T w o o f these error units perpendicular to the identity line in each direction yields a pair
o f approxim ately 95% quality control lines. The perpendicular distance 2 between
these quality con trol lines and the iden tity line thus becomes

D ji 2 = 2 [(s ,i + Sj2 )/2 ] A . [5.4.2]

When Sjj and si2 are sufficiently similar so that the mean o f their squares is approxi­
m ately the same as the square o f their mean, that is

(sii + sii>/2 ~ Ksi2 + si2)/2]2 .

then the distance D il2 from the identity line to a 95% confidence boundary can be
approxim ated by

Dj12 = 2 [(S ji + Sjf )/2 ] A

— 2 [(Sji +Sj2)/2] =$1! + si2 .

Thus fo r the i = 1, K items fo r which paired calibrations are available the distances
(Sji + si2) perpendicular to the identity line drawn through each item point can be used
to locate 95% confidence lines fo r evaluating the overall stability o f the item calibrations
shown in the plot.
96 BEST TEST DESIGN

5.5. CONNECTING TWO TESTS

The usual method fo r equating tests is based on the equation o f equal-percentile


scores. This procedure requires a sample o f persons large enough and broadly enough
distributed to assure an adequate definition o f each score-to-percentile connection. With
Rasch measurement a more economical and better controlled method fo r building an
item bank becomes possible. Links o f 10 to 20 common items can be embedded in pairs
o f otherwise different tests. Each test can then be administered to its own separate sam­
ple o f persons. N o person need take more than one test. But all items in all tests can be
subsequently connected through the network o f common item links.

T o begin with a simple example, a traditional approach to equating tw o 60-item


tests A and B might be to give them simultaneously to a sample o f at least 1200 persons
as depicted in the upper part o f Figure 5.5.1. This is a likely plan since a detailed defini­
tion o f score percentiles is necessary fo r successful percentile equating. Each person must
take both tests, 120 items.

In contrast, a Rasch approach could do the same jo b with each person taking only
one test o f 60 items. T o accomplish this a third 60-item test C is made up o f 30 items
from each o f the original tests A and B. Then each o f these three tests is given to a sample
o f 400 persons as depicted in the low er part o f Figure 5.5.1. N o w each person takes only
one test, but all 120 items are calibrated together through the tw o 30-item links connect­
ing the three tests. The testing burden on each person is one-half o f that required by the
equal-percentile plan.

In Rasch equating the separate calibrations o f each test produce a pair o f indepen­
dent item difficulties fo r each linking item. According to the model, the estimates in each
pair are statistically equivalent except fo r a single constant o f translation common to all
pairs in the link. I f tw o tests, A and B, are joined by a common link o f K items, each test
is given to its own sample o f N persons, and d iA and d iB are the estimated difficulties o f
item i in each test with standard errors o f about 2.5/N%, then the single constant neces­
sary to translate all item difficulties in the calibration o f Test B onto the scale o f Test A is

G a b = ? W ,A - d , B)/K [5.5.1]

with a standard error o f about 3.5/(NK) ,/4 logits.

The quality o f this link can be evaluated by the fit statistic

£ (diA ~ diB - Ga b )2 ( N /1 2 ) [ K /( K - 1)] ~ X k [5.5.2]


i

which according to the model should be approximately chi-square with K degrees o f


freedom.

The individual fit o f any item in the link can be evaluated by

(di A - d iB- G A B )2 ( N /1 2 ) [ K /( K - 1 ) ] ~ X i [5.5.3]

which according to the model should be approximately chi-square with one degree o f
freedom.
C O NSTRUCTING A V A R IA B L E

1 F IG U R E 5.5.1 |

T R A D IT IO N A L AND RASCH
EQUATING DESIGNS

Items

Traditional
Equal-Percentile
Equating

Rasch
Common Item
Equating
98 BEST TEST DESIGN

In using these chi-square statistics to judge link quality we must n ot forget how they
are affected by sample size. When N exceeds 500 these chi-squares can detect link flaws
to o small to make any tangible difference in GA B. When calibration samples are large the
root mean square misfit is more useful. This statistic can be used to estimate the logit
increase in calibration error caused by link flaws.

In deciding how to act on evaluations o f link fit, we must also keep in mind that
random uncertainty in item difficulty o f less than .3 logits has no practical bearing on
person measurement (Wright and Douglas, 1975a, 35-39). Because o f the way sample size
enters into the calculation o f item d ifficu lty and hence into the evaluation o f link quality,
we can deduce that samples o f 200 persons and links o f 10 good items will always be
more than enough to supervise link validity at better than .3 logits. In practice we have
found that we can construct useful item banks with sample units as small as 100 persons.

5.6 B U IL D IN G IT E M BANKS

As we establish and extend the definition o f a variable by the addition o f new items
we have the beginning o f an item bank. With careful planning we can introduce additional
items systematically and in this way build up a bank o f calibrated items useful fo r an in­
creasing variety o f measurement applications. As the number o f items increases, the prob­
lems o f managing such a bank multiply. There is not only the question o f how best to
select and combine items and persons, but o f how to manage effectively the consequent
collection o f calibrated items. Rasch measurement provides a specific well-defined ap­
proach to managing item banking.

The basic structure necessary to calibrate many items onto a single variable is the
common item link in which one set o f linking test items is shared by and so connects
together tw o otherwise different tests. An easy and a hard test could be linked by a com ­
mon set o f items as pictured in Figure 5.6.1. In this example the linking items are the
“ hard” items in the E A S Y test but the “ easy” items in the H A R D test.
CONSTRUCTING A V A R IA B L E 99

With tw o o r m ore test links we can build a chain o f the kind shown in Figure 5.6.2.
T he representation in Figure 5.6.2, however, is awkward. The linking structure can be
con veyed equally well by the simpler scheme in Figure 5.6.3 which emphasizes the links
and facilitates diagramming m ore com plicated structures.

As the number and d ifficu lty range o f the items introduced into an item bank grows
b eyond the test-taking capacity o f any one person, the chain o f items must be parceled
into test form s o f manageable length and d ifficu lty range. In Figure 5.6.3 each circle
indicates a test sufficiently narrow in range o f item difficulties to be manageable by a
suitably chosen sample o f persons. Each line connecting a circle represents a link o f
com m on items shared by the tw o tests it joins. Tests increase in d ifficu lty horizontally
along the variable and are com parable in d ifficu lty vertically.

____________ | F IG U R E 5 .6 .2 |____________

---------------------------------------- A CHAIN W ITH TWO LINKS ----------------------------

Link Link
AB BC

Easy — -------------------------------------------------------------------------------------------------------------------- Hard


Variable

I F IG U R E 5 .6 .3 1
A CHAIN OF TWO LINKS
(Simplified)

Hard
Easy
Variable
10 0 BEST TEST DESIGN

Three links can be constructed to form a loop as in Figure 5.6.4. This loop is an
important linking structure because it yields an additional test o f link coherence. I f the
three links in a loop are consistent, then the sum o f their three link translations should
estimate zero.

®AB + ®BC + ®CA — 0

N otice that G AB means the shift from Test A to Test B as we go around the loop clock­
wise so that G c a means the shift from Test C back to Test A. Estimating zero “ statis­
tically” means that the sum o f these shifts should com e to within a standard error or tw o
o f zero. The standard error o f the sum G AB + G BC + GCA will be about

3 .5 (1 /N a b K a b + 1 /N b c K b c + 1 /N c a ^ c a ^ 1

in which the N ’s are the calibration sample sizes and the K ’s are the number o f items in
each link.
CO NSTRUCTING A V A R IA B L E 101

W ith fou r or m ore tests we can construct a network o f loops. F or example, a se­
quence o f increasingly d ifficu lt tests could be com m only calibrated by a series o f con­
necting links as shown in Figure 5.6.5. These ten tests mark out seven levels o f difficu lty
from Tests A through D. This netw ork could connect ten 60-item tests by means o f
nineteen 10-item links to cover 600 - 190 = 410 items. I f 200 persons were used fo r
each test, then 410 items could be evaluated fo r possible calibration together from the
responses o f on ly 2,000 persons. Even 1,000 persons, at 100 per test, would provide a
substantial purchase on the possibilities fo r building an item bank out o f the best o f
the 410 items.

T h e building blocks o f a test n etw ork are the loops o f three tests each. I f a loop
fits the Rasch m odel, then its three translations should sum to within a standard error or
tw o o f zero. Thus the success o f the n etw ork at linking item calibrations can be evaluated
fro m the magnitudes and directions o f these lo o p sums. Shaky regions can be identified
and steps taken to avoid or im prove them.
102 BEST TEST DESIGN

The implementation o f test networks can lead to banks o f com monly calibrated
items far larger in number and far more dispersed in difficulty than any single person
can handle. The resulting banks, because o f the calibration o f their items onto one
common variable, can provide the item resources fo r a prolific fam ily o f useful tests, long
or short, easy or hard, widely spaced in item difficulty or narrowly focused, all auto­
matically equated in the measures they imply.

These methods fo r building item banks can be applied to existing tests, if they have
been carefully constructed. Suppose we have tw o non-overlapping, sequential series o f tests
A j . A 2. A 3, A 4 and B j . B 2. B 3, B4 which we want to equate by Rasch methods. A ll eight
tests can be equated by connecting them with a new series o f intermediate tests X, Y and
Z made up entirely from items common to both series as shown in Figure 5.6.6. Were the
A and B series o f tests in Figure 5.6.6 still in the planning stage, they could also be linked
directly by embedding common items in each test according to the pattern shown in
Figure 5.6.7.

Since coherence is a vital concern in the building o f an item bank, we are especially
interested in linking structures which maximize statistical control over the join t coher­
ence o f all item calibrations. Networks which maximize the number o f links among test
forms so that each form is linked to as many other forms as possible do this. In the
extreme, this leads to a web in which every individual item in a form links that form to
another different form.

_________________________ | F IG U R E 5 .6 .6 |_________________________

CONNECTING TWO NON-OVERLAPPING


------------------ TEST SERIES BY INTERM EDIATE LINKING TESTS ------------------

Test Series A

Easy ----------------------------------------------------------------------------------------------------------------- *. Hard

Variable
CO NSTRUCTING A V A R IA B L E 103

_____________ | F IG U R E 5 .6 .7 |_____________

CONNECTING TWO TEST SERIES BY


EM BEDDING COMMON ITEMS

Test Series A

E a s y ---------------------------------------- — — ---------------------------------------------------------------► Hard


Variable

T o illustrate w e take a very small banking problem where w e use 10 items per form
in a w eb in which each o f these 10 items also appears in one o f 10 other differen t forms.
T h e com plete set o f 10 + 1 = 11 form s constitutes a web woven out o f 11 x 10/2 = 5 5
individual linking items. Every one o f the 11 form s is w oven to every other form . The
pattern looks like the picture in Figure 5.6.8.

We w ill call this bank building design a “ com plete” web because every form is woven
to every other form . In the design o f useful webs, however, there are three constraints
which a ffe c t their construction. These are the total number o f items we want to calibrate
in to the bank, the maximum number o f items which we can com bine into a single form
and the exten t to which the bank w e have in mind reaches out in d ifficu lty beyond the
capacity o f any one person.

T h e testing situation and the capacity o f the persons taking the test forms will lim it
the number o f items w e can put into a single form . It will usually happen, however, that
w e w ant to calibrate many m ore items than w e can use up in a com plete web like the
one illustrated in Figure 5.6.8. There are tw o possibilities fo r including extra items. The
simplest, but n o t the best statistically, is to design a “ nuclear” com plete web which uses
up some portion o f the items we can include in a single form . We then fill out the re­
quired form length with additional “ tag” items. These tag items are calibrated into the
bank along with the link items in their form . Unlike the link items, however, which always
104 BEST TEST DESIGN

F IG U R E 5.6.8 ______

A COMPLETE WEB
FOR PARALLEL FORMS

Forms
AB C D E F G H I J K
A \ 1 2 3 4 5 6 7 8 9 10
B ^ V 1 1 12 13 14 15 16 17 18 19
C \ 2 0 21 22 23 24 25 26 27
D 28 29 30 31 32 33 34
E \ 3 5 36 37 38 39 40
Forms F \ 4 1 42 43 44 45
G 46 47 48 49
H 50 51 52
1 11 Forms 53 54
J 10 Items per form 55
K (1 1 x 1 0 )/2 = 55 Items

The number entered in each cell is


the identification of the item linking
the two forms which define the posi­
tion of that cell.

appear in tw o forms, the tag items appear in only one form and sogive no help with
linking forms together into one com m only calibrated bank.

Another possibility, which is better statistically, is to increase the number o f forms


used while keeping the items per form fixed at the required limit. This opens the web in a
systematic way but still uses every item tw ice so that the paired data on that item can be
used to evaluate the coherence o f bank calibrations. Figure 5.6.9 shows an “ incom plete”
web fo r a 21 form design with 10 items per form , as in Figure 5.6.8, but with nearly
twice as many items used in the incomplete web.

The incomplete web in Figure 5.6.9 is suitable fo r linking a set o f parallel test forms.
When the reach o f the bank goes beyond the capacity o f any one person, however, neither
o f the webs in Figures 5.6.8 and 5.6.9 will suffice, because we will be unable to combine
items from the easy and hard ends o f the bank into the same forms. The triangle o f link­
ing items in the upper right comers o f Figures 5.6.8 and 5.6.9 will n ot be functional and
will have to be deleted. In order to maintain the balance o f linking along the variable we
will have to do something at each end o f the web to fill out the easiest and hardest forms
so that the extremes are as tightly linked as the center. Figure 5.6.10 shows how this can
be done systematically fo r a set o f 21 sequential forms. We still have 10 items per form
but now only adjacent forms are linked together. There are no common items connecting
the easiest forms directly with the hardest forms. But over the range o f the variable the
forms near to one another in difficu lty level are woven together with the maximum num­
ber o f item links.
CONSTRUCTING A V A R IA B L E 105

Each linking item in the webs shown in Figures 5.6.8, 5.6.9 and 5.6.10 could in fact
refer to a cluster o f tw o or m ore items which appear together in each o f the tw o forms
th ey link. Som etim es the design or printing form at o f items forces them into clusters.
This happens typically in reading comprehension tests where clusters o f items are attached
to reading passages. It also occurs naturally on math and inform ation retrieval tests where
clusters o f items refer to com m on graphs. Clustering, o f course, increases the item length
o f each form by a factor equal to the cluster size.

The statistical analysis o f a bank-building web is simple, i f the w eb is com plete as


in Figure 5.6.8. Th e ro w means o f the corresponding matrix o f form links are least square
estimates o f the form difficulties. We need on ly be careful about signs. I f the web cell
entry G jk estimates the d ifferen ce in d ifficu lty (5j - 5 k) between form s j and k and the
form d ifficu lties are centered at zero so that 6 . = 0 , then
M
G j . = 2 G jk /M

The ro w means o f the link m atrix calibrate the form s on to one com m on variable.
O nce form difficu lties are obtained they need only be added to the item difficulties
within form s to bring all items o n to the com m on variable shared by the forms.

F IG U R E 5.6.9

AN INCOMPLETE WEB
FOR PARALLEL FORMS

FO R M S
A B C D E F G H I J K L M N O P Q R S T U

A 2 3 4 5
B 1 12 13 14 15
C 21 22 23 24
D 29 30 31 32
E 36 37 38 39
F P i 42 43 44 45
G 47 48 49 50
H I 1 52 53 54 55
I 57 58 59 60
J . 1 62 63 64 65
K 67 68 69 70
L 71 72 73 74 75
FO R M S
M 76 77 78 79 80
N 1 82 83 84 85
0 86 87 88 89 90
P 1 92 93 94 95
Q 96 97 98 99
R 21 Forms 101102

S 10 Items per form 103104


T (21 x 10)/2= 105 items
U

Form ulation: N = M L /2
where N = number of items (or links) inthe bank
M = number of forms i.e., 2 N /L
L = number of items (or links) per form
must be even
106 BEST TEST DESIGN

I FIGURE 5.6 10 I
AN INCOMPLETE WEB
FOR SEQUENTIAL FORMS

Easy Forms
A B C D E F G H I J K L M N O P Q R S T U

A 1 2 3 4 5
B 7 8 9 10 11
Easy Forms C 14 15 16 17 18 19 20 21
D 22 2 3 / \ 2 4 25 26 27 28 N
E 29 / \ 3 0 31 32 33 34
F 36 37 38 39
G 41 42 43 44
H 46 47 48 49
I 51 52 53 54
J 56 57 58 59
K 61 62 63 64
L 66 67 68 69
M 71 72 73 74
N 76 77 78 79
0 81 82 83 84
P 86 87 88 89
Q 91 92 93
R 21 Forms ^
\ 99 44 95 96 Hard Forms
S 10 Items per form / 97 98 99
T 21 x 10/2 + 3= 108 items y / l O 0 101102103
U 105106107108
Hard Forms

Formulation: N = M L /2 + K
where N = number of items (or links) in the bank
M = number of forms i.e., 2(N - K )/L
L = number of items (or links) per form
must be even
K = L /4, if L /2 is even
= (L + 2 )/4 , if L /2 is odd

The incomplete webs in Figures 5.6.9 and 5.6.10 require us to estimate row means
from a matrix with missing data. The skew symmetry o f link matrices helps the solution
to this problem which can be done satisfactorily by iteration or regression.

5.7 B A N K IN G T H E KCTB D A T A

The K C TB is a short test so it was practical to ask all 101 persons to attempt all 23
items giving us the response matrix illustrated in Figure 5.7.1. However, most item bank­
ing projects involve the calibration o f hundreds o f items given to thousands o f examinees.
It is then impossible to ask every person to take every item. Fortunately building an item
bank does n ot require such an undertaking. As we saw in Section 5.6, items can be joined
together by a network o f links. In general, tw o types o f form equating are possible,
common persons and common items.
CONSTRUCTING A V A R IA B L E 107

One w ay to link separate form s is to administer them both to the same sample o f
persons. We illustrate “ com m on person” equating with our K C T B data by defining tw o
non-overlapping sequential tests, E A S Y and H A R D , and finding everyone w ho produced
measurable responses simultaneously in both tests. This is an attem pt at the vertical
equating o f an easy and a hard test and we can expect persons with usable scores on
both tests to be scarce. With our K C T B example there are only 29 such persons out o f
101. Th e picture o f this com m on person equating in Figure 5.7.2 shows the core o f 29
persons from the total sample linking tw o non-overlapping parts o f the K C TB , a 9-item
E A S Y test and an 8-item H A R D test.

_________________ F IG U R E 5.7.1 j_ _ _ _ _ _ _ _
COMMON PERSONS A N D COMMON ITEMS
W ITH KCTB

3 Items 25

Item 2 has been dropped because it is too easy to be useful.


108 BEST TEST DESIGN

_______| FIGURE 5.7.2 |_______

COMMON PERSON EQUATING


WITH KCTB

9 8
Easy Items Hard Items

(5 -1 1 ,1 3 ,1 5 ) (1 2 ,1 4 ,1 6 -2 0 ,2 2 )
25

29
Common
Persons

101

A better way to equate forms is by using com mon items. This approach to K CTB
is shown in Figure 5.7.3. There we show eight easy items connected to nine hard items by
a six item link producing a 14 item E A S Y + L IN K form taken by the 50 lowest scoring
persons and a 15 item L IN K + H A R D form taken by the 51 highest scoring persons.
CO NSTRUCTING A V A R IA B L E 109

5 .8 C O M M O N PERSON E Q U A T IN G W IT H T H E KCTB

A m on g the 101 persons taking K C T B Items 3 through 25 we found tw o non-over­


lapping sequential forms, called E A S Y and H A R D , fo r which 29 persons had a pair o f
usable scores. The E A S Y form was made from Items 5 through 11, 13 and 15. The
H A R D form was made from Items 12, 14, 16 through 20 and 22. The 29 persons were
those w h o remained after high scoring persons were removed because o f perfect scores
on the E A S Y test and lo w scoring persons were rem oved because o f zero scores on the
H A R D TE S T.
110 BEST TEST DESIGN

The measurements o f these 29 persons on each form constitute the common person
data fo r linking the E A S Y and H A R D forms together. It is the difference in the tw o
ability means which estimates the shift required to bring the E A S Y and H A R D forms
onto a common scale. The ability statistics fo r the 29 persons on each form are

E A S Y Form H A R D Form Difference


Mean A bility 1.49 -0 .5 7 2.06
Standard Deviation 0.80 0.43

The equating procedure is as follows:

1. Use the observed difference in sample mean ability 1.49 - (-0 .5 7 ) = 2.06 as the
estimated d ifficu lty difference between the tw o forms.

2. Apportion this difference over the nine E A S Y items and the eight H A R D items
so that the average d ifficu lty o f all 17 items becomes zero.

For the nine E A S Y items use

[(17 - 9 )/1 7 ] (2.06) = 0.97

For the eight H A R D items use

[ ( 1 7 - 8 )/1 7 ] (2.06) = 1.09

3. Bring the tw o forms onto a com mon scale by subtracting 0.97 from each E A S Y
form item difficu lty and adding 1.09 to each H A R D form item d ifficu lty/

These computations are displayed in Table 5.8.1. Column 1 gives the K C TB item
name fo r the 17 items used in the E A S Y and H A R D forms. Column 2 gives the separate
item calibrations fo r the E A S Y form. Column 3 gives the separate calibrations fo r the
H A R D form. Because these separate calibrations are each centered within their own
form Columns 2 and 3 each sum to zero.

Converting the calibrations in Columns 2 and 3 to a centered common person scale


requires subtracting 0.97 from the E A S Y form item difficulties in Column 2 and adding
1.09 to the H A R D form item difficulties in Column 3. This is done in Columns 4 and 5
resulting in a common person scale fo r all 17 items centered at 0.0 .

In order to evaluate the efficacy o f this common person equating we obtained a


combined calibration o f all 17 items from the same 29 persons. Column 6 gives these
reference calibrations and Column 7 gives the differences between the common person
scale and the reference scale.

Figure 5.8.1 compares the common person scale and the reference scale. The small
differences between the tw o scales show that the common person technique can produce
results equivalent to a combined calibration o f both tests.
CONSTRUCTING A V A R IA B L E 111

I T A R l.F R R 1 I
EQUATING EASY AND HARD FORMS
USING COMMON PERSONS

1 2 3 4 5 7

Separate Calibrations Common Person Scale Reference


Item E A SY HARD E A SY HARD C alibration* Difference
Name dE dH dc =dE - 0 .97 d c =dH +1.09 dR ^ c - c* r

5 0.03 -0 .9 4 -1 .0 4 - 0.10
6 0 .0 3 -0 .9 4 -1 .0 4 - 0.10
7 - 0 .9 4 -1 .9 1 - 2 .0 5 -0 .1 4
8 0 .0 3 -0 .9 4 -1 .0 4 - 0.10
9 0.24 - 0 .7 3 - 0 .8 2 -0 .0 9
10 0 .4 3 -0 .5 4 - 0 .6 2 -0 .0 8
11 1.36 0.39 0.35 -0 .0 4
12 -1 .4 4 - 0 .3 2 - 0.10 0.22
13 - 0.22 - 1 .1 9 - 1 .3 0 - 0.11
14 - 1 .2 5 - 0 .1 6 0 .05 0.21
15 - 0 .9 4 -1 .9 1 -2 .0 5 -0 .1 4
16 - 2.66 - 1 .5 7 - 1 .3 0 0.27
17 - 0.12 0.97 1 .10 0.13
18 0 .6 5 1.74 1.81 0.07
19 0.65 1.74 1.81 0.07
20 1.83 2.92 2.90 - 0.02
22 2.32 3.41 3.36 -0 .0 5

Mean 0.00 0.00 0.00 0.00 0.00


Standard
Deviation 0 .7 0 1.70 1.62 1.65 0.14

* Based on 29 persons taking all 17 items


112 BEST TEST DESIGN

FIG U R E 5.8.1

COMPARISON OF COMMON PERSON SCALE


WITH THE REFERENCE SCALE

R EFEREN CE
SCALE

COM M ON PERSON SCALE

5.9 COM M ON IT E M E Q U A T IN G W IT H T H E KCTB

T o illustrate common item equating we have divided the 23 K C TB items into three
parts: E A S Y , L IN K and H A R D . The E A S Y + L IN K form contains eight E A S Y items and
six L IN K items to make a 14 item easy test. The L IN K + H A R D form contains the six
common L IN K items plus nine H A R D items making a 15 item hard test.

Each o f these forms was calibrated on separate samples. The E A S Y + L IN K form


was calibrated on the 50 lowest scoring persons and the L IN K + H A R D form was cali­
brated on the 51 highest scoring persons. These calibrations are given in Table 5.9.1.

The paired calibrations o f the six linking items, 11 through 16, are given again in
Columns 2 and 3 o f Table 5.9.2. Their differences D = dE - dH are given in Column 4.
The mean o f these differences is 4.11 which is the difficu lty difference between the
E A S Y + L IN K form and the L IN K + H A R D form . When this difference o f 4.11 is sub­
tracted from D we have the residuals from linking given in Column 5.
CONSTRUCTING A V A R IA B L E 113

_____________ I T A B L E 5.9.1 |______________


ITEM CALIBRATIONS OF EASY + LINK
AND LINK + HARD FORMS

Item EA SY + L IN K L IN K + H A R D
Name D ifficu lty Error D ifficu lty Error

3 - 3 .80
4 - 2.00
5 - 0.37
6 - 0 .37
7 - 2.00
8 - 0 .37
9 0 .0 6
10 0.20
11 0 .9 7 .36 - 2.24 .49
12 2 .08 .38 - 1.83 .44
13 1.58 .36 - 3.22 .73
14 1.95 .37 -2 .8 0 .61
15 0 .84 .36 - 3.90 1.01
16 1.21 .36 - 2.02 .46
17 0 .60
18 - 0 .50
19 0.26
20 1.18
21 1.56
22 1.56
23 2.78
24 4.51
25 4 .06

Mean 0.00 0.00


Standard
Deviation 1.68 2.64

I f these items are providing a usable link, their residuals should distribute around
zero with the standard error predicted by the m odel. The standard errors

Sd = (Se 2 + S h 2 ) *

o f these residuals are given in Column 6 and the standardized residuals

z = (D - 4 .1 1 )/S D

are given in Column 7.

Figures 5.9.1 is a p lo t o f the E A S Y calibrations o f these L IN K items against their


H A R D calibrations. The item points are well within 95% control lines demonstrating that
the shift estimated from this link can be used to connect the tw o forms.
114 BEST TEST DESIGN

TAB LE 5.9.2

C lLINK ANALYSIS
1 2 3 4 5 6 7
Calculating L IN K S H IF T Testing L IN K F IT
Residual Standard Error Standardized
Item EASY HARD Difference Difference o f Residual Residual
Name dE D = dE - d H D - 4.11 So z = ( D - 4 .1 1 )/S D

11 0.97 - 2.24 3.21 -0 .9 0 0.61 -1 .4 8


12 2.08 - 1.83 3.91 - 0.20 0.58 - 0 .3 4
13 1.58 -3 .2 2 4.80 0.69 0.81 0.85
14 1.95 - 2.80 4.75 0.64 0.71 0.90
15 0.84 -3 .9 0 4.74 0.63 1.07 0.59
16 1.21 - 2.02 3.23 - 0.88 0.58 - 1.52

Mean 1.44 - 2.67 4.11 0.00 - 0 .1 7 — 0


Standard
Deviation 0.52 0.79 0.76 0.76 1.13 — 1

6
L IN K Shift = 2 D j /6 = 4.11
i

Standard Error of Residual: SD = (S e 2 + Sh 2 )’/j

Expected mean of z is 0

Expected standard deviation of z is 1

Our next step is to connect E A S Y + L IN K to L IN K + H A R D . We do this by con­


necting both L IN K s and H A R D to E A S Y . Table 5.9.3 shows the method used. In Column
I we have the item name fo r each o f the 23 K C TB items. The item difficulties o f Items 3
through 10 are given in Column 2. Because we will reference all other items to E A S Y, we
record the difficulties fo r Item 3 through 10 directly into Column 6 . For L IN K Items 11
through 16 we have tw o sets o f difficulties. In Column 2 we have difficu lty estimates for
Items 11 through 16 from calibration with the E A S Y items. In Column 3 we have d if­
ficulty estimates fo r these same items obtained from their calibration with the H A R D
items.

T o the L IN K difficulties d H we add the link difficu lty difference o f 4.11. Then we
average the L IN K dE difficulties with the L IN K dH difficulties that were adjusted by the
L IN K shift o f 4.11. The average o f the tw o L IN K estimates (d E + dH + 4.11)/2 fo r Items
I I through 16 is given in Column 5. We enter these in Column 6.
CONSTRUCTING A V A R IA B L E 115

F IG U R E 5.9.1

LIN K FOR COMMON ITEM EQUATING

Form
116 BEST TEST DESIGN

TA B LE 5 9.3. L
EQUATING EASY AND HARD FORMS
BY A COMMON ITEM LINK

J_ _2_ _3_ _4_ _5_ 6 7

Calibrating Each Form Shifting to EASY + L IN K Common Item Scale


Item EASY + L IN K L IN K + H A R D Centered
Name dE dH d H + 4 .1 1 (dE + d H + 4 .1 1 )/2 dc dc - 2.30

3 - 3 .8 0 - 3 .80 - 6.10
4 - 2.00 - 2.00 - 4.30
5 -0 .3 7 - 0.37 - 2.67
6 -0 .3 7 - 0.37 - 2.67
7 - 2.00 - 2.00 - 4.30
8 - 0 .3 7 - 0.37 - 2.67
9 0.06 0 .06 - 2.20
10 0.20 0.20 - 2.10
11 0.97 - 2 .2 4 1.87 1.42 1.42 - 0.92
12 2.08 - 1 .8 3 2.28 2.18 2.18 - 0.12
13 1.58 - 3 .2 2 0.89 1.24 1.24 -1 .0 6
14 1.95 - 2 .8 0 1.31 1.63 1.63 -0 .6 7
15 0.84 - 3 .9 0 0.21 0.53 0 .53 -1 .7 7
16 1.21 - 2.02 2.09 1.65 1.65 -0 .6 5
17 0.60 4.71 4.71 2.41
18 - 0 .5 0 3.61 3.61 1.31
19 0.26 4.37 4 .37 2.07
20 1.18 5.29 5.29 , 2.99
21 1.56 5.67 5.67 * 3.37
22 1.56 5.67 5.67 3.37
23 2.78 6.89 6.89 4.59
24 4.51 8.62 8.62 6.32
25 4.06 8.17 8.17 5.87

Mean 0.00 0.00 4.11 2.30 0.00

Standard
Deviation 1.68 2.64 2.64 3.37 3.37

Finally in order to place the H A R D items on the common scale we add 4.11 to
H A R D Items 17 through 25 and bring these d ifficu lty estimates over to complete Column
6. We then have in Column 6 a new common item scale with the average o f tw o L IN K
d ifficu lty estimates and the H A R D difficulty estimates all connected to the E A S Y item
d ifficu lty estimates.
CONSTRUCTING A V A R IA B L E 117

Th e mean o f this com m on item scale in Column 6 is 2.30 so we subtract 2.30 from
each item d ifficu lty in Column 6 to center the new scale at 0.00 as shown in Column 7.

T o assess the adequacy o f this com m on item equating we w ill compare it to the item
difficulties w e w ould have gotten had we n ot attem pted linking but used all 101 person
responses to all 23 items. Th e com m on item difficulties from Table 5.9.3 are given in
Column 2 o f Table 5.9.4. Column 3 gives the reference calibrations o f all 23 items from
all 101 persons, and Column 4 shows the differences between the com m on item d iffi­
culties d c and the reference scale item difficulties d^. The p lo t o f these values given in
Figure 5.9.2 shows the items close to the expected identity line.

I T A B L E 5 .9 .4 I
COMPARING COMMON ITEM EQUATING W ITH THE
REFERENCE SCALE

1
■“ ““ 2 3 4

Item Common Item Scale Reference Scale Difference


Name dc - 2 .3 0 dR dC

3 - 6.10 - 6.20 .10


4 -4 .3 0 -4 .1 1 - .19
5 - 2 .67 - 2 .5 8 - .09
6 - 2.67 - 2 .7 2 .05
7 -4 .3 0 - 4 .3 4 .04
8 - 2 .67 - 2 .5 8 - .09
9 - 2 .24 - 2 .0 6 - .18
10 - 2.10 - 2 .0 6 - .04
11 - 0 .92 - 1 .0 3 .11
12 - 0.12 - 0.12 .00
13 - 1.06 -0 .8 5 - .21
14 - 0 .67 - 0 .5 2 - .15
15 - 1.77 - 1 .5 2 - .25
16 - 0 .65 - 0 .7 7 .12
17 2.41 1.93 .48
18 1.31 1.36 - .05
19 2.07 2.01 .06
20 2.99 2.88 .11
21 3.37 3.33 .04
22 3.37 3.33 .04
23 4 .5 9 4 .52 .07
24 6.32 6.27 .05
25 5.87 5.81 .06

Mean 0.00 0.00 0.00

Standard
Deviation 3.37 3.32 0.15
118 BEST TEST DESIGN

F IG U R E 5.9.2

COMPARISON OF COMMON ITEM SCALE


WITH THE REFERENCE SCALE

R EFERENCE
SCALE

COMM ON IT E M SCALE

5.10 C R IT E R IO N R EFE R E N C IN G TH E KCT V A R IA B L E

By locating all 23 K CTB items on a single scale we can make the definition o f the
K C T variable more explicit. These items which now mark out the variable are constructed
out o f a few basic components: number o f taps, number o f reverses and overall distance
across blocks. It is the way these underlying components evolve along the variable which
documents fo r us what a measure on the K C T variable means. Figure 5.10.1 gives the d if­
ficulty level o f the K CTB items together with their number o f taps, reverses and distances.
CONSTRUCTING A V A R IA B L E
119

000 s©0£ §dXD0 o>


J

© 0 © © •™

<1 0 © E ©
rH
H

(S © G G 01


©© 00 @D

(9)
© © © --

o
THE KCT VA R IA B LE


| ____________

©© Q0 00 r>» csi
0
© © ©
(9)

E --
0)
*-*
00
5.10.1

CN
>
00
(A
© © 0 ©
FIGURE

c
o
M
to
0 «*) (in
DOCUMENTING

k- ©
0)
0L
n n / *^ in o

|

r
s 0
(2)

© ©
0

0
©0 ©3 ©0 0© cvj
1

G® G® (3® G® 0
ro
1
ro

--
Qt I ^ I
( N g IH

Z
o
h
z in
a>
in
E a>
w V)
a >
Q <0 4)
lii a l£>
I
>
h
a>
E
rtJ
(" o Q ? 0 5 o. 6 a>"In UJ
_l
Z
<
z
E
a>
13 —.
"
a>
n
c s st S
in <0 in
uc0) 4J
^
U
median
3. Nur

H a>
c .2 in
(/) fl) ‘D Q 2 tj u
(taps

CD
* 03 G© I? ,0 0 £O1 S
8

D c ^
(/) *
120 BEST TEST DESIGN
CONSTRUCTING A V A R IA B L E 121

R o w 1 o f Figure 5.10.1 contains the items o f K C T B arranged by their calibrations


on the variable according to the lo g it scale given at the bottom o f the figure. The number
o f taps fo r each item is given in R o w 2. Items 1 and 2 are two-tap items passed by all 101
persons in the sample. As w e m ove up the variable, the number o f taps goes from tw o to
seven. B elow R o w 2 w e have marked the median d ifficu lty level fo r each number o f taps
from tw o to seven. R o w 3 shows the number o f reverses in each item and their median
d ifficu lty levels. R o w 4 shows the distances in blocks tapped fo r each tapping series and
their medians.

T h e pattern o f taps, reverses and distances in Figure 5.10.1 show how the K C T
variable is built ou t o f these basic operations. This provides a substantive, or criterion,
reference fo r the K C T variable. The resulting picture gives us insight into the nature o f
the variable which reaches beneath the individual items. In particular it shows us how to
generate m ore items at any designated d ifficu lty level.

W e can also learn about the K C T variable by seeing how the 101 persons in our
sample are distributed along it. In R ow s 5 and 6 we show each person’s position on the
variable b y their age in years. This allows us to norm reference the variable with age
medians from three to eight years and to give an age distribution o f “ mature” persons o f
9 or m ore years o f age with a mean at 1.3 logits and a standard deviation o f 1.9 logits.
Thus Figure 5.10.1 becomes a map o f the variable which is both criterion and norm
referenced.

5.11 IT E M C A L IB R A T IO N Q U A L IT Y C O N T R O L

We cannot exp ect the items in a bank to retain their calibrations indefinitely or to
w ork equally w ell fo r every person with whom they may be used. The quality o f item
calibration must be supervised continuously. This can be done conveniently by a routine
exam ination o f the differences between h ow persons actually respond to particular items
and h o w w e exp ect them to respond given our calibrations o f the items and our measure­
ments o f the persons. These differences are residuals from expectation. A n occasional
surprising item residual suggests an anomalous testing situation or a peculiar person.
Trends in item residuals, however, may be indicative o f item failure. Tendencies fo r items
to run in to trouble, to shift d ifficu lty or to be biased fo r some types o f persons can be
exposed by a cumulative analysis o f item residuals over tim e, place and person type.
Problem atic items can then be rem oved from use or brought up-to-date in difficu lty.

T h e purpose o f item quality con trol is to maintain supervision over item calibration
stability against the possible influences o f age, sex, education or any other factor which
m ight disturb item functioning. A quality control procedure requires that item usage be
accom panied b y concom itant educational and demographic inform ation so as to provide
a basis fo r analyzing whether these other variables threaten the stability o f item calibra­
tion and hence disturb the interpretation o f test responses. The discussion which follow s
builds on the analysis o f fit developed in Chapter 4.

T o im plem ent item quality con trol we save from each use o f an item:

x vi the response 0 or 1 o f person v to item i,

bv the ability estimate o f person v derived from their score on whatever


“ test” o f calibrated items they to o k and

(y ) the vector o f demographic inform ation which characterizes person v.


122 BEST TEST DESIGN

When the tw o pieces o f information x vi and bv are combined with the item ’s bank
difficulty d; we can form a standardized residual zvi which will retain all the information
in this use o f item i which bears on the possibility o f a disturbance in its functioning.

In general this estimated residual zvi is

zvi = ( * v i - Pvi)/(p v i(1 - Pv,n »

where

pvi = exp (bv - d j )/[1 + exp (bv - d,)]

is the estimated probability o f success fo r person v on item i and hence the estimated
expected value o f x vi given the model.

Since x vi can only take one o f tw o values, 0 fo r an incorrect response or 1 fo r a


correct response, the possibilities fo r zvi and its square zvi2 are limited to those given in
Table 5.11.1. The improbability o f any particular response x vi, as a function o f its zvi2, is
1/(1 + zvi2). In the K C TB example there are 101 persons taking 23 items. These 23 x 101
item-by-person responses imply 2323 occasions fo r misfit. However, m isfit can only show
up when the difference between person ability bv and item d ifficu lty d; is large enough so
that one o f the possible values fo r the response x vi becomes significantly improbable. For
this to happen the difference (b v - d j) must be at least three logits. As a result there are
only about 500 item-by-person occasions where misfit could occur.

_________________ TA B L E 5.11.1 |______________

STANDARDIZED RESPONSE RESIDUALS

Standardized Residuals

Response As a normal deviate As a chi-square


Value
2_ 2
^ /i zv i ~ N ( 0. 1 ) Zvi Xl

"Incorrect" zv i = - P v i / t p V|<1 “ P v i> ] 54


0
= - [ p vi/ ( 1 - p vi) ] *

= - exp [(bv - d j)/ 2 ] zv2i = exp (bv - dj)

"Correct" Zvi = <1 “ P v i> /( P v i( 1 " P v i> ]1/4


1
= 1(1 “ P v iJ /P v i]54

= exp [(dj - bv) / 2 ] z 2, = exp Id, - bv)


CONSTRUCTING A V A R IA B L E 123

Table 5.11.2 gives a summary o f the unexpected responses observed in the K CTB
data. Column 1 gives the range o f absolute difference between person ability and item
d ifficu lty. Column 2 expresses this difference as z 2 = exp (I b - d |) and Column 3 con­
verts z 2 to the response im probability [ 1/(1 + z 2 )] it implies.

______________________ | T A B L E 5.1 1 .2 |_____________________


SUM M ARY OF UNEXPECTED RESPONSES ON KCTB
101 PERSONS BY 23 ITEMS

A b ility-
D iffic u lty Im probability Possible Expected Observed Item Person
Difference z2 1/(1 + z 2 ) Count Count Count Names Names

Over 4 .6 Over 99 Under .01 226 2 3 4 .0 9 C 4 9 M ^ )6 8 F . 93F

3 .9 - 4 .6 4 9 -9 9 .02 - .01 133 5 3 8, 10. 12 79M 83F, 95M

2 .9 - 3.8 1 9 -4 9 .05 - .02 184 20 8 H 3 (3 ] 5 12M, 13M, 27F

6 . ® 1 1 4 7 M .C 4 9 M ^ > 8 2 F

18, 19 95m | 10F

We have counted the number o f item-by-person interactions which could fall within
each ro w o f Table 5.11.2 and multiplied this “ possible” count by its im probability to
estimate the count w e m ight exp ect i f these data fit the m odel. This was done by m ulti­
p lyin g (2 2 6 ) x .01 s 2, (226 + 133) x .02 s (2 + 5) and (226 + 133 + 184) x .05 s
(2 + 5 + 20). Th e actual counts observed in the data are given in Column 6 . Thus when
(b - d ) is over 4.6 logits we exp ect about tw o improbable responses and we observe
three. When (b - d ) is between 3.9 and 4.6 we exp ect about five improbable responses
and again w e observe three. Finally when (b - d ) is between 2.9 and 3.8 we expect about
tw en ty im probable responses but observe on ly eight. These data seem to fit the model
rather well.

When we scan the 14 m ost unexpected item and person responses given in Table
5.11.2, w e see that they are w ell dispersed over items and persons. Only Items 3 and 7
and Persons 49M and 95M appear tw ice and the sexes are equally represented. We must
conclude that no clear sign o f systematic m isfit has been detected in these data.

Nevertheless, in order to use the K C T B example to show the application o f item


quality control, we w ill proceed with a further analysis o f the six most unexpected
responses. These responses o f Persons 49M, 68F, 79M, 83F, 93F and 95M to Items 4, 7,
8 , 9, 10 and 12 are given in Table 5.11.3. F or each o f these unexpected incorrect respon­
ses, given by able persons on easy items, we have entered the appropriate (b v - dj). We
have also given fo r each item its characteristics on the K C T variable, namely its number
o f taps, reverses and distance and the demographic characteristics o f sex, age and grade
fo r each person.
124 BEST TEST DESIGN

T A R I F R 11 3
THE SIX MOST UNEXPECTED RESPONSES
ON KCTB
(101 Persons By 23 Items)

Person
A bility Item D ifficulty Person Characteristics
d;

bv -4 .3 -4 .1 - 2.6 - 2.1 - 2.1 - 0.1 Name Sex Age Grade

0.4 0 (4 .7 ) * 1 1 1 1 1 49 M 16+ 12 +
1.4 1 0 (5 .5 ) 1 1 1 1 68 F 16+ 12 +
1.9 1 1 0 (4 .5 ) 1 1 1 79 M 16+ 12 +
2.4 1 1 1 1 0 (4 .5 ) 1 83 F 16+ 12 +
3.6 1 1 1 0 (5 .7 ) 1 1 93 F 16+ 12 +
4.3 1 1 1 1 1 0 (4 .4 ) 95 M 16+ 12 +

Item
Characteristics

Name #7 #4 #8 J9 #10 #12


Taps 4 3 4 4 4 4
Reverses 0 0 1 2 2 2
Distance 3 3 5 6 5 7

* ( b - d) = ( 0 . 4 ) - (-4 .3 ) = 4 .7

The d ifficu lty characteristics o f the items in reverses and distance show the increase
we would expect as the items become more difficult. A ll six items are on the easy end o f
the variable. The six persons, on the other hand, are all relatively able adults. This sug­
gests that, i f a systematic source o f m isfit has been detected here, it could only be a slight
tendency towards carelessness, or lapses o f attention, among some older persons working
on items rather to o easy fo r them.

F it analysis matrices, like Table 5.11.3, which bring together the person and item
characteristics o f the most unexpected responses, are convenient fo r supervising the
quality o f item functioning. These matrices identify and suggest corrections fo r the sys­
tematic sources o f item failure shown in the data.

The calculations necessary to evaluate unexpected responses can be accomplished in


three ways. The first tw o are UCON by computer and the hand method explained in
Chapter 4. The third way is a crude, but quick, method which often suffices in practical
work.

This crude method o f fit analysis consists o f identifying and calculating only the few
largest z 2 ’s observed on an item and then adding to them a 1 fo r each other person taking
that item. This assumes that all o f the disturbance observed in that item is due to its
outstanding residuals and that the rest o f the pattern is more or less as expected.
CONSTRUCTING A V A R IA B L E 125

Table 5.11.4 gives an illustration o f this method. There we have taken from Table
5.11.3 just the single largest zv i2 observed in our K C TB data and added to it a 1 fo r each
other person taking that item, in this case 100. This gives zvi2 + 100 = X2 as the chi-square
fo r that item and vj = X2/100 as the item mean square.

T o see whether this crude method can be useful, we w ill compare it with U C ON and
the hand m ethod described in Chapter 4, but n ow applied to these K C T B data. In those
procedures we sum all 101 actual z 2 ’s to make our item fit analysis and then divide this
sum o f squares by its 100 degrees o f freedom to get the mean squares shown in Table
5.11.5.

I T A B L E 5 .1 1 .4 l
_________________________ CRUDE F IT ANALYSIS I_________________________
FOR SIX KCTB ITEMS I

Single Crude F it Statistics


Item A b ility minus D ifficu lty Item Chi-Square Mean Square
Name Difference z2 X2= lz2 + 100) w= X2/100

9 5.7 299 399 4.0


4 5.5 245 345 3.5
7 4.7 110 210 2.1
10 4 .5 90 190 1.9
8 4 .5 90 190 1.9
12 4.4 81 181 1.8

_________________________ | T A B L E 5 .1 1 .5 |_______________________
A COMPARISON OF ITEM Q U A L IT Y CONTROL METHODS
APPLIED TO KCTB

Item UCON Hand Fit Crude Fit


Name Mean Square Mean Square Mean Square

9 3.42 3.16 4.0


4 2.64 2.56 3.5
7 1.48 1.40 2.1
10 1.46 1.20 1.9
8 1.43 1.12 1.9
12 1.36 0.98 1.8

The U C O N and hand fit methods approxim ate one another rather closely. Although
the crude fit mean squares are somewhat larger in magnitude, their order is identical to
the other methods and their values are sufficiently close to get a clear idea concerning the
relative fit o f these six items. Table 5.11.5 suggests that the crude method can be useful
fo r the quick analysis o f item functioning.
126 BEST TEST DESIGN

5.12 NORM REFERENCING THE KCT VARIABLE

While norms are no more fundamental to the calibration o f item banks than are
distributions o f person heights to the ruling o f yardsticks, it is usually useful to know
various demographic characteristics o f a variable defined by an item bank. Some o f these
demographic characteristics may even have normative implications under particular
circumstances. Because o f a shift in emphasis, norming a variable in the Rasch approach
takes much less data than norming a test. We need only use enough items to estimate the
desired “ norming” statistics. Once the variable is normed, then all possible scores from all
possible tests drawn from the calibrated bank are automatically norm-referenced through
the variable.

Often we are satisfied with a mean and standard deviation fo r each cell in our norm­
ative sampling plan. These tw o statistics could be estimated from a random sample o f 100
or so persons taking a norming test o f only tw o items. O f course, a somewhat longer test
o f 10 or 15 items will do a better job. N o t only will the estimates be better but the extra
items will yield standard errors around the norming statistics and thus a test o f fit for the
plausibility o f the data. More than 15 items in a norming test, however, will seldom be
necessary. This means that we could norm six different variables simultaneously by allo­
cating 15 items to each o f six subtests administered as one 90-item composite test.
We can estimate quick norms from frequency data on bank calibrated items without
scoring or measuring the individual persons. This may be useful when trimming sample
data is undesirable. I f we seek a probability sample from a population, fo r example, we
would rather not distort the sample’s status by eliminating some o f the persons sampled
because they earned zero or perfect scores.

This norming procedure can be accomplished by working directly from the model
and the observed number o f right answers to each calibrated item. ,

1. For each sampling cell in the norming study, select from the item bank a suitable
set o f K calibrated items sufficiently spaced in difficu lty d; to cover the expected
ability dispersion o f that particular sampling cell. N ote that each sampling cell, in
principle, has its own individually tailored norming test.
2. Administer this test o f K items to a random sample o f N persons from the speci­
fied cell.
3. Observe the number o f persons s; succeeding on each item.
4. Calculate the natural log odds h; o f these correct answers Sj fo r each item

hj = fin [Sj/(N - s,)l i = 1,K [5.12.1J

5. Regress these log odds h; on the associated item difficulties d; over the K items to
obtain the intercept A and slope C o f the least squares straight line.
6 . Estimate the population mean M and standard deviation SD o f that cell’s abilities
as

M = - A /C [5.12.2]

S D = 1.7 [(1 - C 2 )/C 2 ] * [5.12.3]

We will apply these procedures to the K CTB sample o f 101 persons to see how well
they recover the sample mean and standard deviation that we have already estimated
from the measurements o f each o f the 101 persons to be M* = 0.19 and SD' = 2.44.
CONSTRUCTING A V A R IA B L E
127

We have the item difficulties d; fo r Items 3 through 25 and so need only to compute
the natural lo g odds hj o f observed correct answers to each o f these items. The values o f
h; are given in Column 3 o f Table 5.12.1 with the corresponding item difficulties in
Column 4.

Regressing these lo g odds right answers on the item difficulties over the 23 items
gives us an intercept o f A = 0.07 and a slope o f C = -0 .5 6 .

Equation 5.12.2 estimates the sample mean as

M = -A / C

= - 0 .0 7 /- 0 .5 6

= 0.1 3 .

J T A B L E 5.12.1 I
KCTB LOG ODDS CORRECT ANSWERS
AND ITEM D IFFIC U LTIE S

1 2 _3_ 4

Persons Log Odds Item


Item Succeeding Correct D ifficulty
Name si hs=Cn[Sj/N - Sj)] di

3 98 3.49 - 6.20
4 91 2.21 -4 .1 1
5 82 1.46 - 2 .5 8
6 83 1.53 - 2 .7 6
7 92 2.32 - 4 .3 4
8 82 1.46 - 2 .5 8
9 78 1.22 - 2 .0 6
10 78 1.22 - 2 .0 6
11 68 0.72 - 1 .0 3
12 57 0.26 - 0.12
13 66 0.63 - 0 .8 5
14 62 0.46 - 0 .5 2
15 73 0.96 -1 .5 1
16 65 0.59 - 0 .7 7
17 30 - 0.86 1.93
18 37 -0 .5 5 1.36
19 29 -0 .9 1 2.01
20 20 -1 .4 0 2.88
21 16 -1 .6 7 3.33
22 16 -1 .6 7 3.33
23 8 -2 .4 5 4.52
24 2 -3 .9 0 6.27
25 3 -3 .4 9 5.81

N = 101 Mean 0.00


Standard
Deviation 3.32
128 BEST TEST DESIGN

Equation 5.12.3 estimates the sample standard deviation as

SD = 1.7[(1 - C 2 )/C 2 ] *

= 1.7 [(1 - 0.311/0.31 ] *

= 2.54.

These quick norm regression estimates o f 0.13 fo r the mean and 2.54 fo r the stan­
dard deviation compare satisfactorily with the values o f 0.19 and 2.44 computed by
measuring each o f the 101 persons and then calculating their mean and standard deviation
in the usual way.

The p lot o f the log odds correct answers hj against the item difficulties d, in Figure
5.12.1 shows how well these norming data fit the straight line expected by the model.

F IG U R E 5.12.1

THE QUICK NORMING METHOD OF ESTIMATING


SAMPLE MEANS AND STANDARD DEVIATIO NS
APPLIED TO KCTB

LOG ODDS
CORRECT
ANSWERS
+5

+4

-6 -5 -4 -3 -2 -1 © +1 +2 +3 +4 +5 +6
ITEM
DIFFICU LTY
- 1

-2

-3
Intercept = 0.07
Slope = - 0 .5 6
-4

-5
6 D ES IG N IN G TESTS

6.1 IN T R O D U C T IO N

In Chapter 5 w e have shown h ow to establish the operational definition o f a variable


by means o f a calibrated bank o f items. The n ext step is to find ou t h ow to use these cali­
brated items to make measures. T o d o this w e must consider tw o related questions. First,
w e need to fin d ou t h o w to make the best possible selection o f calibrated items from our
bank in order to make any particular measurements we have in mind m ost effective.
Second, given such a selection o f items and an observed pattern o f responses to them, we
need to find ou t h o w to evaluate the quality o f this observation and, i f it is valid, how to
extract from it the measure w e seek, together with its standard error. It is the first ques­
tion, best test design, which is the m ajor topic o f this chapter. Chapter 7 deals with
making measures.

6.2 THE M EASUREM ENT TARG ET

When we plan a measurement, there must be a target person or group o f persons


about whom w e want to know m ore than we already know. I f w e care about the quality
o f our proposed measurements, then we w ill want to construct our measuring instrument
w ith the specifics o f this target in mind. In order to do this systematically w e must begin
b y setting ou t as clearly as w e can what w e expect o f our target. Where do we suppose it
is located on the variable? H o w uncertain are we o f that approxim ate location? What is
the low est ability w e imagine the target could have? What is the highest? H ow are other
possible values distributed in between?

Sometimes we have exp licit prior knowledge about our target. We, or others, have
measured it b efore and so w e can suggest its probable location and dispersion directly in
terms o f these prior measures on the variable and their standard errors. Sometimes we can
use items calibrated along the variable, some o f which we believe are probably just right
fo r the target, some o f which are nearly to o hard and some o f which are nearly to o easy.
Then w e can take from the difficulties o f these reference items rough indications o f the
probable center and boundaries o f our target.

One w ay or another we assemble and clarify our suppositions about our target as
w ell as we can so that w e can derive from them the test design which has the best chance
o f m ost increasing our knowledge.

Obviously i f w e know everthing w e want to know about our target, then w e would
n o t have to measure it in the first place. However, no matter h ow little we know, we
always have some idea o f where our target is. Being as clear as possible about that prior
know ledge is essential fo r the design o f the best possible test.

Graham A. Douglas collaborated in the preparation o f parts o f this chapter. See Wright and Douglas,
1975a.

129
130 BEST TEST DESIGN

A target specification is a statement about where on the variable we suppose the


target to be. We express our best guess by specifying the target’s supposed center, its
supposed dispersion and perhaps its supposed shape or distribution. I f we let

M = our best guess as to target location,


S = our best guess as to target dispersion,
D = our best guess as to target distribution,

then we can describe a target G by the expression G (M ,S,D ) and we can summarize our
prior knowledge, and hence our measurement requirements fo r any target we wish to
measure, by guessing, as well as we can, values fo r the three target parameters M,S and D.

A picture o f a target is given in Figure 6.2.1. Guessing the supposed location M o f a


target is perfectly straightforward. However, guessing the dispersion S and the distribu­
tion D forces us to think through the difference between a target which is a single person
and one which is a group. For the single person, S can describe the extent o f our uncer­
tainty about where that person is located. The larger our uncertainty, the larger S.

I f we can specify boundaries within which we feel fairly sure that the person will be
found, we can set S so that M±kS defines these boundaries. Then, even if we have no clear
idea at all about the distribution D o f our uncertainty between these boundaries, we can
nevertheless expect that at least (1-1/k2) o f the possible measures will fall within M±kS.

I f we go further and expect that the measures we think possible fo r the person will
pile up near M, then we may even be willing to take a normal distribution as a useful way
to describe the shape o f our uncertainty. In that case we can expect .95 o f the possible
measures to fall within M±2S and virtually all o f them to fall within M±3S.

We will refer to these tw o target distributions as the Tchebycheff interval and the
normal. We might consider other target distributions, but these tw o seem to cover all
reasonable target shapes rather well. For example, if we feel unhappy about thinking o f
our target as approximately normal, then it is unlikely that we will have any definite
alternative clearly in mind. Thus, the most likely alternative to a normal target is one o f
unknown distribution, best captured by a Tch ebych eff interval. This realization that all
possible target shapes can be satisfactorily represented by just tw o reasonable alternatives
is important because it makes a unique solution to the problem o f best test design not
only possible but even practical.

I f the target is a group rather than an individual, then we may take S and D to be
our best guess as to the standard deviation and distribution o f that group. I f we think the
group has a more or less normal distribution, then we will take that as our best guess for
D. Otherwise we can always fall back on the Tch ebych eff interval.

Finally, we must be explicit about how precise we want our measurement to be.
A fte r all, this is our m otive fo r measuring. It is just because our present knowledge about
our target is to o approximate to suit us that we want to know more precisely where our
target is and, i f it is a group rather than an individual, more precisely about its dispersion.
However, whether the target is an individual or a group, our decision about the desired
standard error o f measurement SEM will be made in terms o f individuals, fo r that, in the
end, is what we actually measure.

In the case o f a one-person target, we want the SEM to be enough smaller than S to
reward our measurement efforts with a useful increase in the precision o f our knowledge
about where that target person is located. In the case o f a group target we want to achieve
D ESIG N IN G TESTS 131

an im proved estimate n o t only o f M, the center o f the group, but also o f S, its dispersion.
T h e observable variance o f measures over the group estimates n ot only the underlying
variance in ability S2 but also the measurement error variance SEM2. Our ability to see
the dispersion o f our target against the background o f measurement error depends on our
ability to distinguish between these tw o components o f variance. Since they enter into
the observable variance o f estimated measures equally, the smaller SEM2 is with respect
to S2, the m ore clearly w e can id en tify and estimate S2, the com ponent due to target dis­
persion. Thus, fo r all targets w e seek an SEM considerably smaller than S.

___________________ I F IG U R E 6.2.1 I___________________

-------------------------- THE PICTURE OF A TARG ET -------------------------

R E L A T IV E
FREQUENCY
P

4S

SHAPE D M ±2S M ±3S

Interval .75+ .89+

Normal .95 .99

6 .3 T H E M E A S U R IN G T E S T

A test is a set o f suitably calibrated items chosen to go together to form a measuring


instrument. The com plete specification o f a test is the set o f all parameters which charac­
terize these items. But when w e exam ine a picture o f h ow a test works to transform ob­
served scores in to estimated measures, w e see that the operating curve is rather simple and
lends itself to specification through just a fe w test parameters. When the way our items
operate fits the Rasch m odel, then w e know that the on ly item parameters which we need
to consider in order to determ ine the operating characteristics o f a test are its item d iffi­
culties. When we impose a reasonable fixed distribution on these difficulties, then no
m atter h ow many items we use, we can reduce the number o f test parameters to only three.
132 BEST TEST DESIGN

FIGURE 6.3.1 [

THE OPERATION OF A TEST

R E L A T IV E
SCORE
f = r/L

A B IL IT Y
MEASURE
. b

db
LM D — gf LOD = C f/L

In Figure 6.3.1 we can see from the shape o f the test operating curve that its tw o
outstanding features are its position along the variable, which we will call test height, and
the range o f abilities over which the test can measure more or less accurately, a character­
istic caused primarily by the dispersion o f the item difficulties, which we will call test
width.

But height and width do not complete the characterization o f a test. When we look
more closely at the way the test curve transforms observed scores into inferred measures
we see that there is a discontinuity in observable scores which is going to determine the
smallest increment in ability we can measure with any particular test. This least measur­
able difference LM D depends on the test’s least observable difference LOD. Since the least
change possible in a test score is one, the LO D in relative score f = r/L, must be 1/L. In
Section 6.5 we will find that the standard error o f measurement, or least believable d if­
ference, SEM also depends on the number o f items in the test. Indeed SEM = LM D 14. So
in order to finish characterizing a test we must also specify its length.
D ESIG N IN G TESTS 133

From this w e see that any test design can be defined more or less com pletely just by
specifying the three test characteristics; height, width and length. I f we let

H = the height o f the test on the variable, that


is, the average d ifficu lty o f its selected
items,
W = the width o f the test in item difficulties,
that is, the range o f its item difficulties
and
L = the length o f the test in number o f items,

then w e can describe a test design T by the convenient expression T (H ,W ,L ).

In the practical application o f best test design, however, w e w ill have t o approximate
our best design T fo r a target G from a finite p oo l o f existing items. In order to discrim­
inate in our thinking between the best test design T (H ,W ,L ) and its approxim ate realiza­
tion in practice, w e w ill describe an actual test as t(h ,w ,L ) where

h = the average d ifficu lty o f its actual items,


and
w = an estimate o f their actual d ifficu lty range.

6 .4 T H E SHAPE O F A BEST T E S T

A best test is one which measures best in the region within which measurements are
expected to occur.* Measuring best means measuring most precisely. A best test design
T (H ,W ,L ) is one with the smallest error o f measurement SEM over the target G (M ,S ,D )
fo r given length L (o r, what is equivalent, with the smallest L fo r a given value o f SEM).
“ O ver the target” implies the m inim ization o f a distribution o f possible SEMs. Thus, a
position with respect to the m ost likely target distribution must be taken before the
m inim ization o f SEM can proceed.

W e bring the profusion o f possible target shapes under control by focusing on the
tw o extremes—interval and normal. H ow shall m inim ization be specified in each case? For
a normal target it seems reasonable to m axim ize average precision, that is, to minimize
average SEM, over the w h ole target.

T o decide what to d o fo r an interval target, we need to know h ow the SEM varies


over possible test scores. When w e derive an exact form fo r the precision o f measurement,
w e find that fo r ordinary tests with less than three logits between adjacent items, pre­
cision is a maxim um fo r measurements made at the center o f the test and decreases as
test and target are increasingly off-cen ter with respect to one another. F or tests centered
on their targets this means that m axim izing precision at the boundaries o f an interval
target is a good way to m axim ize precision over the target interval. So fo r interval targets
w e w ill m axim ize precision at the target boundaries.

When w e derive the SEM 2 from our response m odel we w ill discover that it is the re­
ciprocal o f the inform ation about ability supplied by each item response averaged over
the test. Since the m ost inform ative items are those nearest the ability being measured

♦Attem pts to meet this requirement have been made by Birnbaum (1968, pp. 465-471). Our ideas are
consistent with his efforts, but we have taken them to their logical and practical conclusion.
134 BEST TEST DESIGN

and the least informative are those farthest away, the precision over the target will de­
pend not only on the distribution o f the target but also on the shape o f the test. Thus,
the question o f what is a best test also depends on our taking a position with respect to
the best distribution o f test item difficulties.

What are the reasonable possibilities? I f we want to measure a normal target, then a
test made up o f normally distributed item difficulties ought to produce the best maxi­
mization o f precision over the target. This is the conclusion implied in Bimbaum’s analy­
sis o f inform ation maximization (Bimbaum, 1968, p. 467).

However, normal tests are clumsy to compose. Normal order statistics can be used to
define a set o f item difficulties, but this is tedious. More problematic is the odd concep­
tion o f measuring implied by an instrument composed o f normally distributed measuring
elements. A normal test would be like a yardstick with rulings bunched in the middle and
spread at the ends. Measuring with such an irregularly ruled yardstick would be awkward.
In the long run, even fo r normal targets, our interest becomes spread out evenly over all
the abilities which might be measured by a test. Equally spaced items are the test shape
which serves that interest best. That is the way we construct yardsticks. The test design
corresponding to an evenly ruled yardstick is the uniform test in which items are evenly
spaced from easiest to hardest (Bimbaum, 1968, p. 466).

T w o target distributions, normal and interval, and tw o test shapes, normal and uni­
form , produce four possible combinations o f target and test. Wright and Douglas (1975a)
investigated all four combinations rather extensively and found the normal test to work
best on the normal target and the uniform test to w ork best on the interval target. When
they compared the normal and uniform tests on normal targets, however, these tw o test
shapes differed so little in their measuring precision as to appear equivalent fo r all prac­
tical purposes. Thus the best all purpose test shape is the uniform test.

6.5 T H E PRECISIO N OF A BEST TEST

N o w we turn to the response model formulation o f the standard error o f measure­


ment SEM so that we can become explicit about which test designs maximize precision
by minimizing SEM. We must find out how the test design T (H ,W ,L ) influences SEM and
how we can vary the test characteristics o f H, W and L in response to a target specifica­
tion G (M ,S,D ) in order to minimize SEM over that target.

The response model specifies

pfj = exp (bf - d | ) / [ 1 + e x p (bf - d j ) ] [6.5.1]

where

Pfj = the probability o f a correct response at f and i,


bf = the ability estimate at relative score f = r/L,
dj = the calibrated difficu lty o f item i.

The measure bf is estimated from a test o f length L with items |d;| fo r i = 1, L through
the equation (fo r details see Sections 1.5 and 3.7)

L
f = Z p f,/L , for f = 1 /L ,(L - 1 )/L [6.5.2]
i
D ESIGNING TESTS
135

with asym ptotic variance

1 / 2 pfi (1 - pf I ) = SEM f 2 [ 6 .5 .3 ]
i

This is the square o f the standard error o f measurement at relative score f .

We see that SEM f depends on the sum o f pfj (1 - p f i ) over i. Thus it is a function o f
b f and all the dj. H owever, fluctuations in p (1 - p ) are rather mild fo r p between 0.2 and
0.8. T o expedite insight in to the make-up o f SEM f we can reform ulate it so that the aver­
age value o f Pfj (1 - P fj) over i is one com ponent and test length L is the other.

S EM f = | l / [ 2 pfi (1 - Pf.) ] ( % (1 /L ),/2 = (Cf/ L ) y* [6 .5 .4 ]


i

in which

Cf = [ 2 Pfi (1 ~ Pfl)/L] -1

In this expression w e factor test length L out o f SEM in order to find a length-free error
c o e ffic ien t C f .

Resuming our study o f the operating curve o f a test given in Figure 6.3.1 we see that
the least measurable d ifferen ce in ability LM D is (“i y - ) LO D . Since the least observable
increm ent in relative score is 1/L, all w e need to com plete the form ulation o f the LM D is
the derivative o f b with respect to f which from Equations 6.5.1 and 6.5.2 is

dh L
— = [ 2 p fi (1 - pf i) / L ] _1 [6.5.5.]

But this is our error co e ffic ien t C f, thus the least measurable difference at relative score f
is

L M D f — Cf /L [6.5.6]

and

SEM f = L M D f ’/a - (Cf / L ) 1/a

With SEM f in this form we note that, as far as test shape is concerned, it is Cf which re­
quires minim ization. This will be true whether we use C min to minimize SEMf given L or
to m inim ize L given SEM f .

6.6 T H E E R R O R C O E F F IC IE N T

N o w we need to know m ore about this error coefficien t Cf. The essential ingredient
o f C f is the expression Pfj (1 - P fj). This is the inform ation If; on bf contained in a re­
sponse to item i with d ifficu lty dj (Birnbaum, 1968, p. 460-68). Its average value

l#. = S lfl/ L = Cf->


I
over the items on a test is the average inform ation about b f per item provided by that
test. Thus C f is the reciprocal o f average test inform ation. The greater the inform ation ob­
tained by a test the smaller Cf and hence the smaller SEM f and so the greater the precision.
136 BEST TEST DESIGN

What values can we expect Cf to take? We can approach this question in tw o ways:
in terms o f the influence o f reasonable values o f (b f - d j) on pfi and, fo r uniform tests, in
terms o f test width W and the boundary probabilities pf l fo r i = 1, the easiest item, and
pfL fo r i = L, the hardest item. The probability pfi is defined in Equation 6.5.1.

Beginning with reasonable values o f (b f - dj), we see that when bf = dj and their
difference is zero, then p fi = 1/2, p fi (1 - p f i ) = 1/4 and Cf = 4, but when (b f - d;) = -2
then p fj = 1/8, p fi (1 - p f i ) = 1/9 and Cf = 9. (N otice that C f = 9 when (b f - dj)= +2 and
Pfi = 7/8 also). Since an average can never be greater than its maximum element nor less
than its minimum, we can use these figures as bounds fo r Cf .

W hen - 2 < ( b f - d|) < + 2

then 1 / 8 < p f j <7 /8

and 4 < C f <9. [6.6.1]

Turning to the bounds we can derive fo r Cf from the test width W and the boundary
probabilities pf l and p fL o f a uniform test, we can use an expression fo r Cf given W de­
rived in Wright and Douglas, 1975a (also Bimbaum, 1968, p. 466).

Cfw = W /(p f l - p fL )

where

W = the item d ifficu lty width o f a uniform test,

pf l = the probability o f a correct response by


bf to the easiest item on the test, and

P fL = the probability o f a correct response by


bf to the hardest item on the test.

When b f is contained within the d ifficu lty boundaries o f the test, and W is greater
than 4 then 1/2 < ( P f j - P f L ^ * *-Tw must fall between W and 2W, that is

W’hen d., < b f < d L

and W>4

then W < C fw < 2 W . [6.6.2]

It follow s from these considerations that SEM = (C/L)1/4 is bounded by


2 /L Vl < S E M < 3 / L 54

fo r any test on which


- 2 < ( b f - dj) < + 2 ,

and by
(W /L )’7’ < S E M < ( 2 W / L ) *

for uniform tests when


W > 4 and d, < b f < d L.
DESIGNING TESTS 137

6.7 T H E D E S IG N O F A BEST T E S T

Best test design depends on relating the characteristics o f test design T (H ,W ,L ) to


the characteristics o f target G (M ,S ,D ) so that the SEM is minimized in the region o f the
variable where the measurements are expected to take place. The relationship between
test and target visible in Figure 6.7.1 makes the general principles o f best test design
obvious. T o match test to target w e aim the height o f the test at the center o f the target,
widen the test sufficiently to cover target dispersion and lengthen the test until it pro­
vides the precision w e require.

F o r best test design on either interval or normal targets w e select a set o f equivalent
items (w here W = 0 ) or a set o f uniform items with the W indicated in Table 6.7.1. Table
6.7.1 gives optim al uniform test widths fo r normal and interval targets. F or example, if
the target is thought to be approxim ately normal with presumed standard deviation
S = 1.5, the optim um test width W is 4. If, however, the target is m ore uniform in shape
then the optim um w idth could be as large as 8 . N o te that fo r any value o f S a smaller W
is always indicated when a normal “ bunched up” target shape is expected.

Table 6.7.1 also shows the e fficien cy o f a simple rule fo r relating test width W to
target dispersion S. T h e rule W = 4S comes close to the optimum W fo r narrow interval
targets and fo r w ide normal targets. When we are vague about where our target is we are
also vague about its boundaries. That is just the situation where we would be w illing to
use a normal distribution as the shape o f our target uncertainty. When our target is nar­
ro w however, that is the tim e when we are rather sure o f our target boundaries but, per­
haps, n o t so w illin g to specify our expectations as to its precise distribution within these
narrow boundaries. T o the exten t that interval shapes are natural fo r harrow targets while
normal shapes are inevitable fo r w ide targets, W = 4S is a useful simple rule.

Th e efficien cy o f this simple rule fo r normal and interval targets is given in the final
columns o f Table 6.7.1. There w e see that its efficien cy is hardly ever less than 90 per
cent. I f w e cross over from an interval target to a normal target as our expected target
dispersion exceeds 1.4, then the efficien cy is never less them 95 per cent. This means, fo r
exam ple, that a simple rule test o f 20 items is never less precise than an optimum test o f
19 items.

Our investigations have shown that given a target M, S and D there exists an opti­
mum test design H and W from which w e m ay generate a unique set o f L uniform ly distri­
buted item parameters j 6 j } . H owever, this design is an idealization and cannot be per­
fected in practice. Real item banks are finite and each item d ifficu lty is only an estimate
o f its corresponding parameter and hence inevitably subject to calibration error. We will
never be able to select the exact items stipulated by the best test design { 5 ; } . Instead we
must attem pt to select am ong the items available, a real set o f |d;[ which comes as close
as possible to our ideal design jfi j } .

Thus parallel to the design specification T (H ,W ,L ) we must write the test description
t(h ,w ,L ) characterizing the actual test j d ;} which we can construct in practice. This raises
the problem o f estimating h and w.

The estimated test height h can be determined b y the average estimated difficulties
o f the test items

h = S d j /L = d. 16.7.11
i
138 BEST TEST DESIGN

J FIGURE 6.7.1 l_

DISTRIBUTION OF A TARGET AND OPERATION OF A TEST

RELATIVE
FREQUENCY
P

4S

RELATIVE
SCORE
f = r/L

A B ILITY
MEASURE
b

BEST TEST DESIGN


H=M
W = 4S
L = Cf/SEM2
LMD = Cf/L
4 <Cf <9
DESIGNING TESTS 139

The estimated test width w can be determined from the range o f these estimated
difficulties, or perhaps a bit m ore precisely from an estimate o f this range based on the
tw o easiest items d j and d 2, and the tw o hardest, d L_ j and dL .

w = [(d L + d L_., - d 2 - d 1 ) / 2 ) [ L / ( L - 2 )] [ 6 .7 .2 ]

TABLE 6.7.1

OPTIMUM VALUES OF W FOR BEST UNIFORM TESTS


ON NORMAL AND INTERVAL TARGETS

TARG ET SIMPLE
STD. DEV. NO RM AL TARG ET IN T E R V A L TARGET RU LE* EFFICI E N C Y**
error minimized error minimized
S over N (M ,S 2 ) at (M ± 2 S ) W =4S Normal Interval

.5 0 0 2 94 97
.6 0 0 2
.7 0 2 3 90 100
.8 0 3 3
.9 0 4 4
1.0 0 5 4 89 98
1.1 0 6 4
1.2 1 6 5 92 96
1.3 2 7 5

1.4 3 7 6
1.5 4 8 6 96 91
1.6 5 9 6

1.8 6 10 7 98 87

2.0 8 11 8 99 84

“This Simple Rule is conservative for narrow targets and more practical since available items are
bound to spread some. It is also close to the normal target optimum for wide targets, which is
reasonable in the face of substantial target uncertainty.

“ ‘ Efficiency = Cw / ^ 4 S= M/v ! ^-4 S

where = m inimum error coefficient for optimum W.


C45 = error coefficient for W = 4S.
Lw = length of optimum test of width W.
L4S = length of equally precise test of width 4S.

6.8 T H E C O M P LETE R U L E S FO R BEST T E S T D ES IG N

We are n ow in a position to give explicit, objective and systematic rules fo r the


design and use o f a best possible test. T o design test T (H ,W ,L ) fo r target G (M ,S,D ):

1. From our hypothesis about M we derive H = M.


2. From our hypothesis about S we derive an optimum W either by consulting
Table 6.7.1 or b y using the simple rule W = 4S.
140 BEST TEST DESIGN

3. From our requirements fo r the measurement precision SEM we seek, we derive


L = C/SEM2. A fairly accurate value fo r C can be found in Table 6.8.1 which
gives, at various expected relative scores, f the value o f C minimized by the W
chosen in Step 2. Alternatively C can be approximated by one o f the simple
rules C = 6S or C = 6 from Equations 6.6.1 or 6.6.2.
4. From these H, W and L we generate the design set o f item sjs according to the
formula
5j = H - (W/2)[ ( L —2i + 1)/L] fori = 1,L

Then fo r test t(h ,w ,L ) from design T (H ,W ,L )

5. We select items d; from our item bank such that they best approximate the set
16;} by minimizing the discrepancy (d; - 6 j).

L
6 . We calculate h = S d / L = d.
i

and w = [(d L + d L _ , - d2 - d., )/2 ] [L /(L - 2)]

7. We administer the set o f |dj| as the test t(h ,w ,L).

TA B LE 6.8.1

ERROR COEFFICIENT Cfvv FOR SELECTED TEST W IDTH W AND


EXPECTED RELATIVE SCORE f FOR UNIFORM TESTS

Expected Test Width W


Relative
Score 0 2 4 6 8 10
f

.10 10.9 11.6 13.0 13.7 15.2 16.0

.20 6.3 6.8 7.3 8.4 10.2 11.6

.30 4.8 5.3 5.8 7.3 9.0 10.2

.40 4.0 4.4 5.3 6.8 8.4 10.2

.50 4.0 4.4 5.3 6.8 8.4 10.2

.60 4.0 4.4 5.3 6.8 8.4 10.2

.70 4.8 5.3 5.8 7.3 9.0 10.2

.80 6.3 6.8 7.3 8.4 10.2 11.6

.90 10.9 11.6 13.0 13.7 15.2 16.0

Cf w = W [ 1 - e x p - W )] / {[1 - exp ( - fW t] [1 - exp ( - (1 - f ) W )]}


7 M A K IN G M EA SU R ES

7.1 U S IN G A V A R IA B L E T O M A K E M E A SU R E S

This chapter is about turning test scores into measures. But before we show h ow to
d o this in Sections 7.2 and 7.3, we w ill review h ow the test items defining a variable can
be used to make measures.

T o make a measure we collect and com bine a series o f observed responses in such a
w ay that they support an inference as to the position o f the person on a variable. We sum­
marize these observations in to a score based on them and this score is used to im ply the
measure o f the person on the variable. The variable itself, however, is an idea and n ot a
direct experience. Its nature can on ly be inferred from relevant samples o f carefully selec­
ted observations.

Th e purpose o f a variable is to provide a basis fo r comparing persons and general­


izing, about their relative status. This purpose requires the achievement o f objectivity in
the variable’s d efin ition and in the w ay measures on it are made. The idea o f the variable
transcends any particular set o f observations and the measure on the variable must tran­
scend the observed responses on which it is based. Making measures with tests requires
objectively calibrated test items which provoke the observed item responses and then
through their calibrations carry these responses onto the scale o f the variable. It is these
items that operationally define the variable and bring meaning to the measurement o f the
person.

D ifferen t ways o f getting a particular score on a test do n ot generally arouse d iffer­


ent opinions o f the abilities o f persons taking the test. When tw o persons earn the same
score, w e seldom put one person ahead o f the other because they answered particular
items successfully. This is because w e think o f each score as resulting from the same ex­
posure to the same items giving each person’s ability the same opportunity to express
itself. But whenever we are w illing to take identical scores to have equivalent meaning and
do n o t care which items are actually answered correctly we are practicing “ item -free”
measurement. This widespread practice o f item-free measurement within a test implies,
w ith ou t further assumption, test-free measurement within a bank o f calibrated items.

A calibrated item bank provides a resource from which subsets o f items can be
selected to form specifically designed tests with optimal characteristics. Scores on these
tests, although stemming from d ifferen t combinations o f “ correct” responses to d if­
ferent selections o f items, can nevertheless be converted through the bank calibrations
in to comparable measures. Procedures fo r obtaining comparable measures fo r individu­
alized tests are given in Sections 7.4 to 7.7.

T o validate these measures, however, we must assess the exten t to which the persons
in question have taken the items in the way we intended them to be taken. The item
calibrations in the bank com e from occasions on which many persons were found to re­
spond to these items in a particular consistent way. This is the con text in which the item
calibrations gained their meaning. The meaning these calibrations now convey depends on
h ow the new persons being measured are found to respond to the items. The validity o f

141
142 BEST TEST DESIGN

their measures depends on the presence o f acceptable relations between what we actually
observe and what we expect to observe according to our measurement model and our
item calibrations. Thus before we can accept any measure as valid, we must examine the
plausibility o f the pattern o f responses on which that measure is based. The procedure for
accomplishing the analysis o f person fit necessary to establish measure validity is given in
Sections 7.8 and 7.9. In these sections we show how to detect person misfit and what
various kinds o f misfit lo o k like.

Whenever misfit is identified, the next step is to deal with the measurement quality
control problem this misfit causes. I f we can identify the circumstances leading to the
misfit, we may be able to extract from the flawed response record a measure which the
observed pattern o f responses can sustain. We show how to do this in Section 7.10.

7.2 C O N V E R T IN G SCORES TO MEASURES BY UCO N, PRO X A N D U FO R M

When a person takes a test, the resulting observation o f the person is their test
score. T o see how to get from this test score r to the estimated measure b which it implies
we refer to the measurement model,

Pj x, = 1 } = it-, = exp (0 - 5 j)/[1 + exp (0 - 6,)] [7.2.1]

which specifies how item calibration 6 , and person measure 0 are implied by the person’s
observed response Xj. The model implies that fo r each response o f a person to an item we
“ expect” an intermediate “ probable” value which is neither Xj = 1 fo r a correct response
nor Xj = 0 fo r an incorrect response, but somewhere in between them. This “ expected”
value is the probability nt given in Equation 7.2.1 that x, = 1, and it works just like our
expectation that fair coins fall half the time heads. Since we “ expect” a value on each
coin toss which is half the time heads and half the time tails, even though what happens
can only be one or the other, our expected value fo r a particular toss is neither 0 nor 1,
but half way between at n = V%.

Thus the expected value o f response x f is

E l x i i = »i
the model probability o f a correct answer to item i.
L
Since the test score r = 2 x ( is the sum o f the item responses, the expected value o f r
i
is the sum o f their expectations,

ZEjx,|. s . , .

I f we now substitute in n-, the measure b r to be estimated fo r 0 on the basis o f score r and
the estimated calibrations|dj} fo r j s , } , we have an estimation equation which relates
r and br as follows
L
r = E exp (br - d j)/[1 + exp (br~ dj)] [7.2.2]

From this equation, a person’s score r and the calibrations j dj | o f the items taken, we can
determine the measure br which they imply.

One way to solve Equation 7.2.2 is to use the UCON procedure described in Chapter
3. The UCON estimated measure is obtained by performing j = 1, m iterations o f

b ri+1 = br< + (r - 2 prij ) /[ 2 pr,J( 1 “ Pri')l [7.2.3]


i i
M A K IN G MEASURES 143

in which

pri* = exp (br' ~ d j)/[ 1 + exp (brj - d,)] . [7.2.4]

and the first value o f br< is

br° = Cn [ r / ( L - r > ] .

This U C O N procedure requires 3 or 4 iterations and a convergence criterion fo r successive


values o f br' such as

I b ri+1 - b ri | < .01 logits

When the convergence criterion is reached, then the estimated measure is the last value o f
br, namely

br = b ri+1 [7.2.5]

w ith standard error

sr = [ 2 p ri» (1 - P , , ' ) ] ' * [7.2.6]

T he U C O N procedure responds in detail to the distribution o f item difficulties |dj}


and so estimates a measure b r which is com pletely freed o f whatever distribution o f item
d ifficu lties characterizes the test. When the items happen to be such that their dj ’s approx­
imate a normal distribution d, ~ N (H , o% ), however, then the P R O X procedure des­
cribed in Chapter 2 is an excellen t approxim ation to the U C O N procedure.

T h e P R O X estimated measure b r can be found w ith ou t iteration as

br = h + [1 + (sd2 /2 .8 9 ) ] * E n [ r / ( L - r ) ] [7.2.7]

in which
L
h= dj/L = d.

estimates test height H and

sd2 = ( 2 d j 2 - Ld.2 ) / ( L —1)


i
estimates the variance o f test item d ifficu lty ad2. The standard error fo r this br is

sr = (1 + s d2 / 2 . 8 9 ) ’/ ’ [ L / r ( L - r ) ] * . [7.2.8]

Since it is often the case that the d j’s o f a sample o f new items approximate a nor­
mal distribution and since normal samples o f persons are typical, P R O X is often useful
fo r calibrating n ew items. In making measures, however, we can take advantage o f already
calibrated items and spread them uniform ly d ( ~ U (H , W ) over the range o f ability to be
measured. Such a uniform test can be described com pletely by its height H, width W, and
length L. Its measures can be calculated efficien tly by the U F O R M procedure described
in Section 7.3.
144 BEST TEST DESIGN

The U FO RM estimated measure bf is

bf = h + w (f - 0.5) + 2n (A /B ) [7.2.9]

where A = 1 - exp ( - w f )

B = 1 - e x p [ - w (1 - f )]
L
and h = Z d / L = d.
I

estimates test height H,

w = [(d L+ d L_ r d2- d , )/2 ] [L /{L - 2)]

estimates test width W, and

f = r/L

is the relative score on the L item test (W right and Douglas, 1975a, 2 1-2 3).

The standard error fo r this bf is

s, = [(w /L K C /A B ) ] 54 [7.2.10]

w here A = 1 - exp (-w f)

B = 1 - exp [~w (1 - f) ]

C = 1 - exp ( - w)

T o illustrate the use o f these procedures we have chosen nine persons from our
K C TB sample o f 101. Three o f these persons are at the preschool level, three are at the
primary level and three are adults.

In Columns 2 through 4 o f Table 7.2.1 we give the sex, age and grade o f these nine
persons. Column 5 contains their K C TB scores. Their corresponding UCON abilities are
given in Column 6 .

TA B LE 7.2.1

NINE PERSONS SELECTED FROM KCTB SAMPLE

1 2 3 4 5 6
A bility Person Age in School KCTB UCON
Group Name Sex Years Grade Score Ability

3M M 3 Preschool 1 -5.8
Preschool 6F F 5 Preschool 3 -3 .9
12M M 4 Preschool 5 - 2.8

29M M 6 1 10 -0 .9
Primary 35F F 9 4 11 -0 .5
69M M 8 4 15 1.4

88 M M 17+ 12 + 18 3.0
Adult 98F F 16 11 20 4.3
10 1F F 17+ 12 + 21 5.2
M A K IN G MEASURES 145

7.3 MEASURES FROM BEST TESTS BY UFORM

P rior to item calibration our only knowledge o f item difficulties comes from our
general concept o f the variable which the items are supposed to define. We do n ot know
the actual distribution o f these items along their variable. Once we have calibrated items,
however, as with K C TB , then we have a detailed picture o f where these items are located.
As a result we can use specially selected subsets o f these calibrated items to expedite
measurement.

These specially designed or “ tailored” tests will vary in length, in d ifficu lty level and
in range o f ability covered depending on the measurement target. Estimating measures
from such subsets o f items can be done efficien tly because we can construct the distri­
bution o f item difficu lties to suit our purpose. In particular, i f we want to optim ize the
e fficie n cy o f our designed tests, w e w ill construct them so that the items are uniform ly
spaced in d ifficu lty over their measurement target. This makes the estimation o f measures
from scores on these tests entirely manageable b y the simple U F O R M procedure.

A ll that is needed to apply U F O R M are estimates o f the height H, width W and


length L o f a test. Then a single conversion table arranged by relative score and test width
provides all the person measures ever needed. A second table, similarly arranged, gives the
coefficien ts necessary to form the standard errors o f these measures. Table 7.3.1 is an
abbreviated table o f these relative measures and Table 7.3.2 is an abbreviated table o f
their error coefficien ts. M ore com plete tables fo r relative measures and their error c o e ffi­
cients are given in Ap p en d ix Tables A and B.

T o use Tables 7.3.1 and 7.3.2 (o r Tables A and B ) we need the approximate width w
o f the test and the person’s relative score f = r/L. Together they determine the person’s
relative ability x fw and its corresponding error coefficien t Cf w . When we combine this
inform ation with test height h and test length L, we get the measure bfw = h + x fw and
its standard error sfw = Cf ^ /I/1 .

In order to use Tables 7.3.1 and 7.3.2 fo r a particular test, we need estimates o f that
test’s basic characteristics H, W and L. Test length L is self-evident. Test height H is
estim ated from the average d ifficu lty level o f the test’s items, namely h = £ d j h = d.. The
estimation o f test width W, however, can be problem atic when an irregular distribution
o f item d ifficu lties at the extrem es o f the test cannot be avoided.

Test w idth can be estimated in various ways. F o r example

w, = (d L- d , ) [ L /( L — 1 )]

w2 = [(d L+ d u _ 1 - d 2 - d 1 ) / 2 ] [ L / ( L - 2 ) ]

w3 = [(d L+ d L _ 1 + d L _ 2 - d 3 - d 2 - d , )/3] [L /(L - 3)]

or

ws = 3.5sd where sd = ( 2 d ,2 - Ld.2 )/(L - 1)

T h e m ethod w e have found best in practice is w 2 , the one based on the average d if­
ference between the tw o easiest and the tw o hardest items. This procedure fo r estimating
test width is illustrated in Table 7.3.3 where we calculate w fo r five forms o f the KCTB.
146 b e s t t e s t d e s ig n

J TAB LE 7.3.1 I

RELATIVE MEASURE xfw FOR UNIFORM TESTS

Relative Test Width w


Score
f 2 4 6 8 10

.1 - 2.3 - 2.7 - 3.2 - 3.8 - 4.5


.2 - 1.5 - 1.8 - 2.2 - 2.6 - 3.1
.3 - 0.9 - 1.1 - 1.4 - 1.7 - 2.1
.4 - 0.4 - 0.5 -0 .7 - 0.8 - 1.0
.5 0.0 0.0 0.0 0.0 0.0
.6 0.4 0.5 0.7 0.8 1.0
.7 0.9 1.1 1.4 1.7 2.1
.8 1.5 1.8 2.2 2.6 3.1
.9 2.3 2.7 3.2 3.8 4.5

For more detail see Appendix Table A

Test Length: L
Relative Score: f = r/L
Test Height: h = Id ,/L
i
Test Width: w = [(d L+ d L_ , - d 2 - d. )/2 ] [L /(L - 2)]
Measure: bf = h + x fw

____________________ | TA B LE 7.3.2 ____________________

ERROR COEFFICIENT C * fw FOR UNIFORM TESTS

Relative Test Width w


Score
f 2 4 6 8 10

.1 3.4 3.5 3.7 3.8 4.0


.2 2.6 2.7 2.9 3.2 3.4
.3 2.3 2.4 2.7 3.0 3.2
A 2.1 2.3 2.6 2.9 3.2
.5 2.1 2.3 2.6 2.9 3.2
.6 2.1 2.3 2.6 2.9 3.2
.7 2.3 2.4 2.7 3.0 3.2
.8 2.6 2.7 2.9 3.2 3.4
.9 3.4 3.5 3.7 3.8 4.0

For more detail see Appendix Table B.

Standard Error: sfvv = C,/4fw / L y'


M A K IN G MEASURES 147

_____________________ T A B L E 7 .3 .3 ___ |_____________________

ESTIM ATING TEST W IDTH w FOR FIVE KCT FORMS

Test Test Tw o Easiest Tw o Hardest Test Width


Form Length Items Items Calculated Used
w' w

KCTB 23 - 6 .2 * -4 .3 5.8 6.3 1 2 .4 *


22 -4 .3 -4 .1 5.8 6.3 11.3 11

Preschool 8 - 6.2 -4 .3 - 2.1 - 2.1 4.2 4

Primary 15 -2 .7 - 2.6 2.0 2.9 5.9 6

A dult 15 -1 .5 - 1.0 5.8 6.3 8.4 8

Pilot 7 - 6.2 - 4 .1 4 .5 6.3 14.8 15

w '= [(d L + d L_ , ~ d 2 - d , )/2 ] [ L /( L - 2 ) ]

w = w ' rounded to nearest integer.

* Item 3 at - 6 . 2 is 2 logits below the more or less uniform stream of 22 items from Item 7 at -4 .3
through Item 24 at 6.3. U F O R M is more accurate with this kind of extreme non-uniform ity, when
test width is calculated w ith o ut the very irregular extreme item.

The first ro w o f Table 7.3.3 concerns the 23 items in the K C TB “ item bank.” From
these 23 items w e have com posed three narrow-range test form s focused on three ability
levels: a Preschool Form o f 8 items, a Primary Form o f 15 items and an A d u lt Form o f
15 items, and also one wide-range P ilo t Form o f 7 items. The calibrations fo r the tw o
hardest and tw o easiest items fo r each o f these test forms are given in Table 7.3.3. With
these calibrations we can estimate the various test widths using the w 2 m ethod to calcu­
late w ' as

w '= [(d L + d L_ 1 - d 2 - d ,) / 2 ] [ L / ( L - 2)] [7.3.1]

and rounding the w' com puted to the nearest integer fo r the value o f w used in tables like
7.3.1 and 7.3.2.

N o w we apply the U C O N , P R O X and U F O R M measuring procedures to our sample


o f nine persons and compare the results. Table 7.3.4 gives the U C ON measures and errors
fo r the K C T B scores from 1 to 22. Table 7.3.5 gives the nine persons’ K C TB scores and
the corresponding ability measures and errors fo r each o f these scores by UCON, P R O X
and U F O R M .

Person 29M, fo r example, earned a K C TB score o f 10 correct out o f 23 items


attem pted. His U C O N ability and error, looked up in Table 7.3.4, are b = -0 .9 and
s = 0.6
148 BEST TEST DESIGN

His P R O X ability and error calculated from h = 0, X = 2.2 and L = 23 are

b = h + X 8n [ r / ( L - r)]

= 0 + 2.2 2n [1 0/13]

= - 0.6

s = X [ L/r( L - r)] *

= 2 .2 [ 2 3 / 1 0 ( 1 3 ) ] 54

= 0 .9

The value o f the expansion factor X comes from the variance o f item difficulty sd 2 = 11.0
as

X = (1 + sd2/2 .8 9 ),/4

= (1 + 1 1 /2 .8 9 )*

= 2.2 .

TA B LE 7.3.4 I
UCON AB ILITIES AND ERRORS
FOR THE 23 KCTB ITEMS
Score Ability
r b

1 -5 .8 1.2
2 -4 .6 1.0
3 -3 .9 0.8
4 -3 .3 0.7
5 - 2.8 0.7
6 -2 .4 0.6
7 - 2.0 0.6
8 - 1.6 0.6
9 -1 .3 0.6
10 -0 .9 0.6
11 -0 .5 0.6
12 - 0.1 0.7
13 0.3 0.7
14 0.8 0.7
15 1.4 0.7
16 1.9 0.8
17 2.4 0.8
18 3.0 0.8
19 3.6 0.8
20 4.3 0.9
21 5.2 1.0
22 6.3 1.2
M A K IN G MEASURES 149
150 BEST TEST DESIGN

His U FO R M ability and error are calculated from his relative score f = r/L= 10/23 = .43
and the values fo r x fw and Cf * found in Tables A and B o f the appendix with h = 0,
w = 11 and L = 23. Thus

b = h + x fw , xfw = - 0 . 8 from Table A

= 0 - 0.8

= - 0.8 .

and

s = Cf£ / L H , Cf * = 3.3 from Table B

= 3 .3 /2 3 y’

= 0.7 .

The last columns o f Table 7.3.5 give the difference between P R O X or U FO RM


and UCON. With the exception o f the P R O X measure fo r Person 3M, no difference is
larger than 0.4 logits. A ll differences are less than half o f the standard errors associated
with their ability measures.

Confidence in the use o f the U FO R M Tables 7.3.1 and 7.3.2 or Appendix Tables A
and B depends on a knowledge o f their functioning over a variety o f typical test situa­
tions. Wright and Douglas (1975a) investigated their functioning with a simulation study
designed to check on the major threats to the success o f these tables in providing useful
measures.

The results o f their study are summarized by the bounds given in Table 7.3.6 fo r the
extent to which a test can depart in practice from a uniform spacing o f item difficulties
before measurements based on the assumption o f a uniform test become unacceptable.
Table 7.3.6 gives the combinations o f H - 0, W, and L within which the bias in estimating
0 caused by non-uniformity in item difficu lty is less than 0.1 logits.

The amount o f leeway shown in Table 7.3.6 may seem surprising, since it allows a
random item difficu lty o f, say, d = 2.0 when uniform ity calls fo r 6 = 1.0. But, when h
and w are calculated from a test’s actual d j5 it is demonstrable that a broad spectrum
o f test designs is exceptionally robust with respect to random departures from uniformity
in item difficulty.

Table 7.3.6 shows that as test length increases beyond 30 items, no reasonable
testing situation risks measurement bias large enough to matter. Tests in the neighbor­
hood o f 30 items, o f width less than 8 logits and which come within 1 logit o f their
target 0 are, fo r all practical purposes, free from bias caused by random deviations in the
uniform ity o f item calibrations o f magnitude less than 1 logit. Only when tests are as
short as 10 items, wider than 8 logits and more than 2 logits off-target does the measure­
ment bias caused by random non-uniformity o f item difficulty exceed 0.2 logits. This
means that U FO RM measurement tables, even though they are based on the assumption
o f perfectly uniform tests, can be used to transform scores into measures in most prac­
tical situations.
M A K IN G MEASURES 151

_____________ TABLE 7.3.6 _____________


PERFORMANCE OF UFORM PROCEDURE
FOR TESTS LESS THAN 8 LOGITS WIDE

Maximum Maximum Minimum Maximum


Item Bias Off-Target Test Length Measurement Bias
Id, - 5 , | | H - 0t L BIAS BIAS/SEM

1.0 2 10 .2 .4
1 30 .1 .3
0.5 2 10 .1 .2
1 30 .1 .2

B IA S = The average measurement bias in 100 replications of a test in which the random departures
from a uniform distribution of item difficulties are bounded by |d, - 5, |.

7 .4 IN D IV ID U A L IZ E D T E S T IN G

T h e need fo r individualized testing becomes obvious whenever we encounter a situa­


tion in which inappropriate items have been given to a person. The solution to this prob­
lem is to tailor tests to persons. T h e construction o f a bank o f calibrated items makes the
e fficie n t im plem entation o f tailored testing simple. The uniform ity o f measurement
precision near the center o f tests o f typical height and width shows that we need only
bring the selected items to within a logit o f their intended target to achieve “ good enough”
tailoring. This can be done in various ways.

Status Tailoring. Inform ation about grade placement or age will often be sufficient
to tailor a school test. Prior know ledge o f the approxim ate grade placement o f the target
group o r pupil and o f the variable’s grade norms can be used to determine an appropriate
segment o f items. Norm ative data in a variety o f school subjects suggests that typical
within grade standard deviations are about one logit. When this is so, even a rough idea
as to a pupil’s within grade quartile provides m ore than enough inform ation to design a
best test fo r that pupil.

Perform ance Tailoring. Where grade or age inform ation are n ot sufficient, tailor­
ing can be accomplished with a p ilo t test o f 5 to 10 items spread out enough in difficu lty
to cover the widest expected target. I f the p ilo t test were set up to be self-scoring, then
pupils could use their number right to guide themselves into a second test specifically
tailored to the ability level im plied b y their p ilo t test score.

Self-Tailoring. A third even more individualized scheme may prove practical in


many circumstances. Th e person to be measured is given a b ook let o f items presented in
order o f uniform ly increasing d ifficu lty and asked to find their own best working level.
Testing begins when the person finds items hard enough to interest them but easy enough
to master. Testing continues into more d ifficu lt items until the person decides that the
level o f d ifficu lty is beyond their ability. The self-tailored test on which this person is
then measured is the continuous segment o f items attempted.
152 BEST TEST DESIGN

O
CO

o
LO

CO o

2 co
cc
o
o
CN

£ “

§1
a oc
LU <
co>
LU .
in
r> .' o
LU X o*
DC
D
•“ x
CD
2 o

-1 .0
p i
t r co
2 <
m Lu
-2.0
x S
H
co
-3.0
-4.0

O
CjJ
00
00
CN
-5.0

<0“

10“
E
a>

E
Q)
-6.0

o
E
c_
Li. o E
k.
LL
“5 >
o

o i—
-C CO

a
>
c
_
u
co
Q_
(3 E
w
a.
D
"D
<

QSl CN
M A K IN G MEASURES 153

This approach is self-adapting to individual variations in speed, test com fort and
level o f productive challenge. The large variety o f d ifferen t test segments which can result
are easy to handle. The sequence number o f the easiest and hardest items attempted and
the number o f correct responses between them can be read o f f a self-scoring answer form
and converted in to a measure and its standard error merely by looking up these three
statistics in a simple one-page table made to fit with the b ook let o f items used in testing.

Self-tailored testing corresponds to the use o f basal and ceiling levels on individually
administered tests like the Stanford-Binet. The only difference is that, with the self­
tailored test, the segment o f items administered is determined by the person taking the
test rather than b y an examiner.

7.5 STATUS TAILORING

T o illustrate status tailoring w e allocated our K C TB items to three sequential forms.


T h e Preschool Form is aimed at preschool children. The Primary Form is aimed at pri­
mary school children. The A d u lt Form is fo r persons beyond primary school. Figure
7.5.1 shows the distribution o f items in to these three forms.

The Preschool Form is com posed o f the first 10 items. Only Items 3 through 10
are calibrated because virtually everyone tested so far has gotten Items 1 and 2 correct.
T h e Prim ary Form is com posed o f Items 5, 6 and 8 through 20 to cover the middle range
o f the variable. T h e A d u lt Form is com posed o f Items 11 through 25, the hardest items
calibrated, and Items 26, 27 and 28 which are so hard that no one tested so far has gotten
them correct.

N o tic e that we can include these five “ out-of-bound” items in our test forms with­
out im pairing our measurements in any w ay. This is because w e can focus our measure­
ments on the portion o f the test which is both taken by the person and made up o f cali­
brated items while letting extrem e items continue to work fo r us as the conceptual
boundaries o f the K C T variable. I f eventually we encounter persons who fail Items 1 or
2 o r w h o pass Items 26, 27 o r 28, then we w ill also be able to calibrate these items onto
the K C T variable and use responses to them in our measurements.

T h e items fo r each o f the three forms and their corresponding item difficulties,
where known, are given in Table 7.5.1. Below the items in each form are that fo rm ’s
test characteristics: height h, width w and length L.

These three test form s were applied to the nine persons. Table 7.5.2 shows how
each person scored on each o f the forms. Persons 3M and 6 F could be measured on only
the Preschool Form while persons 98F and 101F could be measured on only the Adult
Form . Person 12M produced a measurable record on the Preschool and Primary Forms.
Persons 69M and 88M produced measurable records on the Primary and A du lt Forms.
Persons 29M and 35F produced measurable records on all three forms.

F o r Person 29M with relative score .75 on the Preschool Form (h = -3 .3 , w = 4,


L = 8 ) w e lo o k up x fw = 1.4 and Cf * = 2.6 in Tables A and B to find the estimate

b = -3.3 + 1.4 = -1.9

with standard error

s = 2 . 6 / 8 = 0.9
154 BEST TEST DESIGN

__________________[ TABLE 7.5.1 |__________________


TEST STATISTICS OF THREE SEQUENTIAL FORMS
MEASURING THE KCT VARIABLE

PRESCHOOL FORM P R IM A R Y FORM A D U L T FORM


Item Item Item Item Item Item
Name Difficulty Name Difficulty Name Difficulty


1
*
2

3 - 6.2
4 -4 .1
5 - 2.6 5 - 2.6
6 - 2.7 6 -2 .7
7 -4 .3
8 - 2.6 8 - 2.6
9 - 2.1 9 - 2.1
10 - 2.1 10 - 2.1
11 - 1.0 11 - 1.0
12 - 0.1 12 - 0.1
13 -0 .9 13 -0 .9
14 -0 .5 14 - 0.5
15 - 1.5 15 - 1.5
16 - 0.8 16 - 0.8
17 1.9 17 1.9
18 1.4 18 1.4
19 2.0 19 2.-0
20 2.9 20 2.9
21 3.3
22 3.3
23 4.5
24 6.3
25 5.8

**
26
##
27
**
28

Test Characteristics of the Calibrated Items

Preschool Primary Adult


Form Form Form
Height: h = -3 .3 h = - 0.6 h = 1.8
Width: w = 4 w = 6 w= 8
Length: L = 8* L = 15 L = 1 5 **

* Items 1 and 2 were too easy to calibrate


* * Items 26, 27 and 28 were too hard to calibrate

Items 1, 2, 26, 27 and 28 cannot be used for measurement because


their difficulty levels have so far eluded calibration.
M A K IN G MEASURES 155
156 BEST TEST DESIGN

For Person 29M ’s relative score o f .47 on the Primary Form (h = -0 .6 , w = 6 , L = 15)
we look up x fw = -0 .2 and Cf * = 2.6 to find the estimate

b = - 0.6 - 0.2 = - 0.8

with standard error

s = 2.6/15* = 0 .7 .

For Person 29M ’s relative score .27 on the Adult Form (h = 1.8, w = 8, L = 15) we
look up x fw = -2 .0 and Cf * = 3.0 to estimate

b = 1.8 - 2.0 = - 0.2

with standard error

s = 3.0/15,/4 = 0.8 .

Even though only one o f the forms taken is best focused on a person and so pro­
duces their “ best” measure, still we can see in Table 7.5.2 that, in spite o f the wide
variation in score on different forms, the measures fo r a given person are, fo r the most
part, comparable. Person 29M produces the greatest variation in measures over these
three forms. His three relative scores o f .75, .47 and .27 vary widely in response to the
variation in difficulty o f the three forms. According to our model his three measures
o f -1 .9 , -0 .8 and -0 .2 ought to be statistically equivalent, even though they may seem
to vary more than we might like. When their variation is evaluated in the light o f their
standard errors o f 0.9, 0.7 and 0.8 we see that the lowest estimate o f -1 .9 on the
0
Pre-
school Form plus one o f its standard errors and the highest estimate o f -0 .2 on the
Adult Form minus one o f its standard errors touch at -1 .0 .

Table 7.5.3 shows fo r each person their ability measure on the total K CTB test
and their ability measure on each o f the three sequential forms. The difference between
each test form and the K C TB is given at the right o f the table. When these differences
are compared to the errors associated with them it can be seen that all o f the differences
are less thanhalf a standard error except fo r those o f Persons 29M and 98F.

The standard errors for each ability fo r the K CTB and the three test forms are
given in Table 7.5.4. These values are stable and consistent over forms fo r the nine persons.

7.6 P ER FO R M A NC E T A IL O R IN G

T o illustrate performance tailoring we developed a Pilot Form o f seven items from


KCTB. Figure 7.6.1 shows the distribution o f these seven items. They were selected to
be as uniform as possible over the 15 logit range o f the K C T variable. Table 7.6.1 gives
their item difficulties and the Pilot Form test characteristics. Height is centered at 0.0.
The effective width is 15 logits.

T o demonstrate performance tailoring with this Pilot Form we will use the per­
formances o f our nine persons on the Pilot Form to indicate the sequential form most
appropriate fo r measuring each o f them. Then, we will measure them on the indicated
sequential form and compare their “ performance tailored” measure with their measure
based on all 23 K C TB items.
M A K IN G MEASURES
157
158 BEST TEST DESIGN

___________________ TABLE7.5.4 _________________


COMPARING MEASUREMENT PRECISION
FROM THE SEQUENTIAL FORMS WITH
MEASUREMENT PRECISION FROM THE KCT BANK

Sequential Forms
KCTB Preschool Primary Adult
L=23 L=8 L=15 L=15
A bility Person Error Error Error Error
A
Group Name a S1 s2 S3

3M 1 .2 1 .1 *
Preschool 6F 0.8 0.8
12M 0.7 0.8 0.8

29M 0.6 0.9 0.7 0.8


Primary 35F 0.6 1.2 0.7 0.8
69M 0.7 0.8 0.8

88 M 0.8 1.1 0.8


Adult 98F 0.9 0.8 *
10 1 F 1.0 0.9 *

* Discrepancies from KCTB minimum & are due to U FO RM approximation

In Table 7.6.2 we give fo r each person their name and K C TB ability. N ex t we


give their performance on the Pilot Form. This includes their p ilot score r, relative score
f, ability b, and error s1. Then we show the target regions (from b, -s., to b, + s.,)
implied by each Pilot Form performance. This is follow ed by the sequential form indi­
cated and the resulting ability measure b2 based on their performance on the indicated
sequential form. Finally, we show the difference (b 2 - fi) between the K CTB measure
/3 and the sequential form measure b2 together with the K CTB error s^ .

For example, Person 29M had a K C TB ability o f -0 .9 . Using the Pilot Form we
found an ability o f +1.1 with a standard error o f 1.5 indicating a target range o f -0.4 to
2.6. Since the Adult Form is targeted at 1.8 logits, it is the sequential form indicated for
measuring 29M. On this form he obtained a measure o f -0 .2 logits.

In Table 7.6.2 we see that for seven o f the nine persons the difference between
their measure on the K CTB and their measure on a performance-tailored best sequential
form differs by less than half a standard error. Persons 29 M and 98F, however, show dis­
crepancies between the measures implied by the K C TB and the sequential test form
which are o f the order o f one standard error.
M A K IN G MEASURES
159

F IG U R E 7.6.1

ITEM D ISTRIBUTIO N OF A PILOT FORM FOR


LOCATING PERSONS ON THE KCT VA RIAB LE

Pilot Form

© ©
©
©

©
© ©

1 1 1 1 1 1 I
i i I i
Logits - 6.0 -4 .0 - 2.0 0.0 2.0 4.0 6.0

KCT Variable

_______________________ T A B L E 7.6.1________________________

TEST STATISTICS OF A PILOT FORM


FOR LOCATING PERSONS ON THE KCT VARIABLE

P IL O T FORM

Item Name Item D ifficulty

3 - 6.2
4 - 4 .1
9 - 2.1
12 - 0.1
19 2.0
23 4.5
24 6.3

Height: h = 0.0
W idth: W = 15
Length: L = 7
160 b e s t t e s t d e s ig n
M A K IN G MEASURES
161

7.7 S E L F -T A IL O R IN G

In order to develop an example o f self-tailoring we return to the original person


response records o f l ’s and 0 ’s to see how individualized response patterns might emerge
from a matrix such as Table 2.3.1. We want to select a record fo r each o f our nine persons
that approxim ates a self-tailored sequence o f items. T o obtain these individualized seg­
ments we established basal levels, the point at which a particular person might begin
taking items, at three successes prior to the first failure. We set ceiling levels, the point at
which a person m ight stop taking items, at three successive failures. This produced a
unique self-tailored sequence o f items fo r each o f our nine persons.

In Table 7.7.1 we show the response patterns o f these self-tailored tests. The first
item Person 29M missed, fo r example, was Item 6 . Thereafter he continued passing and
failing items until he failed Items 17, 18, and 19 successively. This defined a self-tailored
segment fo r him ranging from Item 3 through Item 19. On these 17 items he had a score
o f r = 10.

In Table 7.7.2 w e com pute a measure fo r each person based upon their self-tailored
test segment. In order to do this com putation we determine fo r each self-tailored segment
its test characteristics h, w and L. These test characteristics fo r each person’s self-tailored
segment are given on the le ft o f Table 7.7.2.

Thus Person 29M has a relative score o f 10 on his self-tailored segment o f 17 items.
Since his segment has a width w = 8 this score o f 10 produces a relative ability measure
o f 0.7 logits which when adjusted fo r the height o f his segment (h = - 1 .5 ) yields an
ability estimate o f -0 .8 with a standard error o f 0.7. This ability estimate is only 0.1
logits away from his K C T B ability estimate o f -0 .9 with error 0.6. Inspection o f the
differences given in Table 7.7.2 between measures on each self-tailored segment and
their corresponding K C T B measures shows that all o f the measures obtained by self­
tailoring are close to the ability measures obtained by the KCTB.

In Table 7.7.3 w e show, fo r each type o f tailoring, the efficien cy in item usage fo r
each o f our nine persons. We see that a considerable number o f items can be saved with­
ou t much diminishing the accuracy o f ability estimates.

Person 29M with a 17 item self-tailored segment requires the most items, y e t even
this segment is 6 items less than the total 23 K C TB items and there is virtually no loss
o f measurement accuracy. Person 3M produces almost as precise an estimate with only
4 self-tailored items as can be obtained fo r him by using all 23 o f them. This saves 19
items. Person 3M, however, is at the extrem e lo w end o f the K C T variable. As a result
only the fou r easiest items are relevant to measure his ability. Were additional easy items
available, we could use them to advantage with Person 3M to im prove the precision o f
his measure.

T he self-tailored procedure always achieves the most efficien t item utilization. This
is especially so when making measures at extremes, in this case, beyond ± 4 logits on the
K C T B ability scale. H owever, while appreciating this apparent efficiency, we must also
realize that the items saved are items inappropriate fo r their target. Our real goal is
to make measurements sufficiently accurate to be useful. Accuracy depends on the
number o f items used which are near enough to the person to be measured so that each
item makes an adequate contribution to the estimated measure. This means that we want
items to be within a lo git o f their target. Once the items are brought this near their target,
all further considerations o f accuracy, and hence o f efficien cy, boil down to the question
o f h ow many o f these “ tailored” items it is practical fo r the person to attempt.
162 BEST TEST DESIGN

»- co in o o o ) co co co

:<5 E

^ ( 0 0 ) in in r- co in

CM

in
CM

co
CM

CM
%

O
CM
%
KCTB

O)
3i
r«*
ooo o
RESPONSE SEQUENCES FROM

T3
00
O 5
>i
CM

5
CD

5
CO
5

o
SELF-TAILORED

in
5 o

LO

CO c
% D

O) in o) CO CO f
CM CO CD 00 O) o

a
■— 3
2 2 E *o
< o ol <
M A K IN G MEASURES 16g

2 cn oo i>* cq co r** 00 0) O
UJ
o o o o o o o'
o> c
CD <o
d) OO
c
£ V
»"
s. ^
Q)
f- o «- CN O CN i- o
o o o o o o o d d

oo q oo Is *. r>: ^ 0) 0) * - a>
°
LU II
o
«-* o o o o o o d a>
CD

> C
H x r* co oo oo oo CN ^

n
+
£.
LO 00 CN o o oo rr
c
RESPONSE SEQUENCES

I I I
< i o
*U
0)
•O
c
D
o
W .2 3
O O ^ CD 00 00 o> cd r* O) oo iq
t ^ CN CN CN CN CN CN CN CN CN
HJ qJ O
CN
O
o I

> >
$£ ° . w. «sr o q
■(0» S 3
*_ o o o o' o'
cc <
x
SELF-TAILORED

"D

(0 O
LO
CN
O
LO
CD
LO
0 )0 0
LO CD CD
io o o
LO LO CD
CN
“O
£ co i

«- 00 IO 0 0) 0) +
FROM

r- CD IO
MEASUREMENTS

</)
-0

*D
0> H 00 00 ff
-C
•2>£ JW-
<T 00 00 0
1 I I I I 1

00 CO O) CO 0) LO O) o co cn

o i <a- IO 00 CN 0 o «-* oo* io


* <
I ! I 1 I

0)10 0) 00 00 t-
CN 00 CD 00 0 ) 0

> a
D
i 2 •O
< <3 <
164 b e s t t e s t d e s ig n

CO 05 00 rs r** r** 05 co «-
•-‘ d o o d d o’ o'

\DT «/)^ 00 00 00 00 00 O)
C
Q . o’ o o* o d d
OC
O
t- °q co r -t r*» oq 00 00 05 D
o o o* o o' o d d O

■3o
TESTING

OQ
cn cq rs r* p* f* CO 05
o<b r-‘ o o o d d o d d
EFFICIENCIES POSSIBLE WITH THREE TYPES OF TAILORED

OQ
h*
O
0) O I 05 r* CO 00 00 CN r* oo
C/) = Al T— *”
E
o

r cm
C/5
V)
O r- in LO LO

o ■C=D “1
E «*-

<0 .O |
O) '(0 -1
I-

Q) O
CO •*=

"O
v
CO
D
(A §
o '§
=
E

O)
3 C
co

2 ° .
CO ‘ to
MEASUREMENT

CO
H CO CO CO
O CN CN CM
*

03 00 05 00 05 in O CO CN
g § * in CO CN 0 0 CO in
* < I: 1 1 1 1

? LL ? ? li­ ? LL LL
°
n P
C CO CO CN 05 en 05 00 00 T—
i- to *— CN CO CO 00 O) o
CD
Q. ^
-

O
0 >
> a -£I
•- 3 <s> F 3
5 2 T3
< O al QI <
M A K IN G MEASURES 165

7 .8 PERSON F IT A N D Q U A L IT Y C O N T R O L

During test administration it may appear that an examinee has taken the test as
planned. Nevertheless, it is always necessary to examine the actual pattern o f responses
to see i f this pattern does in fact correspond to reasonable expectations.

Consider, fo r exam ple, a test o f 10 items administered in increasing order o f d iffi­


culty. Table 7.8.1 shows five d ifferen t ways a score o f five might be achieved on such a
test. The response patterns o f Persons A and B, and even C, seem reasonable. Success
occurs on the easier items to the le ft and failure occurs on the harder items to the right.
H ow ever, the patterns o f Persons D and E are quite implausible. H ow could it happen
that Person D g o t a score o f five by succeeding on the five most d ifficu lt items while at
the same tim e failing the five easiest items? That is so contradictory to our expectations
fo r a meaningful test record that we cannot take Person D ’s score o f five as the basis fo r
a valid measure o f ability. Person D may be smart and careless or dumb and lucky, but
one thing is certain, Person D does n ot have the intermediate ability implied by a score
o f five.

T h e response record o f Person E also raises questions. I f Person E could answer


items in the middle range o f d ifficu lty correctly including fou r o f the five hardest items,
w hy were the three easiest items missed?

T h e Rasch measurement m odel leads to a comprehensive y e t easily applied pro­


cedure fo r evaluating the validity o f each exam inee’s record o f responses. In this pro­
cedure the person’s response record is compared with our expectation o f what should
happen according to the response m odel. The procedure uses this comparison to calculate
a “ f i t ” statistic which indicates the exten t to which the person’s performance on the test
is in accordance with m odel expectations.

I f x vl is the response o f person v with tentative measure bv to item i with bank


calibration dj, and i f x vi = 0 fo r an incorrect response o r x vi = 1 fo r a correct one, then
according to our measurement m odel

zv2j = exp [(2 x vi - 1 Xdj - bv)l [7.8.1]

is a standard square residual fo r evaluating the relationship between the observed response
x vj and its m odel expectations given bv and dj. According to expectation this zvi2 should
be approxim ately distributed as chi-square with about (L - 1)/L degrees o f freedom
where L is the number o f items in the test used to estimate bv. I f the set o f | zu2j } does
appear to be distributed this way, then we have no internal reason to invalidate bv. But if
not, w e must acknowledge a departure in the data from our expectation and we must see
what w e can d o about it.

Every response x vj in the set o f i = 1 to L taken by person v produces its own almost
independent zvj2 . We can sum this set o f L residuals | zvj2 } into an approximate chi-
square with about ( L - 1 ) degrees o f freedom , and fo r convenience express this chi-square
as the standardized statistic

[Bn (v„H-(vw- 1)1 [ ( L - D/8]* ~ N (0 ,1 ) [7.8.2]


tv =
166 BEST TEST DESIGN

J TABLE 7.8.1 |_

FIVE WAYS TO SCORE FIVE ON A TEN ITEM TEST

Items in order of increasing difficulty


Easiest Hardest
Item Item
Person #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 Score

A 1 1 1 1 0 1 0 0 0 0 5
B 1 1 1 0 1 0 1 0 0 0 5
C 1 1 1 0 0 1 1 0 0 0 5
D 0 0 0 0 0 1 1 1 1 1 5
E 0 0 0 1 0 1 1 0 1 1 5

where vv is the mean square

vv = Z z 2 /(L -1 ) . [7.8.3]
i »

The divisor o f 8 in Equation 7.8.2 comes from averaging tw o opposing standardi­


zations o f the mean square v. Thus, if

t, = (v - 1 )[(L - 1 )/2 ]* ~ N (0 ,1 )

and

t2 = [fin (v)] [ ( L “ D/2] 54 ~ N (0 ,1 )

then

t = (t, + t 2 )/2 = [Cn (v) +(v—1 )] [ ( L - 1 ) /8 ] ’/a ~ N (0 ,1 ) .

In Table 7.8.2 we work out the person fit analysis fo r the response patterns o f Per­
sons 12M, 35F and 88M. Person 12M has a tentative measure o f b = -2 .8 . For his first
item, d = -6 .2 , his response is x = 0. These give him a (d - b ) difference o f

( d - b) = [ - 6 . 2 - (-2 .8 )] = - 3 . 4

since (2x - 1) = -1

then z2 = exp [(2x - 1 ) (d —b>] = exp [ - ( - 3 .4 ) ] = exp (3.4) = 30


M A K IN G MEASURES
167

o o q T-
CM CM o ’

q o CM
CM* o

o q CM
o’

CM CM
d o *“
r

in o CN CM q o «- o
0 0 CD co* o*
1 1

00 q q 00 o q
o' 0 o’ in* CM* o*
I
1

in o CO CO q q q q o q q
7 *” o’ o*
I
0 o' d
1

c _ o q in o o c* o q
o
a CN o’ o’ 7
CM* CO o’
1 ?
FIT

o q in q

= exp [(2x - 1 )(d — b)]


in CM q q o
o' o’
PERSON

CN o’ CO* o’ o’
1 7 7

CM
CO CN CM CO CM q q q 1^*’
CN o’ r” CM
1
o’ CM o’ O
1 1 i
CALCULATING

n(O
CO CN CM
°°. CN o CM q I-
CN o’ CM
1
o' CM* o’
o
1 7 1 o

P| q o q o q o q q
CN
1
o’ r"* CM
1
CM o ’
1 ** 7
CO

CO CO q q °°. CM.
o’ CM cm’ o T“ o’
1 1 1 i
I

CO in CM q *■. q o
o’ CM* CM* o ’ o’ CO o ’
1 I 1 1 1 1

CN o o 00 o in q o
CO CO o* CO o ’ P CO* o’
1 1 CO ? 1

ACO
I-
c ♦- JO 3 CM A E
0 w *o X
0IA
1*5
<0 TJ *o
N ■o
a> 4 -*
<r c/) DC
O
LL
D
> > CO
£ .t: 00 CN
2 — CM
o* CO*
c> <
a I
V) V)
0) C
A O
•a
•—
a
(A
A a>
w r
< oc
*2 E in
o5 <2 co
00
00
£ Z
168 BEST TEST DESIGN

_____________________ TAB LE 7.8.3 |______________________

CALCULATING PERSON FIT: RESIDUAL ANALYSIS -------------------

Person Tentative Sum of Degrees of Mean Fit


Name A bility Squares Freedom Square Statistic
b Z2 ( L—1) V t

12M - 2.8 35.0 8 4.4 4 .9 *

35F - 0.3 16.8 14 1.2 0.5

88M 3.2 7.3 10 0.7 -0 .7

"Misfit Signal

z2 = exp [ { 2 x - 1 ) ( d - b)]

v = 2 z 2 / ( L - 1)

t = [Cn(v) + (v - 1)] [ ( L - 1 )/8 ]*

For each other response in Person 12M ’s tailored segment o f 9 items we have given his x,
(d - b ) and z 2 . The residual analysis based upon this row o f z2 ’s fo r Person 12M leads to

2 z,2 = 35,
i 1

which, fo r 8 degrees o f freedom, gives a mean square o f v = 4.4 and an approximate


normal deviate t = 4.9. The residual analyses fo r these three persons are summarized in
Table 7.8.3.

Notice in Table 7.8.2 that we have used (d - b ) rather than the (b - d ) used in
Chapter 4. This is because the (d - b ) form is convenient fo r the calculation o f z2. When­
ever a response is 0, a minus sign is attached to the difference (d - b ) which turns it into
(b - d). If, however, we keep this sign change in mind, we can use Table 4.3.3 to deter­
mine the values in Table 7.8.2. I f you use Table 4.3.3, however, you will find that the
values in Table 7.8.2 are slightly more exact than the values determined from Table 4.3.3.
The difference is greatest on responses which fit well, but these responses play the small­
est role in misfit analysis. The sum o f squares 2 z2 o f 12M based on Table 4.3.3 would be
33 instead o f the 35 given in Table 7.8.3. The resulting t would be 4.5 instead o f 4.9.

The fit statistic t is distributed more or less normally but with wider tails. In our
practical experience the popular rejection level o f about tw o is unnecessarily conservative.
The general guidelines we currently use fo r interpreting t as a signal o f misfit are:
M A K IN G MEASURES 169

If t < 3 w e accept the measurement o f the person as probably valid.

If 3 < t < 5 w e make a careful examination o f the response pattern in order


to id en tify and consider possible sources o f misfit.

If t > 5 w e reject the measure as it stands and take whatever steps we can
to extract a “ corrected” measure from an acceptable segment o f
the response record, i f one exists.

The detailed study o f person m isfit o f course depends on a detailed study o f the approxi­
mate normal deviates

zvi = (2 xvi - 1) |exp [(2xvi - 1)(dj - bv)/2] j

in the response record in order to track dow n the possible sources o f irregularity.

Since those portions o f the 2 z2 which contribute most to t are the large positive
terms, w e can streamline the determ ination o f record validity b y form ing a quick statistic
focused on the m ost surprising responses. Table 4.3.3 (also given as Appendix Table C)
shows that the d ifferen ce between person measure b and item d ifficu lty d must be o f
the order o f ± 2.0 b efore z 2 grows larger than 7 or its probability becomes less than 0.12.
T o reach a probability fo r a given response o f .05 or less we must relax our standard to a
(d - b ) d ifferen ce o f ± 3 producing a z 2 o f 20.

I f w e concentrate our attention on surprising responses fo r which |d - b| > 3, then


the actual z 2 ’s m ay be lo ok ed up in Table 4.3.3 (o r Appendix Table C ) and combined
with an average value o f 1 fo r all the remaining items in the response segment to produce
a crude 2 z 2 fo r which a crude t can be calculated.

F o r exam ple, over the 9 responses o f Person 12M, there is only one surprise. This
is where (d - b ) = -3 .4 and z 2 = 30. Combining this value o f 30 with eight l ’s fo r the
remaining eight items o f the test gives us a crude 2 z 2 = 38 and a crude t = 5.3. This value
fo r t is n o t far from the m ore exact 4.9 we calculated in Table 7.8.2 and leads us to the
same conclusion o f a significant misfit.

In Table 7.8.4 w e summarize the residual analysis fo r all nine persons. For each per­
son w e give the ability measure and standard error from their self-tailored segment o f
items. N e x t w e give the sum o f squares, degrees o f freedom , mean square and fit statistic
fo r each person’s record. F o r eight cases w e find no evidence o f m isfit and so we take
their measures as plausible. O nly the self-tailored segment o f Person 12M ’s record shows
a significant misfit. As w e saw in Table 7.8.2, this m isfit is due entirely to his incorrect
response on the first and easiest item in his record. The reason fo r this incorrect response
m ight be a failure in test taking or a lapse in functioning. In either case we are still inter­
ested in the best possible estimate o f Person 12M ’s ability. The problem o f extracting the
best possible measure from a flaw ed record w ill be discussed in Section 7.10.
170 BEST TEST DESIGN

____________ TABLE 7.8.4 ____________


FIT ANALYSIS OF THE NINE PERSONS
MEASURED BY SELF-TAILORED TESTS

Self-Tailored Residual Analysis


Person A bility Error Sum of Degrees of Mean Fit
Group Name Squares Freedom Square Statistic
b s l z 2 (L -1 ) V t

3M - 5 .7 1.3 1.1 3 0.4 -1 .0


Preschool 6F -3 .8 0.9 2.4 5 0.5 -1 .0
12M -2 .8 0.8 35.0 8 4.4 4 .9 *

29M -0 .8 0.7 17.6 16 1.1 0.3


Primary 35F -0 .3 0.7 16.8 14 1.2 0.5
69M 1.4 0.6 20.0 14 1.4 1.0

88M 3.2 0.9 7.3 10 0.7 -0 .6


Adult 98F 4.4 0.9 2.1 5 0.4 -1 .1
101F 5.2 1.1 1.7 4 0.4 -1 .0

*M isfit Signal

7.9 D IA G N O S IN G M IS F IT

Consider again the 10 item test with items in order o f increasing difficu lty imagined
fo r Table 7.8.1. Were we to encounter the pattern produced by Person E, namely

Score
0 0 0 1 0 1 1 0 1 1 5

we would be puzzled and wonder how this person could answer the hard questions cor­
rectly, while getting the first three easiest questions incorrect. Were they “ sleeping” on
the easy portion o f the test?

On the other hand were we to encounter the response pattern

Score

1 0 1 0 0 0 0 1 1 1 5

our surprise would be as great, but now we might be inclined to explain the irregularity as
the result o f lucky “ guessing” on the three hardest items.

Both the probabilistic nature o f the model and our everyday experience with typical
response patterns leads us to expect patterns which have a center region o f mixed correct
and incorrect responses. When we encounter a pattern like

Score

1 1 1 1 1 0 0 0 0 0 5
M A K IN G MEASURES 171

it therefore strikes us as “ to o good to be true.” This unexpectedly regular pattern is


sometimes produced by persons w h o w ork very slow ly and carefully refusing to proceed
to the n ext item until they have done everything possible to answer the present item cor­
rectly. We w ill refer to this pattern as “ plodding.”

Finally, we can also id en tify a special form o f “ sleeping” which might better be
called “ fum bling” in which the incorrect responses are bunched at the beginning o f the
test suggesting that the person had trouble getting started.

T o summarize, we id en tify the fo llo w in g kinds o f response patterns:

Score
1 1 1 1 0 1 0 0 0 0 5
“ norm al”
1 1 1 0 1 0 1 0 0 0 5

“ sleeping” or
“ fum bling” 0 0 0 1 0 1 1 0 1 1 5

“ guessing” 1 0 1 0 0 0 0 1 1 1 5

“ plod d in g” 1 1 1 1 1 0 0 0 0 0 5

In Section 7.8 w e iden tified a m isfitting response pattern fo r Person 12M. N o w we


w ill investigate m isfitting records, such as that o f Person 12M, to see how the diagnosis o f
irregular response patterns might be accomplished. The self-tailored response pattern fo r
Person 12M, with items in order o f increasing d ifficu lty, is

Score
0 1 1 1 1 1 0 0 0 5

Th e evaluation o f this response pattern in Table 7.8.3 shows a significant m isfit, t = 4.9.
In Table 7.9.1 w e show the response pattern fo r Person 12M again and add fo r each re­
sponse the probability p o f its occurrence under the model. We also give his response
pattern in terms o f z ’s in addition to the z 2 ’s. When we p lo t the z ’s fo r Person 12M in
Figure 7.9.1 w e see what a “ sleeping” or “ fum bling” response pattern looks like. This
figure displays the segment o f items responded to. Each item is spaced horizontally along
the K C T variable according to its d ifficu lty on the logit scale. Its vertical position is
determ ined by the person’s standard residual z produced in response to that item.

T h e observed response pattern o f Person 12M in Figure 7.9.1 shows how the z
statistic indicates misfit. Item 3 has a z = -5 .5 while the other items have z ’s near their
expected value o f zero. The e ffe c t o f Item 3 upon the response pattern o f Person 12M
can be highlighted by considering the tw o alternative patterns given in Table 7.9.1 and
Figure 7.9.1.

In- alternative pattern A we retain a score o f five by exchanging the correct response
o f “ 1 ” to Item 8 , a relatively hard item, with the incorrect response o f “ 0 ” to Item 3, the
easiest item attem pted. N o w we have the pattern

Score

1 1 1 1 1 0 0 0 0 5
172 BEST TEST DESIGN

00
in in in in «-
CN
5 7
-Q
I
*o

in co r*
rr cn
| o o o X
i i CN

a
x
a>
in co n
CM o’ o o
I I i x
CN

~ao>
o
> CD CD
CN 00 ID O)
*5 CN
o I O* o ’ o’
I I
■o
c
CO
cn co in o)
o o’ o
i i
"SLEEPING

T3
C
CO
t—
00
T—
oo
a>
E CN d ^
I

E
0)

co co N in
DIAGNOSING

co co jn
d o T-‘ O o
I
n
a

co co
«•? in cn oo in in cn oo in
* T T-* o ‘ o’ o’ o
I i

CN o o>
o O) CN o O CN
CD co o’ co o’ o’
I i i

.Q -Q
8 o I I I -Q
c «3
O *2
2 2 2 I
^ CN X I (M I X (N ■o
a
co'v(0 f— N t— N r— N
0) CO I
X
I
X
I
X
CN CN CN
X
CN

a
x
CN 00
n
</> u- CN
ii
co 'Z i i
C
i—
CN
O O £ II a>
M A K IN G MEASURES
173

I F IG U R E 7.9.1 |

DIAGNOSING "SLEEPING'

b = - 2 .8

Person Q
12M
(t= 4 .9 ) 2

Alternative
B
(t = 0.1)

Logit
Scale -2
174 BEST TEST DESIGN

TABLE 7.9.2

RESIDUAL ANALYSIS FOR "SLEEPING" PATTERN OF PERSON 12M

Case Sum of Mean Fit


Description Squares Square Statistic
£z2 V t

Person 12M 35.0 4.4 4 .9 *


(b ==-2 .8 )

Alternative A 4.6 0.6 -1 .0

Alternative B 8.5 1.1 0.1

* Misfit Signal

z2 = exp [(2x - 1 )(d — b>]

V = £ z2/( L — 1)

t [Cn (v) + (v - 1)1 [ L - 1 ) / 8 ] *

The misfit statistics fo r these three patterns are summarized in Table 7.9.2. There we see
that Alternative A has a t = -1 .0 instead o f 12M ’s t = 4.9.

In Alternative pattern B we exchange the correct response o f “ 1” to Item 5 with the


incorrect response o f “ 0 ” to Item 15, the hardest item in the segment. This produces the
alternate response pattern

Score

1 1 1 1 0 0 0 0 1 5

Interestingly enough, the misfit fo r the exchange in pattern B is small, only t = 0.1. This is
because Item 15, with difficu lty d = -1 .5 , is not as hard in relation to Person 12M’s
ability o f b = -2 .8 as Item 3, with difficu lty -6 .2 , is to o easy.

In Tables 7.9.3 and 7.9.4 and Figure 7.9.2 we illustrate “ sleeping” and “ guessing”
response patterns using the observed record o f Person 88M. T o change his response
pattern to a sleeping pattern we replace his correct responses to tw o easy items with
incorrect responses and shift these tw o correct responses to Items 17 and 21, thus keep­
ing the score r = 6 . N o w we have the response pattern

Score

0 0 1 1 1 1 1 1 0 0 0 6

which is characteristic o f sleeping. This pattern earns t = 9.1 in Table 7.9.4.


M A K IN G MEASURES 175

CO CO
CM O O) CM «- O O ) CN
% ? d ' *f CO o
I I

- 1 ) (d - b )/ 2 ]
CO CO
in oo q «-t q n q q q co
cm o ‘ o o
2 « l i cm
I
d *
l

co in o co q in in

[(2x
q
w
* * T" O o d
I <?

z = (2x - 1) exp
oo CM
r- ^ »~ q *- q q q q
O* r-‘ ‘ O r-’ o’ o’
I
■p
o
>
00 CM
CO o w1
, q m o q q ^ q o
PATTERNS

«- o) in
CO d o r-‘ o* o o
I I I

> r* r* CO
o t- co in q r_ CO q q q q ^ CM
CM
"5
o
0 o o’ o o o’ o
1 I
RESPONSE

Q
*D CO
C
CO O) © in cm q in cm q q q
0) IT CN o o’ «-* o o’ CO* ’
E I l
<0
Z
E

p = 1 / (1 + z 2)
"G U ESSING "

0)
O C? ^ cm q «- q co r«* in q
CO q o’ o
7

co co
00 q (N q q cm q oo cm oo
t-’ o ’ ’ o’ o’
I l
AND

CO CO
CM «- w. o. q cm «- q q q
co o ’ o co o
SLEEPING"

5 ? I

oo oo
q o q cm q o q q
0 co o ’ o co’ o ’ ‘ o’
1

.a A A
I
o
*o TD ■O
a
i/i «=(0 x p n a x X
T
- 1) (d - b )]

TX I
X X
CM CM CM
[(2x
= exp

00 C
00
3> a a
<o *r
O u
o
176 BEST TEST DESIGN

TA B L E 7.9.4

RESID UAL ANALYSIS FOR


"SLEEPING" AND " GUESSING" RESPONSE PATTERNS

Case Sum of Mean Fit


Description Squares Square Statistic
Z z2 V t

Person 88M 7.3 0.7 -0 .7


(b = 3.2)

"Sleeping" Pattern 71.6 7.2 9.1*

"Guessing" Pattern 43.0 4.3 5.3*

* Misfit Signal

z2 = exp [(2x — 1)(d — b)]

v = Z z 2/ ( L - 1 )

t = [fin(v) + ( v - 1)] t(L — D / 8 ] 54

T o make a guessing pattern we rearrange responses to form

Score

1 1 1 1 0 0 0 0 0 1 1 6

fo r which t = 5.3 in Table 7.9.4. Figure 7.9.2 compares the previously acceptable response
pattern o f 88M with these alternative unacceptable response patterns characteristic o f
sleeping and guessing.

A sleeping pattern particularly characteristic o f “ fumbling” is illustrated in Tables


7.9.5 and 7.9.6 and Figure 7.9.3 where the acceptable response pattern o f Person 29M
has been altered to show incorrect responses on the first four items o f his test, namely
Items 3 through 6 . While the effec t o f one incorrect response among these first four
items does not produce significant misfit, as seen in the observed pattern fo r Person
29M, i f we make all four items incorrect to illustrate “ fumbling” the misfit becomes a
significant t = 24.1.

The second pattern illustrated in Figure 7.9.3 is “ plodding.” In this pattern the
person gets every item correct as far as they go and all remaining items incorrect. This
can be due to a test-taking style governed by slow and deliberate working habits. While
sleeping, guessing and fumbling are indicated by positive values o f t, plodding, on the
other hand, produces a negative t. The negative value indicates that the observed response
pattern fits even better than we expect. It indicates that even the random variability
expected by the model is missing!
M A K IN G MEASURES 177
178 BEST TEST DESIGN

in in
q O) CN o CJl r - O) CN o q *- o CN
cn
I
© ’ o*
i d o • <p CN o’ ’ Q
l

00 o w q co o w *■, O) CO o « 1- 0> CO
5
CN
I
o’ * o’
I
cn
i
o* o
i
CN
I
d •

00 t -
CN
O CO
0
o co r
CN O
q m
d
o q
CN o
I 1 i T I T
00 oo CN
CD q ^ #1 **. O o q q o
? o’ ’
O o V-’
5 °
A
I
r (0 (O (0 ^ in in in
in q q q q o q co co
* o’ o’ ’ o* d o o o

o o
CD CO co oo
0 o ’ p’
* ?
FUMBLING" AND "PLODDING" RESPONSE PATTERNS

1
T3
O
0) o o o
a
c
n o> o in o o o in o o q o in o
o* o
0)
3 5 ?
O'
a)
in
E co cn co in oo cn q q «- q cn w q
O CN * o’ CN ’ v-’ o’ cn ’ «-*
* ?

00 CN CN
«- ° . *- q q o T- q q o
* 7
o o’ o’ * d
T
o’
I

a co
*o CN CO CN 00 CN CO Is- co cn n q q
c
(0
*-’ CO ’ t-* o o !-* o ’ o’
I
0)
E
(0
Z r-.
CN CO CD cn o q q cn co Is* q
E m o O f- o’ * o’ r-’ O’ ’ O’ II
* I a
in
q cn q q cn q
o’ V-* o’ * o’ T-’ o’ ’ o’

CO o q cn ^ o q cn o q q
r* W co o* o’ co o’ * o co o ’ o ’
* I l I

co
CO (N CO ^

%
in A
CO q q co n ; 1
* V o’ o 2
7
X
co co CN
o q cn q o q q
o’ *o’ co o’ ’ o’

o
CN co o o q
in in’ o’ o
4? ¥* l

A A
X X
I
■O
x
■Io
x

T I I
x CN
q

o *o> O)
c
c
8 a
co * r
2 c 45 C
O o E CL>
3
LL
£
(0

0 - ♦(0-*
: CL
M A K IN G MEASURES 179

T A B L E 7.9.6

RESIDUAL ANALYSIS FOR


"FU M B L IN G " AND " PLODDING " RESPONSE PATTERNS

Case Sum of Mean Fit


Description Squares Square Statistic
2 z2 V t

Person 29M 17.6 1.1 0.3


(b = - 0 .9 )

"F um blin g " Pattern 244.7 15.3 2 4 .1 *

"Plodding " Pattern 9.0 0.6 -1 .3

*M is fit Signal

z2 = exp [(2 x - 1 )(d - b )]

v = 2 z 2 /( L - 1)
i

t = [2n (v) + ( v - 1)] [ ( L — D / 8 ] 54


180 BEST TEST DESIGN

_____________________ j FIGURE 7.9.3 |____________________


"FUM BLING" AND "PLODDING" RESPONSE PATTERNS
z
3

Z_
3

(Ite m s arranged In sequence o rd e r)


M A K IN G MEASURES 181

7.10 CORRECTING A MEASURE

When we detect significant m isfit in a response record, diagnose the response pattern
and id en tify possible reasons fo r its occurrence, it is finally necessary to decide if an im ­
proved measure can or should be determined. Whether such a statistically “ corrected”
measure is fair fo r the person or proper in such circumstances cannot be settled by statis­
tics. H ow ever, know ing h ow a measure might be objectively corrected can give us a better
understanding o f the possible meaning in a person’s performance.

We have iden tified the implausibility o f the response o f Person 12M to the first item
in his test segment given in Tables 7.9.1 and 7.9.2. Were we to decide that this particular
response was n o t typical o f Person 12M, we might delete the incorrect response to Item
3 and com pute a new ability estimate based on his responses to the remaining eight items.
This new calculation o f his ability measure is given in Tables 7.10.1 and 7.10.2. The
corrected measure b; = -2 .2 puts Person 12M about 0.6 logits higher on the K C T variable.
Figure 7.10.1 shows the e ffe c t o f this correction on the fit o f Person 12M with t' = -0 .8
instead o f t = 4.9.

F o r Person 12M w e n ow have tw o ability estimates, one at b = -2 .8 and one at


b ' = -2 .2 . Which one we decide is the best estimate depends upon how we evaluate the
response o f Person 12M to Item 3. I f w e think that this response is implausible and that
it is very lik ely that he w ould get Item 3 correct, were he to try it again, then we might
use the corrected b( = - 2 .2 as his measure. H owever, i f we think, instead, that Person 12M
g o t Item 3 incorrect because o f a significant lapse in functioning, then we might consider
the b = -2 .8 as better reflectin g his position on the K C T variable. Clinical experience with
the K C T variable supports the probability that this lapse is indeed an indicator o f im­
paired functioning and that his incorrect response to Item 3 could be an important ele­
m ent in his evaluation. Consequently, in this case w e might w ell choose the uncorrected
measure o f b = - 2 .8 .

In Tables 7.10.3 and 7.10.4 and Figure 7.10.2 we show the correction o f a typ ica l
“ guessing” pattern. The person’s responses to successively more d ifficu lt items show four
correct responses fo llo w e d by five incorrect responses and then by tw o correct ones! This
response pattern has a significant m isfit o f t = 5.3. We must ask whether the ability
estimate b = 3.2 is a good indicator o f this person’s position on the K C T variable. Given
this person’s string o f five incorrect responses prior to his last tw o correct ones, we might
com pute a new estimate with these last tw o surprising responses removed from the
record. W ith this new truncated pattern b' = 1.7 and t' = - 1.2. Statistical analysis alone
cannot tell which estimate is m ore appropriate, but it can detect and arrange the available
inform ation in to a concise and objective summary fo r us to use as part o f our evaluation
o f the person.

Persons w h o guess may succeed on d ifficu lt items more often than their abilities
would predict especially on multiple choice items. This makes them appear more able,
especially when many items are to o d ifficu lt fo r them, because their frequency o f success
does n o t decrease as item d ifficu lty increases. A similar but opposite effec t occurs when
able persons becom e careless with easy items making these persons appear less able.

Item responses affected by guessing or carelessness actually reflect the simultaneous


influence o f tw o variables. There is the ability to be measured, and in addition, there is
the tendency to guess or to becom e careless. The “ guessingness” o f the item may or may
n ot be a simple function o f its d ifficu lty on the main variable or, if a multiple choice
182 BEST TEST DESIGN

in in
* T

o O) ^ (O
o’ o’ p

0>
4*: (N
TwJ CN
O I
CORRECTING THE MEASURE OF PERSON 12M FOR "SLEEPING

p. CN CN *-t * r* oo
00
% O r-' o o’

p N N r oo
D in O t-’ o’
T3
00
I
II
p in id co CN
CD

00 I

delete d <C (b - 2) i.e. d <C - 2.8


CN
I
II
.□

co co, in
% t-‘ o o’ o'

in cn in
l o’ o*
CN
I
o'

o o in
’Sleeping" correction rule:

CO CO o in
* CO I

A
I
c z
Ow ■D
a
c/> «=
<u
0) *->
<r c/>

O CN
*9 T-
O t
o> 5
O <3 u
jO O to
o (J a.
M A K IN G MEASURES
183

CD 00
n-’ o’
I

= £
8 2
2 §•
CO

o CO
ir>
co
CO CO
"SLEEPING" PATTERN

[Cn(v) t (v—1)] [(L-1 )/8] y


o CO CO
UJ o o

CO CN
CD
n CN CN o
< I I ii
co
CN

o *5 5R § co CN CN

t=
CN
a f cl CN CN
o
RESIDUAL ANALYSIS OF A CORRECTED

a
0)
3
> $
co q E
o o _c

c
o

co co o
If) co o
z2 /(L - 1)
2

II
>

-C

5 5
5

°°.
CN
I
184 BEST TEST DESIGN

FIGURE 7.10.1

CORRECTING THE MEASURE OF PERSON 12M


FOR "SLEEPING"

b = - 2.8

Pattern
Observed for
Person 12M
(t = 4.9)

b = -2 .2

2
Corrected
Pattern for
1
Person 12M 0-
(t* — 0.8) -1 H s R s T
-2

Logit Scale -5 -4
M A K IN G MEASURES 185

CO
CN CO

in 00
CN in
*

CO in o q q
CM
o
o CN O
r *1
O
l I I

•p
CN
CN
CO Q V- O) O 0 q cn ^
CO 0 o «-*
o 1 i 7 ° ?
>

CO CO CN ^
CN CO*
% CN 7 ?
CO
>
♦-» II
Z 3 .O
o
cc U- O
CN
O) w CN cn co in
CN o
o % T ° 7 ° ?
■O o
h - c Q.
< (0
0)
Ol
E
<0
b z o> o
CN
O cm co co q co
0 o o
P 'j q

co z E
«-* CO
1 I
o ' CO
CO
LU
D
0
1

co q in JC
OQ
1 .2

1 .1
- 0 .2
a
<
«-* o d
I E
H <

o o

CO q cn q q © JD

o o-* o o d
I
o o 3
k_
% I
111 C
cc o

cc o
o CN co o cn CO CN ij
CD
Ub
i_
o p co o o *-* o o o
% T l 1 o
k
*o>
c
</)
1/)
CD
p- o cn r N i- , M 3
p o o cn o o 0
i
% T

JO A
I I
*o
o w x ZZ.
Q. 'g
</> CD
Ql
OC ^
I
X
CN
T

o> Va> zr
•5 c o E
(A
a> ak> m
a
3 £ II o
o .O
Q_ -—■ U
BEST TEST DESIGN

fa
Lf> C
o>

CO

r>*
E co co
3 O’
1/) CO
"GUESSING" PATTERN

o O) 00

LLI
o’ o

i
>
XI +
< I
II c
CN c*
CO
I
o “ :S£§
u . »*_
HI a) O oi
O
O
RESIDUAL ANALYSIS OF A CORRECTED

Q>
>>
js := X£
(l) -Q o 0
OC < 1

O
o

w
II
>

co.
cm'

Q
O
M A K IN G MEASURES
187

A "GUESSING" PATTERN
CORRECTING

0>
"O —
*b>
c 0) CN
M CO u c
V) <1)
0> id 03 1
3 II
ii O
O (0
o CL
188 BEST TEST DESIGN

item, o f its distractors. For the person being measured, however, tw o quite different
variables are involved. One is their ability, the other is their inclination to guess or their
carelessness. The measurement o f either variable is threatened by the presence o f the other.

In situations where we think that guessing may be influenced by test form at as, for
example, when we think a person may guess at random over m multiple-choice alterna­
tives, we could use the guessing probability o f 1/m as a threshold below which we sup­
pose guessing to occur. T o guard our measures against this kind o f guessing we can then
delete all items from a response record which have difficu lty greater than b + 8 n (m - 1 )
where b is the person’s initial estimated ability. A fte r these deletions we reestimate the
person’s ability from the remaining items attempted. I f we do this, we are taking the
position that when items are so difficu lt that a person can do better by guessing than by
trying, then such items should n ot be used to estimate the person’s ability.

In Tables 7.10.5 and 7.10.6 we show a “ fumbling” pattern and its correction. Here
we have an increasingly difficu lt segment o f 17 items and a response pattern beginning
with four incorrect responses follow ed by ten correct responses and then three incorrect
responses. The pattern seems implausible and significant m isfit is identified in t = 23.1.
Some extraneous factor seems to be influencing the first four responses. It could be a
problem o f test administration procedures, or o f the examinee’s test behavior. A cor­
rected response pattern could be form ed by deleting the first four incorrect responses and
considering only the continuous segment o f correct responses and the three incorrect
responses which fo llo w them.

The corrected responses resulting from this change show a “ plodding” pattern with
t = -3 .0 . This pattern produces a considerably higher ability b' = 1.1 than the original
b = -0 .9 . N o final decision can be made on this problem, however, until sufficient clinical
or behavioral inform ation is gathered to clarify the meaning o f those first fo u t unex­
pected incorrect responses.

T o summarize the statistical aspects o f our correction strategy:

a. When the majority o f unexpected responses are “ incorrect” and t > 3


then delete all the “ to o easy” items dj < (b v - 2 )

1. Compute the new ability estimate after the deletion o f these “ to o


easy” items.
2. Make another analysis o f fit.

b. When the majority o f unexpected responses are “ correct” and t > 3


then delete all the “ to o hard” items dj > [b v + 8 n (m - 1)] where
m is the number o f alternatives.
1. Compute the new ability estimate after the deletion o f the “ too
hard” items.
2. Make another analysis o f fit.
M A K IN G MEASURES Ig g

c
O) O 0)
O q r* CN o o> (0 £
CN CN o
4t o’ £
1 ? ? ?
c
o>
00 O q r- q CO q q
o 4-*
CN o Q* 0 o d (0
4* 1 I a
1 T
a>
</>
c
rv o
0) O CO 00 r* a

0 .4
o

- 0 .3
t/> to
CN o o o <u
4* 1 T w
I E
CO a>
(/>
(0 CO «— r- o q «— c 4-*
Q o o o ’ ’5 ) 3
4* 1 T <D O
.a *♦-
4-*
CO
’* 5
-C
in in CD (0 r*. q *— q 4-» 0 )
0 o o* CN 1/)
o ’ o o
4*: 1 1 1 E
a> E
4-»
o
in in r_ 4-*
<4; q q >
L_ o o* o o
i/i
0) CO <£
■p
% l T a> in
c
O O
4-» o
Q) a
o i/i i/i
c CD 0)
Q>
CO 0) «— o o o o *" mjt i/i L_
3 o* o ' CN o o ’ C
o 4fc 1 l o
Z a> a u
LO a>
DC a> i_
E CN «— cq CN q q q k-
LU 0) q Q
**
o ' o ' CN o ' o
1 - 4*: 1
c
L. T V-» O
1 - £ o
<D
a> >
< L_ c
> CO
CL £ CL o CO
* 3 f- o 0) q q o r- T- q o
o c E
KJ *L
4fc
c o u i/i
7 0 1 i CO
z j5 a »*- 4_>
i/i
in a> O
*o CO
o ' —1 c O DC CN q q q o q 4-» 0)
CQ <0 *“ CN o ’ o ’ co* o o ’ c ..
0) 4* 1 1 1 Q) CO
s E E
LLl
_l
D <0
Z
o>
co O
LL i/i 4-»
ffi in c
9 0

q
0 0

E CN q
*- <D
0 .3

< 0) —
6#

CN CO o’ 3
1 - < | | l O E
3 05
C <D
</>
CJ
z to
0 0

q c
0 .4

0 .2

-
-2 .6

0 .2

-
- 3 .7

3
o
#8

O
h 1 o D
O > C
LU c *4-»
CO c
CC
n <4; o q o 0) oo
4—*
DC r- CO o o in o ’ o ’ QJ
4*:
O 1 1 l c
£_1 "D a)
3
q O co o q O'
jp CD
CO CN CD CN to
4fc 1 1 3 _Q
3
i/>
c
O CO
q ♦-» >
2 .3
5 .5
-2 .6

O o .Q
in a> 0)
4*: 1 "D
0) CD
a) o £
T3 o
q _o
CN o
1

o
CO in *b> o
4fc 1 CN I c *4—
CD
15 u.
CO
E
o q 3 >
5 .3
- 6 .2

o CD
CO o ’ LL A
4* o 4-J
CM 1
0)
(A o -Q A
c 1
o to CM CM
a X N
N X A N N
to (0
a>
DC CO T
L
X X
CN CN
c *o>
c q
C o rre c te d
D e s c rip tio
Case

o ’
P a tte rn

jo
P a tte rn

E 1 II
3 ii
LL A JO
190 BEST TEST DESIGN

O O) O)
N’ CO <7) c/j
CN
I

CD (0
CO CN
a>3 LO O*
5 O'
CO

N;
CN
N*
CN
CO CO

25
00
"FUMBLING" PATTERN

o
L. 00
O
LU

I
A
01 O +
< CN
II c
O)
cx
O
w
O .2
>,' I
O 0) 00
CN CN

oo
RESIDUAL ANALYSIS OF A CORRECTED

i £ s r«*
I f I o’
cc <
c
o
.> a> _j
<o O "w o
in o
• £ I
oc
o

W
ii
>

J1
■S

CO CO
o’
7 I

-Q C
E a>
3 £ o
o
8 CHOOSING A SCALE

8.1 INTRODUCTION

Logits are the units o f measurement we have used thus far. These units flo w directly
from the logistic response m odel which specifies the estimated probability o f a correct
response b y person v to item i as

pvi = exp (bv - d j)/[1 + exp(bv - d,)]

where bv is the estimated ability o f person v and d; is the estimated d ifficu lty o f item i.
It follo w s that the odds fo r a correct response are

Pvi/(1 ~ Pvi) = exp (bv - dj)

from which the natural lo g odds fo r a correct response becomes

*n [pui/ n ~ Pv i)l = (bv - d,) .

These lo g odds are called “ logits” and so differences among items and persons are ini­
tially in lo g it units.

Th e choice o f a unit is entirely arbitrary, but it is absolutely necessary that some


unit be chosen. While it is possible to continue to use the initial logits as the units o f
measurement, this has tw o disadvantages. Logits involve both negatives and decimals,
numerical characteristics which might make them unnecessarily confusing.

T h e K C T lo git scale, fo r exam ple, extends from -5 .8 to +5.2. A t the test lengths
presently available, standard errors o f measurement can be as lo w as 0.6 logits. We could
add a constant such as 10 to d o away with the negatives, but we could n ot avoid deci­
mals b y rounding K C T measures in logits to the nearest integer. That rounding would
produce a least noticeable difference o f almost tw o standard errors and so could ob­
literate differences in measures which might be meaningful. Were we to transform the
lo git scale b y first m u ltiplying each value on the scale by 10 and then adding 100, how ­
ever, w e w ould have a new scale o f measures from 42 to 152 which would convey the
same inform ation as the initial lo git scale but be free from negatives and decimals.

T o create a new scale that is free from the inconvenience o f decimals we must
m ultiply the logits by a “ spacing” factor large enough so that rounding the new units
to the nearest integer does n ot leave behind any useful inform ation. Once this spacing
factor is chosen and the unit o f our new scale is determined, we can then add a “ location”
factor to these new integer units that is large enough so that the lowest possible value
that can occur is greater than zero. The new scale is defined by determining these tw o
factors. Th e multiplicative factor establishes the spacing, or units, o f the scale. The addi­
tive factor establishes the location, or origin, o f the scale.

The choice o f an additive factor which locates all possible values above zero is usu­
ally easy. T h e choice o f a m ultiplicative factor, however, is worth further consideration.
I f w e want to w ork in integer units, then we must arrange matters so that any differences

191
192 BEST TEST DESIGN

on our new scale smaller than one integer will be meaningless. This requires us to investi­
gate the size o f a least meaningful difference.

In addition to determining a least meaningful difference we may also wish to mark


easy to remember points like 50, 100 or 500 on our new scale, either at important sub­
stantive criteria along the variable or at the typical location o f a normative reference
group. It is even possible that we will find it useful to relate our new scale unit directly
to the probabilities fo r success predicted by the response model. We may, fo r example,
want to pinpoint movement through memorable response probabilities like .10, .25, .50,
.75 and .90 with regular increments o f 5 ,1 0 , 20, or 25 along our new scale.

Thus, in addition to removing unnecessary negatives and decimals by adding a con­


stant and establishing a least meaningful unit larger than one, we may also organize our
scale around normative, substantive or response probability considerations.

8.2 FO R M U LA S FOR M A K IN G NEW SCALES

In order to be explicit about how a new scale is determined, we will express its
definition as the linear transformation y = a + 7 x in which x is the logit scale, y is the new
scale, a is the location factor fo r determining the new scale origin and 7 is the spacing fac­
tor fo r determining the new scale unit. We make this transformation linear because we
want to preserve the interval characteristics o f the logits produced by the Rasch model.

Our new measures B and new calibrations D can be expressed in terms o f their logit
counterparts b and d as

B = a + 7 b for persons [ 8.2.1]

D = a + 7 d for items , [8.2.2]

The new standard errors o f measurement and calibration are

SE(B) = 7 S E (b ) [8.2.3]

S E (D )= 7 S E (d ) [8.2.4]

This shows how the nature o f the new scale depends on the values fo r a and y chosen to
define it.

In passing let us appreciate again that person ability and item difficu lty mark loca­
tions on one com mon variable. In constructing this variable we necessarily work with the
calibrations o f the items which define it. However, when we use the variable to measure
persons we then w ork with their measures along the variable defined by these items. What
a measure tells about a person is the d ifficu lty level o f the items on which that person is
likely to succeed half the time. In the same way, what a calibration tells about an item is
the ability level o f persons who are likely to succeed on that item half the time. Thus,
were we n ot reserving the terms “ measure” to refer to the location o f persons and “ cali­
bration” to refer to the location o f items, we could as well speak o f item difficulty as the
measure o f the item and o f person ability as the calibration o f the person.

8.3 TH E LEAST M EA SU RA BLE D IF FE R E N C E

We want to free our new scale from decimals, but we do not want to obliterate use­
ful information. As a result, we need to determine the least measurable difference LM D
on our logit scale so that we can choose a spacing factor 7 that brings this logit LM D to
CHOOSING A SCALE 193

at least one integer on our new scale. The nearest any tw o persons can be in observed
scores, w ith ou t being the same, is one score apart. This is the least observable difference
LO D . We need to transform this L O D into its corresponding LM D in logits o f ability.

L o g it ability b com es from score r through the response m odel expectation

r = ^ | exp (br - d j)/[1 + exp (br - dj)] |

As a result L M D must fo llo w L O D at the rate 9b/9r by which scores o f r produce mea­
sures o f b, that is
9b
L M D ^r LOD

In order to standardize observations with regard to test length L we w ill general­


ize from raw score r to relative score f = r/L. Then with LO D = 1 in score r and f = r/L we
have L O D = 1/L in relative score f giving
9b 9b
L M D —"gf LO D = jv f -(1 /L ) [8.3.1]

T h e Rasch response m odel gives us the expected relation between relative score f
and estimated response probability pf; o f

f = 2 Pf j / L
i
in which pfj = exp (bf - d j)/[ 1 + exp (bf - d ,)l

Thus, the rate at which relative score f produces b is

g f = [ 2 pf j (1 - Pf j)/L ] 1 = Cfw

which turns out to be the error coefficien t Cf w discussed in Chapters 6 and 7.

This c o e ffic ien t is subscripted to test width w as well as relative score f because, as
w e learned in Chapter 6 , the exact values o f this coefficien t depend n ot only on the
relation betw een test d ifficu lty level and person ability expressed in relative score f = r/L,
but also on the width in d ifficu lty covered by the test. This gives us a least measureable
d ifferen ce o f

L M D ~ - g f (1 /L ) = Cfw /L [8.3.2]

The w ay Cfw and hence LM D varies with b is pictured in Figure 8.3.1. As the ability
measure b moves away from test center and/or the test operating curve flattens the
LM D becomes larger. Fortunately the range o f values which Cf w w ill have in practice are
lim ited.

W hen 1 /8 < P fi < 7 / 8 i = 1. L

th e n 4 < C fw < 9

and Cfw = 6

can be used as a convenient single w orking value fo r Cfw (see Table 6.8.1 fo r details).

This gives us as a working definition o f the least measurable difference

L M D = 6 /L [8.3.3]

and implies a spacing factor

7 l m d >L/6 .
194 BEST TEST DESIGN

FIGURE 8.3.1

DETERMINING THE LEAST MEANINGFUL DIFFERENCE

R E L A T IV E
SCORE
f = r/L

MEASURE
b

The LM D approximates the smallest possible meaningful unit since it stems from the
least observable difference. However, from an estimation point o f view, we might con­
sider instead that one standard error o f measurement SEM is actually the least “ believable”
difference. In logits the SEM is related to the LM D as

SEM = (LM D)* = Cfw* / L *

which suggest the working value

SEM = 2.5/L * [8.3.4]

as an alternate basis fo r determing the spacing factor

^ sem > L * / 2.5

As long as there are more than six items in our test the SEM determines a smaller
7 than the LM D since

L * / 2.5 < L/6 when L > 6.

An SEM-based scale, which might be simpler numerically, however, will also be somewhat
less discriminating in its integer increment than an LMD-based scale. Which choice is
preferable in any particular situation cannot be settled by statistical considerations. The
choice will inevitably depend on the use to which the measures are to be put.
CHOOSING A SCALE 195

Finally we might consider the least significant difference between independent


measures, whether replications o f the same person or comparisons o f d ifferen t persons, as
an upper lim it on h ow crude we could allow our new scale to become. T o determine this
least significant d ifferen ce LS D w e take

LSD ab = (S E M a2 + S EM b2 ) ’/a =* (2 SEM 2 ) *

and arrive at the w orking value

LSD = 1.4 SEM = 3 .5 / L * [8.3.5]

which produces a minimum spacing factor

^ l s d > L 54 / 3.5

As long as the number o f items in our test is greater than six, the relative magnitudes
o f these bases fo r determ ining a low er lim it fo r the spacing factor are

LMD < SEM < LSD.

and so the spacing factors th ey determ ine are ordered

^ LM D > ^ S E M > ^ L S D

Figure 8.3.2 shows the relationships between LM D, SEM and LSD inlogits fo r the
K C T B test o f 23 items. The items, their logit values,score equivalents and L O D ’s at 3 to
4, 12 to 13, 18 to 19 and 20 to 21 along with their corresponding exact L M D ’s, SEM ’s
and L S D ’s are shown.

We can com pare the exact values in Figure 8.3.2 with the approximations o f Equa­
tions 8.3.3, 8.3.4 and 8.3.5.

Minimum 7 Implied

L M D = 6 /L = 6 /2 3 = 0 .2 6 4

SEM = 2 .5 / L * = 2 .5 /4 .8 = 0 .5 2 2

LSD = 3 .5 / L * = 3 .5 /4 .8 = 0 .7 3 1.5

Because the 13 lo g it K C T B is unusually wide, these approximations are smaller than the
exact values given in Figure 8.3.2. The minimum LM D spacing factor 7 indicated by the
exact values w ould be about 2 while the approxim ations could lead to a minimum 7 o f 4.
Since w e w ould on ly be in danger o f losing inform ation i f the approximations led us to a
7 o f less than 2 , w e see that even in this extrem e situation the approximations do not
mislead us.

8 .4 D E F IN IN G T H E S PA C IN G F A C T O R

Once we have defined a least meaningful difference in logits, whether it be the least
measurable difference L M D (b ) = 6 /L to maintain maximum observability or the standard
error o f measurement SEM (b ) = 2.5/L’/a and its least significant difference LSD (b ) = 3.5/L’/j
to maintain statistical reliability, w e can use this least meaningful difference to establish a
spacing factor which w ill make all interpretable differences on our new scale greater than
one.

I f our aim is to make the least measurable difference in the new scale LM D (B ) > 1,
then since 7 = LM D (B )/LM D (b ), it follo w s as in Equation 8.3.3 that

7 l m d > L /6 [8.4.11
196

LEAST MEANINGFUL LOGIT DIFFERENCES FOR THE KCTB TEST


BEST TEST DESIGN
CHOOSING A SCALE 197

is the spacing factor which guarantees that no observable differences will be obliterated
by rounding to the nearest integer.

Were w e interested instead in keeping the spacing factor 7 as small as possible, in


order to prevent the presentation o f scale differences which are statistically unreliable, we
m ight set 7 at l/ S E M (b ) or even l/ L S D (b ) that is

7 sem = L*/2.5 [8.4.21

or 7 lsd = Lv’ /3.5 [8.4.31

O ften, however, there will be other considerations which will lead us to allow 7 to
becom e even larger than L /6 in order to reach memorable scale intervals like 5, 10, 20,
25, 50 or 100.

T o get a rough idea as to typical useful values o f y, we list in Table 8.4.1 values for
the least meaningful differences which go with various test lengths. In Table 8.4.1 we see
that w e w ould seldom be satisfied with a spacing factor less than 5 and seldom need one
larger them 100. Table 8.4.1 suggests that we could work satisfactorily with

7 = 5 fo r short classroom tests o f 20 or 30 items,

7 = 10 fo r typical unit tests o f 50 to 60 items

and y = 20 or 25 fo r longer tests o f 120 to 150 items.

O n ly fo r tests o f unusual length, such as 1,000 item examinations, would we want 7 = 100.

T A B L E 8.4.1

THE RELATIO N BETWEEN SPACING FACTORS


AND TEST LENGTH

Approxim ate Spacing Factor 7


Least Meaningful Difference to Reach an Integer Scale

Test Minimum Maximum Maximum Minimum


Length LMD SEM LSD
L 6 /L 2 .5 /L * 3 .5 /L ’/j 7 lm d ^ sem 'LSD

30 0 .20 0.46 0.64 @ 2 2


60 0.10 0.32 0.45 3 2
©
120 0.05 0.23 0.32 20 4 3

150 0.04 0.20 0.29 25 3

300 0.02 0.14 0.20 50 7 E


60 0 0.01 0.10 0.14 100 (10) 7

1200 0.0 0 5 0.07 0.10 200 15 ©


198 BEST TEST DESIGN

A fte r we decide on 7 we apply it to our person b ’s and item d ’s to place them on


our new scale as B ’s and D ’s. While the relation between LM D and SEM in logits o f
L M D (b ) = [S E M (b ) ] 2 is easy to remember, their relation in the new scale also involves 7 .
Since

LM D (B) = 7 L M D (b )

SEM(B) = 7SEM (b)

but LM D(b) = [S E M (b )]2

it follows that in our new scale

LM D (B) = [S E M (B )]2/ 7 (8.4.4]

SEM(B) = [7 L M D (B )]* [8.4.5]

For example, suppose in order to rescale the K CTB shown in Figure 8.3.2 we chose
7 = 5 . Then although in logits

LM D(b) = [S E M (b )]2

in our new scale

LM D (B) = [S E M (B )]2 /5

and SEM(B) = (5 L M D (B )]54

Thus while an SEM (b) o f 0.75 = 0.571,/s goes with an L M D (b ) o f 0.57, when 7 = 5 then

LMD(B) = 5 x 0 . 5 7 = 2.81

but SEM(B) = 5 x 0.75 = 3 .7 5 = (5 x 2.81 )*

8.5. N O R M A T IV E S C A LIN G U NITS: N IT S

I f we want our scale to be based on a normative reference, we can use the observed
logit mean m and logit standard deviation s o f the elected norming sample as factors in a
preliminary transformation d' = (d - m)/s and b' = (b - m)/s which puts the norming
group mean at zero and the scale unit at one normative standard deviation.

A fte r this preliminary step, we then choose a spacing factor large enough so that
meaningful differences become greater than one and at a value which pegs the normative
standard deviation at some easy to remember unit such as 10, 20, 50 or even 100. A t the
same time we choose the location factor a so that the mean o f the norming group is also
easy to recall, fo r example at 50,100 or 500.

Thus to create a norm based scale o f normative units or NITs, we use fo r persons

B = a + 7(b -m )/s [8.5.1]

and fo r items

D = a + 7 (d - m)/s

We then have on the new N ITs scale the norming mean M = a and the norming standard
deviation S = y.
CHOOSING A SCALE 199

Using the administration o f the K C TB to the 68 persons older than 8 as norming


data, we have m = 1.3 and s = 1.9 in logits. I f we now choose a = 50 and 7 = 10, we have
a new N IT s scale on which

B = 5 0 + 1 0 ( b - 1.31/1.9 [8.5.2]

= 50- 10 (1 .3 /1 .9 ) + 1 0 b /1 .9
= 4 3 .2 + 5.3b
and
D = 4 3 .2 + 5.3d .

N o tice that with this scale definition, the normative mean o f m = 1.3 logits becomes

M = 4 3 .2 + 5.3 (1.3) = 50 NITs

I f w e n ow set b at

m + s = 1.3 + 1.9 = 3.2

then
M + S = 4 3 .2 + 5.3 (3.2) = 60 N ITs
so that
M + S - M = S = 6 0 - 5 0 = 10 N ITs

is the norm ative standard deviation on the new N IT scale.

Figure 8.5.1 shows the distribution o f the 68 norming persons. Below their distri­
bution are the ability measures fo r each score in logits and in N ITs, and at the bottom are
the K C T B items which define the variable. In Figure 8.5.1 we see that

30 N IT s -* m - 2s, 4 0N IT s-*m -s, 50N ITs-»m , 60 N ITs -»m +s and 70 N ITs ■* m + 2s.

8.6. SUBSTANTIVE SCALING UNITS: SITS

We m ight instead choose to reference our new scale to substantive considerations


such as a basal and a com petency level, an entry and an e x it level or some other two-
position mastery hierarchy. T o accomplish this we find the d ifficu lty levels d, and d 2
on our lo git scale which mark our choice o f tw o criteria positions. Then we transform
these logits to the values D , and D 2 on a new substantive scale or S IT which positions
our criteria at easy to rem em ber locations such as 5 0 ,1 0 0 o r 200.

I f d., and d 2 id en tify the criteria positions in logits and D, and D 2 represent the
desired easy to rem em ber positions o f these criteria on the new scale, then

a = (D 1 d2 - D 2 d! )/(d 2 - d t )

7 = (D2 - D , ) / ^ - d , )

and so

B = a + 7b

D = a + 7d

becom e
B = [ ( D , d2 - D 2 d , ) + (D 2 - D , )b] /(d 2 - d ! ) [8.6.1]

and
D = [(D 1 d2 - D2 d , ) + (D2 - D, )d] /(d 2 - d , ) .
200 BEST TEST DESIGN
CHOOSING A SCALE 201

In order to apply this m ethod to the K C T B example, we will designate a basal level
at the 3-tap median o f d, = -3 .4 logits and a com petency level at the 5-tap median o f
d 2 = 1.4 logits. Then we w ill arrange to report these criteria at D, = 3 0 fo r basal and
D 2 = 50 fo r com petency using

a = [3 0 (1 .4 ) - 50 (-3 .4 )1 /[ 1 .4 - (-3 .4 )]

= ( 4 2 + 1701/4.8

= 4 4 .2

and

7 = ( 5 0 - 3 0 ) / [ ( 1 . 4 - ( - 3. 4 ) ]

= 2 0 /4 .8

= 4.2

which defines our new substantive scale o f SITs as

B = 4 4 .2 + 4.2b [8.6.2)

and

D = 4 4 .2 + 4.2d

This scaling transforms the 3 -tap median at d , = - 3.4 logits to D , = 44.2 + 4.2 (-3 .4 ) =
30 SITs and the 5-tap median at d2 = 1.4 logits to D 2 = 44.2 + 4.2 (1 .4 ) = 50 SITs.

In Figure 8.6.1 we show the K C T B items and a substantive definition o f the K C T


variable b y marking the scale positions o f each median number o f taps. The ability scores
in logits and in SITs are given b elow this substantive definition.

8.7 RESPONSE P R O B A B IL IT Y S C A L IN G U N IT S : CHIPS

I f we eire interested in using our test to predict successful performance response


rates, then a useful scale fo r these response probability units or CHIPs might be one that
iden tified m ovem ent through the response probabilities o f .10, .25, .50, .75 and .90 with
easy to rem em ber m ultiples along the variable like 5 ,1 0 , 20 or 25.

From the response m odel

p = exp (b - d) / [ 1 + exp(b - d)]

w e can determ ine the differences (b - d ) between person ability and item difficu lty which
lead to the response probabilities .10, .25, .50, .75 and..90. Solving fo r (b - d ) in logits we
have

(b - d) =fin [p/(1 - p)]

and hence

Difference Between Person


Probability A bility and Item D ifficulty
of Success in Logits
P b- d

.10 - 2.2
.25 -1 .1
.50 0.0
.75 1.1
.90 2.2
20 2

SCALING THE KCTB IN SUBSTANTIVE UNITS: SITS


BEST TEST DESIGN
CHOOSING A SCALE 203

T o determ ine a new scale in this manner we use

B = a + 7 (b - c) [8.7.1]

in which
c = either a normative or a substantive choice
of location on the logit scale

7 = an appealing multiple of 1/1.1 = 0.91


such as 5, 10, 20 or 25 leading to the
7 values of 4 .5 5 , 9.1, 18.2 or 22.75.

and
a = 50, 100 or 500.

F o r K C T B we could make a normative choice o f c = 1.3logits at the logit mean o f


the norm ing group o f 68 persons. We could alsoset 7= 4.55 giving usaCHIP spacing o f
5 and choose a to locate the norm ative mean at 50. Then our CHIP scale formulation
becomes

B = 50 + 4 .5 5 ( b - 1.3) [8.7.2]

= 4 4 .0 9 + 4.55b

= 44.1 + 4.6b

N o tice that when b is located at the mean o f the norming group, then

B = 44.1 + 4 . 6 (1.3) = 50 CHIPs.

Our choice o f 7 = 4.55 produces the follo w in g relations between the relative posi­
tions o f a person at B and an item at D

Difference Between Person


Probability A bility and Item D ifficulty
o f Success in CHIPs
P B -D

.10 -1 0
.25 - 5
.50 0
.75 5
.90 10

Thus w e exp ect that when any person confronts any item 10 CHIPs below their ability
the probability fo r a successful response is about .90. A t 5 CHIPs below, the predicted
success rate is .75. On the other side, i f an item is5 CHIPs more d ifficu lt than the person
is able, w e exp ect the success rate to drop to .25 and,when the person is at a disadvantage
o f 10 CHIPs, w e exp ect success on ly .10 o f the time.

Were we to decide on a substantive choice o f scale location, we could use the K C T


5-tap median o f 1.4 logits as our reference location instead o f the norming sample mean
at 1.3 logits. Then our CH IP scale form ulation would become

D = 50 + 4 . 5 5 ( d - 1.4) [8.7.3]

= 4 3 .6 3 + 4.55d

= 4 3 .6 + 4.6d

and so
B = 4 3 .6 + 4.6b
204 BEST TEST DESIGN

N o w when b is at the 5-tap median o f 1.4 logits then

B = 4 3 .6 + 4 .6 (1 .4 ) = 5 0 C H IP s .

Table 8.7.1 brings together the logit, N IT , S IT and CHIP scales for the K CTB test.

_____________| TABLE 8.7.1 j_____________


KCTB SCORES, MEASURES AND ERRORS
IN LOGITS, NITS, SITS AND CHIPS

Test Score Person Ability Measurement Error


Logits NITs SITs CHIPs Logits NITs SITs CHIPs

22 6.26 76 70 73 1.20 6 5 6
21 5.15 70 66 68 0.99 5 4 5
20 4.31 66 62 64 0.89 5 4 4
19 3.60 62 59 61 0.83 4 3 4
18 2.99 59 57 58 0.78 4 3 4
17 2.42 56 54 55 0.76 4 3 3
16 1.88 53 52 53 0.75 4 3 3
15 1.35 50 50 50 0.74 4 3 3
14 0.84 48 48 48 0.73 4 3 3
13 0.35 45 46 46 0.70 4 3 3
12 -0 .1 0 43 44 44 0.67 4 3 3
11 -0 .5 1 40 42 42 0.65 3 3 3
10 -0 .9 0 38 40 40 0.63 3 3 3
9 -1 .2 8 36 39 38 0.62 3 3 3
8 -1 .6 5 34 37 37 0.62 3 3 3
7 -2 .0 2 32 36 35 0.63 3 3 3
6 -2 .4 2 30 34 33 0.65 3 3 3
5 -2 .8 5 28 32 31 0.69 4 3 3
4 -3 .3 3 26 30 29 0.74 4 3 3
3 -3 .9 0 23 28 26 0.82 4 3 4
2 -4 .6 5 19 25 23 0.95 5 4 4
1 -5 .7 5 13 20 18 1.23 7 5 6

[8.5.2] NITs B = 43.2 + 5.3b S E ( B ) = 5.3 SE(b)


[8.6.2] SITs B = 44.2 + 4.2b S E ( B ) = 4.2 SE(b)
[8.7.2] CHIPs B =4 4 . 1 + 4.6b S E ( B ) = 4 .6 SE(b)
CHOOSING A SCALE 205

8.8 REPORTING FORMS

The use o f the Rasch m odel in test construction can facilitate test interpretation. We
illustrate this with a reporting form developed fo r the K C T variable.

Figure 8.8.1 provides a map o f the K C T variable. This map shows all o f the data
gathered thus far: the K C T B items positioned along the variable by their d ifficu lty levels,
the substantive criteria o f number o f taps, reverses and distance across blocks and the
norm ative inform ation o f median ages fo r children and mean and standard deviation fo r
adults. The map shows the exten t to which the K C T variable has been defined and how
various possible K C T measures relate to substantive and normative considerations. A K C T
report form can be developed from this map.

Figure 8.8.2 is a report form fo r interpreting individual performance on the KCTB.


This form , which could be used fo r a single individual or an entire class, shows the per­
form ance o f Persons 12M and 88M as w ell as a response record identical in score to 88M
but designed to show a “ sleeping” pattern o f several unexpected failures.

N o tice that Person 12M with his score o f 5 is located at -2 .8 logits on the K C T
variable. This puts him halfw ay between 3 and 4 taps substantively and at the 5 year old
median norm atively. Person 88M, however, at 3.0 logits is functioning at 6 taps sub­
stantively and at about one standard deviation above the adult mean normatively.

In many instances it will be useful to detect m isfit immediately upon recording a


person’s responses. T h e report form in Figure 8.8.2 is ideal fo r this purpose. Once we
have recorded the correct or incorrect response to each item at its position on the variable
and also the consequent position o f the person on the same variable. M isfit can be esti­
mated directly from this com pleted answer form b y means o f a M isfit Ruler.

Figure 8.8.3 shows a M isfit Ruler scaled in logits. It is marked to indicate the logit
deviations and a corresponding m isfit index y 2 = (z 2 - 1 ) to the le ft and right o f its
center. N o tice that the unexpected response deviations o f 1, 2, 3 and 4 logits indicate
y 2 ’s o f 2, 6 , 19 and 54 respectively. By positioning the center o f the ruler at the point
on the variable where the person is located and comparing the ruler’s markings with the
person’s response to each item we can calculate, at a glance, the m isfit o f the person’s
record.

Whenever an unexpected response is observed, namely a “ 0 ” fo r an incorrect re­


sponse in the easy region to the le ft or a “ 1 ” fo r a correct response in the hard region to
the right, then the corresponding y 2 ’s on the ruler are added to form their sum Q = 2 y 2
fo r just the unexpected pieces o f the record. This sum Q divided by the square ro o t o f the
total number o f items L on the test yields the m isfit statistic

U = Q /L ,/4 [7.8.1]

which, i f the record fits the response m odel, is distributed approximately.

U~N (0,2) .

This easy to calculate statistic can be used to evaluate misfit. When U > 5 the prob­
ability that the record is acceptable has dropped b elow .01, and it seems reasonable to
question the validity o f the record.

In a practical application with a batch o f records to evaluate, it is most reasonable


to begin with the record fo r which U is maximum and to see i f the source o f invalidity
206 BEST TEST DESIGN

KCT VARIABLE MAP


KCTB TEST INTERPRETATION FORM
CHOOSING A SCALE

Items 1, 2, 26, 27 and 28 not included on this form


207
208 BEST TEST DESIGN

can be id en tified and dealt w ith. Subsequent cases can then be handled in order o f U
until all useful explanations o f the invalidities im plied b y U > 5 are discovered.

Figure 8.8.2 shows the test segm ent o f 23 K C T B items in order o f increasing d if­
ficu lty togeth er w ith three response records. T h e response records show the correct and
in correct responses on the 23 items. T o evaluate the f it o f Person 1 2 M ’s record the center
o f the M isfit R u ler is placed at the arrow m arking his position a t - 2 .8 logits determ ined
b y his score o f 5. His in correct response to Item 3 at - 6 .2 logits produces a y 2 o f about
30 fo r Q = 30 and U = 30/2354 = 6.3. This corresponds to a t = 4.5 which is very close to
the m ore exact value o f t given in Table 7.8.3. Again, w e see that this response is to o
im probable to be accepted as part o f a valid measure o f Person 12M.

T h e M isfit R u ler has also been applied to Person 88M and to the “ Sleeping” pattern
w ith the same score o f 18. T h e pattern fo r Person 88M produces a response record that
yields a 2 y 2 = 2, and U = 0.4. T h e “ sleeping” patterns, h ow ever, produces a

2 y 2 = 4 6 + 2 0 + 2 + 3 = 71

and so a

U = 7 1 /2 3 * = 14.8 .

These results are summarized in Table 8.8.1

TABLE 8.8.1

QUICK AN ALYSIS OF
RESPONSE RECORD V A L ID IT Y

Sum of
Unexpected F it
Person Score Responses Statistic
Q = 2 y2 U = Q /L *

12M 5 30 6 .3 *
88M 18 2 0.4
"Sleeping" Pattern 18 4 6 + 20 + 2 + 3 = 71 14.8*

L = 23 ‘ Misfit
CHOOSING A SCALE 209

FIGURE 8.8.3

M IS FIT RULER

Unexpected "O 's" Person’s A bility Unexpected " V s "


Items too easy Items too hard
to get incorrect to get correct

f+ + -B -+ + + + -H —I- -H -f -H -+
-4 -3 -2 -1 Logits

5442 32 25 19 15 1 1 8 6 5 3 2 2 y2 2 2 3 5 6 8 11 15 19 25 32 42 54
.02 .05 .1 .2 P .2 .1 .05 .02

HOW T O USE T H E M IS F IT R U L E R :

1. Position the items on metric record form corresponding to ruler metric.


2. Record person’s responses to items on record form .
3. Locate person's ability position on record form by counting score r and positioning it between the
rth and the (r + 1)th item locations.
4. Place center o f M isfit Ruler at person's ability position.
5. Sum y for all unexpected responses, " 0 's" to the left and "1 's" to the right to form Q.
6. Let L equal the total number of items.
7. Calculate misfit statistic U = Q / L^2.
8. If U > 5 examine the person's record further for sources of invalidity.

P is the im probability of each response.


APPENDICES

TABLE A 211 & 212

TABLE B 213 & 214

TABLE C 215

211
212 APPENDICES

I TA B LE A L________________________

RELATIVE A B ILITY xfw FOR UNIFORM TESTS IN LOGITS

Relative Test Width w Relative


Score Score
f > .5 0 1 2 3 4 5 6 7 8 f < .5 0
.50 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 .50
.51 0.0 0.0 0.0 0.1 0.1 0.1 0.1 0.1 .49
.52 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 .48
.53 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.2 .47
.54 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 .46
.55 0.2 0.2 0.2 0.3 0.3 0.3 0.4 0.4 .45
.56 0.2 0.3 0.3 0.3 0.4 0.4 0.4 0.5 .44
.57 0.3 0.3 0.3 0.4 0.4 0.5 0.5 0.6 .43
.58 0.3 0.3 0.4 0.4 0.5 0.5 0.6 0.7 .42
.59 0.4 0.4 0.4 0.5 0.5 0.6 0.7 0.7 .41
.60 0.4 0.4 0.5 0.5 0.6 0.7 0.7 0.8 .40
.61 0.5 0.5 0.5 0.6 0.7 0.7 0.8 0.9 .39
.62 0.5 0.5 0.6 0.6 0.7 0.8 0.9 1.0 .38
.63 0.5 0.6 0.6 0.7 0.8 0.9 1.0 1.1 .37
.64 0.6 0.6 0.7 0.7 0.8 0.9 1.1 1.2 .36
.65 0.6 0.7 0.7 0.8 0.9 1.0 1.1 1.3 .35
.66 0.7 0.7 0.8 0.9 1.0 1.1 1.2 1.3 .34
.67 0.7 0.8 0.8 0.9 1.0 1.2 1.3 1.4 .33
.68 0.8 0.8 0.9 1.0 1.1 1.2 1.4 1.5 .32
.69 0.8 0.9 0.9 1.0 1.2 1.3 1.4 1.6 .31
.70 0.9 0.9 1.0 1.1 1.2 1.4 1.5 1.7 .30
.71 0.9 1.0 1.0 1.2 1.3 1.4 1.6 1.8 .29
.72 1.0 1.0 1.1 1.2 1.4 1.5 1.7 1.9 .28
.73 1.0 1.1 1.2 1.3 1.4 1.6 1.8 2.0 .27
.74 1.1 1.1 1.2 1.3 1.5 1.7 1.9 2.1 .26
.75 1.1 1.2 1.3 1.4 1.6 1.7 1.9 2.1 .25
.76 1.2 1.2 1.3 1.5 1.6 1.8 2.0 2.2 .24
.77 1.2 1.3 1.4 1.5 1.7 1.9 2.1 2.3 .23
.78 1.3 1.4 1.5 1.6 1.8 2.0 2.2 2.4 .22
.79 1.3 1.4 1.5 1.7 1.9 2.1 2.3 2.5 .21
.80 1.4 1.5 1.6 1.8 1.9 2.2 2.4 2.6 .20
.81 1.5 1.6 1.7 1.8 2.0 2.2 2.5 2.7 .19
.82 1.5 1.6 1.7 1.9 2.1 2.3 2.6 2.8 .18
.83 1.6 1.7 1.8 2.0 2.2 2.4 2.7 2.9 .17
.84 1.7 1.8 1.9 2.1 2.3 2.5 2.8 3.0 .16
.85 1.8 1.8 2.0 2.2 2.4 2.6 2.9 3.2 .15
.86 1.8 1.9 2.1 2.3 2.5 2.7 3.0 3.3 .14
.87 1.9 2.0 2.2 2.4 2.6 2.8 3.1 3.4 .13
.88 2.0 2.1 2.3 2.5 2.7 2.9 3.2 3.5 .12
.89 2.1 2.2 2.4 2.6 2.8 3.1 3.3 3.7 .11
.90 2.2 2.3 2.5 2.7 2.9 3.2 3.5 3.8 .10
.91 2.3 2.4 2.6 2.8 3.1 3.3 3.6 3.9 .09
.92 2.5 2.6 2.7 2.9 3.2 3.5 3.8 4.1 .08
.93 2.6 2.7 2.9 3.1 3.4 3.6 4.0 4.3 .07
.94 2.8 2.9 3.1 3.3 3.5 3.8 4.1 4.5 .06
.95 3.0 3.1 3.3 3.5 3.7 4.0 4.4 4.7 .05
.96 3.2 3.3 3.5 3.7 4.0 4.3 4.6 5.0 .04
.97 3.5 3.6 3.8 4.0 4.3 4.6 5.0 5.3 .03
.98 3.9 4.0 4.2 4.5 4.7 5.1 5.4 5.8 .02
.99 4.6 4.8 4.9 5.2 5.5 5.8 6.1 6.5 .01
f > .5 0 f < .5 0
Measure Measure
b, =h + x f W bf = h - xfw

Test Score: r Test Length: L Relative Score: f = r/L

Test Height: h = 2 d / L Test Width: w = [ (dL + d L- i - d2 _ di )/2 ] [L /{L - 2)]


APPENDICES 213

J TABLE A I--------------------------------------
R E LA T IV E A B IL IT Y xfw FOR UNIFORM TESTS IN LOGITS
(Continued)

Relative Test W idth w Relative


Score Score
f > .5 0 8 9 10 11 12 13 14 15 f < .5 0
.50 0 .0 0 .0 0.0 0.0 0.0 0.0 0.0 0.0 .50
.51 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.2 .49
.52 0.2 0.2 0.2 0.2 0.2 0.3 0.3 0.3 .48
.53 0.2 0.3 0 .3 0.3 0.4 0.4 0.4 0.5 .47
.54 0.3 0.4 0.4 0.4 0.5 0.5 0.6 0.6 .46
.55 0.4 0.5 0.5 0.6 0.6 0.7 0.7 0.8 .45
.56 0 .5 0 .6 0.6 0.7 0.7 0.8 0.8 0.9 .44
.57 0 .6 0.6 0.7 0.8 0.8 0.9 1.0 1.1 .43
.58 0.7 0.7 0.8 0.9 1.0 1.0 1.1 1.2 .42
.59 0.7 0 .8 0.9 1.0 1.1 1.2 1.3 1.4 .41
.60 0 .8 0.9 1.0 1.1 1.2 1.3 1.4 1.5 .40
.61 0 .9 1.0 1.1 1.2 1.3 1.4 1.5 1.7 .39
.62 1.0 1.1 1.2 1.3 1.4 1.6 1.7 1.8 .38
.63 1.1 1.2 1.3 1.4 1.6 1.7 1.8 2.0 .37
.64 1.2 1.3 1.4 1.6 1.7 1.8 2.0 2.1 .36
.65 1.3 1.4 1.5 1.7 1.8 2.0 2.1 2.3 .35
.66 1.3 1.5 1.6 1.8 1.9 2.1 2.2 2.4 .34
.67 1.4 1.6 1.7 1.9 2.1 2.2 2.4 2.6 .33
.68 1.5 1.7 1.8 2.0 2.2 2.4 2.5 2.7 .32
.69 1.6 1.8 1.9 2.1 2.3 2.5 2.7 2.9 .31
.7 0 1.7 1.9 2.1 2.2 2.4 2.6 2.8 3.0 .30
.71 1.8 2.0 2.2 2.4 2.6 2.8 3.0 3.2 .29
.72 1.9 2.1 2.3 2.5 2.7 2.9 3.1 3.3 .28
.73 2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.5 .27
.74 2.1 2.3 2.5 2.7 2.9 3.2 3.4 3.6 .26
.75 2.1 2.4 2.6 2.8 3.1 3.3 3.5 3.8 .25
.76 2.2 2.5 2.7 2.9 3.2 3.4 3.7 3.9 .24
.77 2.3 2.6 2.8 3.1 3.3 3.6 3.8 4.1 .23
.78 2.4 2.7 2.9 3.2 3.4 3.7 4 .0 4 .2 .22
.79 2.5 2.8 3.0 3.3 3.6 3.8 4.1 4.4 .21
.80 2.6 2 .9 3.1 3.4 3.7 4.0 4.3 4.6 .20
.81 2.7 3.0 3.3 3.5 3.8 4.1 4.4 4.7 .19
.82 2 .8 3.1 3.4 3.7 4 .0 4.3 4.6 4.9 .18
.83 2.9 3.2 3.5 3.8 4.1 4.4 4.7 5.0 .17
.84 3 .0 3.3 3.6 3.9 4 .2 4 .6 4.9 5.2 .16
.85 3.2 3.4 3.8 4.1 4.4 4.7 5.0 5.4 .15
.86 3.3 3.6 3 .9 4.2 4 .5 4.9 5.2 5.5 .14
.87 3.4 3.7 4 .0 4 .3 4.7 5.0 5.4 5.7 .13
.88 3.5 3.8 4.2 4 .5 4.8 5.2 5.5 5.9 .12
.89 3.7 4 .0 4.3 4 .6 5.0 5.3 5.7 6.1 .11
.90 3 .8 4.1 4.5 4 .8 5.2 5.5 5.9 6.3 .10
.91 3.9 4 .3 4.6 5.0 5.3 5.7 6.1 6.5 .09
.92 4.1 4 .4 4.8 5.2 5.5 5.9 6.3 6.7 .08
.93 4 .3 4 .6 5.0 5.4 5.7 6.1 6.5 6.9 .07
.94 4 .5 4 .8 5.2 5.6 5.9 6.3 6.7 7.1 .06
.95 4.7 5.1 5.4 5.8 6.2 6.6 7.0 7.4 .05
.96 5.0 5.3 5.7 6.1 6.5 6.9 7.3 7.7 .04
.97 5.3 5.7 6.1 6.4 6.8 7.2 7.7 8.1 .03
.98 5.8 6.1 6.5 6.9 7.3 7.7 8.1 8.6 .02
.99 6.5 6.9 7.3 7.7 8.1 8.5 8.9 9.3 .01
f > .5 0 f < .5 0
Measure Measure
bf = h + * fvv bf = h - x tw

Test Score: r Test Length: L Relative Score: f = r/L

Test Height: h = 2 d ;/L Test W idth: w = [(d L + d i_ -l “ ^2 ~ ^1 ) / 2 ) ( L / ( L 2)]


214 APPENDICES

__________________________ I TA B LE B |_________________________

ERROR COEFFICIENT Cf * FOR UNIFORM TESTS IN LOGITS

Relative Test Width w Relative


Score Score
f > .5 0 1 2 3 4 5 6 7 8 f < .5 0
.50 2.0 2,1 2.2 2.3 2.4 2.6 2.7 2.9 .50
.51 2.0 2.1 2.2 2.3 2.4 2.6 2.7 2.9 .49
.52 2.0 2.1 2.2 2.3 2.4 2.6 2.7 2.9 .48
.53 2.0 2.1 2.2 2.3 2.4 2.6 2.7 2.9 .47
.54 2.0 2.1 2.2 2.3 2.4 2.6 2.7 2.9 .46
.55 2.0 2.1 2.2 2.3 2.4 2.6 2.7 2.9 .45
.56 2.0 2.1 2.2 2.3 2.4 2.6 2.7 2.9 .44
.57 2.0 2.1 2.2 2.3 2.4 2.6 2.7 2.9 .43
.58 2.0 2.1 2.2 2.3 2.4 2.6 2.7 2.9 .42
.59 2.1 2.1 2.2 2.3 2.5 2.6 2.7 2.9 .41
.60 2.1 2.1 2.2 2.3 2.5 2.6 2.7 2.9 .40
.61 2.1 2.1 2.2 2.3 2.5 2.6 2.8 2.9 .39
.62 2.1 2.1 2.2 2.3 2.5 2.6 2.8 2.9 .38
.63 2.1 2.1 2.2 2.4 2.5 2.6 2.8 2.9 .37
.64 2.1 2.2 2.2 2.4 2.5 2.6 2.8 2.9 .36
.65 2.1 2.2 2.3 2.4 2.5 2.6 2.8 2.9 .35
.66 2.1 2.2 2.3 2.4 2.5 2.6 2.8 2.9 .34
.67 2.1 2.2 2.3 2.4 2.5 2.7 2.8 2.9 .33
.68 2.2 2.2 2.3 2.4 2.5 2.7 2.8 3.0 .32
.69 2.2 2.2 2.3 2.4 2.6 2.7 2.8 3.0 .31
.70 2.2 2.3 2.3 2.4 2.6 2.7 2.8 3.0 .30
.71 2.2 2.3 2.4 2.5 2.6 2.7 2.8 3.0 .29
.72 2.2 2.3 2.4 2.5 2.6 2.7 2.9 3.0 .28
.73 2.3 2.3 2.4 2.5 2.6 2.7 2.9 3.0 .27
.74 2.3 2.3 2.4 2.5 2.6 2.8 2.9 3.0 .26
.75 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 .25
.76 2.4 2.4 2.5 2.6 2.7 2.8 2.9 3.1 .24
.77 2.4 2.4 2.5 2.6 2.7 2.8 3.0 3.1 .23
.78 2.4 2.5 2.6 2.6 2.8 2.9 3.0 3.1 .22
.79 2.5 2.5 2.6 2.7 2.8 2.9 3.0 3.1 .21
.80 2.5 2.6 2.6 2.7 2.8 2.9 3.1 3.2 .20
.81 2.6 2.6 2.7 2.8 2.9 3.0 3.1 3.2 .19
.82 2.6 2.7 2.7 2.8 2.9 3.0 3.1 3.2 .18
.83 2.7 2.7 2.8 2.9 3.0 3.1 3.2 3.3 .17
.84 2.7 2.8 2.9 2.9 3.0 3.1 3.2 3.3 .16
.85 2.8 2.9 2.9 3.0 3.1 3.2 3.3 3.4 .15
.86 2.9 2.9 3.0 3.1 3.2 3.3 3.4 3.4 .14
.87 3.0 3.0 3.1 3.2 3.2 3.3 3.4 3.5 .13
.88 3.1 3.1 3.2 3.3 3.3 3.4 3.5 3.6 .12
.89 3.2 3.2 3.3 3.4 3.4 3.5 3.6 3.7 .11
.90 3.3 3.4 3.4 3.5 3.6 3.7 3.7 3.8 .10
.91 3.5 3.5 3.6 3.7 3.7 3.8 3.9 3.9 .09
.92 3.7 3.7 3.8 3.8 3.9 4.0 4.0 4.1 .08
.93 3.9 4.0 4.0 4.1 4.1 4.2 4.3 4.3 .07
.94 4.2 4.2 4.3 4.3 4.4 4.5 4.5 4.6 .06
.95 4.6 4.6 4.7 4.7 4.8 4.8 4.9 4.9 .05
.96 5.1 5.1 5.2 5.2 5.3 5.3 5.4 5.4 .04
.97 5.9 5.9 5.9 6.0 6.0 6.0 6.1 6.1 .03
.98 7.1 7.2 7.2 7.2 7.3 7.3 7.3 7.4 .02
.99 10.1 10.1 10.1 10.1 10.1 10.2 10.2 10.2 .01

f > .5 0 | f < .5 0

Test Score: r Test Length: L Relative Score: f = r/L


L
Test Height: h = ^ d ,/L Test Width: w = [(d L + d L_ 1 - d2 - d i )/2 ] [L /(L - 2)]

Standard Error: sfuv= C f ^ / L 14


APPENDICES 215

TABLE B
ERROR CO EFFICIENT Cf * FOR UNIFORM TESTS IN LOGITS
(Continued)

Relative Test Width w Relative


Score
f > .5 0 8 9 10 11 12 13 14 15 f < .5 0
.50 2.9 3.0 3.2 3.3 3.5 3.6 3.7 3.9 .50
.51 2.9 3.0 3.2 3.3 3.5 3.6 3.7 3.9 .49
.52 2.9 3.0 3.2 3.3 3.5 3.6 3.7 3.9 .48
.53 2.9 3.0 3.2 3.3 3.5 3.6 3.7 3.9 .47
.54 2.9 3.0 3.2 3.3 3.5 3.6 3.7 3.9 .46
.55 2.9 3.0 3.2 3.3 3.5 3.6 3.7 3.9 .45
.56 2.9 3.0 3.2 3.3 3.5 3.6 3.7 3.9 .44
.57 2.9 3.0 3.2 3.3 3.5 3.6 3.7 3.9 .43
.58 2.9 3.0 3.2 3.3 3.5 3.6 3.7 3.9 .42
.59 2.9 3.0 3.2 3.3 3.5 3.6 3.7 3.9 .41
.60 2.9 3.0 3.2 3.3 3.5 3.6 3.7 3.9 .40
.61 2.9 3.1 3.2 3.3 3.5 3.6 3.8 3.9 .39
.62 2.9 3.1 3.2 3.3 3.5 3.6 3.8 3.9 .38
.63 2.9 3.1 3.2 3.3 3.5 3.6 3.8 3.9 .37
.64 2.9 3.1 3.2 3.4 3.5 3.6 3.8 3.9 .36
.65 2.9 3.1 3.2 3.4 3.5 3.6 3.8 3.9 .35
.66 2.9 3.1 3.2 3.4 3.5 3.6 3.8 3.9 .34
.67 2.9 3.1 3.2 3.4 3.5 3.6 3.8 3.9 .33
.68 3.0 3.1 3.2 3.4 3.5 3.6 3.8 3.9 .32
.69 3.0 3.1 3.2 3.4 3.5 3.6 3.8 3.9 .31
.70 3.0 3.1 3.2 3.4 3.5 3.6 3.8 3.9 .30
.71 3.0 3.1 3.3 3.4 3.5 3.6 3.8 3.9 .29
.72 3.0 3.1 3.3 3.4 3.5 3.7 3.8 3.9 .28
.73 3.0 3.1 3.3 3.4 3.5 3.7 3.8 3.9 .27
.74 3.0 3.2 3.3 3.4 3.5 3.7 3.8 3.9 .26
.75 3.0 3.2 3.3 3.4 3.6 3.7 3.8 3.9 .25
.76 3.1 3.2 3.3 3.4 3.6 3.7 3.8 3.9 .24
.77 3.1 3.2 3.3 3.5 3.6 3.7 3.8 3.9 .23
.78 3.1 3.2 3.4 3.5 3.6 3.7 3.8 3.9 .22
.79 3.1 3.3 3.4 3.5 3.6 3.7 3.8 4.0 .21
.80 3.2 3.3 3.4 3.5 3.6 3.7 3.9 4.0 .20
.81 3.2 3.3 3.4 3.5 3.7 3.8 3.9 4.0 .19
.82 3.2 3.4 3.5 3 .6 3.7 3.8 3.9 4.0 .18
.83 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 .17
.84 3.3 3.4 3.5 3 .6 3.7 3.9 4.0 4.1 .16
.85 3.4 3.5 3 .6 3.7 3 .8 3.9 4 .0 4.1 .15
.86 3.4 3.5 3.6 3.7 3 .8 3.9 4.0 4.1 .14
.87 3.5 3.6 3.7 3.8 3.9 4.0 4.1 4.2 .13
.88 3.6 3.7 3 .8 3.9 4 .0 4.1 4.1 4.2 .12
.89 3.7 3.8 3.9 4 .0 4 .0 4.1 4.2 4.3 .11
.90 3.8 3.9 4.0 4.1 4.1 4.2 4.3 4.4 .10
.91 3.9 4.0 4.1 4.2 4.3 4.3 4.4 4.5 .09
.92 4.1 4.2 4.3 4.3 4.4 4.5 4.6 4.6 .08
.93 4.3 4.4 4.5 4.5 4.6 4.7 4.7 4.8 .07
.94 4 .6 4.6 4.7 4.8 4 .8 4 .9 5.0 5.0 .06
.95 4.9 5.0 5.0 5.1 5.2 5.2 5.3 5.3 .05
.96 5.4 5.5 5.5 5.6 5.6 5.7 5.7 5.8 .04
.97 6.1 6.2 6.2 6.3 6.3 6.3 6.4 6.4 .03
.98 7.4 7.4 7.4 7.5 7.5 7.5 7.6 7.6 .02
.99 10.2 10.2 10.3 10.3 10.3 10.3 10.4 10.4 .01

f > .5 0 f < .5 0

Test Score: r Test Length: L Relative Score: f = r/L

Test Height: h = 2 d j/L Test W idth: w = [(d L + d L- i “ d2 ~ d i )/2 ] [L /(L 2)]

Standard Error: Sfw = Cfv^ / L %


216 APPENDICES

J TABLE C I

M ISFIT STATISTICS

Difference
Between Relative Number of Items
Person A bility Squared Improbability Efficiency Needed To
and Standardized of the of the Maintain
Item Difficulty Residual Response Observation Equal Precision
(b -d ) z2=exp (b -d ) p =1/(1+z2 ) l=400p (1—p) L= 1000/1

-0 .6 , 0.3 .50 100 10


0.4, 0.8 2 .33 90 11
0.9, 1.2 3 .25 75 13
1.3, 1.4 4 .20 65 15
1.5, 1.4 5 .17 55 18
1 .7 ,1 .8 6 .14 50 20
1.9, 2.0 7 .12 45 22

2.1 8 .11 40 25
2.2 9 .10 36 28
2.3 10 .09 33 30
2.4 11 .08 31 32
2.5 12 .08 28 36
2.6 14 .07 25 40
2.7 15 .06 23 43
2.8 17 .06 21 48
2.9 18 .05 20 50
3.0 20 .05 18 55 '

3.1 22 .04 16 61
3.2 25 .04 15 66
3.3 27 .04 14 73
3.4 30 .03 12 83
3.5 33 .03 11 91
3.6 37 .03 10 100
3.7 41 .02 9 106
3.8 45 .02 9 117
3.9 50 .02 8 129
4.0 55 .02 7 142

4.1 60 .02 6 156


4.2 67 .02 6 172
4.3 74 .01 5 189
4.4 81 .01 5 209
4.5 90 .01 4 230
4.6 99 .01 4 254
REFERENCES

Allerup, P. and Sorber, G. The Rasch M o d e l f o r Questionnaires With a C om pu ter Program.


Copenhagen: Danmarks Paedagogiske Institut, 1977.
Andersen, E. B. A sym p totic properties o f conditional maximum likelihood estimators.
Journal o f the R o y a l Statistical S ociety B, 1970, 32, 283-301.
Andersen, E. B. A sym p totic properties o f conditional likelihood ratio tests. Journal o f
the A m erica n Statistical Association, 1971, 66, 630-633.
Andersen, E. B. The numerical solution o f a set o f conditional estimation equations.
Journal o f the R o y a l Statistical S ociety B, 1972a, 34, 42-5 4.
Andersen, E. B. A com puter program fo r solving a set o f conditional maximum likelihood
equations arising in the Rasch m odel fo r questionnaires. Research M em orandum
7 2 -6 . Princeton, N.J.: Educational Testing Service, 1972b.
Andersen, E. B. C on d ition a l In feren ce and M odels fo r Measuring. Copenhagen: Mental-
hygiejnisk Forlag, 1973.
Andersen, E. B. Sufficient statistics and latent trait models. Psychom etrika, 1977, 42,
6 9 -8 1 .
Andrich, D. Laten t trait psychom etric theory in the measurement and evaluation o f essay
w riting ability. D octoral dissertation, University o f Chicago, 1973.
Andrich, D. T h e Rasch multiplicative binomial m odel: Applications to attitude data.
Research R e p o r t N o . 1. Measurement and Statistics Laboratory, Department o f
Education, University o f Western Australia, 1975.
A n g o ff, W. H. Measurement and scaling. In C. W. Harris (E d .), Encyclopedia o f Educa­
tional Research. N e w Y o rk : Macmillan, 1960.
Arthur, Grace. A P o in t Scale o f P erform a nce Tests. N ew Y o rk : Psychological Corpora­
tion, 1947.
Baker, F. B. U N IV A C scientific com puter program fo r test scoring and item analysis.
B ehavioral Sciences, 1 95 9,4 , 254-255.
Baker, F. B. Empirical comparison o f item parameters based on the logistic and normal
functions, Psychom etrika, 1961, 26, 235-246.
Baker, F. B. Generalized item and test analysis program—a program fo r the Control Data
1604 Computer. Ed ucational and Psychological Measurement, 1963, 23, 187-190.
Bam dorff-N ielsen, O. In fo rm a tio n and E xp on en tia l Fam ilies in Statistical Theory. N ew
Y o rk : John W iley and Sons, 1978.
Bimbaum, A . Som e latent trait m odels and their use in inferring an exam inee’s ability. In
F. L o rd and M. N ovick, Statistical Theories o f M ental Test Scores. Reading, Mass.:
Addison-W esley, 1968.
Choppin, B. A n item bank using sample-free calibration. Nature, 1968, 219, 870-872.
Choppin,B. Ite m Banking and the M o n ito rin g o f Achievem ent. Slough, England: Nation­
al Foundation fo r Educational Research, 1978.
Cohen, L. A m od ified logistic response m odel fo r item analysis. Manuscript, 1976.
C on n olly, A . J., Nachtman, W. and Pritchett, E. M. K eym a th : D iagnostic A rith m e tic Test.
Circle Pines, Minn.: Am erican Guidance Service, 1971.
Cornish, G. and Wines, R. A C E R M athem atics P ro file Series (M A P S ). Hawthorn, Victoria:
Australian Council fo r Educational Research, 1977.

217
218 REFERENCES

Douglas, Graham A. Test design strategies fo r the Rasch psychometric model. Doctoral
dissertation, University o f Chicago, 1974.
Draba, R. E. The Rasch model and legal criteria o f a “ reasonable” classification. Doctoral
dissertation, University o f Chicago, 1978.
Elliott, C. D., Murray, D. C., and Pearson, L. S. The British A b ility Scales. Slough, Eng­
land: National Foundation fo r Educational Research, 1977.
Fischer, G. H. and Scheiblechner, H. H. T w o simple methods fo r asymptotically unbiased
estimation in Rasch’s model with tw o categories o f answers. Research Bulletin N o.
1, Psychological Institute, University o f Vienna, 1970.
Gulliksen, H. Theory o f M ental Tests. N ew York : John Wiley & Sons, 1950.
Gustafsson, J. E. The Rasch model fo r dichotomous items: Theory, applications and a
computer program. R e p o rt N o . 63. Institute o f Education, University o f Goteberg,
1977.
Habermann, S. Maximum likelihood estimates in exponential response models. The
Annals o f Statistics, 1977,5, 815-841.
Loevinger, J. A systematic approach to the construction and evaluation o f tests o f ability.
Psychological Monographs, 1947, 61.
Loevinger, J. Person and population as psychometric concepts. Psychological Review,
1965, 72, 143-155.
Lord, F. M. An analysis o f the Verbal Scholastic Aptitude Test using Bimbaum’s three-
parameter logistic model. Educational and Psychological Measurement, 1968, 28,
989-1020.
Mason, G. P. and Odeh, R. E. A short-cut formula fo r standard deviation. Journal o f
Educational Measurement, 1968, 5, 319-320.
Mead, R. J. Analysis o f fit to the Rasch model. Doctoral dissertation, University o f
Chicago, 1975.
Panchapakesan, N. The simple logistic m odel and mental measurement. Doctoral disser­
tation, University o f Chicago, 1969.
Rasch, G. On simultaneous factor analysis in several populations. In Uppsala Symposium
on Psychological F a ctor Analysis. Stockholm: Almquist and Wiksells, 1953, 65-71.
Rasch, G. Probabilistic M odels fo r Som e Intelligence and A tta in m en t Tests. Copenhagen:
Danmarks Paedagogiske Institut, 1960 (T o be reprinted by University o f Chicago
Press, 1980).
Rasch, G. On general laws and the meaning o f measurement in psychology. In Proceed­
ings o f the F ou rth Berkeley Symposium on Mathematical Statistics and Probability.
Berkeley: University o f California Press, 1961, 4, 321-333.
Rasch, G. An individualistic approach to item analysis. In P. F. Lazarsfeld and N. W.
Henry (Eds.), Readings in M athematical Social Science. Chicago: Science Research
Associates, 1966a, 89-108.
Rasch, G. An item analysis which takes individual differences into account. British
Journal o f Mathematical and Statistical Psychology. 1966b, 19, 49-57.
Rasch, G. An informal report on the present state o f a theory o f objectivity in compari­
sons. In L. J. van der Kamp and C. A . J. Viek (Eds.), Proceedings o f the N U F F IC
International Sum m er Session in Science at “H e t Oude H o f ." Leiden, 1967.
Rasch, G. A mathematical theory o f objectivity and its consequences fo r model construc­
tion. In R e p o rt from European M eeting on Statistics, Econom etrics and Manage­
m ent Sciences, Amsterdam, 1968.
Rentz, R. R. and Bashaw, W. L. Equating Reading Tests with the Rasch Model. Athens,
Georgia: Educational Resource Laboratory, 1975.
Rentz, R. R. and Bashaw, W. L. The national reference scale for reading: An application
o f the Rasch model. Journal o f Educational Measurement, 1977, 14, 161-180.
REFERENCES 219

Thorndike, E. L ,,e t al. The Measurement o f Intelligence. N ew Y o rk : Columbia University,


Teachers College, 1926.
Thurstone, L. L. A m ethod o f scaling psychological and educational tests. Journal o f
Educational Psychology, 1 9 2 5 ,16, 433-451.
Thurstone, L. L. Th e unit o f measurement in educational scales. Journal o f Educational
Psychology, 1927, 18, 505-524.
Tucker, L. R. Scales m inim izing the importance o f reference groups. In Proceedings o f
the 1952 In vita tion a l Conference on Testing Problem s. Princeton, N.J.: Educational
Testing Service, 1953.
W illm ott, A . and Fowles, D. The O bjective Interpreta tion o f Test Perform ance: The
Rasch M o d e l A pplied. Atlantic Highlands, N.J.: N F E R Publishing Co., Ltd., 1974.
W oodcock, R. W. W oodcock Reading Mastery Tests. Circle Pines, Minnesota: American
Guidance Service, 1974.
Wright, B. D. Sample-free test calibration and person measurement. In Proceedings o f the
1967 In vita tion a l C onference on Testing Problem s. Princeton, N.J.: Educational
Testing Service, 1968.
Wright, B. D. Solving measurement problems with the Rasch model. Journal o f Educa­
tional Measurement, 1 9 7 7 ,14, 97-116.
Wright, B. D. and Douglas, G. A . Best test design and self-tailored testing. Research
M em orandum N o . 19, Statistical Laboratory, Department o f Education, University
o f Chicago, 1975a.
Wright, B. D. and Douglas, G. A . Better procedures fo r sample-free item analysis. R e ­
search M em orandum N o . 20, Statistical Laboratory, Department o f Education,
University o f Chicago, 1975b.
Wright, B. D. and Douglas, G. A . Rasch item analysis by hand. Research M em orandum
N o . 21, Statistical Laboratory, Departm ent o f Education, University o f Chicago,
1976.
Wright, B. D. and Douglas, G. A . Best procedures fo r sample-free item analysis. A pplied
Psychological Measurement, 1977a, 1, 281-294.
Wright, B. D. and Douglas, G. A . Conditional versus unconditional procedures fo r sample-
free item analysis. Educational and Psychological Measurement, 1977b, 37, 573-586.
Wright, B. D. and Mead, R. J. C A L F IT : Sample-free calibration with a Rasch measure­
m ent m odel. Research M em orandum N o . 18. Statistical Laboratory, Department o f
Education, University o f Chicago, 1975.
Wright, B. D. and Mead, R. J. B IC A L : Calibrating items with the Rasch model. Research
M em orandum N o . 23, Statistical Laboratory, Department o f Education, University
o f Chicago, 1976.
Wright, B. D. and Mead, R. J. The Use o f M easurement M odels in the D efin itio n and
A p p lica tio n o f S ocia l Science Variables. Arlington, V A : U.S. Arm y Research Institute
Technical R ep ort D A H C 19-76-G -0011,1977.
Wright, B. D., Mead, R. J. and Draba, R. E. Detecting and correcting test item bias with a
logistic response m odel. Research M em orandum N o . 22, Statistical Laboratory,
D epartm ent o f Education, University o f Chicago, 1976.
Wright, B. D. and Panchapakesan, N. A procedure fo r sample-free item analysis, Educa­
tional and Psychological Measurement, 1969, 29, 23-48.
INDEX

A b ility |3 and b ( see Person ability measure) UFORM, 143 - 151, 214, 216
Ad d itive scale factor a ( see Scale additive factor) Expansion factors X and Y , 21 - 22, 30, 40 - 44,
Analysis o f fit (see F it) 50, 62, 148
Extending a variable, 87 - 93
Bank (see Item bank building)
Best test design (see Test design) Fit:
Beta /} (see Person ability measure) analysis, 2 - 4, 23 - 24, 66 - 82
B IC A L , 46 - 54 computer example, 52 - 55, 58 - 59, 80 - 82
control, 4 6 - 4 7 correcting misfit, 181 - 190
output, 46, 48 • 52, 54 crude, 124 - 125
diagnosing misfit, 170 - 180
Calibration 5 and d (see Item d ifficu lty calibra­ hand example, 69 - 79
tion ) item fit, 52 - 55, 58 - 59, 77 - 79, 121 - 125
Chain, 99 link fit, 93 - 96, 98
Chicago probability unit (see Scale C H IP ) loop fit, 100
CH IP (see Scale C H IP ) person fit, 2 - 4 , 7 6 - 77, 121 - 125, 165 - 180,
Com m on item equating, 108 - 1 0 9 ,1 1 2 - 1 1 8 205 - 209
Com m on person equating, 1 0 6 -1 1 2 response fit, 69 - 77, 121 - 125,165 - 180
Computing algorithms: ruler for fit analysis, 208 - 209
B IC A L , 46 - 54 summary o f fit analysis, 79 - 80
P R O X , 61 - 62 table for fit analysis, 73, 216
hand example, 30 - 44 Fumbling (see Response pattern)
com puter example, 46 - 55
U C O N , 62 - 65 Guessing (see Respone pattern)
U F O R M , 143 -151
tables, 146, 212, 214 Identity line, 89, 92 - 95
Correcting a measure, 181 - 190 Individualized testing (see Tailoring)
Connecting tw o tests, 96 - 98 (see Linking test Information I f |, 16 - 17, 73 - 75, 135, 161 - 164
form s) Intensifying a variable, 87 - 94
Control lines fo r identity plots, 94 - 95 Interval distribution o f items or persons,
Criterion referencing, 118 - 121, 199 - 202, 204, 130- 131,133- 134,137,139
206 - 207 Item:
Crude fit (see F it) characteristic curve, 12 - 14, 51 - 53, 58 - 59
difficulty calibration 5 and d, 17 - 22, 25, 30,
Data matrix, 10, 18, 31, 33, 68, 107 - 109 34 - 38, 40 - 42, 54 - 55, 61 - 65
Degrees o f freedom , 23 - 24, 71, 74, 77, 79 discrimination, ix - x
crude fit, 125 index, 52 - 55
item fit, 24, 77, 79 fit, 52 - 55, 58 - 59, 77 - 79,121 - 125
link analysis, 96 p-value, viii, xi - xiii, 25 - 26
person fit, 23, 76 - 77, 79, 165 - 168 point biserial, viii, x, 26
Delta 6 (see Item d ifficu lty calibration) score Sj, 10, 18 - 22, 32 - 35
Design o f best test (see Test design) Item bank building, 9 8 -1 1 8
Diagnosing misfit, 170 - 180 KCT example, 106 - 118
D ifficu lty 5 and d (see Item difficu lty calibra­ chain, 99
tion ) link, 96 - 106
Discrimination (see Item discrimination index) loop, 100
network, 101 - 103
Editing data, 31 - 34, 47 - 49 web, 102-106
E fficien cy, 74 - 75, 139, 161,164 Item calibration quality control, 121 - 125 (see
Equating test form s (see Linking test form s) Item fit)
Error coefficien t C f , 135 - 140, 146, 193 - 194,
214 KCT (see Knox Cube Test)
Estimation methods, ix - x, 15 - 20, 44 - 45 Knox Cube Test KCT, 28 - 29
P R O X , 21 - 22, 28 - 45, 50 - 56, 60 - 62, 143, banking KCTB, 106 - 118
149 -150 criterion referencing, 118 - 121, 206 - 207
U C O N , 56 - 65, 142 - 1 4 3,148 - 150 KCTB, 106 - 121

220
IN D E X 221

norm referencing, 120, 126 - 128, 198 - 200, characteristic curve, 1 2 - 1 4


204, 206 • 207 fit, 2 - 4, 76 - 77, 121 - 125, 165 - 180,
response m atrix, 31 - 33, 66 - 69 205 - 209
variable d efin ition , 83 - 91, 119 -1 2 0 , response x vi , 9 - 14, 68 - 77, 165 - 180
206 - 207 score rv, 4 - 10, 18 - 22
converting to measure br, 21 - 22, 27,
Least measurable differen ce L M D , 132, 135, 37 - 40, 43 - 44, 61 - 65, 142 - 151
192 -1 9 8 noniinearity, 7 - 9
Least observable differen ce L O D , 132, 193, relative score f r, 132, 134, 140, 144 - 146,
1 9 4 ,1 9 6 149, 193 - 194
Least significant differen ce LS D , 195 - 196 test dependence, 4 - 6
L in earity, vii, 7 - 9, 15, 25, 2 7 ,1 9 1 - 192 Person measure qu ality control, 165 - 170 (see
Lin k in g test form s, 96 - 106 Person fit )
com m on item , 108 - 109, 112 - 118 Plodding (see Response pattern)
com m on person, 107 - 112 Precision o f measure (see Standard error person
L in k, 96 - 98 ( see Item bank building) measure)
fit, 94 - 96 P robability unit (see Scale C H IP )
L O D ( see Least observable d ifferen ce) P R O X (see N orm al approxim ation estim ation)
Logistic:
distribution, ix - x Quality con trol, 121 - 125, 165 - 170 (see F it)
fu nction, 15, 25, 27, 36 Quick norms, 1 2 6 -1 2 8
ogive scaling fa ctor 1.7, 21 - 22
L o g it, 16 - 17, 25, 27, 30, 34, 36, 191 - 192 Rasch m odel, 9 - 27
L o g odds (see L o g it) R elia b ility o f calibration (see Standard error item
L o o p , 100 (see Item bank building) calibration)
fit, 100 R elia b ility o f measure (see Standard error person
L M D (see Least measurable d ifferen ce) measure)
L S D (see Least significant d iffe re n c e ) R ep ortin g form s, 205 - 209
Residual (see Standardized residual)
Response:
M ap (see V ariable d e fin itio n ) curve, 9 - 1 4
M astery referencing (see Scale C H IP ) fit, 69 - '5 , 121 - 125, 165 - 180
M ean square residual v, 23 - 24, 26, 53, 71 - 74, im probability, 7 1 - 7 4
76 - 8 2 ,1 6 5 - 170 K C T m atrix, 31, 33, 66 - 69
Measure J3 and b (see Person ab ility measure) m odel, 9 - 1 4
Measurem ent target (see T arget o f m easurem ent) pattern, 2 - 4 ,1 7 0 - 180
Measuring test, 131 - 133 fum bling, 171, 176, 178 - 180, 188 - 190
M isfit (see F it) guessing, 171, 174 - 177, 181, 185 - 187
plodding, 171, 176, 178 - 180, 188
N etw o rk , 101 - 103 (see Item bank building) sleeping, 171 - 177, 181 - 184
N I T (see Scale N I T ) Response probability scaling unit (see Scale
N on lin earity o f test scores, 7 - 9 C H IP )
N orm referencing, 120 - 121, 126 - 128, 151,
198 - 200, 204, 206 - 207 Sam ple-free item calibration, vii - xiii, 15, 20,
N orm al approxim ation estim ation P R O X , 25 - 26
21 - 22, 28 - 45, 50 - 56, 60 - 62, 143, Scale, 1 9 1 - 2 0 4
149 - 150 additive fa ctor a, 191 - 192
com pu ter algorithm , 6 1 - 6 2 C H IP, 201 - 204
com pu ter exam ple, 46 - 55 linear, vii, 7 - 9, 25, 27, 191 - 192
hand algorithm , 21 - 22, 34, 38 - 40, 42, 44 L M D , 132, 135, 192 - 198
hand exam ple, 30 - 44 logit, 16 - 17, 25, 27, 30, 34, 36, 191 - 192
hand vs. com puter, 5 5 - 5 6 N IT , 198 - 200, 204
P R O X vs. U C O N , 60 - 61 S IT , 199 - 202, 204
N orm al distribution o f items or persons, 21, spacing factor y, 191 - 198
130 -1 3 1 , 133 - 134, 1 3 7 ,1 3 9 Score (see T est score)
N orm ative scaling unit (see Scale N I T ) S IT (see Scale S IT )
Sleeping (see Response pattern)
O b jectivity , viii - x iii, 15, 141 Spacing factor y ( see Scale spacing factor)
Standardized mean square t, 77 - 80, 165 - 169
Person : Standardized residual z and z 2, 23 - 24, 70 - 80,
ab ility measure 0 and b, 17 - 22, 134 - 136, 121 - 125, 165 - 180, 205 - 209
142 - 151 Standard error:
P R O X , 37 - 39, 43 - 44, 51, 56, 61 - 62, co e ffic ie n t C f, 135 - 140, 146, 193 - 194, 214
1 4 3 ,1 4 8 - 149 id en tify line, 89, 92 - 95
U C O N , 57, 61 - 65, 142 - 143, 147 - 149 item calibration S E (d j), 21 - 22, 25 - 26,
U F O R M , 143 - 147, 149 - 151, 212 61 - 65, 143 - 1 4 6 ,1 9 2
222 INDEX

link, 96 - 98 Test score r, 2 - 10, 18 - 20, 27


loop, 100 converting to measure br, 142 - 164
person measure SE (bv ), Sv, SEM and S, P R O X , 21 - 22, 27, 37 - 40, 43 - 44,
21 - 22, 27, 61 - 65, 132 - 136, 140, 192, 61 - 6 2 ,1 4 3
194 - 198 U CO N, 62 - 65, 142 - 143
P R O X item calibration, 2 1 - 2 2 U F O R M , 143 - 151, 212
P R O X person measure, 21 - 22 Traditional test statistics, 24 - 27
spacing factor, 195
Substantive scaling unit ( see Scale S IT ) U C O N (see Unconditional maximum likelihood
estim ation)
Tailoring, 151 - 164 Unconditional maximum likelihood estimation
performance, 156 - 160, 164 UCON:
self, 161 - 164 com puter example, 5 6 - 6 1
status, 153 - 156, 164 computing algorithm, 62 - 65
Target o f measurement, 129 -131 U C O N vs. P R O X , 60 - 61
dispersion S, 129 - 131, 133 - 134, 137 - 140 U F O R M (see U niform approximation estimation)
distribution D, 130 - 131, 134 - 139 U niform approxim ation estimation U FO R M ,
location M, 130 - 131, 137 - 139 143 - 151,2 1 2 , 214
Test design, 131 - 140
distribution o f items or persons, 130 - 139 V alidity o f calibration (see Item fit )
height H and h, 132 - 133, 137 - 140 V alidity o f measurement (see Person fit)
length L, 132 - 133, 136 - 140 Variable definition, 1 - 4, 98 - 106
operating curve, 132 - 133, 138 KCT, 83 - 91, 119 - 120, 206 - 207
shape, 132 - 140
width W and w , 132 - 1 3 3 ,1 3 6 - 140 Web, 102 - 106 (see Item bank building)
Test-free person measurement, vii - xiii, 15, 20, com plete, 103 - 104
27,141 incom plete, 104 - 106
NOTATION

for Persons v = 1, N for Items i = 1, L

0V ability parameter of person v 6j difficulty parameter of item i


bv statistic estimating 0V dj statistic estimating 5j
SE(bv) standard error of statistic bv SE(dj) standard error of statistic dj
rv observed test score of person v Sj observed sample score of item i

br ability estimated for score r Pj sample p-value of item i


nr number of persons with score r

yv test score logit of person v Xj sample score logit of item i


yr logit of test score r
y. sample mean of person logits x. test mean of item logits
V sample variance of person logits U test variance of item logits
X person logit expansion factor Y item logit expansion factor
to adjust for test width to adjust for sample spread

xvi response of person v to item i


p ( x vi | /3V,5j } probability of response xvi given 0V and 5j
7Tvj probability of a correct response i.e. xvj = 1
pvi estimate of 7rvj based on bv and dj
pri estimate of 7rvj for score r based on br and dj
lvi information in xvi about person v and item i
zui standardized residual of xvj from estimated expectation

vv mean square residual for person v v, mean square residual for item i
fv degrees of freedom in vv fj degrees of freedom in Vj
tv standardized mean square vv t; standardized mean square v(

Exceptions to this notation occur when locally convenient, particularly with "s",
N O T A T IO N
(continued)

for Sample of N persons fo r Test of L items

M mean person ability H mean item difficulty


m estimate o f M h estimate of H
0 standard deviation o f person ability c0 standard deviation of item difficulty
s estimate of a W item d ifficu lty range
w estimate o f W

e Napierian or natural log base e = 2 .7 1 8 2 8 ...

g(0v ' ®i)


exp (p _ g j base e raised to the exponent (0V - &,)

M
£ (y i) continued sum o f y: over j = 1, M
j 1 '
M
n (Yj) continued product Vj over j = 1, M

fin(y) natural log o f y

E {y } expected value o f y

V {y} variance o f y

1.7 coefficient which brings the logistic


2 .8 9 = 1.72 cumulative distribution ogive to w ithin 0.01
8 .3 5 = 2 .8 9 2 = 1.74 of the normal cumulative distribution ogive

RASCH M O D E L

For a correct response xvj = 1

P {x vi = 1 |0 W. 5 j } = exp (0v - 5 j ) / [ 1 + exp (0v - 5 j ) ] •

For an incorrect response xvj = 0

P {x vi = 0 |0 V, 5j> ■ 1/(1 + exp (0V - 5j) l .

For either response xvi = 1 or 0

P {x vi |0 V, 5j} = exp [xvi(0v “ 5j) l /(1 + e x p ( 0 v - 8 , ) ] .

You might also like