Best Test Design
Best Test Design
Best Test Design
DESIGN
BEST
TEST
DESIGN
Benjamin D. Wright
Mark H. Stone
University o f Chicago
M E S A PRESS
Chicago
1979
M ESA PRESS
5835 Kimbark Ave.
Chicago, IL
60637
This b oo k owes m ore to G eorg Rasch than words can convey. The many months o f
painstaking and inspired tutoring with which he introduced me to the opportunities and
necessities o f measurement during 1960 are the foundation on which Mark Stone and I
have built. I am also indebted to Jane Loevinger fo r her com m on sense during the 1965
M idwest Psychological Association Symposium on “ Sample-free Probability M odels in
Psychological Measurement,” to Darrell Bock fo r encouraging me to talk about “ Sample-
free Test Construction and Test-free A b ility Measurement” at the March, 1967 meeting
o f the Psychom etric S ociety and to Benjamin Bloom fo r insisting on “ Sample-free Test
Calibration and Person Measurement” at the 1967 ETS Invitational Conference on
Testing Problems.
Benjamin D. Wright
The lon g hard w ork o f mathematical analysis, com puter programming and empirical
verification which makes Rasch measurement n ot only practical but even easy to use was
done with friends in the M E S A program at The University o f Chicago. This great adventure
began in 1964 with Bruce Choppin and Nargis Panchapakesan and continues v ith David
Andrich, Graham Douglas, Ronald Mead, R ob ert Draba, Susan Bell and G e o ffrey Masters.
We dedicate our b o o k to them.
Benjamin D. Wright
Mark H. Stone
The University o f Chicago
M arch 30, 1979
v
FORWARD
Behind the practical procedures o f Rasch measurement are their reasons. The
m ethodological issues that m otivate and govern Rasch measurement are developed in
Chapters 1, 5 and 6. T o those o f you w ho like to begin with theory, we recommend
reading these chapters b efore w orking the problem in Chapters 2 and 4. Finally, if Rasch
measurement is entirely new to you, you m ight want to begin with Section 0.3 o f this
Forward which is an introduction to the topic given at the O ctober 28, 1967 ETS Invi
tational Conference on Testing Problems (W right, 1968).
Section 0.2 reviews the m otivation and history o f the ideas that culminated in
Rasch’s Proba bilistic M odels f o r S om e Intelligence and A tta in m e n t Tests (1960). The
references cited there and elsewhere in this b ook are focused on w ork that (1 ) bears
directly on the discussion, (2 ) is in English and (3 ) is either readily available in a uni
versity library or can be supplied b y us.
0.2 M O T IV A T IO N A N D H IS T O R Y
F ifty years ago Thorndike com plained that contem porary intelligence tests failed to
specify “ h ow far it is proper to add, subtract, m ultiply, divide, and com pute ratios with
the measures obtained.” (Thorndike, 1926, 1). A good measurement o f ability would be
one “ on which zero w ill represent just n ot any o f the ability in question, and 1, 2, 3, 4,
and so on w ill represent amounts increasing by a constant d ifferen ce.” (Thorndike, 1926,
4 ). Thorndike had the courage to complain because he believed he had worked out a
solution to the problem fo r his ow n intelligence test. So did Thurstone (1925).
Thurstone’s m ethod was to transform the proportion in an age group passing any
item into a unit normal deviate and to use these values as the basis fo r scaling. Common
scale values fo r d ifferen t age groups were obtained by assuming a linear relationship be
tween the d ifferen t scale values o f items shared by tw o or more test forms using the
d ifferen t group means and standard deviations as the parameters fo r a transformation
on to a com m on scale. Thurstone redid a piece o f Thorndike’s w ork to show that his
m ethod was better (Thurstone, 1927). His “ absolute scale” (1925, 1927) yields a more
or less interval scale. But one which is quite dependent on the ability distribution o f the
sample used. In addition to item hom ogeniety, the Thurstone method requires the
assumption that ability is norm ally distributed within age groups and that there exist
viii FORWARD
Thurstone used the 1925 version o f his method fo r the rest o f his life, but the major
ity o f test calibrators have relied on the simpler techniques o f percentile ranks and stand
ard scores. The inadequacies o f these methods were clarified by Loevinger’s 1947 analysis
o f the construction and evaluation o f tests o f ability (Loevinger, 1947).
Loevinger showed that test homogeniety and scale m onotonicity were essential
criteria fo r adequate measurement. In addition, “ A n acceptable method o f scaling must
result in a derived scale which is independent o f the original scale and o f the original
group tested.” (Loevinger, 1947, 46). Summing up the test calibration situation in 1947,
Loevinger says, “ N o system o f scaling has been proved adequate by the criteria proposed
here, though these criteria correspond to the claims made by Thurstone’s system.”
(Loevinger, 1947, 43). As fo r reliabilities based on correlations, “ Until an adquate system
o f scaling is found, the correlation between tests o f abilities, even between tw o tests o f
the same ability, will be accidental to an unknown degree.” (Loevinger, 1947, 46).
In 1950 Gulliksen concluded his Theory o f M ental Tests with the observation that
Relatively little experimental or theoretical work has been done on the effect
o f group changes on item parameters. I f we assume that a given item requires a
certain ability, the proportion o f a group answering that item correctly will
increase and decrease as the ability level o f the group changes.. . . As y e t there*
has been no systematic theoretical treatment o f measures o f item difficu lty
directed particularly toward determining the nature o f their variation with
respect to changes in group ability. Neither has the experimental work on item
analysis been directed toward determining the relative invariance o f item
parameters with systematic changes in the ability level o f the group tested
(Gulliksen, 1950, 392-393).
A t the 1953 ETS Invitational Conference on Testing Problems, Tucker suggested that,
“ An ideal test may be conceived as one fo r which the inform ation transmitted by each o f
the possible scaled scores represents a location on some unitary continuum so that uni
form differences between scaled scores correspond to uniform differences between test
performances fo r all score levels ” (Tucker, 1953, 27). He also proposed the comparison
o f groups differing in ability as a strong method fo r evaluating test homogeneity (Tucker,
1953, 25). But the other participants in the conference belittled his proposals as imprac
tical and idealistic.
Most o f the test scales now in use derive their systems o f units from data taken
from actual test administrations, and thus are dependent on the performance
o f the groups tested. When so constructed, the scale has meaning only so long
as the group is well defined and has meaning, and bears a resemblance in some
fashion to the groups or individuals who later take the test fo r the particular
purposes o f selection, guidance, or group evaluation. However, if it is found
FOR W AR D ix
that the sampling fo r the developm ent o f a test scale has n ot been adequate,
or that the group on which the test has been scaled has outlived its usefulness,
possibly because o f changes in the defined population or because o f changes
in educational emphases, then the scale itself comes into question. This is a
serious matter. A test which is to have continued usefulness must have a scale
which does n ot change with the times, which w ill perm it acquaintance and
fam iliarity with the system o f units, and which will perm it an accumulation
o f data fo r historical comparisons (A n g o ff, 1960, 815).
These better m ethods have their roots in the 19th century psychophysical models o f
W eber and Fechner. T h ey are based on simple models fo r what it seems reasonable to
suppose happens when a person responds to a test item. T w o statistical distributions have
been used to m odel the probabilistic aspect o f this event. The normal distribution appears
as a basis fo r mental measurement in Thurstone’s Law o f Comparative Judgement in the
1920’s. Th e use o f the normal ogive as an item response m odel seems to have been
initiated by L aw ley and Finney in the 1940’s. Lord made the normal ogive the corner
stone o f his approach to item analysis until about 1967, when under Bim baum ’s influ
ence, he switched to a logistic response m odel (L o rd , 1968).
The inevitable resolution o f this debate has been im plicit ever since Fisher’s inven
tion o f sufficient estimation in the 1920’s and Neymann and S cott’s work on the con
sistency o f conditional estimators in the 1940’s. Rasch (1968), Andersen (1973, 1977)
and Bam dorff-Nielsen (1 9 7 8 ) each prove decisively that only item d ifficu lty can actually
be estimated consistently and sufficiently from the right/wrong item response data
available fo r item analysis. These proofs make it clear that the dichotomous response data
available fo r item analysis can only support the estimation o f item difficu lty and that
attempts to estimate any other individual item parameters are necessarily doomed.
practical implications. Anyone who actually examines the inner workings o f the various
computer programs advertised to estimate item discriminations and tries to apply them to
actual data, w ill find that the resulting estimates are highly sample dependent. I f attempts
are made in these computer programs to iterate to an apparent convergence, this “ con
vergence” can only be “ reached” by interfering arbitrarily with the inevitable tendency
o f at least one o f the item discrimination estimates to diverge to infinity. In most pro
grams this insurmountable problem is sidestepped either by n ot iterating at all or by
preventing any particular discrimination estimate from exceeding some entirely arbitrary
ceiling such as 2.0.
As far as we can tell, it was the Danish mathematician Georg Rasch w ho first under
stood the possibilities fo r truly objective measurement which reside in the simple logistic
response model. Apparently it was also Rasch who first applied the logistic function to
the actual analysis o f mental test data fo r the practical purpose o f constructing tests.
Rasch began his work on psychological measurement in 1945 when he standardized a
group intelligence test fo r the Danish Department o f Defense. It was in carrying out that
item analysis that he first “ became aware o f the problem o f defining the d ifficu lty o f an
item independently o f the population and the ability o f an individual independently o f
which items he has actually solved.” (Rasch, 1960, viii). By 1952 he had laid dow n the
basic foundations fo r a new psychometrics and w orked ou t tw o probability models fo r
the analysis o f oral reading tests. In 1953 he reanalyzed the intelligence test data and
developed the essentials o f a logistic probability m odel fo r item analysis.
Rasch first published his concern about the problem o f sample dependent estimates
in his 1953 article on simultaneous factor analysis in several populations (Rasch, 1953).
But his w ork on item analysis was unknown in this country until the spring o f 1960
when he visited Chicago fo r three months, gave a paper at the Berkeley Symposium on
Mathematical Statistics (Rasch, 1961), and published P roba bilistic M odels fo r S om e
Intelligen ce and A tta in m e n t Tests (Rasch, 1960).
In her 1965 review o f person and population as psychom etric concepts Loevinger
w rote,
Ever since I was old enough to argue with m y pals over w ho had the best IQ (I say
“ best” because some thought 100 was perfect and 60 was passing), I have been puzzled
b y mental measurement. We were m ixed up about the scale. IQ units were unlike any o f
those measures o f height, weight, and wealth with which we were learning to build a
science o f life. Even that noble achievement, 100 percent, was ambiguous. One hundred
m ight signify the w elcom e news that w e were smart. Or it m ight mean the test was easy.
Som etim es w e prayed fo r easier tests to make us smarter.
L ater I learned one w ay a test score could m ore or less be used. I f I were willing to
accept as a whole the set o f items making up a standardized test, I could get a relative
measure o f ability. I f m y perform ance put me at the eightieth percentile among college
men, I w ould k n ow where I stood. O r w ould I? T h e same score would also put me at the
eigh ty-fifth percentile among college wom en, at the ninetieth percentile among high
school seniors, and above the ninety-ninth percentile among high school juniors. M y
ability depended n o t o n ly on which items I to o k but on w h o I was and the com pany I
kept!
I hope I am reminding you o f some problems which a fflic t present practice in mental
measurement. T h e scales on which ability is measured are uncom fortably slippery. T h ey
have n o regular unit. T heir meaning and estimated quality depend upon the specific set o f
items actually standardized and the particular ability distribution o f the children who
happened to appear in the standardizing sample.
I f all o f a specified set o f items have been tried b y a child you wish to measure, then
yo u can obtain his percentile position among whatever groups o f children were used to
standardize the test. But h o w do you interpret this measure beyond the confines o f that set
o f items and those groups o f children? Change the children and you have a new yardstick.
Change the items and yo u have a new yardstick again. Each collection o f items measures
an ability o f its own. Each measure depends fo r its meaning on its ow n fam ily o f test
takers. H o w can we make objective mental measurements and build a science o f mental
developm ent when we w ork with rubber yardsticks?
When a man says he is at the ninetieth percentile in math ability, we need to know
in what group and on what test before we can make any sense o f his statement. But when
he says he is five feet eleven inches tall, do we ask to see his yardstick? We know yard
sticks d iffer in color, temperature, compositions, weight—even size. Y e t we assume they
share a scale o f length in a manner sufficiently independent o f these secondary charac
teristics to give a measurement o f five feet eleven inches objective meaning. We expect
that another man o f the same height will measure about the same five feet eleven even on
a different yardstick. I may be at a different ability percentile in every group I compare
m yself with. But I am the same 175 pounds in all o f them.
The guiding star toward which models fo r mental measurement should aim is this
kind o f objectivity. Otherwise how can we ever achieve a quantitative grasp o f mental
abilities or ever construct a science o f mental development? The calibration o f test-item
d ifficu lty must be independent o f the particular persons used fo r the calibration. The
measurement o f person ability must be independent o f the particular test items used for
measuring.
When we compare one item with another in order to calibrate a test, it should not
matter whose responses to these items we use fo r the comparison. Our method fo r test
calibration should give us the same results regardless o f whom we try the test on. This is
the only way we will ever be able to construct tests which have uniform meaning regard
less o f whom we choose to measure with them.
When we expose persons to a selection o f test items in order to measure their ability,
it should not matter which selection o f items we use or which items they complete. We
should be able to compare persons, to arrive at statistically equivalent measurements o f
ability, whatever selection o f items happens to have been used—even when they have been
measured with entirely different tests.
Exhortations about objectivity and sarcasm at the expense o f present practices are
easy. But can anything be done about the problem? Is there a better way? In the old way
o f doing things, we calibrate a test item by observing how many persons in a standard
FORW ARD xiii
sample succeed on that item. T h e traditional item “ d ifficu lty ” is the proportion o f cor
rect responses in some standardizing sample. Item quality is judged from the correlation
betw een these item responses and test scores. Person ability is a percentile standing in
the same “ standard” sample. Obviously this approach leans very heavily on assumptions
concerning the appropriateness o f the standardizing sample o f persons.
But this simple m odel has surprising consequences. When measurement is governed
b y this m odel, it is possible to take into account whatever abilities the persons in the cali
bration sample happen to demonstrate and to free the estimation o f item d ifficu lty from
the particulars o f these abilities. The scores persons obtain on the test can be used to
rem ove the influence o f their abilities fro m the estimation o f item d ifficu lty. The result
is a sample-free item calibration.
T h e same thing can happen when w e measure persons. The scores items receive in
whatever sample happens to provide their calibrations can be used to rem ove the influ
ence o f item d ifficu lty from the estimation o f person ability. The result is a test-free
person measurement.1
'A d a p te d from Proceedings o f the 1967 In v ita tio n a l C onference on Testing Problem s. C opyright ©
1968 b y Educational Testing Service. A ll rights reserved. Reprinted by permission.
CONTENTS
FO RW ARD vii
2 IT E M C A L IB R A T IO N B Y H A N D 28
2.1 Introduction 28
2.2 The K n o x Cube Test 28
2.3 T h e Data fo r Item Analysis 29
2.4 Calibrating Items and Measuring Persons 30
2.5 Discussion 44
3 IT E M C A L IB R A T IO N B Y C O M P U T E R 46
3.1 Introduction 46
3.2 B IC A L Output fo r a P R O X Analysis o f the K n o x Cube Test Data 46
3.3 Comparing P R O X by Hand with P R O X b y Computer 55
3.4 A n alyzin g K C T with the U C O N Procedure 56
3.5 Comparing U C O N to P R O X with the K C T Data 60
3.6 A Com puting A lgorith m fo r P R O X 61
3.7 Th e Unconditional Procedure U C O N 62
4 T H E A N A L Y S IS O F F IT 66
4.1 Introduction 66
4.2 The K C T Response M atrix 66
4.3 The Analysis o f F it b y Hand 69
4.4 M isfitting Person Records 76
4.5 M isfitting Item Records 77
4.6 B rief Summary o f the Analysis o f F it 79
4.7 Com puter Analysis o f F it 80
5 C O N S T R U C T IN G A V A R IA B L E 83
5.1 Generalizing the D efinition o f a Variable 83
5.2 D efining the K C T Variable 83
5.3 Intensifying and Extending the K C T Variable 87
xv
xvi CONTENTS
IN D E X 220
1 THE M E A S U R E M E N T M ODEL
This b oo k is about h ow to make and use mental tests. In order to do this success
fu lly w e must have a m ethod fo r turning observations o f test perform ance into measures
o f mental ability. Th e idea o f a measure requires an idea o f a variable on which the
measure is located. I f the variable is visualized as a line, then the measure can be pictured
as a p oin t on that line. This relationship between a measure and its variable is pictured in
Figure 1.1.1.
1 F IG U R E 1.1.1 I
A MEASURE ON A VA RIAB LE
The Variable
When we test a person, our purpose is to estimate their location on the line implied
b y the test. B efore w e can do this w e must construct a test that defines a line. We must
also have a w ay to turn the person’s test perform ance into a location on that line. This
b ook shows h ow to use test items to define lines and how to use responses to these items
to position persons on these lines.
In order fo r a test to define a variable o f mental ability, the items out o f which the
test is made must share a line o f inquiry. This com m on line and its direction towards
increasing ability can be pictured as an arrow with high ability to the right and lo w ability
to the left. Th e meaning o f this arrow is given by the test items which define it. I f we use
the symbols 6 1, 6 2 . . . 6 t . . . , to represent the d ifficu lty levels o f items, then each 5t
marks the location o f an item on the line. The 5 ’s are the calibrations o f the items along
the variable and these calibrated items are the operational definition o f what the variable
measures. Hard items which challenge the m ost able persons define the high, or right, end
o f the line. Easy items which even the least able persons can usually do successfully de
fine the low , or left, end o f the line. Figure 1.1.2 shows a variable defined by fou r items
spread across its length.
1
2 BEST TEST DESIGN
FIG U R E 1.1.2
DEFINING A VARIABLE
Person Measure
Easiest Hardest
Item Item
Item Calibrations
A variable begins as a general idea o f what we want to measure. This general idea is
given substance by writing test items aimed at eliciting signs o f the intended variable in
the behavior o f the persons. These test items become the operational definition o f the
variable. The intuition o f the test builder and the careful' construction o f promising test
items, however, are not enough. We must also gather evidence that a variable is in fact
realized by the test items. We must give the items to suitable persons and analyze the
resulting response patterns to see i f the items fit together in such a way that responses
to them define a variable.
In order to locate a person on this variable we must test them with some o f the
items which define the variable and then determine whether their responses add up to a
position on the line. I f we use the symbol 0 to represent the ability level o f the person,
then j3 marks their location on the line.
The person measure 0 shown in Figure 1.1.2 locates this person above the three
easiest items and below the hardest one. Were this person to take a test made up o f these
four items, their most probable test score would be three and we would expect them
to get the three easiest items correct and the fourth, hardest item, incorrect. This obser
vation is more important than it might seem because it is the basis o f all our methods
fo r estimating person measures from test scores. When we want to know where a person
is located on a variable, we obtain their responses to some o f the items which define the
variable. The only reasonable place to estimate their location from these data is in the
region where their responses shift from mostly correct on easier items to mostly incorrect
on harder ones.
Before we can estimate a person’s measure from their score, however, we must
examine their pattern o f responses. We must see if their pattern is consistent with how we
expect their items to elicit responses. When the items with which a person is tested have
been calibrated along a variable from easy to hard, then we expect the person’s response
pattern to be more or less consistent with the difficu lty order o f these items along the
THE M EASUR EM ENT M ODEL 3
variable. We exp ect the person to succeed on items that ought to be easy fo r them and to
fail on items that ought to be hard fo r them.
Figure 1.1.3 shows tw o response patterns to the same ten item test. The ten items
are located along the variable at their levels o f d ifficu lty. Each pattern o f responses is
recorded above the line o f the variable. The l ’s represent correct answers. The 0 ’s repre
sent incorrect answers. Both patterns produce a score o f six.
In Pattern A the six easiest items are correct and the fou r hardest ones incorrect. It
seems inconceivable to locate this person anywhere except in the region above 56 , the
m ost d ifficu lt item they get correct, but b elow 5 7, the least o f the even m ore d ifficu lt
items they get incorrect.
4 BEST TEST DESIGN
The Pattern B example is an important one because it shows us that even when we
have constructed items that can define a valid variable we still have also to validate every
person’s response pattern before proceeding to use their score as a basis fo r estimating
their measure. When item calibrations have been validated by enough suitable persons,
then most o f the response patterns we encounter among suitable persons will approxi
mate Pattern A. However, the possibility o f occurrences verging on Pattern B forces us
to examine and validate routinely the response pattern o f every person tested before we
can presume to estimate a measure from their test score.
Four steps must be taken to use a test to measure a person. First, we must work
out a clear idea o f the variable we intend to make measures on. Second, we must con
struct items which are believable realizations o f this idea and which can elicit signs o f
it in the behavior o f the persons we want to measure. Third, we must demonstrate that
these items when taken by suitable persons can lead to results that are consistent with
our intentions. Finally, before we can use any person’s score as a basis fo r their measure,
we must determine whether or not their particular pattern o f responses is, in fact, con
sistent with our expectations.
A test score is intended to locate a person on the variable defined by the test items
taken. Nearly everyone who uses test scores supposes that the person’s location on the
variable is satisfactorily determined either by the score itself, or by some linear function
o f the score such as a percent correct or a norm-based scale value. It is taken fo r granted
that the score, or its scale equivalent, tells us something about the person tested that goes
beyond the moment or materials o f the testing. It is also taken fo r granted that scores are
suitable fo r use in the arithmetic necessary to study growth and compare groups. But do
scores actually have the properties necessary to make it reasonable to use them in these
ways?
In order fo r a particular score to have meaning it must come from a response pattern
which is consistent with items that define a variable. But even the demonstration o f item
validity and response validity does not guarantee that the score will be useful. In order to
generalize about the person beyond their score, in order to discover what their score
implies, we must also take into account and adjust fo r the particulars o f the test items
used. H ow, then, does a person’s test score depend on the characteristics o f the items in
the test they take?
THE M EASUREM ENT MODEL 5
f F IG U R E 1.2.1 I
Person
Very P
Easy
Test 6, 68
Expected
Score
8
Person
V ery ,3
Hard 1 ..................1
1■ ■ ■
Test
S.
Expected "1 °8
Score
0
Person
Narrow
Hard I ..................1 -.
Test Si ' ' Sg
Expected
Score
1
Test
Center
Narrow
Easy
Test
Wide
Easy
Test
Test
Center
6 BEST TEST DESIGN
Figure 1.2.1 shows what can happen when one person at a particular ability level
takes five different tests all o f which measure on the same variable but which d iffer in
the level and spread o f their item difficulties from easy to hard and narrow to wide. The
difficulties o f the eight items in each test are marked on the line o f the variable. In order
to see each test separately we have redrawn the line o f the variable five times, once for
each test.
The ability o f the person on the measure is also marked on each line so that we can
see how this person stands with respect to each test. While each test has a different posi
tion on the variable depending on the difficulties o f its items, this person’s position, o f
course, is the same on each line. Figure 1.2.1 also shows the scores we would expect this
person most often to get on these five tests.
The first, V ery Easy Test, has items so easy fo r this person that we expect a test
score o f eight. The second, V ery Hard Test, has such hard items that we expect a score
o f zero. The third, Narrow Hard Test, has seven o f its items above the person’s ability
and one below. In this situation the score we would expect most often to see would be a
one. The fourth, Narrow Easy Test, has seven o f its items below the person’s ability and
so we expect a score o f seven. Finally the fifth , Wide Easy Test, has five items which
should be easy fo r them. Even though this test is centered at the same position on the
variable as the Narrow Easy Test just above it in Figure 1.2.1 and so has the same average
difficu lty level, nevertheless, because o f its greater width in item difficulty, we expect
only a score o f five.
For one person we have five expected scores: zero, one, five, seven and eight!
Although we know the person’s ability does n ot change, the five different scores, as they
stand, suggest five different abilities. Test scores obviously depend as much on the item
characteristics o f the test as on the ability o f the person taking the test.
I f the meaning o f a test score depends on the characteristics o f the test items, how
ever, then before we can determine a person’s ability from their test score we must
“ adjust” their score fo r the effects o f the particular test items from which that particu
lar score comes. This adjustment must be able to turn test-bound scores into measures
o f person ability which are test-free.
Unfortunately, with test scores like zero, in which there is no instance o f success,
and the eight o f our example, in which there is no instance o f failure, there is no satis
factory way to settle on a finite measure fo r the person. A ll we can do in those situations
is to observe that the person who scored all incorrect or all correct is substantially below
or above the operating level o f the test they have taken. I f we wish to estimate a finite
measure fo r such a person, then we will have to find a test fo r them which is more appro
priate to their level o f ability.
Th e dependence o f test scores on item d ifficu lty is a problem with which most
test users are familiar. A lm ost everyone realizes that fifty percent correct on an easy test
does n ot mean as much as fift y percent correct on a hard test. Some test users even
realize that seventy-five percent correct on a narrow test does n o t im ply as much ability
as seventy-five percent correct on a w ide test. But there is another problem in the use o f
test scores which is often overlooked.
In the statistical use o f test scores, flo o r and ceiling effects are occasionally recog
nized. But they are almost never adjusted for. These boundary effects cause any fixed
differences o f score points to vary in meaning over the score range o f the test. The dis
tance on the variable a particular difference in score points implies is n o t the same from
one end o f the test to the other. A difference o f five score points, fo r example, implies
a larger change in ability at the ends o f a test than in the middle.
Figure 1.2.2 illustrates this problem with test scores. We show tw o persons with
measures, 0A and pB, w ho are a fixed distance apart on the same variable. Both persons
are administered five d ifferen t tests all measuring on this variable. The persons’ locations
and hence their measurable differen ce on the variable remain the same from test to test,
but their m ost probable scores vary w idely. This is because the five tests d iffer in their
item d ifficu lty level, spread and spacing. L e t’s see h ow the resulting expected scores
reflect the fix e d difference between these tw o persons.
T est I is com posed o f eight items all o f which fall well between Person A and Person
B. We expect Person A to get none o f these items correct fo r a score o f zero while we
expect Person B to get all eight items correct fo r a score o f eight. On this test their abili
ties w ill usually appear to be eight score points apart. That is as far apart in ability as it is
possible to be on this test.
Test II is com posed o f eight items all o f which are w ell b elow both persons. We ex
p ect both persons to get scores o f eight because this test is to o easy fo r both o f them.
N o w their expected score difference in test scores is zero and their abilities w ill usually
appear to be the same!
Test I I I is com posed o f eight very hard items. N o w we expect both persons to get
scores o f zero because this test is to o hard fo r them. Once again their expected score d if
ference is zero and their abilities w ill usually appear to be the same.
Test I was successful in separating Persons A and B. Tests II and III failed because
they were to o far o f f target. Perhaps it is only necessary to center a test properly in order
to observe the difference between tw o persons.
8 BEST TEST DESIGN
0A Pb
Test I
Pa Pb
Test II Ii .................................
. . . . . . i |
51 58
Expected Expected
Score Score Expected Score difference:
8 8 8 -8 = 0
Pa
Test IV
Test V
THE M EASUREM ENT M O DEL 9
•Test IV is centered between Person A and Person B but its items are so spread out
that there is a w ide gap in its m iddle into which Person A and Person B both fall. The
result is that both persons can be expected to achieve scores o f fou r because fou r items
are to o easy and fou r items are to o hard fo r both o f them. Even fo r this test which is
m ore or less centered on their positions, their expected score difference is zero and their
abilities w ill still usually appear to be the same.
T est V , at last, is both wide and fairly w ell centered on Persons A and B. It con
tains tw o items which fall between their positions and therefore separate them. We expect
Person A to get the fou r easiest items correct fo r a most probable score o f four. As fo r
Person B, however, w e exp ect them n o t on ly to get the same fou r items correct but also
the n ext tw o harder ones because these tw o items are also b elow Person B ’s ability level.
Thus on Test V the expected difference in scores between Person’s A and B becomes
tw o. On this test their abilities w ill usually appear to be somewhat, but n ot extrem ely,
different.
What can w e infer about the differences in ability between Persons A and B from
scores like these? Persons A and B w ill tend to appear equally able on Tests II, III, and
IV , som ewhat d ifferen t on. Test V and as d ifferen t as possible on Test I. I f differences
between the test scores o f the same tw o persons can be made to vary so w idely merely
b y changing the difficu lties o f the items in the test, then h ow can we use differences in
test scores to study ability differences on a variable?
Th e answer is, w e can’t. N o t as they stand. In order to use test scores, which are
n ot linear in the variable they im ply, to analyze differences w e must find a way to trans
form the test scores in to measures which approxim ate linearity.
In this section w e have illustrated tw o serious problems with test scores. The first
illustration shows h o w test scores are test-bound and how we have to adjust them fo r the
characteristics o f their test items b efore w e can use the scores as a basis fo r measurement.
The second illustration shows h ow test scores d o n ot mark locations on their variable in a
linear w ay and h ow w e need to transform test scores into measures that are linear before
w e can use them to study growth or to compare groups.
The discussions in Sections 1.1 and 1.2 establish our need fo r 1) valid items which
can be demonstrated to define a variable, 2) valid response patterns which can be used to
locate persons on this variable, 3) test-free measures that can be used to characterize per
sons in a general way and 4 ) linear measures that can be used to study growth and com
pare groups. N o w we must build a m ethod that comes to grips with these requirements.
T h e responses o f individual persons to individual items are the raw data with which
w e begin. Th e m ethod we develop must take these data and make from them item cali
brations and person measures with the properties we require. Figure 1.3.1 shows a very
10 BEST TEST DESIGN
simple data matrix containing the responses o f eight persons to a five item test. The five
items are named at the top o f the matrix. The eight persons are named at the left. The
response o f each person to each item is indicated by “ 1” fo r a correct response and “ 0”
fo r an incorrect response. N otice that the responses in Figure 1.3.1 have been summed
across the items and entered on the right side o f the matrix as person scores and down
the persons and entered at the bottom o f the matrix as item scores.
J F IG U R E 1.3.1 I
A DATA MATRIX
OF OBSERVED RESPONSES
Item
Person Name Person
Name 1 2 3 4 5 Score
a 1 0 0 0 0 1
b 0 1 0 0 0 1
c 1 1 0 0 0 2
d 1 0 1 0 0 2
e 1 1 1 0 0 3
f 1 1 0 1 0 3
9 1 1 1 1 0 4
h 1 1 1 0 1 4
7 6 4 2 1 20
Item
Score
Figure 1.3.1 shows what the basic data look like. But before we can put these data
to work we must answer a fundamental question. Where do we think these data come
from? What are these item and person scores supposed to tell us about items and per
sons? H ow do we think these patterns o f l ’s and 0’s are produced? In order to figure out
how to use these data we must set up a reasonable model fo r what we suppose happens
when a person attempts to answer an item.
We would like a person v ’s ability /?„, that is their location on the variable, to govern
how far along the variable we can expect them to produce correct responses to items.
Indeed that is the only situation in which we can use item difficulties and a person’s
responses to them as the basis fo r measuring the person.
O f course we can think o f other factors which might affect a person’s responses. I f
items are multiple-choice, some guessing is bound to occur and persons d iffer in how
much guessing they are willing to engage in. The possibilities o f disturbing influences
which interfere with the clear expression and hence the unambiguous observation o f
ability Eire endless. But, if it is really the person’s ability that we hope to measure, then
THE M EASUREM ENT M ODEL 11
We w ould also like item t’s d ifficu lty 5 t, that is its location on the variable, to
determ ine h ow far along the variable we can expect correct responses to that item to
occur. As with persons, w e can think up item characteristics, such as discrimination and
vulnerability to guessing, which m ight m o d ify persons’ responses to them. Some psycho
metricians attem pt to estimate these additional item characteristics even though there
are good reasons to expect that all such attempts must, in principle, fail. But, again, it
hardly seems reasonable n ot to d o our best to arrange things so that it is an item ’s d if
ficu lty which dominates h ow persons o f various abilities respond to that item. In any
case, the fact is that whenever we use unweighted scores as our test results w e are assuming
that, fo r all practical purposes, it is item difficulties, and person abilities, that dominate
person responses.
Person Observed
A bility Response
These considerations lead us to set up a response model that is the simplest repre
sentation possible. Figure 1.3.2 diagrams person v with ability acting on item i with
difficu lty 81 to produce the response x „ t. These are the essential elements we will take
into account when we try to explain the data in Figure 1.3.1. Figure 1.3.2 proposes that
the response x vl which occurs when person v takes item i can be thought o f as governed
by the person’s ability 0V and the item ’s difficulty 5( and nothing else.
Our next step is to decide how we want person ability 0V and item difficulty 51to
interact in order to produce x „ t. What is a reasonable and useful way to set up a mathe
matical relation between 0V and 51? Since we require that 0V and 61 represent locations
along one common variable which they share, it is their difference (0„ - 5t) which is the
most convenient and natural formulation o f their relation.
Identifying the difference (0„ - 5t), however, does not finish our work because we
must also decide how we want this difference to govern the value o f the response x vl.
Even when a person is more able than an item is difficult, so that their 0V is greater
than the item ’s 5(, it will occasionally happen that this person nevertheless fails to give
a correct answer to that relatively easy item so that the resulting value o f x vi is “ 0 ” . It
will also happen occasionally that a person o f moderate ability nevertheless succeeds on a
very difficult item. Obviously it is going to be awkward to force a deterministic relation
ship onto the way - 5t) governs the value o f response x vi. A better way to deal with
this problem is to acknowledge that the way the difference (0V - 6t) influences the re
sponse x „ t can only be probabilistic and to set up our response model accordingly.
Figure 1.3.3 shows how it would be most reasonable to have the difference (0„ - 8t)
affect the probability o f a correct response. When is larger than 5(, so that the ability
level o f person v is greater than the difficu lty level o f item i and their difference (/J„ - 5t)
is greater than zero, then we want the probability o f a correct answer to be greater than
one half. When, on the other hand, the ability level o f person v is less than the difficulty
level o f item i, so that their difference (0„ —61) is less than zero, then we want the proba
bility o f a correct answer to be less than one half. Finally, when the levels o f person
ability and item difficulty are the same so that their difference (0„ - 51) is zero, then
the only probability that seems reasonable to assign to a correct (o r to an incorrect)
answer is exactly one half.
The curve in Figure 1.3.4 summarizes the implications o f Figure 1.3.3 fo r all reason
able relationships between probabilities o f correct responses and differences between
person ability and item difficulty. This curve specifies the conditions our response model
must fulfill. The differences (fiv - 5t) could arise in tw o ways. They could arise from a
variety o f person abilities reacting to a single item or they could arise from a variety o f
item difficulties testing the ability o f one person. When the curve is drawn with ability
0 as its variable so that it describes an item, it is called an item characteristic curve (IC C )
because it shows the way the item elicits responses from persons o f every ability. When
the curve is drawn with difficu lty 5 as its variable so that it describes how a person
responds to a variety o f items, we can call it a person characteristic curve (PCC).
THE M EASUREM ENT M ODEL 13
[ F IG U R E 1.3.3 |
1. When
0 „> 5 t
f (0 „ - 5 t) > O
5,
2. When
P„<6 t
I
(Pit ~ 8 t ) < 0 5t
and P|xj,t = lJ < ’/2
Pv
3. When
P it =
\
t
5t
(/3„- 6t) = 0
and P {x w = 1|= %
1.4 T H E RASCH M O D E L
This exponential expression varies between zero and plus in fin ity and we can bring it
into the interval between zero and one by form ing the ratio
This form ulation has a shape which follo w s the ogive in Figure 1.3.4 quite well. It can be
used to specify the probability o f a successful response as
A n y mathematical form which describes an ogive o f the shape in Figure 1.3.4 could
provide a solution to the linearity problem by transforming scores which are restricted
between 0 and 100 percent into “ measures” which run from minus in fin ity to plus
infinity.
A n y ogive and any form ulation, however, w ill n ot do. In fact, on ly the formulation
o f Equation 1.4.1, the Rasch m odel, allows us to estimate pv and 5 t independently o f one
another in such a w ay that the estimates pv are freed from the effects o f the 6 ( and the
estimates dt are freed from the effects o f the P y ’s.
efficient, and sufficient (Andersen, 1970, 1971, 1972a, 1973, 1977; Haberman, 1977).
Simple approximations fo r these conditional maximum likelihood estimators which
are accurate enough fo r almost all practical purposes are described in Wright and Pan-
chapakesan (1969), Wright and Douglas (1975a, 1975b, 1977a, 1977b) and Wright
and Mead (1976). These procedures have been useful in a wide variety o f applications
(Connolly, Nachtman and Pritchett, 1971; W oodcock, 1974; W illm ott and Fowles,
1974; Rentz and Bashaw, 1975, 1977; Andrich, 1975; Mead, 1975; Wright and Mead,
1977; Cornish and Wines, 1977; Draba, 1978; Elliott, Murray and Pearson, 1977.
Pv *» exp(/3j, - 5 t) nvi
K i = nv i^ ~ nw)
We can see in Equation 1.4.1 that when person v is smarter than item i is difficult,
then j3„ is more than 61, their difference is positive and the probability o f success on item
i is greater than one half. The more the person’s ability surpasses the item ’s difficulty,
the greater this positive difference and the nearer the probability o f success comes to
one. But when the item is to o hard for the person, then is less than 6t, their difference
is negative and the person’s probability o f success is less than one half. The more the
item overwhelms the person, the greater this negative difference becomes and the nearer
the probability o f success comes to zero.
THE M EASUREM ENT M ODEL 17
The mathematical units fo r pv and 6t defined by this m odel are called “ logits.” A
person’s ability in logits is their natural log odds fo r succeeding on items o f the kind
chosen to define the “ zero ” p oin t on the scale. And an item ’s d ifficu lty in logits is its
natural lo g odds fo r eliciting failure from persons with “ zero ” ability.
Table 1.4.1 gives examples o f various person abilities and item difficulties in logits,
their differences (/3„ - 5 1) and the success probabilities which result. The first six rows
illustrate various person abilities and their success probabilities when provoked by items
o f zero d ifficu lty. The last six rows give examples o f various item difficulties and the
probabilities o f success on them by persons with zero ability.
The origin and scale o f the logits used in Table 1.4.1 are arbitrary. We can add any
constant to all abilities and all difficulties w ith ou t changing the difference (0„ - 6( ). This
means that we can place the zero p oin t on the scale so that negative difficulties and abili
ties d o n ot occur. We can also introduce any scaling factor w e find convenient including
one large enough to eliminate any need fo r decimal fractions. Chapter 8 investigates these
possibilities in detail.
Th e last column o f Table 1.4.1 gives the relative inform ation I„ t = -nVL( 1 - -nv l)avail
able in a response observed at each (j3„ - 5 t). When item d ifficu lty 5t is within a logit o f
person ability j3„, the inform ation about either 6 L or in one observation is greater than
.20. But when item d ifficu lty is m ore than tw o logits o f f target, the inform ation is less
and .11 and fo r \PV - 6t | > 3 less than .05. The implications fo r efficien t calibration
sampling and best test design are that responses in the \fiv - 5 1| < 1 region are worth
m ore than tw ice as much fo r calibrating items or measuring persons as those outside o f
\Pv - 5 1|> 2 and m ore than fou r times as much as those outside o f |j3„ — 51|> 3.
1.5 U S IN G T H E RASCH M O D E L FO R C A L IB R A T IN G A N D M E A S U R IN G
We have established the need fo r an exp licit approach to measurement and shown
h ow measurement problems can be addressed with a m odel fo r what happens when a
person takes an item. N o w w e are ready to w ork through the mathematics o f this m odel
in order to find ou t h ow w e can use the m odel to calibrate items and measure persons.
The m odel specifies the probability o f person v with ability j3„ giving response x „ t to item
i with d ifficu lty 51 as
When w e insert each o f these values o f Xj,t into Equation 1.5.1 we find that it breaks
dow n into the com plem entary expressions
t = 1 .L
J F IG U R E 1.5.1
Items Person
Scores
L
Persons
N
Item
Scores 2 xvl = s,
THE M EASUREM ENT M O DEL 19
What happens when we analyze these data as though they were governed by the
m odel o f Equation 1.5.1? A ccordin g to that m odel the only systematic influences on the
production o f the x ^ ’s are the N person abilities (Pv ) and the L item difficulties (6 (). As
a result, apart from these parameters, the x ^ ’s are m odeled to be quite independent o f
one another. This means that the probability o f the whole data matrix ( ( x „ t)), given the
m odel and its parameters (0 „) and (5 1), can be expressed as the product o f the probabili
ties o f each separate x vl given by Equation 1.5.1 continued over all v = 1, N and all
i = 1, L.
i ± ;|e x p [x „ ( ( / 3 „ - 5 ,)] )
P{ <( *„ ) >l ( 0„ us, l ) . n | . [1.5.4]
N L
When w e m ove the continued product operators n and n in the numerator o f
v i
Equation 1.5.4 into the exponential expression
N L f"N L "1
n n exp [xvl(Pv ~ 6t)] = exp 2 2 x „ t (0 „ - 5 t) .
v i L v i J
Then, since
N L
and
N L L
2 2 Xm St = 2 st 5t,
V I I
Equation 1.5.5 is im portant because it shows that in order to estimate the para
meters (/3j,) and (5 t), w e need only the marginal sums o f the data matrix, (r„) and (^ ).
This is because that is the only w ay the data ( ( x „ t)) appear in Equation 1.5.5. Thus the
person scores (r,,) and item scores (st) contain all the m odelled inform ation about person
measures and item calibrations.
Finally, the numerator o f Equation 1.5.5 can be factored into tw o parts so that the
m odel probability o f the data m atrix becomes
20 BEST TEST DESIGN
N L
[exp ( 2 r„ 0„f) [exp (- 2 st 5t)]
P {((x w ))|(/3„),(6t)} = ------------------------------------ '---------- [1.5.6]
I I I I [1 + exp ( 0 „ - 8,)]
i
Equation 1.5.6 is important because it shows that the person and item parameters
can be estimated independently o f one another. The separation o f
( 2 r„ pj
v
and
L
( 2 st 5t)
i
in Equation 1.5.6 makes it possible to condition either set o f parameters out o f Equation
1.5.6 when estimating the other set. This means, in the language o f statistics, that the
scores (r „ ) and (st) are sufficient fo r estimating the person measures and the item cali
brations.
Because o f this we can use the person scores (r „ ) to remove the person parameters
(P „) from Equation 1.5.6 when calibrating items. This frees the item calibrations from the
modelled characteristics o f the persons and in this way produces sample-free item cali
brations. As fo r measuring persons, we could use the item scores (s( ) to remove the item
parameters (8 t) from Equation 1.5.6. When we come to person measurement, however,
we will find it more convenient to work directly from the estimated item calibrations
(dt).
There are several ways that Equation 1.5.6 can be used to estimate values fo r pv and
5t. The ideal way is to use the sufficient statistics fo r persons (r,,) to condition person
parameters (0 „) out o f the equation. This leaves a conditional liklihood involving only the
item parameters (6 t) and they can be estimated from this conditional liklihood (Fischer
and Scheiblechner, 1970; Andersen, 1972a, 1972b; Wright and Douglas, 1975b, 1977b;
Allerup and Sorber, 1977; Gustafsson, 1977).
But this ideal method is impractical and unnecessary. Computing times are excessive.
R ound-off errors lim it application to tests o f fifty items at most. And, in any case, results
are numerically equivalent to those o f quicker and more robust methods. A convenient
and practical alternative is to use Equation 1.5.6 as it stands. T o learn more about this
unconditional estimation o f item parameters see Wright and Panchapakesan (1969),
Wright and Douglas (1975b, 1977a, 1977b) and Chapter 3, Section 3.4. o f this book.
Even this unconditional method, however, is often unnecessarily detailed and costly
fo r practical work. I f the persons we use to calibrate items are n ot to o unsymmetrically
distributed in ability and n ot to o far o ff target so that the impact o f their ability distribu
tion can be more or less summarized by its mean and variance, then we can use a very
simple and workable method fo r estimating item difficulties. This method, called PRO X ,
was first suggested by Leslie Cohen in 1973 (see Wright and Douglas, 1977a; Wright,
1977).
THE M EASUREM ENT M ODEL 21
1.6 A S IM P L E U S E F U L E S T IM A T IO N P R O C E D U R E
Three methods o f parameter estimation w ill be used in this book. The general un
conditional m ethod called U C O N requires a com puter and a com puter program such as
B IC A L (W right and Mead, 1976). U C O N is discussed and illustrated in Chapter 3. A
second m ethod called U F O R M , which can be done by hand with the help o f the simple
tables given in A ppen dix C, is discussed and applied in Chapter 7. The third method,
P R O X , is com pletely manageable b y hand. In addition the sim plicity o f P R O X helps us
to see h o w the Rasch m odel works to solve measurement problems. The derivations o f
the U F O R M and P R O X equations are given in W right and Douglas (1975a, 1975b).
P R O X assumes that person abilities (0 „) are m ore o r less norm ally distributed with
mean M and standard deviation a and that item difficulties (5 t) are also m ore or less
norm ally distributed with average d ifficu lty H and d ifficu lty standard deviation co.
If
~ N (M , a )
and
5t ~ N (H , co2 ),
then fo r any person v with person score r„ on a test o f L items it follo w s that
and fo r any item i with item scores st in a sample o f N persons it follo w s that
X = ( 1 + w 2/2 .8 9 )1/2 [1 .6 .3 ]
and
Y = (1 + o 2/2.89)Vx [1.6.4]
Th e value 2.89 = 1.72 comes from the scaling factor 1.7 which brings the logistic
ogive into approxim ate coincidence with the normal ogive. This is because the logistic
ogive fo r values o f 1.7z is never m ore than one percent d ifferen t from the normal ogive
fo r values o f z.
This estimation method can be applied directly to observed item scores (st) by calcu
lating the sample score logit o f item t as
y„ = Cn [ r „ / ( L - r„)] [1.6.8]
U = ( 2 x ,2 - L x .2 ) / ( L - 1), [1.6.11]
I 1
T o com plete this estimation, we set the test center at zero so that H = 0. Then
dt = M + Y x t = Y (x t - x.) [1.6.13]
b„ = H + Xy„ = X y„ [1.6.14]
and
Finally the estimated person sample mean and standard deviation become
M « - Yx . [1.6.17]
Once we have estimated b„ and dt we can use them to obtain the difference be
tw een what the m odel predicts and the data we have actually observed. These residuals
from the m odel are calculated by estimating the m odel expectations at each x „ t from by
and dt and subtracting this expectation from the x „ t which was observed. The m odel
expectation fo r x vi is
^ ( XW } = 17VI
V {*v i } = v vi ^ ~ 71vJ
where
(X w - i r w ) / t i r w ( 1 - w ^ ) ] 54 . [1.6.191
I f the data fit the m odel this standardized residual ought to be distributed m ore or less
norm ally with mean zero and variance one.
7 2
z lH Xi
as guidelines fo r evaluating the exten t to which any particular set o f data can be managed
b y our measurement model.'
We can calculate the sum o f their squared residuals zvl2 fo r each person. According
to the m odel this sum o f squared normal deviates should approxim ate a chi-square distri
bution with about ( L - 1) degrees o f freedom . This gives us a chi-square statistic
[ 1.6.211
vy = C l / fy ~ F y °o [1.6.23]
24 BEST TEST DESIGN
which approximates an F-distribution when the person’s responses fit the model.
The sum o f squared residuals fo r each item can be used in the same way to evaluate
item fit. For items
with
and
Finally, since x „ ( can only equal one or zero, we can use the definition o f p „t given
in Equation 1.6.20 to calculate z„,Vl and z „ t2
2 directly as
and
This relation can also be worked backwards. I f we already have a z „ t2 and wish to
calculate the probability o f the observed responsex „ t to which it refers in order to decide
whether or not that response is to o improbable to believe, then we can use
In contrast with the pul o f Equation 1.6.20 which is the estimated probability o f a
correct answer, the probability o f Equation 1.6.29 applies to x „ t whatever value it takes,
whether x „ t = 1 fo r a correct answer or x „ t = 0 fo r an incorrect one.
Sections 1.1, 1.2 and 1.3 discuss the purpose o f tests, the use o f test scores and the
problems o f generality and linearity in making measures. Sections 1.4, 1.5 and 1.6 des
cribe a simple and practical solution to these measurement problems. Because the math
ematics are new it might seem that using the Rasch model will take us far away from the
traditional item statistics with which we are familiar. This is n ot so.
Applying the Rasch model in test development gives us new versions o f the old
statistics. These new statistics contain all o f the old familiar information, but in a form
which solves most o f the measurement problems that have always beset traditional
test construction. T o show this we will examine the three most common traditional item
and person statistics and see how closely they relate to their corresponding Rasch mea
surement statistics.
THE M EASUREM ENT M ODEL 25
T h e m ost fam iliar traditional item statistic is the item “ p-value.” This is the propor
tion o f persons in a specified sample w ho get that item correct. The P R O X estimation
equation (1 .6 .2 ) gives us a convenient way to formulate the relationship between the
traditional item p-value and Rasch item d ifficu lty. I f the p-value fo r item i is expressed as
Pt = s(/N
in which st is the number o f persons in the sample o f N persons who answered item i
correctly, then the P R O X estimated Rasch item d ifficu lty is
Equation 1.7.1 shows that the Rasch item d ifficu lty d t is in a one-to-one relation
with the item p-value represented by p(. It also shows that this one-to-one relation is curvi
linear and involves the ability mean M and variance a 2 o f the calibrating sample.
Cn [(1 - P t )/P tl
to transform the item p-value which is n o t linear in the im plied variable into a new value
which is. This new lo git value expresses the item d ifficu lty on an equal interval scale
and makes the subsequent correction o f the item ’s p-value fo r the ability mean M and
variance a 2 o f the calibrating sample easy to accomplish.
This correction is made by scaling the lo g it to rem ove the effects o f sample variance
a 2 and translating this scaled lo git to rem ove the effects o f sample mean M. The resulting
Rasch item difficu lties are n ot only on an equal interval scale but they are also freed
o f the observed ability mean and variance o f the calibrating sample. Just as the item
p-value P( has a binomial standard error o f
so the P R O X item d ifficu lty dt has its ow n closely related standard error o f
But there are tw o im portant differences between Equations 1.7.2 and 1.7.3. Unlike the
p-value standard error in Equation 1.7.2, the Rasch standard error in Equation 1.7.3 is
corrected fo r the ability variance a2 o f the calibrating sample. The second difference
between these tw o form ulations is m ore subtle, but even m ore important.
The traditional item p-value standard errors in Equation 1.7.2 are maximum in the
m iddle at p-values near one-half and zero at the extremes at p-values o f zero or one. This
makesit appear that w e know the m ost about an item, that is have the smallest standard
error fo r its p-value when, in fact, we actually know the least. This is because the item
p-value is focused on the calibrating sample as well as on the item. As the sample goes
o f f target fo r the item, the item p-value nears zero or one and its standard error nears
zero. This assures us that the item p-value fo r this particular sample is extrem e but it
26 BEST TEST DESIGN
tells us nothing else about the item. Thus even though our knowledge o f the item ’s p-
value is increasing our information concerning the actual difficu lty o f the item is de
creasing. When item p-values are zero or one, the calibrating sample which was intended
to tell us how that item works is shown to be to o able or to o unable to interact with the
item. We know exactly in which direction to look fo r the item difficulty, but we have
no information as to where in that direction it might be.
In contrast, the Rasch standard error fo r dt varies in a more reasonable manner. The
expression pt (1 - pt) which goes to zero as pt goes to zero or one, appears in the denomi
nator o f Equation 1.7.3 instead o f in the numerator, as it does in Equation 1.7.2. There
fore, the Rasch standard error is smallest at pt = .5, where the sample is centered on the
item and thus gives us the most information about how that item functions. A t the
extremes, however, where we have the least information, the Rasch standard error goes to
infinity reminding us that we have learned almost nothing about that item from this
sample.
The second most widely used traditional item statistic is the point biserial correlation
between the sampled persons’ dichotomous responses to an item and their total test
scores. The item point-biserial has tw o characteristics which interfere with its usefulness
as an index o f how well an item fits with the set o f items in which it appears. First, there
is no clear basis fo r determining what magnitude item point-biserial establishes item
acceptability. Rejecting the statistical hypothesis that an item point-biserial is zero does
not produce a satisfactory statistical criterion fo r validating an item. The second inter
fering characteristic is that the magnitude o f the point-biserial is substantially influenced
by the score distribution o f the calibrating sample. A given item ’s point-biserial is'largest
when the persons in the sample are spread out in scores and centered on that item. Con
versely as the variance in person scores decreases or the sample level moves away from the
item level, so that the p-value approaches zero or one, the point-biserial decreases to zero
regardless o f the quality o f the item.
The Rasch statistic that corresponds in meaning to the item point-biserial is the
item ’s mean square residual given in Equation 1.6.26. This mean square residual is not
only sensitive to items which fail to correlate with the test score, but also to item point-
biserials which are unexpectedly large. This happens, fo r example, when an additional
and unmodelled variable produces a local interaction between a unique feature o f the
item in question and a corresponding idiosyncrasy among some members o f the cali
brating sample.
In contrast with the point-biserial, the Rasch item mean square residual has a useful
statistical reference distribution. The reference value fo r testing the statistical hypothesis
that an item belongs in the test is a mean square o f one with a standard error o f (2/f)54
for f degrees o f freedom. Thus the extent to which an observed mean square exceeds the
expected value o f one can be tested fo r its statistical significance at whatever significance
level is considered useful.
The Rasch item mean square is also very nearly indifferent to the ability distribution
o f the calibrating sample. This provides a test o f item fit which is focused on just those
sample and item characteristics which remain when the modelled values fo r item d iffi
culty and person abilities are removed.
THE M EASUR EM ENT M ODEL 27
Th e m ost fam iliar traditional person statistic is test score, the number o f correct
answers the person earns on the test taken. Once again we can use the P R O X estimation
procedure to show the connection between the traditional test score and Rasch person
ability. Using the P R O X estimation equation (1 .6 .1 ) we have
in which
As with the item p-values we see the lo git function transforming the person scores
which are n o t linear in the variable they im ply into an approxim ately linear metric. We
also see this lo g it being scaled fo r test width, which is represented in Equation 1.7.4 by
the item d ifficu lty variance oj 2, and then being shifted to adjust fo r test d ifficu lty level
H so that the resulting estimated person ability is freed from the local effects o f the test
and becom es a test-free measure.
While traditional test practices almost always emphasize the analysis o f item validity,
hardly any attention is ever given to the validity o f the pattern o f responses leading to a
person score. As far as w e know no one calculates a person point-biserial coefficien t in
order to determ ine the relationship between the responses that person gives to each item
and the supposedly relevant item p-values. This would be a reasonable way to apply the
traditional point-biserial correlation co e ffic ien t to the supervision o f person score validity.
The Rasch approach to person score validity is outlined in Equations 1.6.19 through
1.6.23 and discussed and illustrated at length in Chapters 4 and 7.
There are other connections that can be made between traditional test statistics
and Rasch statistics. We could review here the various ways that traditional test reliability
and validity, norm referencing, criterion referencing, form equating and mastery testing
are handled in Rasch measurement. But each o f these topics deserves a thorough dis
cussion and that, in fact, is the purpose o f the chapters which follo w . Our next step now
is to see h ow the P R O X estimation procedure works to solve a simple problem in test
construction.
2 ITEM CALIBRATION BY H A ND
2.1 IN T R O D U C T IO N
This chapter describes and illustrates in detail an extrem ely simple procedure for the
Rasch calibration o f test items. The procedure, called P R O X , approximates the results
obtained by more elaborate and hence more accurate procedures extrem ely well. It
achieves the basic aims o f Rasch item analysis, namely linearization o f the latent scale and
adjustment for the local effects o f sample ability distribution. Th e assumption which
makes P R O X simple is that the effects on item calibration o f sample ability distribution
can be adequately accounted fo r by just a mean and standard deviation. This assumption
makes P R O X so simple that it can easily be applied by hand.
The data fo r illustrating P R O X com e from the administration o f the 18-item. K nox
Cube Test, a subtest o f the Arthur Point Scale (Arthur, 1947) to 35 students in Grades 2
to 7. Our analysis o f these data shows how Rasch item analysis can be useful for
managing not only the construction o f national item banks but also the smallest
imaginable measurement problem, i.e., one short test given to one room ful o f examinees.
Using student correct/incorrect responses to each item o f the test, w e work out in
detail each step o f the procedure fo r P R O X item analysis. Then, Chapter 3 reviews
comparable computer analyses o f the same data by both the P R O X procedure and the
more accurate UCON procedure used in most com puter programs for Rasch item analysis.
These detailed steps o ffe r a systematic illustration o f the item analysis procedure with
which to compare and by which to understand com puter outputs. They also demonstrate
the ease o f hand computations using P R O X (P R O X is derived and described at length in
Wright and Douglas, 1975b, 1976, 1977a; Cohen, 1976, and Wright, 1977). Finally, they
illustrate the empirical development o f a latent trait or variable. Each step moves from
the observed data toward the inferred variable, from the confines o f the observed
test-bound scores to the reaches o f the inferred test-free measurements.
While the Arthur Point Scale covers a variety o f mental tasks, the K n ox Cube Test
implies a single latent trait. Success on this subtest requires the application o f visual
attention and short-term memory to a simple sequencing task. It appears to be free from
school-related tasks and hence to be an indicator o f nonverbal intellectual capacity.
28
ITEM C A LIB R A T IO N BY H AN D 29
T h e K n o x Cube Test uses five one-inch cubes. Four o f the cubes are fixed tw o
inches apart on a board, and the fifth cube is used to tap a series on the other four. The
fou r attached cubes w ill be referred to, from le ft to right, as “ 1,” “ 2,” “ 3 ” and “ 4 ” to
avoid confusion when specifying any particular series to be tapped. In the original version
o f the test used fo r this exam ple, there are 18 such series going from the two-step sequences
(1-4) and (2-3) to the seven-step sequence (4-1-3-4-2-1-4). Usually, a subject is administered
this test tw ice with another subtest from the battery intervening. However, we need use
only the first administration fo r our analysis.
T h e 18 series are given in Figure 2.2.1. These are the 18 “ item s” o f the test. N ote
that Items 1 and 2 require a two-step sequence; Items 3 through 6, a three-step sequence;
Item s 7 through 10, a four-step sequence; Items 11 through 13, a five-step sequence,
Items 14 through 17, a six-step sequence; and Item 18, a seven-step sequence.
F IG U R E 2.2.1
1 1 4
2 2 3
3 1 2 4
4 1 3 4
5 2 1 4
6 3 4 1
7 1 4 3 2
8 1 4 2 3
9 1 3 2 4
10 2 4 3 1
11 1 3 1 2 4
12 1 3 2 4 3
13 1 4 3 2 4
14 1 4 2 3 4 1
15 1 3 2 4 1 3
16 1 4 2 3 1 4
17 1 4 3 1 2 4
18 4 1 3 4 2 1
2.3 T H E D A T A FOR IT E M A N A L Y S IS
Student scores, the number o f correct responses achieved by each student, are given
at the end o f each row in the last column on the right. Item scores, the total number o f
correct responses to each item, are given at the bottom o f each column.
Inspection o f Table 2.3.1 shows that the order o f administration is very close to the
order o f difficulty. Items 1, 2 and 3 are answered correctly by all students. A second,
slightly greater, level o f d ifficu lty is observed in Items 4 through 9. Then Items 10 and 11
show a sharp increase in difficulty. Items 12 through 17 are answered correctly by only a
few students, and no student succeeds on Item 18. Only 12 students score successfully at
least once on Items 12 through 17, and only five o f these students do one or more o f the
six-tap items successfully.
The general plan for accomplishing item analysis begins with editing the data in
Table 2.3.1 to remove persons and items fo r which no definite estimates o f ability or
d ifficu lty can be made, i.e., those with all correct or all incorrect responses. This means
that Person 35 and Items 1, 2, 3 and 18 must be set aside, leaving 34 persons and 14
items for analysis. Then the remaining inform ation about persons and items is
summarized into a distribution o f person scores and a distribution o f item scores.
For “ item d ifficu lty ” this variable increases with the proportion o f incorrect
responses. For “ person ability” it increases with the proportion o f correct responses. The
mean and variance fo r each distribution o f logits are then computed, and the mean item
logit is lused to center the item logits at zero. This choice o f origin fo r the new scale is
inevitably arbitrary but must be made. Basing it on items rather than on persons and
placing it in the center o f the current items is natural and convenient.
The item logit and person logit variances are used to calculate tw o expansion factors,
one fo r items and one fo r persons. These factors are used to calculate the final
sample-free item difficulties and test-free person abilities. They are needed because the
apparent relative difficulties o f the items depend upon how dispersed in ability the
sample o f persons is. The more dispersed the persons, the more similar in difficu lty will
items appear. This is also true for apparent ability. The more dispersed the test in item
difficulty, the more similar in ability will persons appear. These effects o f sample spread
and test width must be removed from the estimates o f item difficu lty and person ability,
if these estimates are to be made sample-free and test-free.
Finally, the standard errors o f these estimates are calculated. The standard errors
are needed to assess the precision o f the estimates. T h ey depend on the same expansion
factors plus the extent to which the item d ifficu lty is centered among the person abilities
and the person ability is centered among the item difficulties. The more that items or
persons are centered on target, the more precise are their estimates and hence the smaller
their standard errors.
ITEM C A LIB R A T IO N BY H A N D 31
Z ui
° cc |> . 0 0 «0 0 0 ^ 0 0 » - 0 0 0 0 0 « - ( w> 0 0 > ,- 0> » - C N ( M C N ^ i f ) O N 0 0 0 1 0 * - (
a
£
8
8
o o o o o o o o o o o o o o o o o o o c o o o o o o o o o o o o o o
00000000000000000000000^-0000000000
00000000000000000000000^-0000000000
TEST
000000^-000000000000000000000000000
CUBE
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 * - 0 0 0 0 0 » - 0 0 0 0 r -
KNOX
0 0 0 0 0 0 « - 0 0 0 0 0 0 0 « - 0 0 » - 0 ’- » - 0 » - » - 0 0 0 0 0 0 0 0 0 0
ON THE
00«-000*-00000*-0»-000000«-00000000000'
18 ITEMS
o o o o o o —o o » - o o « - « - « - « - o o o o « - « - o * - o o o « - « - o o » - - o o O CM
UI
2
<
TO
z
2
OF 35 PERSONS
LU
h-
O r-
RESPONSES
o o
O R IG IN A L
O CM
O “ 2 cc
co 5
cc < t 0
C0
J
32 BEST TEST DESIGN
The data matrix in Table 2.3.1 has been rearranged in Table 2.4.1 so that person
scores are ordered from low to high with their respective proportions given in the
right-most column, and item scores are ordered from high to low with their proportions
given in the bottom row.
The data matrix o f person-by-item responses in Table 2.4.1 has also been edited by
removing all items that were answered correctly by everyone or no one, and by removing
all persons who had perfect scores or who had not answered any items correctly.
The boundary lines drawn in Table 2.3.1 show the items and persons removed by
the editing process. Items 1, 2 and 3 were removed because they were answered correctly
by everyone. Rem oving these three items then brought about the removal o f Person 35
because this person had only these three items correct and hence none correct after these
items were removed. Item 18 was removed because no person answered this item
correctly.
Editing a data matrix may require several such cycles because removing items can
necessitate removing persons and vice versa. F or example, had there been a person who
had succeeded on all but Item 18, then removal o f Item 18 would have le ft this person
with a perfect score on the remaining items and so that person would also have had to be
removed.
Why were some items and some persons removed? When no one in a sample o f
persons gets an item correct, that shows that the item is to o d ifficu lt fo r this sample o f
persons. However, no further inform ation is available as to just how much to o d ifficu lt it
actually is. When everyone gets an item correct, that shows that the item is to o easy for
these persons, but again, no further inform ation is available as to exactly how much too
easy the item actually is. T o make a definite estimate for a very easy item we must find at
least one measurable person who gets it incorrect, and fo r a very hard item, at least one
measurable person who gets it correct. That is, we must “ bracket” the item between
persons at least one o f whom is more and at least one o f whom is less able than the item
is difficult. O f course, only one person below a very easy item or above a very hard one
does not give a very precise estimate o f that item ’s difficulty.
Thus, w e have insufficient data in our example to evaluate the extreme Items 1, 2, 3
and 18. We know that Items 1, 2 and 3 appear very easy and that Item 18 appears to be
very hard fo r these persons, but we d o not have enough inform ation to specify definite
estimates o f the difficulties o f these four items.
As for extreme persons, do persons with a zero score know nothing? A re scores o f
100% indicative o f persons who “ know it all” or have they only answered easy questions?
T o make a definite estimate fo r a person, we must bracket that person between items that
are both easier and harder than the person is able.
The boundary scores o f zero and 100%, whether for items or for persons, represent
incomplete information. They tell us in which direction to look for an estimate o f the
person’s ability or the item ’s difficulty, but they do not tell us how far to go in that
direction. For sufficient information to make a definite estimate o f where the person or
ITEM C A LIB R A T IO N BY H A N D 33
34 BEST TEST DESIGN
the item is on the latent variable, w e must find some items to o easy and some items too
hard for these persons, and some persons to o smart and others too dumb for these items,
so that each item and person is bracketed by observations. Then we can make an estimate
o f where they are on the variable.
From the edited data matrix in Table 2.4.1, we build a grouped distribution o f the
10 different item scores and their logits incorrect, and compute the mean and variance o f
the distribution o f these item logits over the test o f 14 items. This is done in Table 2.4.2.
*These values come from Table 2.4.3 w here£n[.06/.94] = — 2.75. Were these calculations made with Sj and N as in Cn[(N-Sj)/Sj]
35
then £n[2/32] = — 2.77. The difference between — 2.75 and — 2.77 is due to the rounding in Sj/N = 32/34 = 0.941176 . . . ~ 0.94.
36 BEST TEST DESIGN
b£
in t- r* cn 05 in O) (o co o
CM O O) O) O
co ^^ in
O) in ^ CO CO 05 O
cm
o «- cm cm n n in in cn oo q o) o cm N q t- ^ co q be
o «- r - r-’ CM CM cvi cvi c\i cn cvi CO CO CO ^ 0
tp >»
c
o
co
L
05
z a
o 05
JS
h-
CC 05
O co r*^ oo 05 o «— cn co lo CO N CO 05 O r- CM CO in co co 05
Q.
r^. r-» oo co oo co co co CO CO CO CO 05 05 05 05 05 05 05 05 05 05 JS
o
oc 3
JS
Eh
co cn co o co CN CD r - LT5 05 CO CO CN CO o 10 o 05 in a
o O O r; r - CM CM CM CO CO ^ 10 Ifl CO CO 00 00 05 05 05 O 2
o o o o o o 0 6 0 6 0 0 0 0 0 0 0 0 0 0 0 6 6 0 ^ 05
cd
09
£2
05
JS
>»
£2
P R O P O R T IO N S
-a
05
z >> !2
o o -Q
]>
a
o H •a 'O
CC cn co ^ in CO N 00 O) O *— cn co in
05
o CO CO 05 O 1- cn co <3* in T3
a. in in in in in 10 in lo 10 co co co co co co co co co co r>. i"*
o
oc
CL
a
c o
o o
FROM
a
a a
o
c m 05 ^ o in o in «- co cn CO 00 05 LO T- CO CN00 ^ O CO CN 00 ^ O
O 05 05 05 CO cq cq
LOGITS
CO in in ^ ^ CO COCN CN CN o o o
0 0 0 0 0 0 0 0 0 0 0 0 0 0 00 0 0 6 0 6 6 0
I I I I I I I I 1 1 l 1 1 I II I I 1 1 1 I 1
05
£) 05
2 JO
z 3
g c 3
C
H 05
OC JS 05
o CO S 00 05 O 1— cn co ^ in CO CO 05 o «- cn co in co rs oo 05 o -C
-4
1
-> .
a. CN CN CN CN CO CO CO CO CO CO CO CO CO CO ^ ^ ^ ^ 'T in
05 CO ■
O 2 05
OC
o
(5
2 I
05 C5 *
£2 05 ^
fl
la ex
.2 1 a n
0 ^ * Z
O 05 CO CO <3*
(5 00 ^ r- O) in 05^ o 05 05 .O CN CO CO 05 CN in 05 cn r** t— in o d, u
0—
o
a is q ^ n cm o q 05 co Is* co in in ^ co CO CN CN r— r—
o CO CO CO CN CN CN CN CN CN ,P*©< a 1
o ' *
I I I I I I I II I 1 l I I l l l l l l I I II I 5 Z
1 «
3 I
« t£
05
0 I
z
g S i Sz
c ^ s ~
H 0J 2 1
OC r- cn co ^ in
09
o
CL 0 0 0 0 0
CO P*^ CO 05 o
o o o o «-
cn co ^ in CO C** 00 05 O r- cn co ^ in
CN CN CN CN CN
L 77
05 O
a »—•
3z
o u c
oc goc
.2 r° oc
b C
ITEM C A LIB R A T IO N BY H A N D 37
CD
ID
II
CO
c
<D
SI
? 0
-j o
? «
c
—
c 00
"O
3 9
_i °
c ^
<D *“
£ II
« -J
o> .E
ID _
O)
CN C
' 1
*
—
11 8
• u
CO <D
O) _c
o
c 3
*D
CO
ID
CN
CO I
~a
CN c
CO
O)
ID
CN
I
C
0)
<D
£
CD
(J
c
_c
H
38 BEST TEST DESIGN
G
The mean and variance for the item x. = Z f : X j/L
i
logits in Column 6 are then computed
from the values in Columns 7 and 8 and G
given beneath these columns. U = ( I f j X j 2 - Lx.2 ) / (L-1)
i
In this example, the mean and variance have been computed from the values in
Columns 7 and 8. Hand calibration can be facilitated even further by a short-cut
expression fo r a standard deviation proposed by Mason and Odeh (1 9 6 8 ). T o do this,
sum the item logits in Column 9 (or Column 6 ) for the top and bottom sixth o f the items
ordered by difficu lty, and take the square o f twice the difference o f these sums divided
by one less than the number o f items.
b. The item logits incorrect for the top three items in Column 9 are 3.29,
which times 2.33 is 3.29 x 2.33 = 7.67.
c. The item logit incorrect for the bottom item is -2.94 and for the next tw o
items is -2.50. So, taking the lowest item, -2.94, plus 7/3 - 3/3 = 4/3 o f the
next tw o items, gives (-2.94) + (-2.50 x 1.33) = - 6.27.
e. Tw ice this amount divided by the number o f items minus one and squared
becomes the variance estimate [2(13.94)/13]2 = 4.6.
This short-cut value o f 4.6 is somewhat smaller than 5.7 but the number o f items is
small and the distribution is flatter than the normal distribution assumed by the
short-cut.
Completion o f the steps in Table 2.4.2 provides initial values fo r item difficulties in
preparation for the adjustment which will compensate fo r the effect o f sample spread.
In Table 2.4.4, we take identical steps with a grouped distribution o f person scores
in order to obtain the distribution o f person score logits and hence initial values for the
abilities that go with each possible score on the test.
T h e mean and variance fo r the distribution o f score logits over persons are given at
the base o f Table 2.4.4, as is the short-cut estimate o f the variance.
N o te that because w e are interested not on ly in the scores observed in this sample
but also in the measurements im plied by any possible score which might be observed on
this test o f 14 items, unobserved scores o f 1, 12 and 13 have been added to Table 2.4.4,
together w ith the initial measures fo r these scores. T h e measurement m odel specifies what
measures are equivalent to these scores even when no persons in the sample actually earn
them.
T o summarize the procedure thus far (n o w letting each item defin e its ow n item
score group fo r notational sim plicity, so that the item index i now runs from 1 to 14
items instead o f from 1 to 10 item score groups):
N ow w e are ready to adjust the initial calibrations and measures in Tables 2.4.2 and
2.4.4 fo r the local effects o f the person ability distribution o f the sample and the item
d ifficu lty distribution o f the test.
1 + U /2 .8 9
a. The person ability expansion factor X =
1 - U V /8 .3 5
due to test width is
lA
1 + 5 .7 2 /2 .8 9
1 - (5.72) (0.461/8.35
lA
2.98
= 2.09
0.68
lA
1 + U 7 2 .9
or short-cut value X' =
1 - U 'V '/ 8 .4
lA
j + 4 .6 /2 .9
1 -(4 .6 ) (0 .5 )/8 .4
lA
2.59
= 1.9
0.73
>/2
1.16
= 1.31
0.68
*/2
1 + V 7 2 .9
or short-cut value Y’ =
1 - U 'V 7 8 .4
%
1 + 0 .5 /2 .9
1 - (4.6) (0.5) /8.4_
Vi
1.17
= 1.3
0.73
In Table 2.4.5 we obtain the final corrected item calibrations and their standard
errors from the sample spread expansion factor Y .
T A B L E 2 .4 .6
1 2 3 4 5
r X b = Xb° SE (b r )
*>r r r
6 -0 .2 8 2.09 -0 .5 9 1.13
L =14
%
SE (b r) = X [ L /r ( L - r ) ]
44 BEST T E S T D ESIG N
In Table 2.4.6 we obtain the final corrected person measures and their standard
errors from the test width expansion factor X.
2.5 DISCUSSION
The P R O X item analysis procedure has been carefully described not only because it
accomplishes item calibration and hence person measurement but also because it
embodies in a logical and straightforward manner the simplest possible analysis o f the
interaction between items and persons. The decisive idea on which this analysis is based is
that the probability o f success is dominated by the person’s ability and the item ’s d iffi
culty. A more able person is supposed always to have a greater chance o f success on any
item than is a less able person. A n y particular person is supposed always to have a better
chance o f success on an easy item than on a difficu lty one. T o the extent this is the case
the probability o f any person’s success on any item can be specified as the consequence
o f the difference between the person’s position on a single variable and the item ’s
position on that same variable. That is the Rasch model for item analysis and test
construction and, indeed, the fundamental model implicit in the item analysis o f all those
who work with unweighted scores (Andersen, 1977).
IT E M C A L IB R A T IO N BY H A N D 45
A ll the inform ation observed about a person’s position on the variable, e.g., his
ability, is assumed to be expressed in his responses to the set o f items he takes as
summarized in the unweighted count o f the number o f items he gets correct. For item
d ifficu lty, the inform ation observed is assumed to be com pletely contained in the
unweighted count o f persons in the sample w h o responded correctly to that item.
3.1 IN T R O D U C T IO N
In this chapter we display and describe computer output fo r Rasch item calibration
using the estimation procedures P R O X and U C ON (Wright and Panchapakesan, 1969;
Wright and Douglas, 1975b, 1977a, 1977b). The K nox Cube Test data analyzed by hand
in Chapter 2 are used for illustration. The estimation procedures are performed by the
computer program B IC A L (Wright and Mead, 1976). The accuracy and utility o f the
hand calibration described in Chapter 2 are evaluated by comparing the “ hand” estimates
with those produced by a computer analysis o f the same data. Knowing the steps by
which these procedures can be applied by hand should facilitate understanding and using
the computer output.
We will move through the computer output step by step in order to bring out its
organization and use. B IC A L produces the output given in Tables 3.2.1 to 3.2.9. A
comparison o f the item and person statistics from P R O X by hand with those from P R O X
by computer is given in Tables 3.3.1 through 3.3.5.
The first page o f the output, in Table 3.2.1, recaps the control specifications neces
sary to apply the 1976 version o f B IC A L to this calibration jo b (fo r details consult the
manual that goes with your B IC A L program). A t the top we begin with the jo b title, the
various control parameters, the record (o r card) columns read, the test scoring key and a
copy o f the first person record read. Finally, the output reports that 18 items and 34 per
sons went into this analysis.
K N O X CUBE TEST
C O N TR O L PAR AM ETERS
N IT E M NGROP Ml NSC MAXSC LREC KCAB SCORE
18 10 1 17 21 1 0
C O LUM NS SELECTED
1 2 3 4
1* ......................
111111111111111111
K EY
111111111111111111
F IR S T SUBJECT
001111111100000000000
46
ITE M C A L IB R A T IO N BY COMPUTER 47
Table 3.2.2 gives each item ’s response frequencies fo r each response value. This table
can accom m odate up to five response values as specified by the user. A n “ unknown”
value colum n records the count o f all other values encountered. The final column is fo r
the key. T h e key marks the value specified as correct when the data is still to be scored.
As Table 3.2.2 shows the K C T data was entered in scored form . The appropriate key,
therefore, is the vector o f ‘ l ’s shown in Tables 3.2.1 and 3.2.2. Each item is identified
on the le ft b y its sequence number in the original order o f test items as read into B IC A L.
A four-character item name can also be used to id en tify test items. F o r the K C T w e have
named the items b y the number o f taps required.
Table 3.2.2 enables us to exam ine the observed responses fo r obvious disturbances
to our test plan and w ill often suggest possible explanations fo r gross misfits. The distri
bution o f responses over m ultiple-choice distractors, fo r example, can reveal the undue
influence o f particular distractors. The effects o f insufficient tim e show up in the piling
up o f responses in the U N K N column tow ard the end o f the test. The effects o f wide
spread inexperience in test taking show up in the pile-up o f U N K N responses in the first
one or tw o items o f the test.
We see again in Table 3.2.2 what w e already learned from Table 2.3.1, namely that
the first three items are answered correctly by all 34 persons, that Item 18 was n ot an
swered correctly b y anyone and that there is a rapid shift from largely correct responses
to largely incorrect responses between Items 9 and 11. Since IT E M N A M E gives the
number o f taps in the series, w e see that this shift occurs when the task moves from a
series o f fo u r taps up to five taps.
48 BEST TEST DESIGN
TABLE 3.2.2
1 2 0 34 0 0 0 0 1
2 2 0 34 0 0 0 0 1
3 3 0 34 0 0 0 0 1
4 3 2 32 0 0 0 0 1
5 3 3 31 0 0 0 0 1
6 3 4 30 0 0 0 0 1
7 4 3 31 0 0 0 0 1
8 4 7 27 0 0 0 0 1
9 4 4 30 0 0 0 0 1
10 4 10 24 0 0 0 0 1
11 5 22 12 0 0 0 0 1
12 5 28 6 0 0 0 0 1
13 5 27 7 0 0 0 0 1
14 6 31 3 0 0 0 0 1
15 6 33 1 0 0 0 0 1
16 6 33 1 0 0 0 0 1
17 6 33 1 0 0 0 0 1
18 6 34 0 0 0 0 0 1
Table 3.2.3 reports the editing process. It summarizes the work o f the editing rou
tine which successively removes person records with zero or perfect scores and items
correctly answered by all persons or n ot answered correctly by any persons, until all such
persons or items are detected and set aside. The editing process determines the final
matrix o f item-by-person responses that is analyzed.
Table 3.2.3 shows that initially there were no persons with perfect o r zero scores,
and that 18 items entered the run, with no person scoring below 1 o r above 17, leaving
34 persons fo r calibration (The 35th person appearing in Table 2.3.1 had already been
removed from the data deck by hand.). Items 1, 2 and 3 are then removed by the editing
process because they were answered correctly by all subjects and Item 18 is removed
because no one answered it correctly. A fte r this editing the calibration sample still con
sists o f 34 subjects, but now only the 14 items which can be calibrated remain, with the
same minimum score o f 1 and a new maximum score o f 13.
Table 3.2.4 shows the distribution o f persons over the K C T scores. The histogram is
scaled according to the scale factor printed below the graph. The distribution o f person
scores gives a picture o f how this sample responded to these items. It shows how well the
items were targeted on the persons and how relevant the persons selected were fo r this
calibration. For the best calibration, persons should be more or less evenly distributed
over a range o f scores, around and above the center o f the test. In our sample we see a
symmetrical distribution around a modal score o f 7.
IT E M C A L IB R A T IO N B Y COMPUTER 49
T A B L E 3.2 .3
TH E E D ITIN G PROCESS
K N O X CUBE T E S T
N U M B E R O F Z E R O SCORES 0
N U M B E R O F P E R F E C T SCORES 0
N U M B E R O F IT E M S S E L E C T E D 18
N U M B E R O F IT E M S N A M E D 18
SUBJECTS B ELO W 1 0
SUBJECTS A B O V E 17 0
SUBJECTS R E M A IN IN G 34
T O T A L SUBJECTS 34
REJE C TE D IT E M S
IT E M IT E M ANSW ERED
NUMBER NAME CORRECTLY
1 2 34 H IG H SCORE
2 2 34 H IG H SCORE
3 3 34 H IG H SCORE
18 6 0 LOW SCORE
SUBJECTS D E L E T E D = 0
SUBJECTS R E M A IN IN G = 34
IT E M S D E L E T E D = 4
IT E M S R E M A IN IN G = 14
M IN IM U M SCORE = 1
M A X IM U M SCORE = 13
T A B L E 3.2 .4
SCORE D IS T R IB U T IO N O F A B IL IT Y
COUNT P R O P O R T IO N 2
1 0 0.0
2 1 0 .03 X
3 2 0 .06 XX
4 2 0 .06 XX
5 2 0 .06 XX
6 3 0 .09 XXX
7 12 0 .36 xxxxxxxxxxxx
8 5 0 .15 xxxxx
9 4 0 .12 xxxx
10 1 0 .03 X
11 2 0 .06 XX
12 0 0.0
13 0 0.0
14 0 0.0
EACH X = 2 .94 P ER C EN T
50 BEST TEST DESIGN
Table 3.2.5 shows the distribution o f item easiness. The scale is given below the
graph. Items 4 through 9 are seen to be fairly easy, with Item 8 the most difficu lt among
them. Item 10 is slightly more difficult. Item 11 is much more difficult, follow ed by
Items 12 and 13, more difficult still. Finally, items 15, 16 and 17 are so difficu lt that
only one person answered these items correctly.________
TA B LE 3.2.5___ _________________ _
ITEM D IS T R IB U T IO N OF EASINESS
CO U NT PROPORTION 2 4 6 8 10
4 32 0.94 x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
5 31 0.91 x x x x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
6 30 0.88 x x x x x x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxx
7 31 0.91 x x x x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxxxxxx
8 27 0.79 x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxxxx
9 30 0.88 x x x x x x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxxxxxx
10 24 0.71 x x x x x x x x x x x x x x xxxxxxxxxxxxxxxxxxxxx
11 12 0.35 xxxxxxxxxxxxxxxxxx
12 6 0.18 xxxxxxxxx
13 7 0.21 xxxxxxxxxx
14 3 0.09 xxxx
15 1 0.03 X
16 1 0.03 X
17 1 0.03 X
Table 3.2.6 gives the estimation information. A t the top are the P R O X d ifficu lty
and ability expansion factors. N otice that these values are identical to those w e obtained
by hand in Chapter 2. Within the table, the first fou r columns give the item sequence
number, item name, item d ifficu lty and standard error.
___________________ TA B LE 3.2.6
CALIBRATION BY PROX
4 3 -3 .8 6 5 0.833
5 3 -3 .2 9 4 0.691
6 3 -2 .8 7 6 0.608
7 4 -3 .2 9 4 0.691
8 4 -2 .0 0 7 0.485
9 4 -2 .8 7 6 0.608
10 4 -1 .3 8 8 0.430
11 5 0.547 0.410
12 5 1.767 0.514
13 5 1.518 0.485
14 6 2.805 0.691
15 6 4.321 1.160
16 6 4.321 1.160
17 6 4.321 1.160
IT E M C A L IB R A T IO N BY COMPUTER 51
Table 3.2.7 gives the lo git ability measure and its standard error fo r each score on
the K C T and the number o f persons in the sample obtaining each score. F or each raw
score we can see the sample frequency at that score and the ability and standard error
im plied by that score. Th e sample ability mean and standard deviation are given at the
b otto m o f the table.
T A B L E 3.2 .7
C O M P L E T E SCORE E Q U IV A L E N C E T A B L E
13 0 5 .40 1.51
12 0 3 .77 1.11
11 2 2.73 0 .94
10 1 1.93 0 .86
9 4 1.24 0.81
8 5 0.61 0 .78
7 12 0 .0 0 0 .78
6 3 -0 .6 1 0 .78
5 2 - 1 .2 4 0.81
4 2 - 1 .9 3 0 .86
3 2 - 2 .7 3 0 .94
2 1 - 3 .7 7 1.11
1 0 - 5 .4 0 1.51
M E A N A B IL IT Y = - 0 .0 6
SD O F A B IL IT Y = 1.14
Table 3.2.8 provides item characteristic curves and fit statistics. The tests o f fit
include a division o f the calibration sample in to ability subgroups by score level. Three
groups have been made ou t o f the K C T sample, the 10 persons with scores from 1 to 6,
the 12 persons at score 7 and the 12 persons with scores from 8 to 13. Control over
group size and hence over the number o f groups used is asserted through the control
parameter N G R O P . A n evaluation o f item d ifficu lty invariance over these ability groups is
made b y com paring fo r each item its d ifficu lty estimates over the differen t groups. The
tests o f fit are thus sample-dependent. H owever, i f the d ifficu lty estimates they use pass
these tests, then those estimates are sample-free as far as that sample is concerned. O f
course, successful item fit in one sample does n ot guarantee fit in another. However, as
the ability groups within a given sample are arranged by scores, w e do obtain inform ation
about the stability o f item difficu lties over various abilities and therefore can see whether
our items are displaying sufficient invariance over these particular ability groups to
qu alify the items fo r use as instruments o f objective measurement.
52 BE ST T E S T D E S IG N
i- oc o CN o CO 05 ,_ CN 00 i O CO CO CO |
Z ID CN CD p LO lo LO CN p p co 1
d d d d d d d d o d ' d d d d 1 DC
CL CQ 1 <r
LL O
1 LL. DC
1 o DC
O X o LO r>» oo\ 05 o o 05 r— CO CO co 1 LU
a
C/J Q o p 00 00 p p n i O
QZ o7 d d
" i
1
< 05 00 CO CN rs ■
34
I- CD 81 CN 00 p p 05 CN
o d d o
y d d d d d d d d d d |
1
d
1
1
I d CO
CO
00 CD 05 \ CD LO
05
LO CN
00
LO
CO
00
p
CO
CN
00 00
21 CO CN
CURVES AND ANALYSIS OF FIT
LD CN ) °°. p CO
d d d CO* 1 ' d ' d d CN o d d d 1 d
co a 1
I
1
| d LO CO o o CD CD 00 IS 1 LO
i—o LO CD s CN 00 p r** CN 05 T~ CO CN
o o
£
5 “
o
d d
0 d d d d d d d d d j
1
d
1
a.
CN CN 00 CD CN CN CN 1 CN
Q ^ p o o p o O o o p o o o q i
a DO
CCo
DC d d d d d d d d d d d d d o !
u
13
in Q
X DCLU
o CL
CN CN C
O s-
cc p o o
a.
>
is Z
CMoDC
a
O
O
O
O
I
d d
i
p
d d
CQ LU
O S
o
ITEM CHARACTERISTIC
CL
DC LD LO CN o o o o
C/5 O
O O O O o o o
«“ CC O O d d
ii
I I i i
CL
0 2 o o o o o o o r>
. co oo oo oo oo CO CD
DO
CO O o o o o o o CD CO LO o o o
C DC d d d d I
00 2
C/5 a
DC
CL
a 3 o o CN CO CN o LO co oo o
z o
CN CC
o o
i-
P
d
00
d
05
d
o r^
d
co
d
o
d
o o o o I o
d
d d d
<o O
o
Q.
H ^ o o
o o o o o o
05 O
r- cc 00
d d d d d d d
p
d
o
LO LO LO CD CD CD CD UJ
O
Z
< 00
CC
<
a : LU
00 0
5 o CN co [ LO CD CC
17 o
o
C/5
IT E M C A L IB R A T IO N BY COMPUTER 53
In the ‘Item Characteristic Curve” panel o f Table 3.2.8 we have the proportion o f
correct answers given by each ability group to each item. The score range and mean
ability fo r each group are given at the b ottom o f each column. We expect these ICCs to
increase as w e m ove from le ft to right, from less able to more able score groups, and fo r
the m ost part we see that in Table 3.2.8 they do. However, Item 7 does show a rather
implausible pattern. A greater proportion o f persons get it correct in the low est score
group than in the m iddle one!
In the m iddle panel o f Table 3.2.8 w e have the differences in ICC proportions fo r
each ability group between those observed and those predicted by the Rasch measurement
m odel. Here we can see where the largest proportional departures occur and in which
direction they go. Again, Item 7 is out o f line with the other items, especially fo r the
low est ability group.
In the “ Analysis o f F it” panel o f Table 3.2.8 w e have a series o f fit mean squares.
These fit statistics are mean square standardized residuals fo r item-by-person responses
averaged over persons, and partitioned into tw o components, one between ability groups
and the other within ability groups. These mean squares increase in magnitude away from
a reference value o f 1 as the observed ICC departs from the expected ICC, i.e., when to o
many high-ability persons fail an easy item o r to o many low -ability persons succeed on a
d ifficu lt one. T h e statistical significance o f large values can be judged by comparing the
observed mean squares with their expected value o f 1 in terms o f the expected standard
errors given at the b ottom o f the table.
The “ to ta l” mean square evaluates the general agreement between the variable
defined b y the item and the variable defined b y all other items over the whole sample.
O n ly Item 7 is significantly ou t o f line, with an observed mean square o f 1.73, more than
three times its expected standard error o f 0.24 above its expected value o f 1.
Th e discrimination index shown in the n ext to last column o f Table 3.2.8 describes
the linear trend o f departures from the m odel across ability groups expressed around a
m odel value o f 1. When this index is near 1, then the observed and expected ICCs are
close together over the reference points defined by the ability grouping.
When the index is substantially less than 1, then the observed ICC is flatter them
expected and the particular item is failing to differentiate among abilities as well as the
other items do. This condition, o f course, tends to go with a low er p oin t biserial cor
relation between item response and total test score. However, the discrimination index i<
less influenced in its magnitude than the p oin t biserial by h ow central the item is to the
sample or h o w dispersed in ability the sample is.
When the index is substantially greater than 1, then the item gives the appearance o f
differentiating abilities m ore distinctly than the average items in the test. The cause o f
this unusual “ discrim ination” must then be investigated. It is almost always found to be
caused b y a local interaction between a secondary characteristic o f the item and a
54 BEST TEST DESIGN
TA B LE 3.2.9
S E R IA L O R D ER D IF F IC U L T Y O R D E R
Table 3.2.9 summarizes the item calibration inform ation in three useful arrangements.
We have there fo r each item its name, d ifficu lty, discrimination index and total fit mean
square listed first by serial order, second b y d ifficu lty order, and third by fit order. While
in the K C T exam ple w e have on ly a fe w items to deal with, on longer tests the convenient
reordering o f these items b y d ifficu lty and by fit helps us to find m isfitting items and to
grasp the pattern o f m isfit, i f there is one. In our example we see again that the item with
the greatest m isfit, Item 7, is identified fo r us at the b ottom o f the third panel o f Table 3.2.9.
3 .3 C O M P A R IN G P R O X B Y H A N D W IT H P R O X B Y C O M P U TE R
N o w w e can com pare the P R O X estimation results fo r item difficulties and person
measures obtained by hand with those produced by computer. The data on P R O X by
hand fo r item difficu lties and person measures comes from Tables 2.4.5 and 2.4.6 in
Chapter 2. T h e data on P R O X by com puter com e from Tables 3.2.6 and 3.2.7. These
data have been com piled in to Tables 3.3.1 and 3.3.2.
In Table 3.3.1, each item is listed with its calibration by hand and by computer. The
standard error fo r each item as com puted by hand and by com puter is also given. The
results fro m P R O X b y hand and P R O X by com puter are virtually the same.
C A L IB R A T IO N STA N D A R D ERROR
The ability measures and their standard errors are given in Table 3.3.2. Again, the
differences between the tw o methods are minimal. Only the standard errors o f measure
m ent show a difference o f any magnitude. This difference is due to the use o f a more
accurate but also more laborious form ula in P R O X by computer. Thus, with the mild
exception o f the standard errors o f measurement, the very simple P R O X by hand and
P R O X b y com puter produce virtually the same results.
56 BEST TEST DESIGN
MEASURE S T A N D A R D ERROR
Table 3.4.1 gives the test items with their UCON calibrations and standard errors.
UCON uses P R O X item difficulties as its point o f departure and these are given in the far
right column o f the table. Table 3.4.2 gives the U C ON ability measure associated with
each score and the standard error fo r each measure. The larger standard errors at scores 7
and 8, fo r abilities between ± 2 logits are caused by the bimodal distribution o f item
difficulties shown in Table 3.4.1. Six o f the 14 items have difficulties below -3 .2 logits,
while another six have difficulties greater than +1.8 logits. This leaves only tw o items to
function in the 5 logit range between -3 .2 and +1.8 and the standard errors o f measure
ment in that region are accordingly higher.
ITEM C A LIB R A T IO N B Y COMPUTER 57
T A B L E 3.4.1 _____
D IF F IC U L T Y E X P A N S IO N F A C TO R 1.31
A B IL IT Y E X P A N S IO N F A C T O R 2 .10
N U M B E R O F IT E R A T IO N S = 7
S EQ U E N C E IT E M IT E M STANDARD LA S T D IF F PRO X
NUMBER NAME D IF F IC U L T Y ERR O R CHANGE D IF F
4 3 -4 .1 8 6 0.8 1 6 -0 .0 2 5 -3 .8 6 5
5 3 -3 .6 4 8 0.7 0 9 -0 .0 2 3 -3 .2 9 4
6 3 -3 .2 2 0 0.6 4 7 -0 .0 2 1 -2 .8 7 6
7 4 -3 .6 4 8 0.7 0 9 -0 .0 2 3 -3 .2 9 4
8 4 -2 .2 4 1 0.5 4 7 -0 .0 1 5 -2 .0 0 7
9 4 -3 .2 2 0 0.6 4 7 -0 .0 2 1 -2 .8 7 6
10 4 -1 .4 9 8 0.4 8 9 -0 .0 0 9 -1 .3 8 8
11 5 0 .7 6 0 0 .4 5 6 0 .0 0 6 0.5 4 7
12 5 2.1 3 5 0.5 5 6 0 .0 1 5 1.767
13 5 1.861 0.5 2 9 0.014 1.518
14 6 3.2 1 4 0.7 0 5 0.0 2 2 2.805
15 6 4 .5 6 4 1.076 0.0 2 7 4.321
16 6 4 .5 6 4 1.076 0.027 4.321
17 6 4 .5 6 4 1.076 0.0 2 7 4.321
R O O T M E A N S Q U A R E = 0.0 2 2
T A B L E 3.4.2
MEASUREMENT BY UCON
C O M P L E T E SCORE E Q U IV A L E N C E T A B L E
13 0 5.09 1.14
12 0 4.11 0.95
11 2 3.31 0.92
10 1 2.53 0.93
9 4 1.71 0 .96
8 5 0.81 1.03
7 12 - 0 .2 2 1.07
6 3 - 1 .1 9 0.97
5 2 - 1 .9 6 0.86
4 2 -2 .6 1 0.81
3 2 -3 .2 1 0.81
2 1 - 3 .8 6 0.88
1 0 - 4 .7 3 1.10
M E A N A B IL IT Y = - 0 .1 6
SD O F A B IL IT Y = 1.45
58 BEST TEST DESIGN
*dasia
INIOd
o CN o CO O) CM 00 o CO CO CO 1
CN CO 5 LO LO LO CN CO CO CO 1
o' o o o o o o o o o o o o o 1 5
Q oc
CC
O
"
u.
cc
LL oc
o X|
CO Q
CO
CD
o
CO
o
CN
CM
CN
o CO
O)
CN
00 CO
CN
00
00
o
00
o
00 1
o j
o LU
o o' r- t-’ *- o o *- o i- ,
u a
LUH
Q CO
/ 00 \ LO o 00 r— CO CO CO CO mi-
co LO q i q 1 CN r^* r^. q co *— «— t— CO CM
o o It -’/ o o o o o o t—’ o o o
Z o-
S 3 o r>. CO ' oo\ IO o CO CO CD CN i CO CN
CN CD o o o 1 00
CURVES AND ANALYSIS OF FIT
F O CO CM col *fr
LU CC o o o i*■ / o o o o o o o o o 1 o
COO
CD
CO
LO
LO o f°A LO o
CN
CM
00 0
O
0 o>
o> o
CO
o>
CO
<3* ^ 1 t-
CO
LO
CN
o o O - o o O o o o * - o o o J o
> o
a. o t— t— t— r* 1
O 3 CO LO CO CO CN
o o o o o o o o o r— o o o o . t—
O CC O o o o o o o o o o d o o o o ,
a co cc
a
LU Q
Z CC UJ
o
o i« 03
CL
CM
o
CO /
T
CO
“j o
LO
o
CO
o
CO
o
0
0 LO
o j; o o q q
o
D Eg z o
CM CC o
o 1
o I ol o o o o' o o’ o' o* o' d
> 2 *
LU _
u
DO
os
o
ITEM CHARACTERISTIC
Ol
cc CO T—j ^ 1 o CD 3 CO CN CN r- o o o o
u. i- 2 o o o t—| o o o o o O o o o o
10 O o o o 1i o'/ o o’ o o o' o o o o' o
* - cc II
o
Q.
Q3 o o o o o o o CO 00 1^ 0
0 0
0 0
0 CO
OC O o o o o o o o CO CO LO t— 0
0 o o 1 CO
t— T— T— 1- o o o o o o O I
CO QC «- «- 0 «-
0
CO o
cc
UJ
I- UJ CL
QD o o CM CO CN o LO CO Is* 00 CN
< CC z o o o CD 00 CD o CO *T o o o o CM
o'
CN QC «— O o o t—
’ o o o o o o o I
<« o
o
Q.
o o <o °
c/)o 00 o o o o o o
«- OC o’ o' o o' o o' o*
o
co co co LO LO IO CO CO CO CO LU >
O I-
z
< CQ
cc
LU
<
00 O) O «- CM CO CC z
o <
O LU
CO
ITEM C A L IB R A T IO N BY COMPUTER 59
Table 3.4.3 gives the observed Item Characteristic Curve shown in Table 3.2.8, the
departures o f this ICC from the m odel ICC as expected by U C O N estimates and the fit
mean squares resulting from the U C O N analysis. Table 3.4.4 summarizes the U C ON
calibration in the same form as Table 3.2.9 summarizes the P R O X calibration.
T A B L E 3 .4 .4
S E R IA L O R D E R D IF F IC U L T Y O R D E R
F IT O R D E R
SEQ IT E M IT E M DISC F IT P O IN T
NUM NAME D IF F IN D X M N SQ BISER
Table 3.5.1 gives the calibrations and standard errors fo r the K C T data produced
by the UCON and P R O X methods. The calibration differences between UCON and P R O X
run about ± .3 logits. The difference between their standard errors is at most ± .1 logits.
C A L IB R A T IO N S T A N D A R D ERROR
Item UCON PROX Difference UCON PRO X Difference
4 -4 .2 -3 .9 -0 .3 0.8 0.8 0.0
5 -3 .6 -3 .3 -0 .3 0.7 0.7 0.0
6 -3 .2 -2 .9 -0 .3 0.6 0.6 0.0
7 -3 .6 - -3 .3 -0 .3 0.7 0.7 0.0
8 -2 .2 -2 .0 -0 .2 0.6 0.5 0.1
9 -3 .2 -2 .9 -0 .3 0.6 0.6 0.0
10 -1 .5 -1 .4 -0 .1 0.5 0.4 0.1
11 +0.8 +0.5 0.3 0.5 0.4 0.1
12 +2.1 +1.8 0.3 0.6 0.5 0.1
13 +1.9 +1.5 0.4 0.5 0.5 0.0
14 +3.2 +2.8 0.4 0.7 0.7 0.0
15 +4.6 +4.3 0.3 1.1 1.2 -0 .1
16 +4.6 +4.3 0.3 1.1 1.2 -0 .1
17 +4.6 +4.3 0.3 1.1 1.2 -0 .1
Table 3.5.2 gives the person measures and their standard errors fo r UCON and
PR O X . There the differences between UCON and P R O X methods run as much as ± .7
logits fo r the measures.
We see that using the more accurate UCON procedure which takes into account the
particular distributions o f item difficulties and person abilities does make a tangible d if
ference fo r the K C T data. As we have seen these K C T items have a distinctly bimodal
distribution not well handled by the P R O X procedure. Although, these differences
between P R O X and UCON are never as much as a standard error, and hence could n ot be
ITE M C A L IB R A T IO N BY COMPUTER 61
T A B L E 3.5.2
MEASURE S TA N D A R D ERROR
Score UCO N PROX Difference UCON PRO X Difference
1 -4 .7 -5 .4 0.7 1.1 1.5 - 0 .4
2 -3 .9 -3 .8 -0 .1 0.9 1.1 - 0 .2
3 -3 .2 -2 .7 - 0 .5 0.8 0.9 -0 .1
4 - 2 .6 -1 .9 -0 .7 0.8 0.9 -0 .1
5 -2 .0 - 1 .2 -0 .8 0.9 0.8 0.1
6 -1 .2 -0 .6 -0 .6 1.0 0.8 0.2
7 - 0 .2 0.0 -0 .2 1.0 0.8 0.2
8 0.8 0.6 0.2 1.0 0.8 0.2
9 1.7 1.2 0.5 1.0 0.8 0.2
10 2.5 1.9 0 .6 0.9 0.9 0.0
11 3.3 2.7 0.6 0.9 0.9 0.0
12 4.1 3.8 0.3 1.0 1.1 -0 .1
13 5.1 5.4 -0 .3 1.1 1.5 -0 .4
3 .6 A C O M P U T IN G A L G O R IT H M FO R P R O X
x. = 2 x ./L [ 3 .6 .2 ]
y r = Cn [ r / ( L - r ) ] [ 3 .6 .3 ]
62 b e s t t e s t d e s ig n
L -1
y. = 2 nr y r / N [3.6.41
D = 2 ( Xj _ x . ) 2 / 2 .8 9 ( L - 1) [3.6.5]
B = z ’ n , ( y r - y . ) 2 / 2 .8 9 ( N - 1) [3.6.6]
r
G = BD [3.6.7]
Y = [(1 + B) / (1 - G )] 54 . [3.6.9]
d, = Y <Xj —x . ) f f o r i = 1, L [3.6.10]
S E( d j ) = Y ( N / S j ( N - Sj ) ] 14. [3.6.11]
br = X y r, for r = 1, L -1 [3.6.12]
SE (b r ) = X [ L / r ( L —r ) ] 1/4 . [3.6.13]
The Rasch model fo r binary observations defines the probability o f a response x„j to
item i by person v as
[ 1 i f correct
where xt»i = \ 0 otherwise,
The likelihood A o f the data matrix ((x „ , )) is the continued product o f Equation
[3.7.1] over all values o f v and i, where L is the number o f items and N is the number o f
persons with test scores between 0 and L, since scores o f 0 and L lead to infinite ability
estimates.
A = exp [ Z 2 x„., ( 0 „ - 6 , ) ] / 8 h [1 + exp ( 0 ^ - 5 . ) ] [3.7.2]
v i v I
ITE M C A L IB R A T IO N BY COMPUTER 63
L
2 x^j = rv be the score o f person v
i
N
and 2 x,,j = Sj be the score o f item i,
The reduction o f the data matrix ( ( x ^ ) ) to its margins (r „ ) and (s,) and the separa
tion o f rvPv and S|8 , in Equation 3.7.3 establish the sufficiency o f r„ fo r estimating $v
and o f S| fo r estimating 5 j, as w ell as the objectivity o f these estimates.
L
With the side condition 2 8( = 0 to restrain the indeterminacy o f origin in the
i
response parameters, the first and second partial derivatives o f X with respect to and
6 j becom e
9X l
10“ “ r„ - 2 p • 1,N [3.7.4]
a2x L
i J j = - 2 ^ ,(1 ~nvi) [3.7.5]
3X n
and 35 . = -Sj + 2 i= 1,L [3.7.6]
3 2X N
2 7r„i(1 ~ n vi) [3.7.7]
and write the estimated probability that a person with a score r will succeed on item i
as
N L -1
Then 2 nVi 2 nrpr, , as far as estimates are concerned.
V r
b r (0) = 8 n ^ T ' T _j r = 1 ,L - 1 [ 3 .7 .9 ]
d, (0) = fir /L i = 1 ,L [ 3 .7 .1 0 ]
- S, * ' Li \ Prtli)
d . Ii T 1 1 = d , ' i » - i = 1 ,L [ 3 .7 .1 1 ]
v (j) n OK
- 2 n rp rl (1 ~ P ri )
r
and the current set o f ( br) are given by the previous cycle.
5. Using this improved set o f (dj), apply N ew ton ’smethod to Equation 3.7.4 to
improve each br according to
r - 2I K
p ri
( [" ’
br<m + 1) = br<m) l r = 1, L — 1 [ 3 .7 .1 3 ]
- 2 p'T* (1 + p [r’ )
i
6. R epeat steps (3 ) through (5 ) until successive estimates o f the whole set o f (dj)
becom e stable at
7, Use the reciprocals o f the negative square roots defined in Equation 3.7.7 as asymp
totic estimates o f the standard errors o f d ifficu lty estimates,
SE (d ,) = [ 2 1n r p r i (1 - p r i ) ] _V4 . i = 1, L [ 3 . 7 .1 6 ]
r
Andersen (1 9 7 3 ) has shown that the presence o f the ability parameters (/3„) in the
likelihood equation o f this unconditional approach leads to biased estimates o f item d if
ficulties (5j). Simulations undertaken to test U C O N in 1966 indicated that m ultiplying
the centered item d ifficu lty estimates by the coefficien t [ ( L - 1)/L] compensates fo r
m ost o f this bias. ( F o r a discussion and evaluation o f the unbiasing coefficien t [ ( L - 1)/L]
see W right and Douglas, 1975b or 1977a).
4 THE ANALYSIS OF FIT
4.1 IN T R O D U C T IO N
4.2. T H E K CT RESPONSE M A T R IX
We begin the study o f fit analysis by returning to the item-by-person data matrix o f
the K C T given in Table 2.4.1. In this table we have the edited and ordered responses o f
34 persons to 14 K C T items. The editing process removed items answered correctly by
everyone or no one, and persons answering correctly all or none o f the items. The
remaining persons and items have been arranged in order o f increasing item and person
score.
This item-by-person matrix o f l ’s and 0 ’s is the com plete record o f usable person
responses to the items o f the test. By inspection we see that the increasing d ifficu lty o f
the K C T items has divided the matrix roughly into tw o triangles: a lower left triangle
dominated by correct responses signified by l ’s and an upper right triangle dominated by
incorrect responses signified by 0 ’s.
This is the pattern w e expect. As items get harder, going from left to right in Table
2.4.1, any particular person’s string o f successes should gradually peter out and end in a
string o f failures on the items much to o hard fo r that person. Similarly, when we examine
the pattern o f responses for any item by proceeding from the bottom o f Table 2.4.1 up
66
TH E A N A LY S IS OF F IT 67
that item ’s colum n over persons o f decreasing ability, w e expect the string o f successes at
the b otto m to peter ou t into failures as the persons becom e to o low in ability to succeed
on this item.
From our calibration o f the K C T items we have estimates o f the item difficulties
( dj) and o f the abilities ( b r) which go with the possible scores (r) on this test. In Table
4.2.1 w e show the m atrix o f responses from Table 2.4.1 to which we have added, from
our calibration, the item difficu lties (d;) across the bottom and the abilities ( b r ) asso
ciated w ith each score dow n the right column. The item difficulties and score abilities in
Table 4.2.1 are those estimated with P R O X by hand from Chapter 2.
N o tic e h ow Table 4.2.1 is arranged into six sections in order to bring out the pattern
o f responses. T h e 14 items are partitioned into the 7 easier and the 7 harder. The 34
persons are partitioned into the 10 scoring b elow seven, the 12 scoring exactly seven and
the 12 scoring above seven. In the lo w er le ft section there are on ly l ’s. Every higher
ability person go t every easier item correct. In the upper right section there are on ly 0 ’s.
Every lo w er ability person g o t every harder item incorrect. But in the other fou r sections
there is a pattern o f l ’s and 0 ’s that must be analyzed.
When w e exam ine the pattern o f responses in these data fo r unexpected “ corrects”
and “ incorrects,” w e find that Table 4.2.1 shows several exceptions to a pattern o f all l ’s
fo llo w e d b y all 0 ’s. O f course, w e d o n ot exp ect every single person to fail fo r the first
tim e at a particular p oin t and then always to continue to d o so on all harder items. We
expect to find a run o f successes and failures leading finally to a run o f failures as the
items finally becom e to o d ifficu lt. H ow ever, some o f the exceptions in Table 4.2.1 seem
to exceed even this expectation. T o facilitate their examination w e have circled those
responses which seem m ost unexpected given the overall pattern.
Th e expected pattern is the one w e see in the records o f Persons 12 o r 23. Here each
record shows a string o f l ’s w ith a few adjacent and alternating l ’s and 0 ’s, follow ed by a
string o f 0 ’s.
Person 11, failed Item 4 but passed Items 5 through 9 before failing
all the remaining items.
Person 13, passed Items 4 and 5, missed Items 6 and 7, passed Items 8
through 12 and then missed the remaining ones.
68 BEST TEST DESIGN
blf.
a. -
o
um d .o-
z
o
id in in in in i
bi
tr w 00 CO CO CO CO (
O
o
</>
oooooooooo oooooooooooo OO OO O O O O O O —O
o
cc oooooooooo ooooooooo
o o o o o o o o ——o o o
<
z
oooooooooo o o o o o o —o ———o
oooooooooo oooooooooooo O — O O — — O — O — — —
o o —o o o ——o o
oo ooo © -
- •
o —o o o «
>
</> © "©■"' - © ...........
<
© ©---
h
a<
bi Z
a.
5 s£
o a>it
E £
—ja. ~ a
TH E A N A LY S IS OF F IT 69
There are a fe w oth er records that might also be examined such as Persons 3 and 12,
but as w e “ eyeb all” this small matrix, w e can see that the other records are less
exceptional.
4 .3 T H E A N A L Y S IS O F F IT B Y H A N D
In order to focus our application o f these ideas, w e have taken from Table 4.2.1 the
responses o f the six persons w ith the m ost implausible patterns to the seven items on
w hich their implausible responses occur. These selected responses com prise Table 4.3.1.
W ith this table w e can m ore easily study the outstanding unexpected “ correct” or
“ in correct” responses.
T o begin w ith, w e can tabulate the number o f unexpected responses fo r each person
and item in Table 4.3.1 to arrive at a simple count with which to describe what is
occurring. W e see that Persons 13 and 29 make the worst showing with three unexpected
responses each. H ow ever, this simple count does n ot tell us how to weigh and hence how
to judge the degree o f unexpectedness in these responses.
TA B LE 4.3.1
ITEM NUMBER OF
UNEXPECTED rcnaun
PERSON 4 5 7 6 8 12 14 RESPONSES A B IL IT Y *
11 1 1 1 1 0 0 1 -1.2
12 1 1 1 ® ® 0 0 2 -1.2
17 1 1 1 1 0 0 1 -0.6
3 1 1 1 1 1 © 0 1 0.0
13 1 1
® © 1 © 0 3 0.0
29 1 1
® 1 ® 0 © 3 0.0
Number of
Unexpected
Responses 1 1 2 2 2 2 1 11
Item
*
Difficulty -3 .9 - 3.3 - 3.3 - 2 . 9 - 2.0 1.7 2.8
Thus we can use pvi as an estimate o f the expected value o f instances o f x„j.
T o estimate this standard residual zvi, we subtract from the observed x vi its esti
mated expected value p„j and standardize this residual difference by the divisor
[p„i (1 ~ P ^ ) ] 54
TH E A N A LY S IS OF F IT 71
which is the estimated binom ial standard deviation o f such observations. T o the extent
that our data approxim ate the m odel, w e expect this estimated residual zvi to be distribu
ted m ore or less norm ally with a mean o f about 0 and a variance o f about 1.
Thus, as a rough but useful criterion fo r the fit o f the data to the m odel, we can
exam ine the exten t to which these standard residuals approxim ate a normal distribu
tion, i.e.
‘■i' i N(0, 1)
or their squares approxim ate a one degree o f freedom chi-square distribution, i.e.
2
V\ 'Xl
T h e reference values o f 0 fo r the mean and 1 fo r the standard deviation and the
reference distributions o f N (0 , 1 ) and x i 2 help us to see i f the estimated standard resid
uals deviate significantly from their m odel expectations. This examination o f residuals
w ill suggest whether w e can proceed to use these items to make measurements, or whether
we must d o further w o rk on the items and the testing situation to bring them into line
with reasonable expectations. It w ill also indicate when particular persons have failed to
respond to the test in a plausible manner.
zx = (x - p )/[p (1 - p )l 54 .
TA B LE 4.3.2
ITEM
PERSON
PERSON 4 5 7 6 8 12 14 AB IL ITY
11 2.7 - 1.2
17 2.7 -0 .6
3 1.7 0 .0
29 3.3 2 .0 2.8 0 .0
Item
Difficulty -3 .9 -3 .3 -3 .3 -2 .9 -2 .0 1.7 2.8
Since Since
" 1 " expected " 0 " expected
" 0 " unexpected " 1 " unexpected
entry is (b - d) entry is (d - b)
T A B LE 4.3.3
M IS F IT S T A T IS T IC S
1
DIFFERENCE
BETWEEN RELATIVE NUMBER OF ITEMS
PERSON A B ILITY SQUARED IMPROBABILITY EFFICIENCY NEEDED TO
AND STANDARDIZED OF THE OF THE M AINTAIN
ITEM D IFFICULTY RESIDUAL RESPONSE OBSERVATION EQUAL PRECISION
1000/1
-0 .6 .0 .4 1 .50 100 10
0 .5 ,0 .9 2 .33 90 11
1.0,1.2 3 .25 75 13
1.3,1.5 4 .20 65 15
1.6,1.7 5 .17 55 18
1.8,1.8 6 .14 50 20
1.9,2.0 7 .12 45 22
2.1 8 .11 40 25
2.2 9 .10 36 28
2.3 10 .09 33 30
2.4 11 .08 31 32
2.5 12 .08 28 36
2.6 13 .07 25 40
2.7 15 .06 23 43
2.8 16 .06 21 48
2.9 18 .05 20 50
3.0 20 .05 18 55
3.1 22 .04 16 61
3.2 25 .04 15 66
3.3 27 .04 14 73
3.4 30 .03 12 83
3.5 33 .03 11 91
4 .6 99 .01 4 254
• F o r incorrect responses when x = 0 then CQ = (b - d). For correct responses when x = 1 then C j - (d b).
74 BEST TEST DESIGN
ITEM PERSON
• M ISFIT
PERSON 4 5 7 6 8 • 12 14 TOTAL
~~~~~ ~ “ “““ ■ ~~ —
■
11 15 ■ 15
■
12 6 2 ! 8
■
■
17 15 i■ 15
■
■
3 | 6 6
■
■
13 27 18 * 6 51
■
■
29 27 7 !■ 17 51
■
■■■■■'■ ■ -------
■
Item !■
Misfit 15 15 54 24 9 I 12 17 146
Total ■
We can locate the difference +2.7 fo r the (b - d) o f Person 11 on Item 4 in the first
column o f Table 4.3.3 and read the corresponding z2 in Column 2 as 15. This value and
all o f the other values fo r the differences in Table 4.3.2 have been recorded in Table
4.3.4, which now contains all the z2 for every instance o f unexpectedness that we have
observed fo r the six persons and seven items. In the margins o f Table 4.3.4 are the sums
o f these z2 fo r each person and item. These sums indicate how unexpected the person or
item pattern o f responses is.
o f inform ation provided by the observation as a percentage o f the maximum inform ation
that one observation at (b - d ) = 0, i.e., right on target, could provide. The percent infor
m ation in an observation can be used to judge the value o f any particular item fo r mea
suring a person. This can be done b y considering how much inform ation would be lost by
rem oving that item from the test. Thus, the I o f 23% fo r Person 11 on Item 4 gives us an
indication o f h ow much w e gain by including Item 4 in the measurement o f Person 11 or
o f h ow much we w ould lose were w e to rem ove Item 4.
The w ay the idea o f inform ation or efficien cy enters into judging the value o f an
observation is through its bearing on the precision o f measurement. Measurement
precision depends on the number o f items in the record and on the relevance o f each item
to the particular person. W e can sim plify the evaluation o f each item ’s contribution to
our know ledge o f the person by calculating what percent o f a best possible item the item
in question contributes. That is what the values o f I in Column 4 provide.
When the item and person are close to one another, i.e., on target, then the item
contributes m ore to the measure o f the person than when the item and person are far
apart. T h e greater the d ifferen ce between item and person, the greater the number o f
items needed to obtain a measure o f comparable precision and, as a result, the less
effic ie n t each item.
F o r exam ple, it requires five 20% items to provide as much inform ation about a
person as could be provided by one 100% item. Thus, when (b-d) is about 3.0, it takes
fo u r to five times as m any items to provide as much inform ation as could be had from
items that fell w ithin one lo git o f the person, i.e., in the |b-d| <1 region.
T o facilitate the use o f Table 4.3.3, it has been arranged in fou r sections:
Upon examining the rows o f Table 4.3.4 fo r high z2 values in person records, we
find that the highest accumulated values are fo r Persons 13 and 29. These are the tw o
persons whose test behavior is most questionable, and so w e will examine their records in
more detail.
co £2
TH E A N A LY S IS OF F IT 77
Table 4.4.1 displays the response vectors fo r Persons 13 and 29 over all 14 items.
F o r each person w e show their responses o f 0 or 1, the concom itant (b-d) o r ( d - b )
differences, depending upon w hether the response is 0 fo r incorrect or 1 fo r correct, and
the consequent value o f z2 . T h e sums o f the ro w o f z2 fo r Person 13 and Person 29 are,
coincidentally, 53. A ccordin g to the m odel, these accumulated z2 ’s ought to fo llo w a
chi-square distribution w ith 1 degree o f freedom fo r each z2 minus the degree o f
freed om necessary to estimate the person measure b .
Further, any sum o f z2 ’s, when divided by its degrees o f freedom , should fo llo w a
mean square o r v = Z z2 /f distribution which can conveniently be evaluated as the t —
statistic:
F o r Person 13 w e have
14
fo r which
%
t 13 = U n ( v13) + v13- 1] [ 13/s ] = [1 .4 + 4.1 - 1 ] [1.3] = 5 .8 ,
which is a rather im probable value fo r t, if this person’s performance fits the model.
F o r Person 29 w e observe the same results and the same t - statistic. W ith such
significant m isfit it w ou ld seem reasonable to diagnose these tw o records as unsuitable
data sources either fo r the measurement o f these tw o persons or fo r the calibration o f
these items.
4 .5 M IS F IT T IN G IT E M R EC O R D S
W e can also see in Table 4.3.4 that Items 7 and 6 show the greatest m isfit among
items, especially Item 7 w ith an accumulated z2 o f 54. In Table 4.5.1 w e analyze the
com plete data vectors o f these tw o items, showing fo r each person’s response o f 0 or 1
the associated (b-d) or ( d- b) w ith their respective z2 .
F o r Item 7
34
v , = Z z2 /(34-1 1 = 5 7 /3 3 = 1.7
7 V, f
V
fo r w hich
^ y2
t ? = [£n(v7 ) + v 7 - 1 ] [ 33/8 ] = [0.5 + 1 . 7 - 1] [2.0] = 2 .4 ,
which is also a som ewhat im probable value fo r t, if this item fits the model.
78 BEST TEST DESIGN
I TA B LE 4.5.1 I
C O M P L E T E F I T A N A L Y S IS F O R
IT E M 7 A N D 6
IT E M 7 IT E M 6
(d = - 3.3) (d = - 2.9)
25 -3 .8 0 -0 .5 1 1 0.9 3
4 -2 .8 1 -0 .5 1 0 + 0.1 1
33 - 2.8 1 -0 .5 1 0 + 0.1 1
27 -1 .9 1 - 1.4 0 1 - 1.0 0
11 - 1.2 1 -2 .1 0 1 - 1.7 0
12 - 1.2 1 -2 .1 0 + 1.7 6
17 -0 .6 1 -2 .7 0 1 -2 .3 0
19 -0 .6 1 -2 .7 0 1 -2 .3 0
30 -0 .6 1 -2 .7 0 1 -2 .3 0
2 0.0 1 -3 .3 0 1 -2 .9 0
3 0.0 1 -3 .3 0 1 -2 .9 0
5 0.0 1 -3 .3 0 1 -2 .9 0
6 0.0 1 -3 .3 0 1 -2 .9 0
8 0.0 1 -3 .3 0 1 -2 .9 0
9 0.0 1 -3 .3 0 1 -2 .9 0
0.0 0 + 3 .3 27 + 2.9 18
©
16 0.0 1 -3 .3 0 1 -2 .9 0
26 0.0 1 -3 .3 0 1 -2 .9 0
28 0.0 1 -3 .3 0 1 -2 .9 0
0.0 0 + 3 .3 27 1 -2 .9 0
0
31 0.0 1 -3 .3 0 1 -2 .9 0
10 + 0.6 1 -3 .9 0 1 -3 .5 0
18 + 0.6 1 -3 .9 0 1 -3 .5 0
14 + 0.6 1 -3 .9 0 1 -3 .5 0
32 + 0.6 1 -3 .9 0 1 -3 .5 0
20 + 0.6 1 -3 .9 0 1 -3 .5 0
21 + 1.2 1 -4 .5 0 1 -4 .1 0
22 + 1.2 1 -4 .5 0 1 -4 .1 0
23 + 1.2 1 -4 .5 0 1 -4 .1 0
34 + 1.2 1 -4 .5 0 1 -4 .1 0
15 + 1.9 1 - 5.2 0 1 -4 .8 0
7 + 2.8 1 -6 .1 0 1 -5 .7 0
24 + 2.8 1 -6 .1 0 1 -5 .7 0
SUM OF SQUARES 57 29
T H E A N A LY S IS OF F IT 79
F o r Item 6
34
Vi
t6 = U n<v6 ) + v 6 - 1] [ 33/a] = (-0.1+0.9-11(2.0] = 0.4,
W e find that the mean square fo r Item 7 is significant but that the mean square fo r
Item 6 is not. H ow ever, when w e exam ine Table 4.5.1 again, we see that it is the tw o
significantly m isfitting persons 13 and 29 w ho contribute m ost to the m isfit values for
these tw o items. N o w w e have the opportu n ity o f im proving the fit o f the data to the
m odel, either by rem oving Item 7 and observing what happens then or by removing
Persons 13 and 29.
4 .6 B R IE F S U M M A R Y O F T H E A N A L Y S IS O F F IT
xv j = 0 i f “ in correct” and
x Vj = 1 i f “ correct.”
T o evaluate th e overall fit o f person v, w e sum his vector o f standard square residuals
2
(z •) over the test o f i = 1 ,L items, and calculate his person m isfit statistic as
L
V „ = 2 z 2 /(L - 1) ~ f l - , , ° o [4.6.1]
T o evaluate the fit o f Item i, w e sum the item ’s vector o f standard square residuals
( Zy j) over the sample o f v = 1 ,N persons, and calculate the item m isfit statistic as
80 BEST T E S T D ESIG N
4.7 COMPUTER A N A L Y S IS OF F IT
In the analysis o f fit done by hand we saw that certain person records and items had
residuals evaluated as significant. Having shown the procedures for the analysis o f fit by
hand we turn to computer analysis and return to our calibration o f the K C T with 18
items and 34 persons. In the calibration o f the K C T w e see from the fit mean square,
given in the left panel o f Table 4.7.1, that Item 7 produces the greatest misfit with a value
o f 1.98 not far from the 1.7 found in our hand computation. From our analysis o f person
misfit we know that Persons 13 and 29 greatly contributed to this misfit in Item 7.
Without this information at the tim e o f our calibration, however, w e might have
considered the possible deletion o f Item 7 because o f its high fit mean square. With this
much lack o f fit fo r Item 7 we might have chosen to recalibrate with Item 7 removed.
This has been done and the results are given in the middle panel o f Table 4.7.1. N o w we
see that Item 6 has acquired a misfit o f 2.73 even though previously when w e calibrated
all 14 items, Item 6 had a fit mean square o f only 0.90. This change in the status o f Item
6 is troublesome. We do not seem to be focusing in on a set o f suitable items.
Nevertheless w e go one step further and recalibrate once more, this tim e removing both
Item 7 and Item 6. The results are in the right panel o f Table 4.7.1. Alas, now we find
that Item 8 has become a misfit. These attempts to find a properly fitting set o f items
appear doomed.
J T A B LE 4.7.1 |_
ANALYSIS OF FIT
WITH UCON :
ITEM DELETIONS
MN
16 6 4.56 0.13
□
1.89 0.82
L = 14 N = 34 L = 13 N = 34 L = 12 N = 34
T H E A N A LY S IS OF F IT 81
I TABLE 4.7.2 I
UCON UCON
PERSON SCORE A B IL IT Y M ISFIT
r b V
25 2 -4 .4 0 .5
4 3 -3 .7 0 .4
33 3 -3 .7 0 .9
1 4 -3 .1 0 .3
27 4 -3 .1 0 .3
11 5 -2 .3 0 .8
12 5 - 2.3 0 .5
17 6 - 1.4 1.0
19 6 - 1.4 0 .2
30 6 - 1.4 0 .2
2 7 -0 .3 0.1
3 7 -0 .3 1.4
5 7 -0 .3 0.1
6 7 - 0 .3 0.1
8 7 - 0 .3 0.1
9 7 -0 .3 0.1
©
16
7
7
-0 .3
-0 .3
I 5 .7
0 .6
I (Hand P R O X = 4 .1 )
26 7 -0 .3 0.1
28 7 -0 .3 0 .6
@31
7
7
-0 .3
-0 .3
| 6 .6 |
0.1
(Hand P R O X = 4 .1 )
10 8 + 1.0 0 .2
14 8 + 1.0 0 .2
18 8 + 1.0 0.4
20 8 + 1.0 0 .4
32 8 + 1.0 0 .2
21 9 + 2 .0 0 .2
22 9 + 2 .0 0 .2
23 9 + 2 .0 0 .7
34 9 + 2 .0 0 .7
15 10 + 3 .0 0 .2
7 11 + 3 .9 0 .4
24 11 + 3 .9 0 .9
Mean 0.7
T A B L E 4 .7 .3
A N A L Y S IS O F F IT
W IT H U C O N :
P ER SO N D E L E T IO N S
7 4 - 5 .7 0 0.1 0
16 6 5.27 0.13
17 6 5.27 0.1 3
15 6 5.27 0 .1 3
8 4 - 2.87 0.17
9 4 - 3.73 0.21
6 3 -4 .2 4 0.34
13 5 2.34 0 .3 8
4 3 -4 .8 4 0 .4 0
14 6 4.43 0.55
5 3 -4 .2 4 0.64
11 5 1.51 0.7 0
12 5 3.01 0.99
10 4 - 1.48 1.03
L = 14 N =32
It seems clear that it was the test records o f these tw o unpredictable persons which
caused Item 7 and then Item 6 to seem to misfit. Thus, we learn that successive deletions
o f items without analyzing person fit can lead us to believe that items are misfitting
when, in fact, it is the response records o f a few irregular persons which are causing the
trouble. While the very small sample size used in our example exaggerates the impact o f
the tw o irregular persons, even large samples do not com pletely obliterate the
contaminating influence o f irregular person records, and in a large sample such flawed
records may be harder to spot and so remain unknown unless explicit tests o f person fit
are routinely made.
5 C O N STR U C TIN G A VARIABLE
5.1 G E N E R A L IZ IN G T H E D E F IN IT IO N O F A V A R IA B L E
In Chapters 2, 3 and 4 w e have shown h ow to expose and evaluate the observed rela
tionship between intended measuring instruments, the test items, and the objects they are
intended to measure, the persons. This prepares us fo r the present chapter which is con
cerned with h o w to defin e a variable.
If, how ever, when w e com pare these tw o estimates by a standard error or tw o, they
overlap substantially, then w e cannot assume that the tw o values d iffe r and as a result no
direction fo r a variable has been defined. Instead the items define a point w ithout direction.
Figure 5.1.1 illustrates this. In Exam ple 1 w e have Items A and B separated from
each other by several standard errors. Even with tw o items we begin to see a direction to
the variable at least as defined b y these tw o items. In the second example, however, we
find the tw o items so close to each other that, considering their standard errors, they are
n o t separable. We have found a point. But no direction has been established and so no
variable has as y e t been implied.
As an exam ple o f variable defin ition , w e w ill continue our study o f the K C T data to
see h o w w ell the K C T items succeed in defining a variable and just what that variable
seems to be.
5 .2 D E F IN IN G T H E K C T V A R IA B L E
T h e items o f the K C T form a tapping series that grows in length by increasing the
number o f taps and grows in com p lexity by the distance between adjacent taps and the
number o f reverses in direction o f movement.
83
84 BEST TEST DESIGN
Figure 5.2.1 lists the 18 items comprising the original K C T. Each item is described
by its numerical name, tapping series and tapping order pattern.
Table 5.2.1 focuses on those 14 K C T items that were calibrated in Chapters 2 and 3.
Items 1, 2 and 3 are not included because they were to o easy fo r the 34 persons in that
sample and Item 18 is not included because it was to o hard. Table 5.2.1 gives the item
names, tapping series, item difficulties and their standard errors. The difficu lty range o f
these 14 items is from -4 .2 logits to +4.6 logits.
I T A B L E 5.2.1 I
C A LIB R A TIO N OF THE KCT VA R IA B LE
W ITH ITEMS IN ORDER OF D IF F IC U L T Y
IT E M T A P P IN G IT E M STA N DA RD
NAME S ER IE S C A L IB R A T IO N ERROR
CO
4 - 4 .2 0.8
I
I
5 2 -1 -4 -3 .6 0.7
CO
CN
7 -3 .6 0.7
I
I
I
CO
6 -3 .2 0.6
I
I I
00
CN
9 - 3 .2 0.6
I
ro I
CO
8 -2 .2 0.6
I
I
CN
CO
10 -1 .5 0.5
I
11 1 - 3 - 1 - 2 - 4 0.8 0.5
T—
00
CN
13 1.9 0.5
I
I
00
JN 00
00 -e.
M
12 2.1 0.6
I
I I
NO
14 1 3.2 0.7
I
I
I
CO
CM
15 3 4 .6 1.1
I
16 1 - 4 - 2 - 3 - 1 - 4 4.6 1.1
00
ON
NO
17 4 4 .6 1.1
I
I
I
Mean 0 .0
We see that m ost o f the persons in this sample fall in the center o f the test. But that
is just where w e have a large gap in test items. We have discovered something important
and useful to us, nam ely that our test instrument is weakest at the m ode o f our sample. It
becom es clear that, i f w e w ant to discriminate among the m ajority o f persons found in
the m iddle range o f the K C T , then w e must construct some additional middle range items
which w ill be m ore appropriate to m iddle range abilities.
5 .3 IN T E N S IF Y IN G A N D E X T E N D IN G T H E K CT V A R IA B L E
W ith these considerations in mind, further developm ent o f the K C T variable was
undertaken. A ll 18 items from the original K C T were retained, and ten new items were
added. T h e original K C T was from Form II o f the Arthur Point Scale. We examined Form
I and found three items n ot used in Form II (Arthur, 1943). T o these three items we
added seven m ore. Five items were designed to fill the middle range gap, fou r items were
designed to exten d the K C T variable upward and one o f the Form I items was expected
to fit near old Items 5, 6 and 7. The tapping series fo r these additional items and their
intended locations on the K C T variable are shown in Figures 5.3.1 and 5.3.2.
Figure 5.3.1 shows the one item from Form I and the five new items designed to fill
the gap between the old K C T Item s 10 and 11. The fou r items designed to extend the
K C T in the region o f Item 18 are shown in Figure 5.3.2. The result is a new test form ,
K C T B , which contains all 18 old items and, in addition, 10 new items. This new instru
m ent o f 28 items was administered to a sample o f 101 persons and Items 2 through 25
were calibrated. Item 1 was still to o easy and Items 26, 27 and 28 were still to o hard to
be calibrated.
Colum n 6 in Table 5.3.1 gives these new K C T B calibrations. Th e rest o f Table 5.3.1
shows the relationship betw een the old K C T and the new K C T B calibrations. Column 1
names the 14 old K C T items. Column 2 shows their original calibrations from Table
3.4.4. N o tic e in Colum n 6 that we have n ow obtained calibrations on old K C T Items 2, 3
and 18, three o f the original items which remained uncalibrated in our first study with 34
persons.
Colum n 3 o f Table 5.3.1 applies the necessary adjustment to bring the old K C T cali
brations in to line with their new calibrations on the new K CTB . This is done by shifting
the calibrations in Column 2 by the constant 0.4 which is the mean position o f the old
K C T items in the n ew K C T B calibrations. This causes Column 3 and Column 5 to have
the same mean o f 0.4.
In Table 5.3.1 w e see that the new K C T B Items 12 through 16 fall more or less
where expected, i f som ewhat on the easy side. K C T B Item 25 along with K C T Item 18
extend the reach o f the K C T variable 2 logits further upwards, but w e have found no one
w h o succeeds on K C T B Item s 26, 27 and 28.
Figure 5.3.3 compares the difficu lties o f those items which appeared in both the
K C T and K C T B calibrations. Each o f the 14 items is located in Figure 5.3.3 by its pair o f
d ifficu lty estimates. I f the items fit the measurement m odel, then w e expect these inde
pendent estimates o f their difficu lties to be statistically equivalent.
Thus the exten t to which the 14 points fall along the identity line tests the invar
iance o f these 14 items difficulties. As Figure 5.3.3 shows, the 14 points all lie well within
95% quality con trol lines. This is the pattern that the m odel says they must approximate
in order to be useful as instruments o f measurement.
90 BEST TEST DESIGN
Items 1 • 6 Items 1 - 6
Old KCT become New KCTB
Items 7 - 9 Items 8 - 1 0
CONSTRUCTING A V A R IA B L E 91
J TABLE 5.3.1 {
CALIBRATION OF KCTB
1 2 3 4 5 6
Old KCT New KCTB
2 2 -6 .0
3 3 -5 .6
4 -4 .2 -3 .8 4 - 3 .8 -3 .8
5 -3 .6 -3 .2 5 - 2 .3 -2 .3
6 -3 .2 -2 .8 6 -2 .5 - 2 .5
7 -4 .0
7 -3 .6 -3 .2 8 -2 .3 -2 .3
8 -2 .2 -1 .8 9 -1 .8 -1 .8
9 -3 .2 -2 .8 10 - 1 .8 - 1 .8
10 -1 .5 -1 .1 11 - 0 .8 -0 .8
12 0.1
13 - 0 .6
14 -0 .3
15 - 1 .3
16 - 0 .5
11 0.8 1.2 17 2.2 2.2
12 2.1 2.5 18 1.6 1.6
13 1.9 2.3 19 2.2 2.2
14 3.2 3.6 20 3.1 3.1
15 4.6 5.0 21 3.6 -3 .6
16 4.6 5.0 22 3.6 3.6
17 4.6 5.0 23 4.7 4.7
18 24 6.5
25 6.0
'"The Chapter 3 calibrations o f the 14 old K C T items in Column 2 have been shifted along the variable
by 0.4 logits so that the mean o f these Chapter 3 calibrations equals their mean calibration in the new
K C T B calibrations. This new mean was calculated from Column 5.
CO NSTRUCTING A V A R IA B L E 93
J
F IG U R E 5.3 .3 1
PLOT OF ITEM CALIBRATIONS,
KCT VERSUS KCTB
95%
KCTB IT E M C A L IB R A T IO N
F IG U R E 5.4.1
95% IDENTITY
CALIBRATION i Boundary LINE
(d
Figure 5.3.3 contains a pair o f 95% quality control lines which help us see the extent
to which the 14 item points conform to our model expectation o f item difficulty invar
iance. In plots which are used to evaluate the invariance o f item d ifficu lty and hence the
quality o f items, these 95% lines make it easy to see how satisfactorily the item points in
the p lot fo llo w the expected identity line.
CONSTRUCTING A V A R IA B L E 95
Figure 5.4.1 shows h ow such lines are drawn. Each p lo t compares a series o f paired
item calibrations. Each item has a d ifficu lty dj and a standard error Sj from each o f tw o
independent calibrations in which the item appeared. Thus fo r each item i we have
( d j j , Sjj) and ( d i2, si2 ). Since each pair o f calibrations applies to one item, we expect
the tw o difficu lties d ^ and di2, after a single translation necessary to establish an origin
com m on to both sets o f items, to estimate the same d ifficu lty 8 j. We also expect the
error o f these estimates to be estimated by Sjj and sj2.
This gives us a statistic fo r testing the exten t to which the tw o dj’s estimate the same
8 j, namely
in which (sjf + si2 )1/s estimates the expected standard error o f the difference between
the tw o independent estimates d (1 and di2 o f the one parameter 8j. We can introduce
this test fo r the quality o f each item point into the p lo t by drawing quality control
boundaries at about tw o o f these standard errors away from the identity line on each
side.
Since the standard unit o f d ifferen ce error parallel to either axis o f the p lot is
[(s,? + sJ)/2)* .
T w o o f these error units perpendicular to the identity line in each direction yields a pair
o f approxim ately 95% quality control lines. The perpendicular distance 2 between
these quality con trol lines and the iden tity line thus becomes
When Sjj and si2 are sufficiently similar so that the mean o f their squares is approxi
m ately the same as the square o f their mean, that is
then the distance D il2 from the identity line to a 95% confidence boundary can be
approxim ated by
Thus fo r the i = 1, K items fo r which paired calibrations are available the distances
(Sji + si2) perpendicular to the identity line drawn through each item point can be used
to locate 95% confidence lines fo r evaluating the overall stability o f the item calibrations
shown in the plot.
96 BEST TEST DESIGN
In contrast, a Rasch approach could do the same jo b with each person taking only
one test o f 60 items. T o accomplish this a third 60-item test C is made up o f 30 items
from each o f the original tests A and B. Then each o f these three tests is given to a sample
o f 400 persons as depicted in the low er part o f Figure 5.5.1. N o w each person takes only
one test, but all 120 items are calibrated together through the tw o 30-item links connect
ing the three tests. The testing burden on each person is one-half o f that required by the
equal-percentile plan.
In Rasch equating the separate calibrations o f each test produce a pair o f indepen
dent item difficulties fo r each linking item. According to the model, the estimates in each
pair are statistically equivalent except fo r a single constant o f translation common to all
pairs in the link. I f tw o tests, A and B, are joined by a common link o f K items, each test
is given to its own sample o f N persons, and d iA and d iB are the estimated difficulties o f
item i in each test with standard errors o f about 2.5/N%, then the single constant neces
sary to translate all item difficulties in the calibration o f Test B onto the scale o f Test A is
G a b = ? W ,A - d , B)/K [5.5.1]
which according to the model should be approximately chi-square with one degree o f
freedom.
C O NSTRUCTING A V A R IA B L E
1 F IG U R E 5.5.1 |
T R A D IT IO N A L AND RASCH
EQUATING DESIGNS
Items
Traditional
Equal-Percentile
Equating
Rasch
Common Item
Equating
98 BEST TEST DESIGN
In using these chi-square statistics to judge link quality we must n ot forget how they
are affected by sample size. When N exceeds 500 these chi-squares can detect link flaws
to o small to make any tangible difference in GA B. When calibration samples are large the
root mean square misfit is more useful. This statistic can be used to estimate the logit
increase in calibration error caused by link flaws.
In deciding how to act on evaluations o f link fit, we must also keep in mind that
random uncertainty in item difficulty o f less than .3 logits has no practical bearing on
person measurement (Wright and Douglas, 1975a, 35-39). Because o f the way sample size
enters into the calculation o f item d ifficu lty and hence into the evaluation o f link quality,
we can deduce that samples o f 200 persons and links o f 10 good items will always be
more than enough to supervise link validity at better than .3 logits. In practice we have
found that we can construct useful item banks with sample units as small as 100 persons.
5.6 B U IL D IN G IT E M BANKS
As we establish and extend the definition o f a variable by the addition o f new items
we have the beginning o f an item bank. With careful planning we can introduce additional
items systematically and in this way build up a bank o f calibrated items useful fo r an in
creasing variety o f measurement applications. As the number o f items increases, the prob
lems o f managing such a bank multiply. There is not only the question o f how best to
select and combine items and persons, but o f how to manage effectively the consequent
collection o f calibrated items. Rasch measurement provides a specific well-defined ap
proach to managing item banking.
The basic structure necessary to calibrate many items onto a single variable is the
common item link in which one set o f linking test items is shared by and so connects
together tw o otherwise different tests. An easy and a hard test could be linked by a com
mon set o f items as pictured in Figure 5.6.1. In this example the linking items are the
“ hard” items in the E A S Y test but the “ easy” items in the H A R D test.
CONSTRUCTING A V A R IA B L E 99
With tw o o r m ore test links we can build a chain o f the kind shown in Figure 5.6.2.
T he representation in Figure 5.6.2, however, is awkward. The linking structure can be
con veyed equally well by the simpler scheme in Figure 5.6.3 which emphasizes the links
and facilitates diagramming m ore com plicated structures.
As the number and d ifficu lty range o f the items introduced into an item bank grows
b eyond the test-taking capacity o f any one person, the chain o f items must be parceled
into test form s o f manageable length and d ifficu lty range. In Figure 5.6.3 each circle
indicates a test sufficiently narrow in range o f item difficulties to be manageable by a
suitably chosen sample o f persons. Each line connecting a circle represents a link o f
com m on items shared by the tw o tests it joins. Tests increase in d ifficu lty horizontally
along the variable and are com parable in d ifficu lty vertically.
____________ | F IG U R E 5 .6 .2 |____________
Link Link
AB BC
I F IG U R E 5 .6 .3 1
A CHAIN OF TWO LINKS
(Simplified)
Hard
Easy
Variable
10 0 BEST TEST DESIGN
Three links can be constructed to form a loop as in Figure 5.6.4. This loop is an
important linking structure because it yields an additional test o f link coherence. I f the
three links in a loop are consistent, then the sum o f their three link translations should
estimate zero.
N otice that G AB means the shift from Test A to Test B as we go around the loop clock
wise so that G c a means the shift from Test C back to Test A. Estimating zero “ statis
tically” means that the sum o f these shifts should com e to within a standard error or tw o
o f zero. The standard error o f the sum G AB + G BC + GCA will be about
3 .5 (1 /N a b K a b + 1 /N b c K b c + 1 /N c a ^ c a ^ 1
in which the N ’s are the calibration sample sizes and the K ’s are the number o f items in
each link.
CO NSTRUCTING A V A R IA B L E 101
W ith fou r or m ore tests we can construct a network o f loops. F or example, a se
quence o f increasingly d ifficu lt tests could be com m only calibrated by a series o f con
necting links as shown in Figure 5.6.5. These ten tests mark out seven levels o f difficu lty
from Tests A through D. This netw ork could connect ten 60-item tests by means o f
nineteen 10-item links to cover 600 - 190 = 410 items. I f 200 persons were used fo r
each test, then 410 items could be evaluated fo r possible calibration together from the
responses o f on ly 2,000 persons. Even 1,000 persons, at 100 per test, would provide a
substantial purchase on the possibilities fo r building an item bank out o f the best o f
the 410 items.
T h e building blocks o f a test n etw ork are the loops o f three tests each. I f a loop
fits the Rasch m odel, then its three translations should sum to within a standard error or
tw o o f zero. Thus the success o f the n etw ork at linking item calibrations can be evaluated
fro m the magnitudes and directions o f these lo o p sums. Shaky regions can be identified
and steps taken to avoid or im prove them.
102 BEST TEST DESIGN
The implementation o f test networks can lead to banks o f com monly calibrated
items far larger in number and far more dispersed in difficulty than any single person
can handle. The resulting banks, because o f the calibration o f their items onto one
common variable, can provide the item resources fo r a prolific fam ily o f useful tests, long
or short, easy or hard, widely spaced in item difficulty or narrowly focused, all auto
matically equated in the measures they imply.
These methods fo r building item banks can be applied to existing tests, if they have
been carefully constructed. Suppose we have tw o non-overlapping, sequential series o f tests
A j . A 2. A 3, A 4 and B j . B 2. B 3, B4 which we want to equate by Rasch methods. A ll eight
tests can be equated by connecting them with a new series o f intermediate tests X, Y and
Z made up entirely from items common to both series as shown in Figure 5.6.6. Were the
A and B series o f tests in Figure 5.6.6 still in the planning stage, they could also be linked
directly by embedding common items in each test according to the pattern shown in
Figure 5.6.7.
Since coherence is a vital concern in the building o f an item bank, we are especially
interested in linking structures which maximize statistical control over the join t coher
ence o f all item calibrations. Networks which maximize the number o f links among test
forms so that each form is linked to as many other forms as possible do this. In the
extreme, this leads to a web in which every individual item in a form links that form to
another different form.
_________________________ | F IG U R E 5 .6 .6 |_________________________
Test Series A
Variable
CO NSTRUCTING A V A R IA B L E 103
_____________ | F IG U R E 5 .6 .7 |_____________
Test Series A
T o illustrate w e take a very small banking problem where w e use 10 items per form
in a w eb in which each o f these 10 items also appears in one o f 10 other differen t forms.
T h e com plete set o f 10 + 1 = 11 form s constitutes a web woven out o f 11 x 10/2 = 5 5
individual linking items. Every one o f the 11 form s is w oven to every other form . The
pattern looks like the picture in Figure 5.6.8.
We w ill call this bank building design a “ com plete” web because every form is woven
to every other form . In the design o f useful webs, however, there are three constraints
which a ffe c t their construction. These are the total number o f items we want to calibrate
in to the bank, the maximum number o f items which we can com bine into a single form
and the exten t to which the bank w e have in mind reaches out in d ifficu lty beyond the
capacity o f any one person.
T h e testing situation and the capacity o f the persons taking the test forms will lim it
the number o f items w e can put into a single form . It will usually happen, however, that
w e w ant to calibrate many m ore items than w e can use up in a com plete web like the
one illustrated in Figure 5.6.8. There are tw o possibilities fo r including extra items. The
simplest, but n o t the best statistically, is to design a “ nuclear” com plete web which uses
up some portion o f the items we can include in a single form . We then fill out the re
quired form length with additional “ tag” items. These tag items are calibrated into the
bank along with the link items in their form . Unlike the link items, however, which always
104 BEST TEST DESIGN
F IG U R E 5.6.8 ______
A COMPLETE WEB
FOR PARALLEL FORMS
Forms
AB C D E F G H I J K
A \ 1 2 3 4 5 6 7 8 9 10
B ^ V 1 1 12 13 14 15 16 17 18 19
C \ 2 0 21 22 23 24 25 26 27
D 28 29 30 31 32 33 34
E \ 3 5 36 37 38 39 40
Forms F \ 4 1 42 43 44 45
G 46 47 48 49
H 50 51 52
1 11 Forms 53 54
J 10 Items per form 55
K (1 1 x 1 0 )/2 = 55 Items
appear in tw o forms, the tag items appear in only one form and sogive no help with
linking forms together into one com m only calibrated bank.
The incomplete web in Figure 5.6.9 is suitable fo r linking a set o f parallel test forms.
When the reach o f the bank goes beyond the capacity o f any one person, however, neither
o f the webs in Figures 5.6.8 and 5.6.9 will suffice, because we will be unable to combine
items from the easy and hard ends o f the bank into the same forms. The triangle o f link
ing items in the upper right comers o f Figures 5.6.8 and 5.6.9 will n ot be functional and
will have to be deleted. In order to maintain the balance o f linking along the variable we
will have to do something at each end o f the web to fill out the easiest and hardest forms
so that the extremes are as tightly linked as the center. Figure 5.6.10 shows how this can
be done systematically fo r a set o f 21 sequential forms. We still have 10 items per form
but now only adjacent forms are linked together. There are no common items connecting
the easiest forms directly with the hardest forms. But over the range o f the variable the
forms near to one another in difficu lty level are woven together with the maximum num
ber o f item links.
CONSTRUCTING A V A R IA B L E 105
Each linking item in the webs shown in Figures 5.6.8, 5.6.9 and 5.6.10 could in fact
refer to a cluster o f tw o or m ore items which appear together in each o f the tw o forms
th ey link. Som etim es the design or printing form at o f items forces them into clusters.
This happens typically in reading comprehension tests where clusters o f items are attached
to reading passages. It also occurs naturally on math and inform ation retrieval tests where
clusters o f items refer to com m on graphs. Clustering, o f course, increases the item length
o f each form by a factor equal to the cluster size.
The ro w means o f the link m atrix calibrate the form s on to one com m on variable.
O nce form difficu lties are obtained they need only be added to the item difficulties
within form s to bring all items o n to the com m on variable shared by the forms.
F IG U R E 5.6.9
AN INCOMPLETE WEB
FOR PARALLEL FORMS
FO R M S
A B C D E F G H I J K L M N O P Q R S T U
A 2 3 4 5
B 1 12 13 14 15
C 21 22 23 24
D 29 30 31 32
E 36 37 38 39
F P i 42 43 44 45
G 47 48 49 50
H I 1 52 53 54 55
I 57 58 59 60
J . 1 62 63 64 65
K 67 68 69 70
L 71 72 73 74 75
FO R M S
M 76 77 78 79 80
N 1 82 83 84 85
0 86 87 88 89 90
P 1 92 93 94 95
Q 96 97 98 99
R 21 Forms 101102
Form ulation: N = M L /2
where N = number of items (or links) inthe bank
M = number of forms i.e., 2 N /L
L = number of items (or links) per form
must be even
106 BEST TEST DESIGN
I FIGURE 5.6 10 I
AN INCOMPLETE WEB
FOR SEQUENTIAL FORMS
Easy Forms
A B C D E F G H I J K L M N O P Q R S T U
A 1 2 3 4 5
B 7 8 9 10 11
Easy Forms C 14 15 16 17 18 19 20 21
D 22 2 3 / \ 2 4 25 26 27 28 N
E 29 / \ 3 0 31 32 33 34
F 36 37 38 39
G 41 42 43 44
H 46 47 48 49
I 51 52 53 54
J 56 57 58 59
K 61 62 63 64
L 66 67 68 69
M 71 72 73 74
N 76 77 78 79
0 81 82 83 84
P 86 87 88 89
Q 91 92 93
R 21 Forms ^
\ 99 44 95 96 Hard Forms
S 10 Items per form / 97 98 99
T 21 x 10/2 + 3= 108 items y / l O 0 101102103
U 105106107108
Hard Forms
Formulation: N = M L /2 + K
where N = number of items (or links) in the bank
M = number of forms i.e., 2(N - K )/L
L = number of items (or links) per form
must be even
K = L /4, if L /2 is even
= (L + 2 )/4 , if L /2 is odd
The incomplete webs in Figures 5.6.9 and 5.6.10 require us to estimate row means
from a matrix with missing data. The skew symmetry o f link matrices helps the solution
to this problem which can be done satisfactorily by iteration or regression.
5.7 B A N K IN G T H E KCTB D A T A
The K C TB is a short test so it was practical to ask all 101 persons to attempt all 23
items giving us the response matrix illustrated in Figure 5.7.1. However, most item bank
ing projects involve the calibration o f hundreds o f items given to thousands o f examinees.
It is then impossible to ask every person to take every item. Fortunately building an item
bank does n ot require such an undertaking. As we saw in Section 5.6, items can be joined
together by a network o f links. In general, tw o types o f form equating are possible,
common persons and common items.
CONSTRUCTING A V A R IA B L E 107
One w ay to link separate form s is to administer them both to the same sample o f
persons. We illustrate “ com m on person” equating with our K C T B data by defining tw o
non-overlapping sequential tests, E A S Y and H A R D , and finding everyone w ho produced
measurable responses simultaneously in both tests. This is an attem pt at the vertical
equating o f an easy and a hard test and we can expect persons with usable scores on
both tests to be scarce. With our K C T B example there are only 29 such persons out o f
101. Th e picture o f this com m on person equating in Figure 5.7.2 shows the core o f 29
persons from the total sample linking tw o non-overlapping parts o f the K C TB , a 9-item
E A S Y test and an 8-item H A R D test.
_________________ F IG U R E 5.7.1 j_ _ _ _ _ _ _ _
COMMON PERSONS A N D COMMON ITEMS
W ITH KCTB
3 Items 25
9 8
Easy Items Hard Items
(5 -1 1 ,1 3 ,1 5 ) (1 2 ,1 4 ,1 6 -2 0 ,2 2 )
25
29
Common
Persons
101
A better way to equate forms is by using com mon items. This approach to K CTB
is shown in Figure 5.7.3. There we show eight easy items connected to nine hard items by
a six item link producing a 14 item E A S Y + L IN K form taken by the 50 lowest scoring
persons and a 15 item L IN K + H A R D form taken by the 51 highest scoring persons.
CO NSTRUCTING A V A R IA B L E 109
5 .8 C O M M O N PERSON E Q U A T IN G W IT H T H E KCTB
The measurements o f these 29 persons on each form constitute the common person
data fo r linking the E A S Y and H A R D forms together. It is the difference in the tw o
ability means which estimates the shift required to bring the E A S Y and H A R D forms
onto a common scale. The ability statistics fo r the 29 persons on each form are
1. Use the observed difference in sample mean ability 1.49 - (-0 .5 7 ) = 2.06 as the
estimated d ifficu lty difference between the tw o forms.
2. Apportion this difference over the nine E A S Y items and the eight H A R D items
so that the average d ifficu lty o f all 17 items becomes zero.
3. Bring the tw o forms onto a com mon scale by subtracting 0.97 from each E A S Y
form item difficu lty and adding 1.09 to each H A R D form item d ifficu lty/
These computations are displayed in Table 5.8.1. Column 1 gives the K C TB item
name fo r the 17 items used in the E A S Y and H A R D forms. Column 2 gives the separate
item calibrations fo r the E A S Y form. Column 3 gives the separate calibrations fo r the
H A R D form. Because these separate calibrations are each centered within their own
form Columns 2 and 3 each sum to zero.
Figure 5.8.1 compares the common person scale and the reference scale. The small
differences between the tw o scales show that the common person technique can produce
results equivalent to a combined calibration o f both tests.
CONSTRUCTING A V A R IA B L E 111
I T A R l.F R R 1 I
EQUATING EASY AND HARD FORMS
USING COMMON PERSONS
1 2 3 4 5 7
5 0.03 -0 .9 4 -1 .0 4 - 0.10
6 0 .0 3 -0 .9 4 -1 .0 4 - 0.10
7 - 0 .9 4 -1 .9 1 - 2 .0 5 -0 .1 4
8 0 .0 3 -0 .9 4 -1 .0 4 - 0.10
9 0.24 - 0 .7 3 - 0 .8 2 -0 .0 9
10 0 .4 3 -0 .5 4 - 0 .6 2 -0 .0 8
11 1.36 0.39 0.35 -0 .0 4
12 -1 .4 4 - 0 .3 2 - 0.10 0.22
13 - 0.22 - 1 .1 9 - 1 .3 0 - 0.11
14 - 1 .2 5 - 0 .1 6 0 .05 0.21
15 - 0 .9 4 -1 .9 1 -2 .0 5 -0 .1 4
16 - 2.66 - 1 .5 7 - 1 .3 0 0.27
17 - 0.12 0.97 1 .10 0.13
18 0 .6 5 1.74 1.81 0.07
19 0.65 1.74 1.81 0.07
20 1.83 2.92 2.90 - 0.02
22 2.32 3.41 3.36 -0 .0 5
FIG U R E 5.8.1
R EFEREN CE
SCALE
T o illustrate common item equating we have divided the 23 K C TB items into three
parts: E A S Y , L IN K and H A R D . The E A S Y + L IN K form contains eight E A S Y items and
six L IN K items to make a 14 item easy test. The L IN K + H A R D form contains the six
common L IN K items plus nine H A R D items making a 15 item hard test.
The paired calibrations o f the six linking items, 11 through 16, are given again in
Columns 2 and 3 o f Table 5.9.2. Their differences D = dE - dH are given in Column 4.
The mean o f these differences is 4.11 which is the difficu lty difference between the
E A S Y + L IN K form and the L IN K + H A R D form . When this difference o f 4.11 is sub
tracted from D we have the residuals from linking given in Column 5.
CONSTRUCTING A V A R IA B L E 113
Item EA SY + L IN K L IN K + H A R D
Name D ifficu lty Error D ifficu lty Error
3 - 3 .80
4 - 2.00
5 - 0.37
6 - 0 .37
7 - 2.00
8 - 0 .37
9 0 .0 6
10 0.20
11 0 .9 7 .36 - 2.24 .49
12 2 .08 .38 - 1.83 .44
13 1.58 .36 - 3.22 .73
14 1.95 .37 -2 .8 0 .61
15 0 .84 .36 - 3.90 1.01
16 1.21 .36 - 2.02 .46
17 0 .60
18 - 0 .50
19 0.26
20 1.18
21 1.56
22 1.56
23 2.78
24 4.51
25 4 .06
I f these items are providing a usable link, their residuals should distribute around
zero with the standard error predicted by the m odel. The standard errors
Sd = (Se 2 + S h 2 ) *
z = (D - 4 .1 1 )/S D
TAB LE 5.9.2
C lLINK ANALYSIS
1 2 3 4 5 6 7
Calculating L IN K S H IF T Testing L IN K F IT
Residual Standard Error Standardized
Item EASY HARD Difference Difference o f Residual Residual
Name dE D = dE - d H D - 4.11 So z = ( D - 4 .1 1 )/S D
6
L IN K Shift = 2 D j /6 = 4.11
i
Expected mean of z is 0
T o the L IN K difficulties d H we add the link difficu lty difference o f 4.11. Then we
average the L IN K dE difficulties with the L IN K dH difficulties that were adjusted by the
L IN K shift o f 4.11. The average o f the tw o L IN K estimates (d E + dH + 4.11)/2 fo r Items
I I through 16 is given in Column 5. We enter these in Column 6.
CONSTRUCTING A V A R IA B L E 115
F IG U R E 5.9.1
Form
116 BEST TEST DESIGN
TA B LE 5 9.3. L
EQUATING EASY AND HARD FORMS
BY A COMMON ITEM LINK
3 - 3 .8 0 - 3 .80 - 6.10
4 - 2.00 - 2.00 - 4.30
5 -0 .3 7 - 0.37 - 2.67
6 -0 .3 7 - 0.37 - 2.67
7 - 2.00 - 2.00 - 4.30
8 - 0 .3 7 - 0.37 - 2.67
9 0.06 0 .06 - 2.20
10 0.20 0.20 - 2.10
11 0.97 - 2 .2 4 1.87 1.42 1.42 - 0.92
12 2.08 - 1 .8 3 2.28 2.18 2.18 - 0.12
13 1.58 - 3 .2 2 0.89 1.24 1.24 -1 .0 6
14 1.95 - 2 .8 0 1.31 1.63 1.63 -0 .6 7
15 0.84 - 3 .9 0 0.21 0.53 0 .53 -1 .7 7
16 1.21 - 2.02 2.09 1.65 1.65 -0 .6 5
17 0.60 4.71 4.71 2.41
18 - 0 .5 0 3.61 3.61 1.31
19 0.26 4.37 4 .37 2.07
20 1.18 5.29 5.29 , 2.99
21 1.56 5.67 5.67 * 3.37
22 1.56 5.67 5.67 3.37
23 2.78 6.89 6.89 4.59
24 4.51 8.62 8.62 6.32
25 4.06 8.17 8.17 5.87
Standard
Deviation 1.68 2.64 2.64 3.37 3.37
Finally in order to place the H A R D items on the common scale we add 4.11 to
H A R D Items 17 through 25 and bring these d ifficu lty estimates over to complete Column
6. We then have in Column 6 a new common item scale with the average o f tw o L IN K
d ifficu lty estimates and the H A R D difficulty estimates all connected to the E A S Y item
d ifficu lty estimates.
CONSTRUCTING A V A R IA B L E 117
Th e mean o f this com m on item scale in Column 6 is 2.30 so we subtract 2.30 from
each item d ifficu lty in Column 6 to center the new scale at 0.00 as shown in Column 7.
T o assess the adequacy o f this com m on item equating we w ill compare it to the item
difficulties w e w ould have gotten had we n ot attem pted linking but used all 101 person
responses to all 23 items. Th e com m on item difficulties from Table 5.9.3 are given in
Column 2 o f Table 5.9.4. Column 3 gives the reference calibrations o f all 23 items from
all 101 persons, and Column 4 shows the differences between the com m on item d iffi
culties d c and the reference scale item difficulties d^. The p lo t o f these values given in
Figure 5.9.2 shows the items close to the expected identity line.
I T A B L E 5 .9 .4 I
COMPARING COMMON ITEM EQUATING W ITH THE
REFERENCE SCALE
1
■“ ““ 2 3 4
Standard
Deviation 3.37 3.32 0.15
118 BEST TEST DESIGN
F IG U R E 5.9.2
R EFERENCE
SCALE
COMM ON IT E M SCALE
By locating all 23 K CTB items on a single scale we can make the definition o f the
K C T variable more explicit. These items which now mark out the variable are constructed
out o f a few basic components: number o f taps, number o f reverses and overall distance
across blocks. It is the way these underlying components evolve along the variable which
documents fo r us what a measure on the K C T variable means. Figure 5.10.1 gives the d if
ficulty level o f the K CTB items together with their number o f taps, reverses and distances.
CONSTRUCTING A V A R IA B L E
119
© 0 © © •™
<1 0 © E ©
rH
H
(S © G G 01
g©
©© 00 @D
(9)
© © © --
o
THE KCT VA R IA B LE
0©
| ____________
©© Q0 00 r>» csi
0
© © ©
(9)
E --
0)
*-*
00
5.10.1
CN
>
00
(A
© © 0 ©
FIGURE
c
o
M
to
0 «*) (in
DOCUMENTING
k- ©
0)
0L
n n / *^ in o
G®
|
r
s 0
(2)
© ©
0
0
©0 ©3 ©0 0© cvj
1
G® G® (3® G® 0
ro
1
ro
0©
--
Qt I ^ I
( N g IH
Z
o
h
z in
a>
in
E a>
w V)
a >
Q <0 4)
lii a l£>
I
>
h
a>
E
rtJ
(" o Q ? 0 5 o. 6 a>"In UJ
_l
Z
<
z
E
a>
13 —.
"
a>
n
c s st S
in <0 in
uc0) 4J
^
U
median
3. Nur
H a>
c .2 in
(/) fl) ‘D Q 2 tj u
(taps
CD
* 03 G© I? ,0 0 £O1 S
8
D c ^
(/) *
120 BEST TEST DESIGN
CONSTRUCTING A V A R IA B L E 121
T h e pattern o f taps, reverses and distances in Figure 5.10.1 show how the K C T
variable is built ou t o f these basic operations. This provides a substantive, or criterion,
reference fo r the K C T variable. The resulting picture gives us insight into the nature o f
the variable which reaches beneath the individual items. In particular it shows us how to
generate m ore items at any designated d ifficu lty level.
W e can also learn about the K C T variable by seeing how the 101 persons in our
sample are distributed along it. In R ow s 5 and 6 we show each person’s position on the
variable b y their age in years. This allows us to norm reference the variable with age
medians from three to eight years and to give an age distribution o f “ mature” persons o f
9 or m ore years o f age with a mean at 1.3 logits and a standard deviation o f 1.9 logits.
Thus Figure 5.10.1 becomes a map o f the variable which is both criterion and norm
referenced.
5.11 IT E M C A L IB R A T IO N Q U A L IT Y C O N T R O L
We cannot exp ect the items in a bank to retain their calibrations indefinitely or to
w ork equally w ell fo r every person with whom they may be used. The quality o f item
calibration must be supervised continuously. This can be done conveniently by a routine
exam ination o f the differences between h ow persons actually respond to particular items
and h o w w e exp ect them to respond given our calibrations o f the items and our measure
ments o f the persons. These differences are residuals from expectation. A n occasional
surprising item residual suggests an anomalous testing situation or a peculiar person.
Trends in item residuals, however, may be indicative o f item failure. Tendencies fo r items
to run in to trouble, to shift d ifficu lty or to be biased fo r some types o f persons can be
exposed by a cumulative analysis o f item residuals over tim e, place and person type.
Problem atic items can then be rem oved from use or brought up-to-date in difficu lty.
T h e purpose o f item quality con trol is to maintain supervision over item calibration
stability against the possible influences o f age, sex, education or any other factor which
m ight disturb item functioning. A quality control procedure requires that item usage be
accom panied b y concom itant educational and demographic inform ation so as to provide
a basis fo r analyzing whether these other variables threaten the stability o f item calibra
tion and hence disturb the interpretation o f test responses. The discussion which follow s
builds on the analysis o f fit developed in Chapter 4.
T o im plem ent item quality con trol we save from each use o f an item:
When the tw o pieces o f information x vi and bv are combined with the item ’s bank
difficulty d; we can form a standardized residual zvi which will retain all the information
in this use o f item i which bears on the possibility o f a disturbance in its functioning.
where
is the estimated probability o f success fo r person v on item i and hence the estimated
expected value o f x vi given the model.
Standardized Residuals
Table 5.11.2 gives a summary o f the unexpected responses observed in the K CTB
data. Column 1 gives the range o f absolute difference between person ability and item
d ifficu lty. Column 2 expresses this difference as z 2 = exp (I b - d |) and Column 3 con
verts z 2 to the response im probability [ 1/(1 + z 2 )] it implies.
A b ility-
D iffic u lty Im probability Possible Expected Observed Item Person
Difference z2 1/(1 + z 2 ) Count Count Count Names Names
6 . ® 1 1 4 7 M .C 4 9 M ^ > 8 2 F
We have counted the number o f item-by-person interactions which could fall within
each ro w o f Table 5.11.2 and multiplied this “ possible” count by its im probability to
estimate the count w e m ight exp ect i f these data fit the m odel. This was done by m ulti
p lyin g (2 2 6 ) x .01 s 2, (226 + 133) x .02 s (2 + 5) and (226 + 133 + 184) x .05 s
(2 + 5 + 20). Th e actual counts observed in the data are given in Column 6 . Thus when
(b - d ) is over 4.6 logits we exp ect about tw o improbable responses and we observe
three. When (b - d ) is between 3.9 and 4.6 we exp ect about five improbable responses
and again w e observe three. Finally when (b - d ) is between 2.9 and 3.8 we expect about
tw en ty im probable responses but observe on ly eight. These data seem to fit the model
rather well.
When we scan the 14 m ost unexpected item and person responses given in Table
5.11.2, w e see that they are w ell dispersed over items and persons. Only Items 3 and 7
and Persons 49M and 95M appear tw ice and the sexes are equally represented. We must
conclude that no clear sign o f systematic m isfit has been detected in these data.
T A R I F R 11 3
THE SIX MOST UNEXPECTED RESPONSES
ON KCTB
(101 Persons By 23 Items)
Person
A bility Item D ifficulty Person Characteristics
d;
0.4 0 (4 .7 ) * 1 1 1 1 1 49 M 16+ 12 +
1.4 1 0 (5 .5 ) 1 1 1 1 68 F 16+ 12 +
1.9 1 1 0 (4 .5 ) 1 1 1 79 M 16+ 12 +
2.4 1 1 1 1 0 (4 .5 ) 1 83 F 16+ 12 +
3.6 1 1 1 0 (5 .7 ) 1 1 93 F 16+ 12 +
4.3 1 1 1 1 1 0 (4 .4 ) 95 M 16+ 12 +
Item
Characteristics
* ( b - d) = ( 0 . 4 ) - (-4 .3 ) = 4 .7
The d ifficu lty characteristics o f the items in reverses and distance show the increase
we would expect as the items become more difficult. A ll six items are on the easy end o f
the variable. The six persons, on the other hand, are all relatively able adults. This sug
gests that, i f a systematic source o f m isfit has been detected here, it could only be a slight
tendency towards carelessness, or lapses o f attention, among some older persons working
on items rather to o easy fo r them.
F it analysis matrices, like Table 5.11.3, which bring together the person and item
characteristics o f the most unexpected responses, are convenient fo r supervising the
quality o f item functioning. These matrices identify and suggest corrections fo r the sys
tematic sources o f item failure shown in the data.
This crude method o f fit analysis consists o f identifying and calculating only the few
largest z 2 ’s observed on an item and then adding to them a 1 fo r each other person taking
that item. This assumes that all o f the disturbance observed in that item is due to its
outstanding residuals and that the rest o f the pattern is more or less as expected.
CONSTRUCTING A V A R IA B L E 125
Table 5.11.4 gives an illustration o f this method. There we have taken from Table
5.11.3 just the single largest zv i2 observed in our K C TB data and added to it a 1 fo r each
other person taking that item, in this case 100. This gives zvi2 + 100 = X2 as the chi-square
fo r that item and vj = X2/100 as the item mean square.
T o see whether this crude method can be useful, we w ill compare it with U C ON and
the hand m ethod described in Chapter 4, but n ow applied to these K C T B data. In those
procedures we sum all 101 actual z 2 ’s to make our item fit analysis and then divide this
sum o f squares by its 100 degrees o f freedom to get the mean squares shown in Table
5.11.5.
I T A B L E 5 .1 1 .4 l
_________________________ CRUDE F IT ANALYSIS I_________________________
FOR SIX KCTB ITEMS I
_________________________ | T A B L E 5 .1 1 .5 |_______________________
A COMPARISON OF ITEM Q U A L IT Y CONTROL METHODS
APPLIED TO KCTB
The U C O N and hand fit methods approxim ate one another rather closely. Although
the crude fit mean squares are somewhat larger in magnitude, their order is identical to
the other methods and their values are sufficiently close to get a clear idea concerning the
relative fit o f these six items. Table 5.11.5 suggests that the crude method can be useful
fo r the quick analysis o f item functioning.
126 BEST TEST DESIGN
While norms are no more fundamental to the calibration o f item banks than are
distributions o f person heights to the ruling o f yardsticks, it is usually useful to know
various demographic characteristics o f a variable defined by an item bank. Some o f these
demographic characteristics may even have normative implications under particular
circumstances. Because o f a shift in emphasis, norming a variable in the Rasch approach
takes much less data than norming a test. We need only use enough items to estimate the
desired “ norming” statistics. Once the variable is normed, then all possible scores from all
possible tests drawn from the calibrated bank are automatically norm-referenced through
the variable.
Often we are satisfied with a mean and standard deviation fo r each cell in our norm
ative sampling plan. These tw o statistics could be estimated from a random sample o f 100
or so persons taking a norming test o f only tw o items. O f course, a somewhat longer test
o f 10 or 15 items will do a better job. N o t only will the estimates be better but the extra
items will yield standard errors around the norming statistics and thus a test o f fit for the
plausibility o f the data. More than 15 items in a norming test, however, will seldom be
necessary. This means that we could norm six different variables simultaneously by allo
cating 15 items to each o f six subtests administered as one 90-item composite test.
We can estimate quick norms from frequency data on bank calibrated items without
scoring or measuring the individual persons. This may be useful when trimming sample
data is undesirable. I f we seek a probability sample from a population, fo r example, we
would rather not distort the sample’s status by eliminating some o f the persons sampled
because they earned zero or perfect scores.
This norming procedure can be accomplished by working directly from the model
and the observed number o f right answers to each calibrated item. ,
1. For each sampling cell in the norming study, select from the item bank a suitable
set o f K calibrated items sufficiently spaced in difficu lty d; to cover the expected
ability dispersion o f that particular sampling cell. N ote that each sampling cell, in
principle, has its own individually tailored norming test.
2. Administer this test o f K items to a random sample o f N persons from the speci
fied cell.
3. Observe the number o f persons s; succeeding on each item.
4. Calculate the natural log odds h; o f these correct answers Sj fo r each item
5. Regress these log odds h; on the associated item difficulties d; over the K items to
obtain the intercept A and slope C o f the least squares straight line.
6 . Estimate the population mean M and standard deviation SD o f that cell’s abilities
as
M = - A /C [5.12.2]
We will apply these procedures to the K CTB sample o f 101 persons to see how well
they recover the sample mean and standard deviation that we have already estimated
from the measurements o f each o f the 101 persons to be M* = 0.19 and SD' = 2.44.
CONSTRUCTING A V A R IA B L E
127
We have the item difficulties d; fo r Items 3 through 25 and so need only to compute
the natural lo g odds hj o f observed correct answers to each o f these items. The values o f
h; are given in Column 3 o f Table 5.12.1 with the corresponding item difficulties in
Column 4.
Regressing these lo g odds right answers on the item difficulties over the 23 items
gives us an intercept o f A = 0.07 and a slope o f C = -0 .5 6 .
M = -A / C
= - 0 .0 7 /- 0 .5 6
= 0.1 3 .
J T A B L E 5.12.1 I
KCTB LOG ODDS CORRECT ANSWERS
AND ITEM D IFFIC U LTIE S
1 2 _3_ 4
3 98 3.49 - 6.20
4 91 2.21 -4 .1 1
5 82 1.46 - 2 .5 8
6 83 1.53 - 2 .7 6
7 92 2.32 - 4 .3 4
8 82 1.46 - 2 .5 8
9 78 1.22 - 2 .0 6
10 78 1.22 - 2 .0 6
11 68 0.72 - 1 .0 3
12 57 0.26 - 0.12
13 66 0.63 - 0 .8 5
14 62 0.46 - 0 .5 2
15 73 0.96 -1 .5 1
16 65 0.59 - 0 .7 7
17 30 - 0.86 1.93
18 37 -0 .5 5 1.36
19 29 -0 .9 1 2.01
20 20 -1 .4 0 2.88
21 16 -1 .6 7 3.33
22 16 -1 .6 7 3.33
23 8 -2 .4 5 4.52
24 2 -3 .9 0 6.27
25 3 -3 .4 9 5.81
SD = 1.7[(1 - C 2 )/C 2 ] *
= 2.54.
These quick norm regression estimates o f 0.13 fo r the mean and 2.54 fo r the stan
dard deviation compare satisfactorily with the values o f 0.19 and 2.44 computed by
measuring each o f the 101 persons and then calculating their mean and standard deviation
in the usual way.
The p lot o f the log odds correct answers hj against the item difficulties d, in Figure
5.12.1 shows how well these norming data fit the straight line expected by the model.
F IG U R E 5.12.1
LOG ODDS
CORRECT
ANSWERS
+5
+4
-6 -5 -4 -3 -2 -1 © +1 +2 +3 +4 +5 +6
ITEM
DIFFICU LTY
- 1
-2
-3
Intercept = 0.07
Slope = - 0 .5 6
-4
-5
6 D ES IG N IN G TESTS
6.1 IN T R O D U C T IO N
Sometimes we have exp licit prior knowledge about our target. We, or others, have
measured it b efore and so w e can suggest its probable location and dispersion directly in
terms o f these prior measures on the variable and their standard errors. Sometimes we can
use items calibrated along the variable, some o f which we believe are probably just right
fo r the target, some o f which are nearly to o hard and some o f which are nearly to o easy.
Then w e can take from the difficulties o f these reference items rough indications o f the
probable center and boundaries o f our target.
One w ay or another we assemble and clarify our suppositions about our target as
w ell as we can so that w e can derive from them the test design which has the best chance
o f m ost increasing our knowledge.
Obviously i f w e know everthing w e want to know about our target, then w e would
n o t have to measure it in the first place. However, no matter h ow little we know, we
always have some idea o f where our target is. Being as clear as possible about that prior
know ledge is essential fo r the design o f the best possible test.
Graham A. Douglas collaborated in the preparation o f parts o f this chapter. See Wright and Douglas,
1975a.
129
130 BEST TEST DESIGN
then we can describe a target G by the expression G (M ,S,D ) and we can summarize our
prior knowledge, and hence our measurement requirements fo r any target we wish to
measure, by guessing, as well as we can, values fo r the three target parameters M,S and D.
I f we can specify boundaries within which we feel fairly sure that the person will be
found, we can set S so that M±kS defines these boundaries. Then, even if we have no clear
idea at all about the distribution D o f our uncertainty between these boundaries, we can
nevertheless expect that at least (1-1/k2) o f the possible measures will fall within M±kS.
I f we go further and expect that the measures we think possible fo r the person will
pile up near M, then we may even be willing to take a normal distribution as a useful way
to describe the shape o f our uncertainty. In that case we can expect .95 o f the possible
measures to fall within M±2S and virtually all o f them to fall within M±3S.
We will refer to these tw o target distributions as the Tchebycheff interval and the
normal. We might consider other target distributions, but these tw o seem to cover all
reasonable target shapes rather well. For example, if we feel unhappy about thinking o f
our target as approximately normal, then it is unlikely that we will have any definite
alternative clearly in mind. Thus, the most likely alternative to a normal target is one o f
unknown distribution, best captured by a Tch ebych eff interval. This realization that all
possible target shapes can be satisfactorily represented by just tw o reasonable alternatives
is important because it makes a unique solution to the problem o f best test design not
only possible but even practical.
I f the target is a group rather than an individual, then we may take S and D to be
our best guess as to the standard deviation and distribution o f that group. I f we think the
group has a more or less normal distribution, then we will take that as our best guess for
D. Otherwise we can always fall back on the Tch ebych eff interval.
Finally, we must be explicit about how precise we want our measurement to be.
A fte r all, this is our m otive fo r measuring. It is just because our present knowledge about
our target is to o approximate to suit us that we want to know more precisely where our
target is and, i f it is a group rather than an individual, more precisely about its dispersion.
However, whether the target is an individual or a group, our decision about the desired
standard error o f measurement SEM will be made in terms o f individuals, fo r that, in the
end, is what we actually measure.
In the case o f a one-person target, we want the SEM to be enough smaller than S to
reward our measurement efforts with a useful increase in the precision o f our knowledge
about where that target person is located. In the case o f a group target we want to achieve
D ESIG N IN G TESTS 131
an im proved estimate n o t only o f M, the center o f the group, but also o f S, its dispersion.
T h e observable variance o f measures over the group estimates n ot only the underlying
variance in ability S2 but also the measurement error variance SEM2. Our ability to see
the dispersion o f our target against the background o f measurement error depends on our
ability to distinguish between these tw o components o f variance. Since they enter into
the observable variance o f estimated measures equally, the smaller SEM2 is with respect
to S2, the m ore clearly w e can id en tify and estimate S2, the com ponent due to target dis
persion. Thus, fo r all targets w e seek an SEM considerably smaller than S.
R E L A T IV E
FREQUENCY
P
4S
6 .3 T H E M E A S U R IN G T E S T
FIGURE 6.3.1 [
R E L A T IV E
SCORE
f = r/L
A B IL IT Y
MEASURE
. b
db
LM D — gf LOD = C f/L
In Figure 6.3.1 we can see from the shape o f the test operating curve that its tw o
outstanding features are its position along the variable, which we will call test height, and
the range o f abilities over which the test can measure more or less accurately, a character
istic caused primarily by the dispersion o f the item difficulties, which we will call test
width.
But height and width do not complete the characterization o f a test. When we look
more closely at the way the test curve transforms observed scores into inferred measures
we see that there is a discontinuity in observable scores which is going to determine the
smallest increment in ability we can measure with any particular test. This least measur
able difference LM D depends on the test’s least observable difference LOD. Since the least
change possible in a test score is one, the LO D in relative score f = r/L, must be 1/L. In
Section 6.5 we will find that the standard error o f measurement, or least believable d if
ference, SEM also depends on the number o f items in the test. Indeed SEM = LM D 14. So
in order to finish characterizing a test we must also specify its length.
D ESIG N IN G TESTS 133
From this w e see that any test design can be defined more or less com pletely just by
specifying the three test characteristics; height, width and length. I f we let
In the practical application o f best test design, however, w e w ill have t o approximate
our best design T fo r a target G from a finite p oo l o f existing items. In order to discrim
inate in our thinking between the best test design T (H ,W ,L ) and its approxim ate realiza
tion in practice, w e w ill describe an actual test as t(h ,w ,L ) where
6 .4 T H E SHAPE O F A BEST T E S T
A best test is one which measures best in the region within which measurements are
expected to occur.* Measuring best means measuring most precisely. A best test design
T (H ,W ,L ) is one with the smallest error o f measurement SEM over the target G (M ,S ,D )
fo r given length L (o r, what is equivalent, with the smallest L fo r a given value o f SEM).
“ O ver the target” implies the m inim ization o f a distribution o f possible SEMs. Thus, a
position with respect to the m ost likely target distribution must be taken before the
m inim ization o f SEM can proceed.
W e bring the profusion o f possible target shapes under control by focusing on the
tw o extremes—interval and normal. H ow shall m inim ization be specified in each case? For
a normal target it seems reasonable to m axim ize average precision, that is, to minimize
average SEM, over the w h ole target.
When w e derive the SEM 2 from our response m odel we w ill discover that it is the re
ciprocal o f the inform ation about ability supplied by each item response averaged over
the test. Since the m ost inform ative items are those nearest the ability being measured
♦Attem pts to meet this requirement have been made by Birnbaum (1968, pp. 465-471). Our ideas are
consistent with his efforts, but we have taken them to their logical and practical conclusion.
134 BEST TEST DESIGN
and the least informative are those farthest away, the precision over the target will de
pend not only on the distribution o f the target but also on the shape o f the test. Thus,
the question o f what is a best test also depends on our taking a position with respect to
the best distribution o f test item difficulties.
What are the reasonable possibilities? I f we want to measure a normal target, then a
test made up o f normally distributed item difficulties ought to produce the best maxi
mization o f precision over the target. This is the conclusion implied in Bimbaum’s analy
sis o f inform ation maximization (Bimbaum, 1968, p. 467).
However, normal tests are clumsy to compose. Normal order statistics can be used to
define a set o f item difficulties, but this is tedious. More problematic is the odd concep
tion o f measuring implied by an instrument composed o f normally distributed measuring
elements. A normal test would be like a yardstick with rulings bunched in the middle and
spread at the ends. Measuring with such an irregularly ruled yardstick would be awkward.
In the long run, even fo r normal targets, our interest becomes spread out evenly over all
the abilities which might be measured by a test. Equally spaced items are the test shape
which serves that interest best. That is the way we construct yardsticks. The test design
corresponding to an evenly ruled yardstick is the uniform test in which items are evenly
spaced from easiest to hardest (Bimbaum, 1968, p. 466).
T w o target distributions, normal and interval, and tw o test shapes, normal and uni
form , produce four possible combinations o f target and test. Wright and Douglas (1975a)
investigated all four combinations rather extensively and found the normal test to work
best on the normal target and the uniform test to w ork best on the interval target. When
they compared the normal and uniform tests on normal targets, however, these tw o test
shapes differed so little in their measuring precision as to appear equivalent fo r all prac
tical purposes. Thus the best all purpose test shape is the uniform test.
where
The measure bf is estimated from a test o f length L with items |d;| fo r i = 1, L through
the equation (fo r details see Sections 1.5 and 3.7)
L
f = Z p f,/L , for f = 1 /L ,(L - 1 )/L [6.5.2]
i
D ESIGNING TESTS
135
1 / 2 pfi (1 - pf I ) = SEM f 2 [ 6 .5 .3 ]
i
We see that SEM f depends on the sum o f pfj (1 - p f i ) over i. Thus it is a function o f
b f and all the dj. H owever, fluctuations in p (1 - p ) are rather mild fo r p between 0.2 and
0.8. T o expedite insight in to the make-up o f SEM f we can reform ulate it so that the aver
age value o f Pfj (1 - P fj) over i is one com ponent and test length L is the other.
in which
Cf = [ 2 Pfi (1 ~ Pfl)/L] -1
In this expression w e factor test length L out o f SEM in order to find a length-free error
c o e ffic ien t C f .
Resuming our study o f the operating curve o f a test given in Figure 6.3.1 we see that
the least measurable d ifferen ce in ability LM D is (“i y - ) LO D . Since the least observable
increm ent in relative score is 1/L, all w e need to com plete the form ulation o f the LM D is
the derivative o f b with respect to f which from Equations 6.5.1 and 6.5.2 is
dh L
— = [ 2 p fi (1 - pf i) / L ] _1 [6.5.5.]
But this is our error co e ffic ien t C f, thus the least measurable difference at relative score f
is
L M D f — Cf /L [6.5.6]
and
With SEM f in this form we note that, as far as test shape is concerned, it is Cf which re
quires minim ization. This will be true whether we use C min to minimize SEMf given L or
to m inim ize L given SEM f .
6.6 T H E E R R O R C O E F F IC IE N T
N o w we need to know m ore about this error coefficien t Cf. The essential ingredient
o f C f is the expression Pfj (1 - P fj). This is the inform ation If; on bf contained in a re
sponse to item i with d ifficu lty dj (Birnbaum, 1968, p. 460-68). Its average value
What values can we expect Cf to take? We can approach this question in tw o ways:
in terms o f the influence o f reasonable values o f (b f - d j) on pfi and, fo r uniform tests, in
terms o f test width W and the boundary probabilities pf l fo r i = 1, the easiest item, and
pfL fo r i = L, the hardest item. The probability pfi is defined in Equation 6.5.1.
Beginning with reasonable values o f (b f - dj), we see that when bf = dj and their
difference is zero, then p fi = 1/2, p fi (1 - p f i ) = 1/4 and Cf = 4, but when (b f - d;) = -2
then p fj = 1/8, p fi (1 - p f i ) = 1/9 and Cf = 9. (N otice that C f = 9 when (b f - dj)= +2 and
Pfi = 7/8 also). Since an average can never be greater than its maximum element nor less
than its minimum, we can use these figures as bounds fo r Cf .
Turning to the bounds we can derive fo r Cf from the test width W and the boundary
probabilities pf l and p fL o f a uniform test, we can use an expression fo r Cf given W de
rived in Wright and Douglas, 1975a (also Bimbaum, 1968, p. 466).
Cfw = W /(p f l - p fL )
where
When b f is contained within the d ifficu lty boundaries o f the test, and W is greater
than 4 then 1/2 < ( P f j - P f L ^ * *-Tw must fall between W and 2W, that is
and W>4
and by
(W /L )’7’ < S E M < ( 2 W / L ) *
6.7 T H E D E S IG N O F A BEST T E S T
F o r best test design on either interval or normal targets w e select a set o f equivalent
items (w here W = 0 ) or a set o f uniform items with the W indicated in Table 6.7.1. Table
6.7.1 gives optim al uniform test widths fo r normal and interval targets. F or example, if
the target is thought to be approxim ately normal with presumed standard deviation
S = 1.5, the optim um test width W is 4. If, however, the target is m ore uniform in shape
then the optim um w idth could be as large as 8 . N o te that fo r any value o f S a smaller W
is always indicated when a normal “ bunched up” target shape is expected.
Table 6.7.1 also shows the e fficien cy o f a simple rule fo r relating test width W to
target dispersion S. T h e rule W = 4S comes close to the optimum W fo r narrow interval
targets and fo r w ide normal targets. When we are vague about where our target is we are
also vague about its boundaries. That is just the situation where we would be w illing to
use a normal distribution as the shape o f our target uncertainty. When our target is nar
ro w however, that is the tim e when we are rather sure o f our target boundaries but, per
haps, n o t so w illin g to specify our expectations as to its precise distribution within these
narrow boundaries. T o the exten t that interval shapes are natural fo r harrow targets while
normal shapes are inevitable fo r w ide targets, W = 4S is a useful simple rule.
Th e efficien cy o f this simple rule fo r normal and interval targets is given in the final
columns o f Table 6.7.1. There w e see that its efficien cy is hardly ever less than 90 per
cent. I f w e cross over from an interval target to a normal target as our expected target
dispersion exceeds 1.4, then the efficien cy is never less them 95 per cent. This means, fo r
exam ple, that a simple rule test o f 20 items is never less precise than an optimum test o f
19 items.
Our investigations have shown that given a target M, S and D there exists an opti
mum test design H and W from which w e m ay generate a unique set o f L uniform ly distri
buted item parameters j 6 j } . H owever, this design is an idealization and cannot be per
fected in practice. Real item banks are finite and each item d ifficu lty is only an estimate
o f its corresponding parameter and hence inevitably subject to calibration error. We will
never be able to select the exact items stipulated by the best test design { 5 ; } . Instead we
must attem pt to select am ong the items available, a real set o f |d;[ which comes as close
as possible to our ideal design jfi j } .
Thus parallel to the design specification T (H ,W ,L ) we must write the test description
t(h ,w ,L ) characterizing the actual test j d ;} which we can construct in practice. This raises
the problem o f estimating h and w.
The estimated test height h can be determined b y the average estimated difficulties
o f the test items
h = S d j /L = d. 16.7.11
i
138 BEST TEST DESIGN
J FIGURE 6.7.1 l_
RELATIVE
FREQUENCY
P
4S
RELATIVE
SCORE
f = r/L
A B ILITY
MEASURE
b
The estimated test width w can be determined from the range o f these estimated
difficulties, or perhaps a bit m ore precisely from an estimate o f this range based on the
tw o easiest items d j and d 2, and the tw o hardest, d L_ j and dL .
w = [(d L + d L_., - d 2 - d 1 ) / 2 ) [ L / ( L - 2 )] [ 6 .7 .2 ]
TABLE 6.7.1
TARG ET SIMPLE
STD. DEV. NO RM AL TARG ET IN T E R V A L TARGET RU LE* EFFICI E N C Y**
error minimized error minimized
S over N (M ,S 2 ) at (M ± 2 S ) W =4S Normal Interval
.5 0 0 2 94 97
.6 0 0 2
.7 0 2 3 90 100
.8 0 3 3
.9 0 4 4
1.0 0 5 4 89 98
1.1 0 6 4
1.2 1 6 5 92 96
1.3 2 7 5
1.4 3 7 6
1.5 4 8 6 96 91
1.6 5 9 6
1.8 6 10 7 98 87
2.0 8 11 8 99 84
“This Simple Rule is conservative for narrow targets and more practical since available items are
bound to spread some. It is also close to the normal target optimum for wide targets, which is
reasonable in the face of substantial target uncertainty.
5. We select items d; from our item bank such that they best approximate the set
16;} by minimizing the discrepancy (d; - 6 j).
L
6 . We calculate h = S d / L = d.
i
TA B LE 6.8.1
7.1 U S IN G A V A R IA B L E T O M A K E M E A SU R E S
This chapter is about turning test scores into measures. But before we show h ow to
d o this in Sections 7.2 and 7.3, we w ill review h ow the test items defining a variable can
be used to make measures.
T o make a measure we collect and com bine a series o f observed responses in such a
w ay that they support an inference as to the position o f the person on a variable. We sum
marize these observations in to a score based on them and this score is used to im ply the
measure o f the person on the variable. The variable itself, however, is an idea and n ot a
direct experience. Its nature can on ly be inferred from relevant samples o f carefully selec
ted observations.
A calibrated item bank provides a resource from which subsets o f items can be
selected to form specifically designed tests with optimal characteristics. Scores on these
tests, although stemming from d ifferen t combinations o f “ correct” responses to d if
ferent selections o f items, can nevertheless be converted through the bank calibrations
in to comparable measures. Procedures fo r obtaining comparable measures fo r individu
alized tests are given in Sections 7.4 to 7.7.
T o validate these measures, however, we must assess the exten t to which the persons
in question have taken the items in the way we intended them to be taken. The item
calibrations in the bank com e from occasions on which many persons were found to re
spond to these items in a particular consistent way. This is the con text in which the item
calibrations gained their meaning. The meaning these calibrations now convey depends on
h ow the new persons being measured are found to respond to the items. The validity o f
141
142 BEST TEST DESIGN
their measures depends on the presence o f acceptable relations between what we actually
observe and what we expect to observe according to our measurement model and our
item calibrations. Thus before we can accept any measure as valid, we must examine the
plausibility o f the pattern o f responses on which that measure is based. The procedure for
accomplishing the analysis o f person fit necessary to establish measure validity is given in
Sections 7.8 and 7.9. In these sections we show how to detect person misfit and what
various kinds o f misfit lo o k like.
Whenever misfit is identified, the next step is to deal with the measurement quality
control problem this misfit causes. I f we can identify the circumstances leading to the
misfit, we may be able to extract from the flawed response record a measure which the
observed pattern o f responses can sustain. We show how to do this in Section 7.10.
When a person takes a test, the resulting observation o f the person is their test
score. T o see how to get from this test score r to the estimated measure b which it implies
we refer to the measurement model,
which specifies how item calibration 6 , and person measure 0 are implied by the person’s
observed response Xj. The model implies that fo r each response o f a person to an item we
“ expect” an intermediate “ probable” value which is neither Xj = 1 fo r a correct response
nor Xj = 0 fo r an incorrect response, but somewhere in between them. This “ expected”
value is the probability nt given in Equation 7.2.1 that x, = 1, and it works just like our
expectation that fair coins fall half the time heads. Since we “ expect” a value on each
coin toss which is half the time heads and half the time tails, even though what happens
can only be one or the other, our expected value fo r a particular toss is neither 0 nor 1,
but half way between at n = V%.
E l x i i = »i
the model probability o f a correct answer to item i.
L
Since the test score r = 2 x ( is the sum o f the item responses, the expected value o f r
i
is the sum o f their expectations,
ZEjx,|. s . , .
I f we now substitute in n-, the measure b r to be estimated fo r 0 on the basis o f score r and
the estimated calibrations|dj} fo r j s , } , we have an estimation equation which relates
r and br as follows
L
r = E exp (br - d j)/[1 + exp (br~ dj)] [7.2.2]
From this equation, a person’s score r and the calibrations j dj | o f the items taken, we can
determine the measure br which they imply.
One way to solve Equation 7.2.2 is to use the UCON procedure described in Chapter
3. The UCON estimated measure is obtained by performing j = 1, m iterations o f
in which
br° = Cn [ r / ( L - r > ] .
When the convergence criterion is reached, then the estimated measure is the last value o f
br, namely
br = b ri+1 [7.2.5]
br = h + [1 + (sd2 /2 .8 9 ) ] * E n [ r / ( L - r ) ] [7.2.7]
in which
L
h= dj/L = d.
sr = (1 + s d2 / 2 . 8 9 ) ’/ ’ [ L / r ( L - r ) ] * . [7.2.8]
Since it is often the case that the d j’s o f a sample o f new items approximate a nor
mal distribution and since normal samples o f persons are typical, P R O X is often useful
fo r calibrating n ew items. In making measures, however, we can take advantage o f already
calibrated items and spread them uniform ly d ( ~ U (H , W ) over the range o f ability to be
measured. Such a uniform test can be described com pletely by its height H, width W, and
length L. Its measures can be calculated efficien tly by the U F O R M procedure described
in Section 7.3.
144 BEST TEST DESIGN
bf = h + w (f - 0.5) + 2n (A /B ) [7.2.9]
where A = 1 - exp ( - w f )
B = 1 - e x p [ - w (1 - f )]
L
and h = Z d / L = d.
I
f = r/L
is the relative score on the L item test (W right and Douglas, 1975a, 2 1-2 3).
s, = [(w /L K C /A B ) ] 54 [7.2.10]
B = 1 - exp [~w (1 - f) ]
C = 1 - exp ( - w)
T o illustrate the use o f these procedures we have chosen nine persons from our
K C TB sample o f 101. Three o f these persons are at the preschool level, three are at the
primary level and three are adults.
In Columns 2 through 4 o f Table 7.2.1 we give the sex, age and grade o f these nine
persons. Column 5 contains their K C TB scores. Their corresponding UCON abilities are
given in Column 6 .
TA B LE 7.2.1
1 2 3 4 5 6
A bility Person Age in School KCTB UCON
Group Name Sex Years Grade Score Ability
3M M 3 Preschool 1 -5.8
Preschool 6F F 5 Preschool 3 -3 .9
12M M 4 Preschool 5 - 2.8
29M M 6 1 10 -0 .9
Primary 35F F 9 4 11 -0 .5
69M M 8 4 15 1.4
88 M M 17+ 12 + 18 3.0
Adult 98F F 16 11 20 4.3
10 1F F 17+ 12 + 21 5.2
M A K IN G MEASURES 145
P rior to item calibration our only knowledge o f item difficulties comes from our
general concept o f the variable which the items are supposed to define. We do n ot know
the actual distribution o f these items along their variable. Once we have calibrated items,
however, as with K C TB , then we have a detailed picture o f where these items are located.
As a result we can use specially selected subsets o f these calibrated items to expedite
measurement.
These specially designed or “ tailored” tests will vary in length, in d ifficu lty level and
in range o f ability covered depending on the measurement target. Estimating measures
from such subsets o f items can be done efficien tly because we can construct the distri
bution o f item difficu lties to suit our purpose. In particular, i f we want to optim ize the
e fficie n cy o f our designed tests, w e w ill construct them so that the items are uniform ly
spaced in d ifficu lty over their measurement target. This makes the estimation o f measures
from scores on these tests entirely manageable b y the simple U F O R M procedure.
T o use Tables 7.3.1 and 7.3.2 (o r Tables A and B ) we need the approximate width w
o f the test and the person’s relative score f = r/L. Together they determine the person’s
relative ability x fw and its corresponding error coefficien t Cf w . When we combine this
inform ation with test height h and test length L, we get the measure bfw = h + x fw and
its standard error sfw = Cf ^ /I/1 .
In order to use Tables 7.3.1 and 7.3.2 fo r a particular test, we need estimates o f that
test’s basic characteristics H, W and L. Test length L is self-evident. Test height H is
estim ated from the average d ifficu lty level o f the test’s items, namely h = £ d j h = d.. The
estimation o f test width W, however, can be problem atic when an irregular distribution
o f item d ifficu lties at the extrem es o f the test cannot be avoided.
w, = (d L- d , ) [ L /( L — 1 )]
w2 = [(d L+ d u _ 1 - d 2 - d 1 ) / 2 ] [ L / ( L - 2 ) ]
or
T h e m ethod w e have found best in practice is w 2 , the one based on the average d if
ference between the tw o easiest and the tw o hardest items. This procedure fo r estimating
test width is illustrated in Table 7.3.3 where we calculate w fo r five forms o f the KCTB.
146 b e s t t e s t d e s ig n
J TAB LE 7.3.1 I
Test Length: L
Relative Score: f = r/L
Test Height: h = Id ,/L
i
Test Width: w = [(d L+ d L_ , - d 2 - d. )/2 ] [L /(L - 2)]
Measure: bf = h + x fw
* Item 3 at - 6 . 2 is 2 logits below the more or less uniform stream of 22 items from Item 7 at -4 .3
through Item 24 at 6.3. U F O R M is more accurate with this kind of extreme non-uniform ity, when
test width is calculated w ith o ut the very irregular extreme item.
The first ro w o f Table 7.3.3 concerns the 23 items in the K C TB “ item bank.” From
these 23 items w e have com posed three narrow-range test form s focused on three ability
levels: a Preschool Form o f 8 items, a Primary Form o f 15 items and an A d u lt Form o f
15 items, and also one wide-range P ilo t Form o f 7 items. The calibrations fo r the tw o
hardest and tw o easiest items fo r each o f these test forms are given in Table 7.3.3. With
these calibrations we can estimate the various test widths using the w 2 m ethod to calcu
late w ' as
and rounding the w' com puted to the nearest integer fo r the value o f w used in tables like
7.3.1 and 7.3.2.
b = h + X 8n [ r / ( L - r)]
= 0 + 2.2 2n [1 0/13]
= - 0.6
s = X [ L/r( L - r)] *
= 2 .2 [ 2 3 / 1 0 ( 1 3 ) ] 54
= 0 .9
The value o f the expansion factor X comes from the variance o f item difficulty sd 2 = 11.0
as
X = (1 + sd2/2 .8 9 ),/4
= (1 + 1 1 /2 .8 9 )*
= 2.2 .
TA B LE 7.3.4 I
UCON AB ILITIES AND ERRORS
FOR THE 23 KCTB ITEMS
Score Ability
r b
1 -5 .8 1.2
2 -4 .6 1.0
3 -3 .9 0.8
4 -3 .3 0.7
5 - 2.8 0.7
6 -2 .4 0.6
7 - 2.0 0.6
8 - 1.6 0.6
9 -1 .3 0.6
10 -0 .9 0.6
11 -0 .5 0.6
12 - 0.1 0.7
13 0.3 0.7
14 0.8 0.7
15 1.4 0.7
16 1.9 0.8
17 2.4 0.8
18 3.0 0.8
19 3.6 0.8
20 4.3 0.9
21 5.2 1.0
22 6.3 1.2
M A K IN G MEASURES 149
150 BEST TEST DESIGN
His U FO R M ability and error are calculated from his relative score f = r/L= 10/23 = .43
and the values fo r x fw and Cf * found in Tables A and B o f the appendix with h = 0,
w = 11 and L = 23. Thus
= 0 - 0.8
= - 0.8 .
and
= 3 .3 /2 3 y’
= 0.7 .
Confidence in the use o f the U FO R M Tables 7.3.1 and 7.3.2 or Appendix Tables A
and B depends on a knowledge o f their functioning over a variety o f typical test situa
tions. Wright and Douglas (1975a) investigated their functioning with a simulation study
designed to check on the major threats to the success o f these tables in providing useful
measures.
The results o f their study are summarized by the bounds given in Table 7.3.6 fo r the
extent to which a test can depart in practice from a uniform spacing o f item difficulties
before measurements based on the assumption o f a uniform test become unacceptable.
Table 7.3.6 gives the combinations o f H - 0, W, and L within which the bias in estimating
0 caused by non-uniformity in item difficu lty is less than 0.1 logits.
The amount o f leeway shown in Table 7.3.6 may seem surprising, since it allows a
random item difficu lty o f, say, d = 2.0 when uniform ity calls fo r 6 = 1.0. But, when h
and w are calculated from a test’s actual d j5 it is demonstrable that a broad spectrum
o f test designs is exceptionally robust with respect to random departures from uniformity
in item difficulty.
Table 7.3.6 shows that as test length increases beyond 30 items, no reasonable
testing situation risks measurement bias large enough to matter. Tests in the neighbor
hood o f 30 items, o f width less than 8 logits and which come within 1 logit o f their
target 0 are, fo r all practical purposes, free from bias caused by random deviations in the
uniform ity o f item calibrations o f magnitude less than 1 logit. Only when tests are as
short as 10 items, wider than 8 logits and more than 2 logits off-target does the measure
ment bias caused by random non-uniformity o f item difficulty exceed 0.2 logits. This
means that U FO RM measurement tables, even though they are based on the assumption
o f perfectly uniform tests, can be used to transform scores into measures in most prac
tical situations.
M A K IN G MEASURES 151
1.0 2 10 .2 .4
1 30 .1 .3
0.5 2 10 .1 .2
1 30 .1 .2
B IA S = The average measurement bias in 100 replications of a test in which the random departures
from a uniform distribution of item difficulties are bounded by |d, - 5, |.
7 .4 IN D IV ID U A L IZ E D T E S T IN G
Status Tailoring. Inform ation about grade placement or age will often be sufficient
to tailor a school test. Prior know ledge o f the approxim ate grade placement o f the target
group o r pupil and o f the variable’s grade norms can be used to determine an appropriate
segment o f items. Norm ative data in a variety o f school subjects suggests that typical
within grade standard deviations are about one logit. When this is so, even a rough idea
as to a pupil’s within grade quartile provides m ore than enough inform ation to design a
best test fo r that pupil.
Perform ance Tailoring. Where grade or age inform ation are n ot sufficient, tailor
ing can be accomplished with a p ilo t test o f 5 to 10 items spread out enough in difficu lty
to cover the widest expected target. I f the p ilo t test were set up to be self-scoring, then
pupils could use their number right to guide themselves into a second test specifically
tailored to the ability level im plied b y their p ilo t test score.
O
CO
o
LO
CO o
2 co
cc
o
o
CN
£ “
§1
a oc
LU <
co>
LU .
in
r> .' o
LU X o*
DC
D
•“ x
CD
2 o
-1 .0
p i
t r co
2 <
m Lu
-2.0
x S
H
co
-3.0
-4.0
O
CjJ
00
00
CN
-5.0
<0“
10“
E
a>
E
Q)
-6.0
o
E
c_
Li. o E
k.
LL
“5 >
o
o i—
-C CO
a
>
c
_
u
co
Q_
(3 E
w
a.
D
"D
<
QSl CN
M A K IN G MEASURES 153
This approach is self-adapting to individual variations in speed, test com fort and
level o f productive challenge. The large variety o f d ifferen t test segments which can result
are easy to handle. The sequence number o f the easiest and hardest items attempted and
the number o f correct responses between them can be read o f f a self-scoring answer form
and converted in to a measure and its standard error merely by looking up these three
statistics in a simple one-page table made to fit with the b ook let o f items used in testing.
Self-tailored testing corresponds to the use o f basal and ceiling levels on individually
administered tests like the Stanford-Binet. The only difference is that, with the self
tailored test, the segment o f items administered is determined by the person taking the
test rather than b y an examiner.
The Preschool Form is com posed o f the first 10 items. Only Items 3 through 10
are calibrated because virtually everyone tested so far has gotten Items 1 and 2 correct.
T h e Prim ary Form is com posed o f Items 5, 6 and 8 through 20 to cover the middle range
o f the variable. T h e A d u lt Form is com posed o f Items 11 through 25, the hardest items
calibrated, and Items 26, 27 and 28 which are so hard that no one tested so far has gotten
them correct.
N o tic e that we can include these five “ out-of-bound” items in our test forms with
out im pairing our measurements in any w ay. This is because w e can focus our measure
ments on the portion o f the test which is both taken by the person and made up o f cali
brated items while letting extrem e items continue to work fo r us as the conceptual
boundaries o f the K C T variable. I f eventually we encounter persons who fail Items 1 or
2 o r w h o pass Items 26, 27 o r 28, then we w ill also be able to calibrate these items onto
the K C T variable and use responses to them in our measurements.
T h e items fo r each o f the three forms and their corresponding item difficulties,
where known, are given in Table 7.5.1. Below the items in each form are that fo rm ’s
test characteristics: height h, width w and length L.
These three test form s were applied to the nine persons. Table 7.5.2 shows how
each person scored on each o f the forms. Persons 3M and 6 F could be measured on only
the Preschool Form while persons 98F and 101F could be measured on only the Adult
Form . Person 12M produced a measurable record on the Preschool and Primary Forms.
Persons 69M and 88M produced measurable records on the Primary and A du lt Forms.
Persons 29M and 35F produced measurable records on all three forms.
s = 2 . 6 / 8 = 0.9
154 BEST TEST DESIGN
•
1
*
2
3 - 6.2
4 -4 .1
5 - 2.6 5 - 2.6
6 - 2.7 6 -2 .7
7 -4 .3
8 - 2.6 8 - 2.6
9 - 2.1 9 - 2.1
10 - 2.1 10 - 2.1
11 - 1.0 11 - 1.0
12 - 0.1 12 - 0.1
13 -0 .9 13 -0 .9
14 -0 .5 14 - 0.5
15 - 1.5 15 - 1.5
16 - 0.8 16 - 0.8
17 1.9 17 1.9
18 1.4 18 1.4
19 2.0 19 2.-0
20 2.9 20 2.9
21 3.3
22 3.3
23 4.5
24 6.3
25 5.8
**
26
##
27
**
28
For Person 29M ’s relative score o f .47 on the Primary Form (h = -0 .6 , w = 6 , L = 15)
we look up x fw = -0 .2 and Cf * = 2.6 to find the estimate
s = 2.6/15* = 0 .7 .
For Person 29M ’s relative score .27 on the Adult Form (h = 1.8, w = 8, L = 15) we
look up x fw = -2 .0 and Cf * = 3.0 to estimate
s = 3.0/15,/4 = 0.8 .
Even though only one o f the forms taken is best focused on a person and so pro
duces their “ best” measure, still we can see in Table 7.5.2 that, in spite o f the wide
variation in score on different forms, the measures fo r a given person are, fo r the most
part, comparable. Person 29M produces the greatest variation in measures over these
three forms. His three relative scores o f .75, .47 and .27 vary widely in response to the
variation in difficulty o f the three forms. According to our model his three measures
o f -1 .9 , -0 .8 and -0 .2 ought to be statistically equivalent, even though they may seem
to vary more than we might like. When their variation is evaluated in the light o f their
standard errors o f 0.9, 0.7 and 0.8 we see that the lowest estimate o f -1 .9 on the
0
Pre-
school Form plus one o f its standard errors and the highest estimate o f -0 .2 on the
Adult Form minus one o f its standard errors touch at -1 .0 .
Table 7.5.3 shows fo r each person their ability measure on the total K CTB test
and their ability measure on each o f the three sequential forms. The difference between
each test form and the K C TB is given at the right o f the table. When these differences
are compared to the errors associated with them it can be seen that all o f the differences
are less thanhalf a standard error except fo r those o f Persons 29M and 98F.
The standard errors for each ability fo r the K CTB and the three test forms are
given in Table 7.5.4. These values are stable and consistent over forms fo r the nine persons.
7.6 P ER FO R M A NC E T A IL O R IN G
T o demonstrate performance tailoring with this Pilot Form we will use the per
formances o f our nine persons on the Pilot Form to indicate the sequential form most
appropriate fo r measuring each o f them. Then, we will measure them on the indicated
sequential form and compare their “ performance tailored” measure with their measure
based on all 23 K C TB items.
M A K IN G MEASURES
157
158 BEST TEST DESIGN
Sequential Forms
KCTB Preschool Primary Adult
L=23 L=8 L=15 L=15
A bility Person Error Error Error Error
A
Group Name a S1 s2 S3
3M 1 .2 1 .1 *
Preschool 6F 0.8 0.8
12M 0.7 0.8 0.8
For example, Person 29M had a K C TB ability o f -0 .9 . Using the Pilot Form we
found an ability o f +1.1 with a standard error o f 1.5 indicating a target range o f -0.4 to
2.6. Since the Adult Form is targeted at 1.8 logits, it is the sequential form indicated for
measuring 29M. On this form he obtained a measure o f -0 .2 logits.
In Table 7.6.2 we see that for seven o f the nine persons the difference between
their measure on the K CTB and their measure on a performance-tailored best sequential
form differs by less than half a standard error. Persons 29 M and 98F, however, show dis
crepancies between the measures implied by the K C TB and the sequential test form
which are o f the order o f one standard error.
M A K IN G MEASURES
159
F IG U R E 7.6.1
Pilot Form
© ©
©
©
©
© ©
1 1 1 1 1 1 I
i i I i
Logits - 6.0 -4 .0 - 2.0 0.0 2.0 4.0 6.0
KCT Variable
_______________________ T A B L E 7.6.1________________________
P IL O T FORM
3 - 6.2
4 - 4 .1
9 - 2.1
12 - 0.1
19 2.0
23 4.5
24 6.3
Height: h = 0.0
W idth: W = 15
Length: L = 7
160 b e s t t e s t d e s ig n
M A K IN G MEASURES
161
7.7 S E L F -T A IL O R IN G
In Table 7.7.1 we show the response patterns o f these self-tailored tests. The first
item Person 29M missed, fo r example, was Item 6 . Thereafter he continued passing and
failing items until he failed Items 17, 18, and 19 successively. This defined a self-tailored
segment fo r him ranging from Item 3 through Item 19. On these 17 items he had a score
o f r = 10.
In Table 7.7.2 w e com pute a measure fo r each person based upon their self-tailored
test segment. In order to do this com putation we determine fo r each self-tailored segment
its test characteristics h, w and L. These test characteristics fo r each person’s self-tailored
segment are given on the le ft o f Table 7.7.2.
Thus Person 29M has a relative score o f 10 on his self-tailored segment o f 17 items.
Since his segment has a width w = 8 this score o f 10 produces a relative ability measure
o f 0.7 logits which when adjusted fo r the height o f his segment (h = - 1 .5 ) yields an
ability estimate o f -0 .8 with a standard error o f 0.7. This ability estimate is only 0.1
logits away from his K C T B ability estimate o f -0 .9 with error 0.6. Inspection o f the
differences given in Table 7.7.2 between measures on each self-tailored segment and
their corresponding K C T B measures shows that all o f the measures obtained by self
tailoring are close to the ability measures obtained by the KCTB.
In Table 7.7.3 w e show, fo r each type o f tailoring, the efficien cy in item usage fo r
each o f our nine persons. We see that a considerable number o f items can be saved with
ou t much diminishing the accuracy o f ability estimates.
Person 29M with a 17 item self-tailored segment requires the most items, y e t even
this segment is 6 items less than the total 23 K C TB items and there is virtually no loss
o f measurement accuracy. Person 3M produces almost as precise an estimate with only
4 self-tailored items as can be obtained fo r him by using all 23 o f them. This saves 19
items. Person 3M, however, is at the extrem e lo w end o f the K C T variable. As a result
only the fou r easiest items are relevant to measure his ability. Were additional easy items
available, we could use them to advantage with Person 3M to im prove the precision o f
his measure.
T he self-tailored procedure always achieves the most efficien t item utilization. This
is especially so when making measures at extremes, in this case, beyond ± 4 logits on the
K C T B ability scale. H owever, while appreciating this apparent efficiency, we must also
realize that the items saved are items inappropriate fo r their target. Our real goal is
to make measurements sufficiently accurate to be useful. Accuracy depends on the
number o f items used which are near enough to the person to be measured so that each
item makes an adequate contribution to the estimated measure. This means that we want
items to be within a lo git o f their target. Once the items are brought this near their target,
all further considerations o f accuracy, and hence o f efficien cy, boil down to the question
o f h ow many o f these “ tailored” items it is practical fo r the person to attempt.
162 BEST TEST DESIGN
»- co in o o o ) co co co
:<5 E
^ ( 0 0 ) in in r- co in
CM
in
CM
co
CM
CM
%
O
CM
%
KCTB
O)
3i
r«*
ooo o
RESPONSE SEQUENCES FROM
T3
00
O 5
>i
CM
5
CD
5
CO
5
o
SELF-TAILORED
in
5 o
LO
CO c
% D
O) in o) CO CO f
CM CO CD 00 O) o
a
■— 3
2 2 E *o
< o ol <
M A K IN G MEASURES 16g
2 cn oo i>* cq co r** 00 0) O
UJ
o o o o o o o'
o> c
CD <o
d) OO
c
£ V
»"
s. ^
Q)
f- o «- CN O CN i- o
o o o o o o o d d
oo q oo Is *. r>: ^ 0) 0) * - a>
°
LU II
o
«-* o o o o o o d a>
CD
> C
H x r* co oo oo oo CN ^
—
n
+
£.
LO 00 CN o o oo rr
c
RESPONSE SEQUENCES
I I I
< i o
*U
0)
•O
c
D
o
W .2 3
O O ^ CD 00 00 o> cd r* O) oo iq
t ^ CN CN CN CN CN CN CN CN CN
HJ qJ O
CN
O
o I
> >
$£ ° . w. «sr o q
■(0» S 3
*_ o o o o' o'
cc <
x
SELF-TAILORED
"D
(0 O
LO
CN
O
LO
CD
LO
0 )0 0
LO CD CD
io o o
LO LO CD
CN
“O
£ co i
«- 00 IO 0 0) 0) +
FROM
r- CD IO
MEASUREMENTS
</)
-0
*D
0> H 00 00 ff
-C
•2>£ JW-
<T 00 00 0
1 I I I I 1
00 CO O) CO 0) LO O) o co cn
0)10 0) 00 00 t-
CN 00 CD 00 0 ) 0
> a
D
i 2 •O
< <3 <
164 b e s t t e s t d e s ig n
CO 05 00 rs r** r** 05 co «-
•-‘ d o o d d o’ o'
\DT «/)^ 00 00 00 00 00 O)
C
Q . o’ o o* o d d
OC
O
t- °q co r -t r*» oq 00 00 05 D
o o o* o o' o d d O
■3o
TESTING
OQ
cn cq rs r* p* f* CO 05
o<b r-‘ o o o d d o d d
EFFICIENCIES POSSIBLE WITH THREE TYPES OF TAILORED
OQ
h*
O
0) O I 05 r* CO 00 00 CN r* oo
C/) = Al T— *”
E
o
r cm
C/5
V)
O r- in LO LO
o ■C=D “1
E «*-
<0 .O |
O) '(0 -1
I-
Q) O
CO •*=
"O
v
CO
D
(A §
o '§
=
E
O)
3 C
co
2 ° .
CO ‘ to
MEASUREMENT
CO
H CO CO CO
O CN CN CM
*
03 00 05 00 05 in O CO CN
g § * in CO CN 0 0 CO in
* < I: 1 1 1 1
? LL ? ? li ? LL LL
°
n P
C CO CO CN 05 en 05 00 00 T—
i- to *— CN CO CO 00 O) o
CD
Q. ^
-
O
0 >
> a -£I
•- 3 <s> F 3
5 2 T3
< O al QI <
M A K IN G MEASURES 165
7 .8 PERSON F IT A N D Q U A L IT Y C O N T R O L
During test administration it may appear that an examinee has taken the test as
planned. Nevertheless, it is always necessary to examine the actual pattern o f responses
to see i f this pattern does in fact correspond to reasonable expectations.
is a standard square residual fo r evaluating the relationship between the observed response
x vj and its m odel expectations given bv and dj. According to expectation this zvi2 should
be approxim ately distributed as chi-square with about (L - 1)/L degrees o f freedom
where L is the number o f items in the test used to estimate bv. I f the set o f | zu2j } does
appear to be distributed this way, then we have no internal reason to invalidate bv. But if
not, w e must acknowledge a departure in the data from our expectation and we must see
what w e can d o about it.
Every response x vj in the set o f i = 1 to L taken by person v produces its own almost
independent zvj2 . We can sum this set o f L residuals | zvj2 } into an approximate chi-
square with about ( L - 1 ) degrees o f freedom , and fo r convenience express this chi-square
as the standardized statistic
J TABLE 7.8.1 |_
A 1 1 1 1 0 1 0 0 0 0 5
B 1 1 1 0 1 0 1 0 0 0 5
C 1 1 1 0 0 1 1 0 0 0 5
D 0 0 0 0 0 1 1 1 1 1 5
E 0 0 0 1 0 1 1 0 1 1 5
vv = Z z 2 /(L -1 ) . [7.8.3]
i »
t, = (v - 1 )[(L - 1 )/2 ]* ~ N (0 ,1 )
and
then
In Table 7.8.2 we work out the person fit analysis fo r the response patterns o f Per
sons 12M, 35F and 88M. Person 12M has a tentative measure o f b = -2 .8 . For his first
item, d = -6 .2 , his response is x = 0. These give him a (d - b ) difference o f
( d - b) = [ - 6 . 2 - (-2 .8 )] = - 3 . 4
since (2x - 1) = -1
o o q T-
CM CM o ’
q o CM
CM* o
o q CM
o’
CM CM
d o *“
r
in o CN CM q o «- o
0 0 CD co* o*
1 1
00 q q 00 o q
o' 0 o’ in* CM* o*
I
1
in o CO CO q q q q o q q
7 *” o’ o*
I
0 o' d
1
c _ o q in o o c* o q
o
a CN o’ o’ 7
CM* CO o’
1 ?
FIT
o q in q
CN o’ CO* o’ o’
1 7 7
CM
CO CN CM CO CM q q q 1^*’
CN o’ r” CM
1
o’ CM o’ O
1 1 i
CALCULATING
n(O
CO CN CM
°°. CN o CM q I-
CN o’ CM
1
o' CM* o’
o
1 7 1 o
P| q o q o q o q q
CN
1
o’ r"* CM
1
CM o ’
1 ** 7
CO
CO CO q q °°. CM.
o’ CM cm’ o T“ o’
1 1 1 i
I
CO in CM q *■. q o
o’ CM* CM* o ’ o’ CO o ’
1 I 1 1 1 1
CN o o 00 o in q o
CO CO o* CO o ’ P CO* o’
1 1 CO ? 1
ACO
I-
c ♦- JO 3 CM A E
0 w *o X
0IA
1*5
<0 TJ *o
N ■o
a> 4 -*
<r c/) DC
O
LL
D
> > CO
£ .t: 00 CN
2 — CM
o* CO*
c> <
a I
V) V)
0) C
A O
•a
•—
a
(A
A a>
w r
< oc
*2 E in
o5 <2 co
00
00
£ Z
168 BEST TEST DESIGN
"Misfit Signal
z2 = exp [ { 2 x - 1 ) ( d - b)]
v = 2 z 2 / ( L - 1)
For each other response in Person 12M ’s tailored segment o f 9 items we have given his x,
(d - b ) and z 2 . The residual analysis based upon this row o f z2 ’s fo r Person 12M leads to
2 z,2 = 35,
i 1
Notice in Table 7.8.2 that we have used (d - b ) rather than the (b - d ) used in
Chapter 4. This is because the (d - b ) form is convenient fo r the calculation o f z2. When
ever a response is 0, a minus sign is attached to the difference (d - b ) which turns it into
(b - d). If, however, we keep this sign change in mind, we can use Table 4.3.3 to deter
mine the values in Table 7.8.2. I f you use Table 4.3.3, however, you will find that the
values in Table 7.8.2 are slightly more exact than the values determined from Table 4.3.3.
The difference is greatest on responses which fit well, but these responses play the small
est role in misfit analysis. The sum o f squares 2 z2 o f 12M based on Table 4.3.3 would be
33 instead o f the 35 given in Table 7.8.3. The resulting t would be 4.5 instead o f 4.9.
The fit statistic t is distributed more or less normally but with wider tails. In our
practical experience the popular rejection level o f about tw o is unnecessarily conservative.
The general guidelines we currently use fo r interpreting t as a signal o f misfit are:
M A K IN G MEASURES 169
If t > 5 w e reject the measure as it stands and take whatever steps we can
to extract a “ corrected” measure from an acceptable segment o f
the response record, i f one exists.
The detailed study o f person m isfit o f course depends on a detailed study o f the approxi
mate normal deviates
in the response record in order to track dow n the possible sources o f irregularity.
Since those portions o f the 2 z2 which contribute most to t are the large positive
terms, w e can streamline the determ ination o f record validity b y form ing a quick statistic
focused on the m ost surprising responses. Table 4.3.3 (also given as Appendix Table C)
shows that the d ifferen ce between person measure b and item d ifficu lty d must be o f
the order o f ± 2.0 b efore z 2 grows larger than 7 or its probability becomes less than 0.12.
T o reach a probability fo r a given response o f .05 or less we must relax our standard to a
(d - b ) d ifferen ce o f ± 3 producing a z 2 o f 20.
F o r exam ple, over the 9 responses o f Person 12M, there is only one surprise. This
is where (d - b ) = -3 .4 and z 2 = 30. Combining this value o f 30 with eight l ’s fo r the
remaining eight items o f the test gives us a crude 2 z 2 = 38 and a crude t = 5.3. This value
fo r t is n o t far from the m ore exact 4.9 we calculated in Table 7.8.2 and leads us to the
same conclusion o f a significant misfit.
In Table 7.8.4 w e summarize the residual analysis fo r all nine persons. For each per
son w e give the ability measure and standard error from their self-tailored segment o f
items. N e x t w e give the sum o f squares, degrees o f freedom , mean square and fit statistic
fo r each person’s record. F o r eight cases w e find no evidence o f m isfit and so we take
their measures as plausible. O nly the self-tailored segment o f Person 12M ’s record shows
a significant misfit. As w e saw in Table 7.8.2, this m isfit is due entirely to his incorrect
response on the first and easiest item in his record. The reason fo r this incorrect response
m ight be a failure in test taking or a lapse in functioning. In either case we are still inter
ested in the best possible estimate o f Person 12M ’s ability. The problem o f extracting the
best possible measure from a flaw ed record w ill be discussed in Section 7.10.
170 BEST TEST DESIGN
*M isfit Signal
7.9 D IA G N O S IN G M IS F IT
Consider again the 10 item test with items in order o f increasing difficu lty imagined
fo r Table 7.8.1. Were we to encounter the pattern produced by Person E, namely
Score
0 0 0 1 0 1 1 0 1 1 5
we would be puzzled and wonder how this person could answer the hard questions cor
rectly, while getting the first three easiest questions incorrect. Were they “ sleeping” on
the easy portion o f the test?
Score
1 0 1 0 0 0 0 1 1 1 5
our surprise would be as great, but now we might be inclined to explain the irregularity as
the result o f lucky “ guessing” on the three hardest items.
Both the probabilistic nature o f the model and our everyday experience with typical
response patterns leads us to expect patterns which have a center region o f mixed correct
and incorrect responses. When we encounter a pattern like
Score
1 1 1 1 1 0 0 0 0 0 5
M A K IN G MEASURES 171
Finally, we can also id en tify a special form o f “ sleeping” which might better be
called “ fum bling” in which the incorrect responses are bunched at the beginning o f the
test suggesting that the person had trouble getting started.
Score
1 1 1 1 0 1 0 0 0 0 5
“ norm al”
1 1 1 0 1 0 1 0 0 0 5
“ sleeping” or
“ fum bling” 0 0 0 1 0 1 1 0 1 1 5
“ guessing” 1 0 1 0 0 0 0 1 1 1 5
“ plod d in g” 1 1 1 1 1 0 0 0 0 0 5
Score
0 1 1 1 1 1 0 0 0 5
Th e evaluation o f this response pattern in Table 7.8.3 shows a significant m isfit, t = 4.9.
In Table 7.9.1 w e show the response pattern fo r Person 12M again and add fo r each re
sponse the probability p o f its occurrence under the model. We also give his response
pattern in terms o f z ’s in addition to the z 2 ’s. When we p lo t the z ’s fo r Person 12M in
Figure 7.9.1 w e see what a “ sleeping” or “ fum bling” response pattern looks like. This
figure displays the segment o f items responded to. Each item is spaced horizontally along
the K C T variable according to its d ifficu lty on the logit scale. Its vertical position is
determ ined by the person’s standard residual z produced in response to that item.
T h e observed response pattern o f Person 12M in Figure 7.9.1 shows how the z
statistic indicates misfit. Item 3 has a z = -5 .5 while the other items have z ’s near their
expected value o f zero. The e ffe c t o f Item 3 upon the response pattern o f Person 12M
can be highlighted by considering the tw o alternative patterns given in Table 7.9.1 and
Figure 7.9.1.
In- alternative pattern A we retain a score o f five by exchanging the correct response
o f “ 1 ” to Item 8 , a relatively hard item, with the incorrect response o f “ 0 ” to Item 3, the
easiest item attem pted. N o w we have the pattern
Score
1 1 1 1 1 0 0 0 0 5
172 BEST TEST DESIGN
00
in in in in «-
CN
5 7
-Q
I
*o
in co r*
rr cn
| o o o X
i i CN
a
x
a>
in co n
CM o’ o o
I I i x
CN
~ao>
o
> CD CD
CN 00 ID O)
*5 CN
o I O* o ’ o’
I I
■o
c
CO
cn co in o)
o o’ o
i i
"SLEEPING
T3
C
CO
t—
00
T—
oo
a>
E CN d ^
I
E
0)
co co N in
DIAGNOSING
co co jn
d o T-‘ O o
I
n
a
co co
«•? in cn oo in in cn oo in
* T T-* o ‘ o’ o’ o
I i
CN o o>
o O) CN o O CN
CD co o’ co o’ o’
I i i
.Q -Q
8 o I I I -Q
c «3
O *2
2 2 2 I
^ CN X I (M I X (N ■o
a
co'v(0 f— N t— N r— N
0) CO I
X
I
X
I
X
CN CN CN
X
CN
a
x
CN 00
n
</> u- CN
ii
co 'Z i i
C
i—
CN
O O £ II a>
M A K IN G MEASURES
173
I F IG U R E 7.9.1 |
DIAGNOSING "SLEEPING'
b = - 2 .8
Person Q
12M
(t= 4 .9 ) 2
Alternative
B
(t = 0.1)
Logit
Scale -2
174 BEST TEST DESIGN
TABLE 7.9.2
* Misfit Signal
V = £ z2/( L — 1)
The misfit statistics fo r these three patterns are summarized in Table 7.9.2. There we see
that Alternative A has a t = -1 .0 instead o f 12M ’s t = 4.9.
Score
1 1 1 1 0 0 0 0 1 5
Interestingly enough, the misfit fo r the exchange in pattern B is small, only t = 0.1. This is
because Item 15, with difficu lty d = -1 .5 , is not as hard in relation to Person 12M’s
ability o f b = -2 .8 as Item 3, with difficu lty -6 .2 , is to o easy.
In Tables 7.9.3 and 7.9.4 and Figure 7.9.2 we illustrate “ sleeping” and “ guessing”
response patterns using the observed record o f Person 88M. T o change his response
pattern to a sleeping pattern we replace his correct responses to tw o easy items with
incorrect responses and shift these tw o correct responses to Items 17 and 21, thus keep
ing the score r = 6 . N o w we have the response pattern
Score
0 0 1 1 1 1 1 1 0 0 0 6
CO CO
CM O O) CM «- O O ) CN
% ? d ' *f CO o
I I
- 1 ) (d - b )/ 2 ]
CO CO
in oo q «-t q n q q q co
cm o ‘ o o
2 « l i cm
I
d *
l
co in o co q in in
[(2x
q
w
* * T" O o d
I <?
z = (2x - 1) exp
oo CM
r- ^ »~ q *- q q q q
O* r-‘ ‘ O r-’ o’ o’
I
■p
o
>
00 CM
CO o w1
, q m o q q ^ q o
PATTERNS
«- o) in
CO d o r-‘ o* o o
I I I
> r* r* CO
o t- co in q r_ CO q q q q ^ CM
CM
"5
o
0 o o’ o o o’ o
1 I
RESPONSE
Q
*D CO
C
CO O) © in cm q in cm q q q
0) IT CN o o’ «-* o o’ CO* ’
E I l
<0
Z
E
p = 1 / (1 + z 2)
"G U ESSING "
0)
O C? ^ cm q «- q co r«* in q
CO q o’ o
7
co co
00 q (N q q cm q oo cm oo
t-’ o ’ ’ o’ o’
I l
AND
CO CO
CM «- w. o. q cm «- q q q
co o ’ o co o
SLEEPING"
5 ? I
oo oo
q o q cm q o q q
0 co o ’ o co’ o ’ ‘ o’
1
.a A A
I
o
*o TD ■O
a
i/i «=(0 x p n a x X
T
- 1) (d - b )]
TX I
X X
CM CM CM
[(2x
= exp
00 C
00
3> a a
<o *r
O u
o
176 BEST TEST DESIGN
TA B L E 7.9.4
* Misfit Signal
v = Z z 2/ ( L - 1 )
Score
1 1 1 1 0 0 0 0 0 1 1 6
fo r which t = 5.3 in Table 7.9.4. Figure 7.9.2 compares the previously acceptable response
pattern o f 88M with these alternative unacceptable response patterns characteristic o f
sleeping and guessing.
The second pattern illustrated in Figure 7.9.3 is “ plodding.” In this pattern the
person gets every item correct as far as they go and all remaining items incorrect. This
can be due to a test-taking style governed by slow and deliberate working habits. While
sleeping, guessing and fumbling are indicated by positive values o f t, plodding, on the
other hand, produces a negative t. The negative value indicates that the observed response
pattern fits even better than we expect. It indicates that even the random variability
expected by the model is missing!
M A K IN G MEASURES 177
178 BEST TEST DESIGN
in in
q O) CN o CJl r - O) CN o q *- o CN
cn
I
© ’ o*
i d o • <p CN o’ ’ Q
l
00 o w q co o w *■, O) CO o « 1- 0> CO
5
CN
I
o’ * o’
I
cn
i
o* o
i
CN
I
d •
00 t -
CN
O CO
0
o co r
CN O
q m
d
o q
CN o
I 1 i T I T
00 oo CN
CD q ^ #1 **. O o q q o
? o’ ’
O o V-’
5 °
A
I
r (0 (O (0 ^ in in in
in q q q q o q co co
* o’ o’ ’ o* d o o o
o o
CD CO co oo
0 o ’ p’
* ?
FUMBLING" AND "PLODDING" RESPONSE PATTERNS
1
T3
O
0) o o o
a
c
n o> o in o o o in o o q o in o
o* o
0)
3 5 ?
O'
a)
in
E co cn co in oo cn q q «- q cn w q
O CN * o’ CN ’ v-’ o’ cn ’ «-*
* ?
00 CN CN
«- ° . *- q q o T- q q o
* 7
o o’ o’ * d
T
o’
I
a co
*o CN CO CN 00 CN CO Is- co cn n q q
c
(0
*-’ CO ’ t-* o o !-* o ’ o’
I
0)
E
(0
Z r-.
CN CO CD cn o q q cn co Is* q
E m o O f- o’ * o’ r-’ O’ ’ O’ II
* I a
in
q cn q q cn q
o’ V-* o’ * o’ T-’ o’ ’ o’
CO o q cn ^ o q cn o q q
r* W co o* o’ co o’ * o co o ’ o ’
* I l I
co
CO (N CO ^
%
in A
CO q q co n ; 1
* V o’ o 2
7
X
co co CN
o q cn q o q q
o’ *o’ co o’ ’ o’
o
CN co o o q
in in’ o’ o
4? ¥* l
A A
X X
I
■O
x
■Io
x
T I I
x CN
q
o *o> O)
c
c
8 a
co * r
2 c 45 C
O o E CL>
3
LL
£
(0
—
0 - ♦(0-*
: CL
M A K IN G MEASURES 179
T A B L E 7.9.6
*M is fit Signal
v = 2 z 2 /( L - 1)
i
Z_
3
When we detect significant m isfit in a response record, diagnose the response pattern
and id en tify possible reasons fo r its occurrence, it is finally necessary to decide if an im
proved measure can or should be determined. Whether such a statistically “ corrected”
measure is fair fo r the person or proper in such circumstances cannot be settled by statis
tics. H ow ever, know ing h ow a measure might be objectively corrected can give us a better
understanding o f the possible meaning in a person’s performance.
We have iden tified the implausibility o f the response o f Person 12M to the first item
in his test segment given in Tables 7.9.1 and 7.9.2. Were we to decide that this particular
response was n o t typical o f Person 12M, we might delete the incorrect response to Item
3 and com pute a new ability estimate based on his responses to the remaining eight items.
This new calculation o f his ability measure is given in Tables 7.10.1 and 7.10.2. The
corrected measure b; = -2 .2 puts Person 12M about 0.6 logits higher on the K C T variable.
Figure 7.10.1 shows the e ffe c t o f this correction on the fit o f Person 12M with t' = -0 .8
instead o f t = 4.9.
In Tables 7.10.3 and 7.10.4 and Figure 7.10.2 we show the correction o f a typ ica l
“ guessing” pattern. The person’s responses to successively more d ifficu lt items show four
correct responses fo llo w e d by five incorrect responses and then by tw o correct ones! This
response pattern has a significant m isfit o f t = 5.3. We must ask whether the ability
estimate b = 3.2 is a good indicator o f this person’s position on the K C T variable. Given
this person’s string o f five incorrect responses prior to his last tw o correct ones, we might
com pute a new estimate with these last tw o surprising responses removed from the
record. W ith this new truncated pattern b' = 1.7 and t' = - 1.2. Statistical analysis alone
cannot tell which estimate is m ore appropriate, but it can detect and arrange the available
inform ation in to a concise and objective summary fo r us to use as part o f our evaluation
o f the person.
Persons w h o guess may succeed on d ifficu lt items more often than their abilities
would predict especially on multiple choice items. This makes them appear more able,
especially when many items are to o d ifficu lt fo r them, because their frequency o f success
does n o t decrease as item d ifficu lty increases. A similar but opposite effec t occurs when
able persons becom e careless with easy items making these persons appear less able.
in in
* T
o O) ^ (O
o’ o’ p
0>
4*: (N
TwJ CN
O I
CORRECTING THE MEASURE OF PERSON 12M FOR "SLEEPING
p. CN CN *-t * r* oo
00
% O r-' o o’
p N N r oo
D in O t-’ o’
T3
00
I
II
p in id co CN
CD
00 I
co co, in
% t-‘ o o’ o'
in cn in
l o’ o*
CN
I
o'
o o in
’Sleeping" correction rule:
CO CO o in
* CO I
A
I
c z
Ow ■D
a
c/> «=
<u
0) *->
<r c/>
O CN
*9 T-
O t
o> 5
O <3 u
jO O to
o (J a.
M A K IN G MEASURES
183
CD 00
n-’ o’
I
= £
8 2
2 §•
CO
o CO
ir>
co
CO CO
"SLEEPING" PATTERN
CO CN
CD
n CN CN o
< I I ii
co
CN
o *5 5R § co CN CN
t=
CN
a f cl CN CN
o
RESIDUAL ANALYSIS OF A CORRECTED
a
0)
3
> $
co q E
o o _c
c
o
co co o
If) co o
z2 /(L - 1)
2
II
>
-C
5 5
5
°°.
CN
I
184 BEST TEST DESIGN
FIGURE 7.10.1
b = - 2.8
Pattern
Observed for
Person 12M
(t = 4.9)
b = -2 .2
2
Corrected
Pattern for
1
Person 12M 0-
(t* — 0.8) -1 H s R s T
-2
Logit Scale -5 -4
M A K IN G MEASURES 185
CO
CN CO
in 00
CN in
*
CO in o q q
CM
o
o CN O
r *1
O
l I I
•p
CN
CN
CO Q V- O) O 0 q cn ^
CO 0 o «-*
o 1 i 7 ° ?
>
CO CO CN ^
CN CO*
% CN 7 ?
CO
>
♦-» II
Z 3 .O
o
cc U- O
CN
O) w CN cn co in
CN o
o % T ° 7 ° ?
■O o
h - c Q.
< (0
0)
Ol
E
<0
b z o> o
CN
O cm co co q co
0 o o
P 'j q
co z E
«-* CO
1 I
o ' CO
CO
LU
D
0
1
co q in JC
OQ
1 .2
1 .1
- 0 .2
a
<
«-* o d
I E
H <
o o
CO q cn q q © JD
o o-* o o d
I
o o 3
k_
% I
111 C
cc o
cc o
o CN co o cn CO CN ij
CD
Ub
i_
o p co o o *-* o o o
% T l 1 o
k
*o>
c
</)
1/)
CD
p- o cn r N i- , M 3
p o o cn o o 0
i
% T
JO A
I I
*o
o w x ZZ.
Q. 'g
</> CD
Ql
OC ^
I
X
CN
T
o> Va> zr
•5 c o E
(A
a> ak> m
a
3 £ II o
o .O
Q_ -—■ U
BEST TEST DESIGN
fa
Lf> C
o>
CO
r>*
E co co
3 O’
1/) CO
"GUESSING" PATTERN
o O) 00
LLI
o’ o
i
>
XI +
< I
II c
CN c*
CO
I
o “ :S£§
u . »*_
HI a) O oi
O
O
RESIDUAL ANALYSIS OF A CORRECTED
Q>
>>
js := X£
(l) -Q o 0
OC < 1
O
o
w
II
>
co.
cm'
Q
O
M A K IN G MEASURES
187
A "GUESSING" PATTERN
CORRECTING
0>
"O —
*b>
c 0) CN
M CO u c
V) <1)
0> id 03 1
3 II
ii O
O (0
o CL
188 BEST TEST DESIGN
item, o f its distractors. For the person being measured, however, tw o quite different
variables are involved. One is their ability, the other is their inclination to guess or their
carelessness. The measurement o f either variable is threatened by the presence o f the other.
In situations where we think that guessing may be influenced by test form at as, for
example, when we think a person may guess at random over m multiple-choice alterna
tives, we could use the guessing probability o f 1/m as a threshold below which we sup
pose guessing to occur. T o guard our measures against this kind o f guessing we can then
delete all items from a response record which have difficu lty greater than b + 8 n (m - 1 )
where b is the person’s initial estimated ability. A fte r these deletions we reestimate the
person’s ability from the remaining items attempted. I f we do this, we are taking the
position that when items are so difficu lt that a person can do better by guessing than by
trying, then such items should n ot be used to estimate the person’s ability.
In Tables 7.10.5 and 7.10.6 we show a “ fumbling” pattern and its correction. Here
we have an increasingly difficu lt segment o f 17 items and a response pattern beginning
with four incorrect responses follow ed by ten correct responses and then three incorrect
responses. The pattern seems implausible and significant m isfit is identified in t = 23.1.
Some extraneous factor seems to be influencing the first four responses. It could be a
problem o f test administration procedures, or o f the examinee’s test behavior. A cor
rected response pattern could be form ed by deleting the first four incorrect responses and
considering only the continuous segment o f correct responses and the three incorrect
responses which fo llo w them.
The corrected responses resulting from this change show a “ plodding” pattern with
t = -3 .0 . This pattern produces a considerably higher ability b' = 1.1 than the original
b = -0 .9 . N o final decision can be made on this problem, however, until sufficient clinical
or behavioral inform ation is gathered to clarify the meaning o f those first fo u t unex
pected incorrect responses.
c
O) O 0)
O q r* CN o o> (0 £
CN CN o
4t o’ £
1 ? ? ?
c
o>
00 O q r- q CO q q
o 4-*
CN o Q* 0 o d (0
4* 1 I a
1 T
a>
</>
c
rv o
0) O CO 00 r* a
0 .4
o
- 0 .3
t/> to
CN o o o <u
4* 1 T w
I E
CO a>
(/>
(0 CO «— r- o q «— c 4-*
Q o o o ’ ’5 ) 3
4* 1 T <D O
.a *♦-
4-*
CO
’* 5
-C
in in CD (0 r*. q *— q 4-» 0 )
0 o o* CN 1/)
o ’ o o
4*: 1 1 1 E
a> E
4-»
o
in in r_ 4-*
<4; q q >
L_ o o* o o
i/i
0) CO <£
■p
% l T a> in
c
O O
4-» o
Q) a
o i/i i/i
c CD 0)
Q>
CO 0) «— o o o o *" mjt i/i L_
3 o* o ' CN o o ’ C
o 4fc 1 l o
Z a> a u
LO a>
DC a> i_
E CN «— cq CN q q q k-
LU 0) q Q
**
o ' o ' CN o ' o
1 - 4*: 1
c
L. T V-» O
1 - £ o
<D
a> >
< L_ c
> CO
CL £ CL o CO
* 3 f- o 0) q q o r- T- q o
o c E
KJ *L
4fc
c o u i/i
7 0 1 i CO
z j5 a »*- 4_>
i/i
in a> O
*o CO
o ' —1 c O DC CN q q q o q 4-» 0)
CQ <0 *“ CN o ’ o ’ co* o o ’ c ..
0) 4* 1 1 1 Q) CO
s E E
LLl
_l
D <0
Z
o>
co O
LL i/i 4-»
ffi in c
9 0
q
0 0
E CN q
*- <D
0 .3
< 0) —
6#
CN CO o’ 3
1 - < | | l O E
3 05
C <D
</>
CJ
z to
0 0
q c
0 .4
0 .2
-
-2 .6
0 .2
-
- 3 .7
3
o
#8
O
h 1 o D
O > C
LU c *4-»
CO c
CC
n <4; o q o 0) oo
4—*
DC r- CO o o in o ’ o ’ QJ
4*:
O 1 1 l c
£_1 "D a)
3
q O co o q O'
jp CD
CO CN CD CN to
4fc 1 1 3 _Q
3
i/>
c
O CO
q ♦-» >
2 .3
5 .5
-2 .6
O o .Q
in a> 0)
4*: 1 "D
0) CD
a) o £
T3 o
q _o
CN o
1
o
CO in *b> o
4fc 1 CN I c *4—
CD
15 u.
CO
E
o q 3 >
5 .3
- 6 .2
o CD
CO o ’ LL A
4* o 4-J
CM 1
0)
(A o -Q A
c 1
o to CM CM
a X N
N X A N N
to (0
a>
DC CO T
L
X X
CN CN
c *o>
c q
C o rre c te d
D e s c rip tio
Case
o ’
P a tte rn
jo
P a tte rn
E 1 II
3 ii
LL A JO
190 BEST TEST DESIGN
O O) O)
N’ CO <7) c/j
CN
I
CD (0
CO CN
a>3 LO O*
5 O'
CO
N;
CN
N*
CN
CO CO
25
00
"FUMBLING" PATTERN
o
L. 00
O
LU
I
A
01 O +
< CN
II c
O)
cx
O
w
O .2
>,' I
O 0) 00
CN CN
oo
RESIDUAL ANALYSIS OF A CORRECTED
i £ s r«*
I f I o’
cc <
c
o
.> a> _j
<o O "w o
in o
• £ I
oc
o
W
ii
>
J1
■S
CO CO
o’
7 I
-Q C
E a>
3 £ o
o
8 CHOOSING A SCALE
8.1 INTRODUCTION
Logits are the units o f measurement we have used thus far. These units flo w directly
from the logistic response m odel which specifies the estimated probability o f a correct
response b y person v to item i as
where bv is the estimated ability o f person v and d; is the estimated d ifficu lty o f item i.
It follo w s that the odds fo r a correct response are
These lo g odds are called “ logits” and so differences among items and persons are ini
tially in lo g it units.
T h e K C T lo git scale, fo r exam ple, extends from -5 .8 to +5.2. A t the test lengths
presently available, standard errors o f measurement can be as lo w as 0.6 logits. We could
add a constant such as 10 to d o away with the negatives, but we could n ot avoid deci
mals b y rounding K C T measures in logits to the nearest integer. That rounding would
produce a least noticeable difference o f almost tw o standard errors and so could ob
literate differences in measures which might be meaningful. Were we to transform the
lo git scale b y first m u ltiplying each value on the scale by 10 and then adding 100, how
ever, w e w ould have a new scale o f measures from 42 to 152 which would convey the
same inform ation as the initial lo git scale but be free from negatives and decimals.
T o create a new scale that is free from the inconvenience o f decimals we must
m ultiply the logits by a “ spacing” factor large enough so that rounding the new units
to the nearest integer does n ot leave behind any useful inform ation. Once this spacing
factor is chosen and the unit o f our new scale is determined, we can then add a “ location”
factor to these new integer units that is large enough so that the lowest possible value
that can occur is greater than zero. The new scale is defined by determining these tw o
factors. Th e multiplicative factor establishes the spacing, or units, o f the scale. The addi
tive factor establishes the location, or origin, o f the scale.
The choice o f an additive factor which locates all possible values above zero is usu
ally easy. T h e choice o f a m ultiplicative factor, however, is worth further consideration.
I f w e want to w ork in integer units, then we must arrange matters so that any differences
191
192 BEST TEST DESIGN
on our new scale smaller than one integer will be meaningless. This requires us to investi
gate the size o f a least meaningful difference.
In order to be explicit about how a new scale is determined, we will express its
definition as the linear transformation y = a + 7 x in which x is the logit scale, y is the new
scale, a is the location factor fo r determining the new scale origin and 7 is the spacing fac
tor fo r determining the new scale unit. We make this transformation linear because we
want to preserve the interval characteristics o f the logits produced by the Rasch model.
Our new measures B and new calibrations D can be expressed in terms o f their logit
counterparts b and d as
SE(B) = 7 S E (b ) [8.2.3]
S E (D )= 7 S E (d ) [8.2.4]
This shows how the nature o f the new scale depends on the values fo r a and y chosen to
define it.
In passing let us appreciate again that person ability and item difficu lty mark loca
tions on one com mon variable. In constructing this variable we necessarily work with the
calibrations o f the items which define it. However, when we use the variable to measure
persons we then w ork with their measures along the variable defined by these items. What
a measure tells about a person is the d ifficu lty level o f the items on which that person is
likely to succeed half the time. In the same way, what a calibration tells about an item is
the ability level o f persons who are likely to succeed on that item half the time. Thus,
were we n ot reserving the terms “ measure” to refer to the location o f persons and “ cali
bration” to refer to the location o f items, we could as well speak o f item difficulty as the
measure o f the item and o f person ability as the calibration o f the person.
We want to free our new scale from decimals, but we do not want to obliterate use
ful information. As a result, we need to determine the least measurable difference LM D
on our logit scale so that we can choose a spacing factor 7 that brings this logit LM D to
CHOOSING A SCALE 193
at least one integer on our new scale. The nearest any tw o persons can be in observed
scores, w ith ou t being the same, is one score apart. This is the least observable difference
LO D . We need to transform this L O D into its corresponding LM D in logits o f ability.
As a result L M D must fo llo w L O D at the rate 9b/9r by which scores o f r produce mea
sures o f b, that is
9b
L M D ^r LOD
T h e Rasch response m odel gives us the expected relation between relative score f
and estimated response probability pf; o f
f = 2 Pf j / L
i
in which pfj = exp (bf - d j)/[ 1 + exp (bf - d ,)l
g f = [ 2 pf j (1 - Pf j)/L ] 1 = Cfw
This c o e ffic ien t is subscripted to test width w as well as relative score f because, as
w e learned in Chapter 6 , the exact values o f this coefficien t depend n ot only on the
relation betw een test d ifficu lty level and person ability expressed in relative score f = r/L,
but also on the width in d ifficu lty covered by the test. This gives us a least measureable
d ifferen ce o f
L M D ~ - g f (1 /L ) = Cfw /L [8.3.2]
The w ay Cfw and hence LM D varies with b is pictured in Figure 8.3.1. As the ability
measure b moves away from test center and/or the test operating curve flattens the
LM D becomes larger. Fortunately the range o f values which Cf w w ill have in practice are
lim ited.
th e n 4 < C fw < 9
and Cfw = 6
can be used as a convenient single w orking value fo r Cfw (see Table 6.8.1 fo r details).
L M D = 6 /L [8.3.3]
7 l m d >L/6 .
194 BEST TEST DESIGN
FIGURE 8.3.1
R E L A T IV E
SCORE
f = r/L
MEASURE
b
The LM D approximates the smallest possible meaningful unit since it stems from the
least observable difference. However, from an estimation point o f view, we might con
sider instead that one standard error o f measurement SEM is actually the least “ believable”
difference. In logits the SEM is related to the LM D as
As long as there are more than six items in our test the SEM determines a smaller
7 than the LM D since
An SEM-based scale, which might be simpler numerically, however, will also be somewhat
less discriminating in its integer increment than an LMD-based scale. Which choice is
preferable in any particular situation cannot be settled by statistical considerations. The
choice will inevitably depend on the use to which the measures are to be put.
CHOOSING A SCALE 195
^ l s d > L 54 / 3.5
As long as the number o f items in our test is greater than six, the relative magnitudes
o f these bases fo r determ ining a low er lim it fo r the spacing factor are
^ LM D > ^ S E M > ^ L S D
Figure 8.3.2 shows the relationships between LM D, SEM and LSD inlogits fo r the
K C T B test o f 23 items. The items, their logit values,score equivalents and L O D ’s at 3 to
4, 12 to 13, 18 to 19 and 20 to 21 along with their corresponding exact L M D ’s, SEM ’s
and L S D ’s are shown.
We can com pare the exact values in Figure 8.3.2 with the approximations o f Equa
tions 8.3.3, 8.3.4 and 8.3.5.
Minimum 7 Implied
L M D = 6 /L = 6 /2 3 = 0 .2 6 4
SEM = 2 .5 / L * = 2 .5 /4 .8 = 0 .5 2 2
LSD = 3 .5 / L * = 3 .5 /4 .8 = 0 .7 3 1.5
Because the 13 lo g it K C T B is unusually wide, these approximations are smaller than the
exact values given in Figure 8.3.2. The minimum LM D spacing factor 7 indicated by the
exact values w ould be about 2 while the approxim ations could lead to a minimum 7 o f 4.
Since w e w ould on ly be in danger o f losing inform ation i f the approximations led us to a
7 o f less than 2 , w e see that even in this extrem e situation the approximations do not
mislead us.
8 .4 D E F IN IN G T H E S PA C IN G F A C T O R
Once we have defined a least meaningful difference in logits, whether it be the least
measurable difference L M D (b ) = 6 /L to maintain maximum observability or the standard
error o f measurement SEM (b ) = 2.5/L’/a and its least significant difference LSD (b ) = 3.5/L’/j
to maintain statistical reliability, w e can use this least meaningful difference to establish a
spacing factor which w ill make all interpretable differences on our new scale greater than
one.
I f our aim is to make the least measurable difference in the new scale LM D (B ) > 1,
then since 7 = LM D (B )/LM D (b ), it follo w s as in Equation 8.3.3 that
7 l m d > L /6 [8.4.11
196
is the spacing factor which guarantees that no observable differences will be obliterated
by rounding to the nearest integer.
O ften, however, there will be other considerations which will lead us to allow 7 to
becom e even larger than L /6 in order to reach memorable scale intervals like 5, 10, 20,
25, 50 or 100.
T o get a rough idea as to typical useful values o f y, we list in Table 8.4.1 values for
the least meaningful differences which go with various test lengths. In Table 8.4.1 we see
that w e w ould seldom be satisfied with a spacing factor less than 5 and seldom need one
larger them 100. Table 8.4.1 suggests that we could work satisfactorily with
O n ly fo r tests o f unusual length, such as 1,000 item examinations, would we want 7 = 100.
T A B L E 8.4.1
LM D (B) = 7 L M D (b )
For example, suppose in order to rescale the K CTB shown in Figure 8.3.2 we chose
7 = 5 . Then although in logits
LM D(b) = [S E M (b )]2
LM D (B) = [S E M (B )]2 /5
Thus while an SEM (b) o f 0.75 = 0.571,/s goes with an L M D (b ) o f 0.57, when 7 = 5 then
LMD(B) = 5 x 0 . 5 7 = 2.81
I f we want our scale to be based on a normative reference, we can use the observed
logit mean m and logit standard deviation s o f the elected norming sample as factors in a
preliminary transformation d' = (d - m)/s and b' = (b - m)/s which puts the norming
group mean at zero and the scale unit at one normative standard deviation.
A fte r this preliminary step, we then choose a spacing factor large enough so that
meaningful differences become greater than one and at a value which pegs the normative
standard deviation at some easy to remember unit such as 10, 20, 50 or even 100. A t the
same time we choose the location factor a so that the mean o f the norming group is also
easy to recall, fo r example at 50,100 or 500.
Thus to create a norm based scale o f normative units or NITs, we use fo r persons
and fo r items
D = a + 7 (d - m)/s
We then have on the new N ITs scale the norming mean M = a and the norming standard
deviation S = y.
CHOOSING A SCALE 199
B = 5 0 + 1 0 ( b - 1.31/1.9 [8.5.2]
= 50- 10 (1 .3 /1 .9 ) + 1 0 b /1 .9
= 4 3 .2 + 5.3b
and
D = 4 3 .2 + 5.3d .
N o tice that with this scale definition, the normative mean o f m = 1.3 logits becomes
I f w e n ow set b at
then
M + S = 4 3 .2 + 5.3 (3.2) = 60 N ITs
so that
M + S - M = S = 6 0 - 5 0 = 10 N ITs
Figure 8.5.1 shows the distribution o f the 68 norming persons. Below their distri
bution are the ability measures fo r each score in logits and in N ITs, and at the bottom are
the K C T B items which define the variable. In Figure 8.5.1 we see that
30 N IT s -* m - 2s, 4 0N IT s-*m -s, 50N ITs-»m , 60 N ITs -»m +s and 70 N ITs ■* m + 2s.
I f d., and d 2 id en tify the criteria positions in logits and D, and D 2 represent the
desired easy to rem em ber positions o f these criteria on the new scale, then
a = (D 1 d2 - D 2 d! )/(d 2 - d t )
7 = (D2 - D , ) / ^ - d , )
and so
B = a + 7b
D = a + 7d
becom e
B = [ ( D , d2 - D 2 d , ) + (D 2 - D , )b] /(d 2 - d ! ) [8.6.1]
and
D = [(D 1 d2 - D2 d , ) + (D2 - D, )d] /(d 2 - d , ) .
200 BEST TEST DESIGN
CHOOSING A SCALE 201
In order to apply this m ethod to the K C T B example, we will designate a basal level
at the 3-tap median o f d, = -3 .4 logits and a com petency level at the 5-tap median o f
d 2 = 1.4 logits. Then we w ill arrange to report these criteria at D, = 3 0 fo r basal and
D 2 = 50 fo r com petency using
a = [3 0 (1 .4 ) - 50 (-3 .4 )1 /[ 1 .4 - (-3 .4 )]
= ( 4 2 + 1701/4.8
= 4 4 .2
and
7 = ( 5 0 - 3 0 ) / [ ( 1 . 4 - ( - 3. 4 ) ]
= 2 0 /4 .8
= 4.2
B = 4 4 .2 + 4.2b [8.6.2)
and
D = 4 4 .2 + 4.2d
This scaling transforms the 3 -tap median at d , = - 3.4 logits to D , = 44.2 + 4.2 (-3 .4 ) =
30 SITs and the 5-tap median at d2 = 1.4 logits to D 2 = 44.2 + 4.2 (1 .4 ) = 50 SITs.
w e can determ ine the differences (b - d ) between person ability and item difficu lty which
lead to the response probabilities .10, .25, .50, .75 and..90. Solving fo r (b - d ) in logits we
have
and hence
.10 - 2.2
.25 -1 .1
.50 0.0
.75 1.1
.90 2.2
20 2
B = a + 7 (b - c) [8.7.1]
in which
c = either a normative or a substantive choice
of location on the logit scale
and
a = 50, 100 or 500.
B = 50 + 4 .5 5 ( b - 1.3) [8.7.2]
= 4 4 .0 9 + 4.55b
= 44.1 + 4.6b
N o tice that when b is located at the mean o f the norming group, then
Our choice o f 7 = 4.55 produces the follo w in g relations between the relative posi
tions o f a person at B and an item at D
.10 -1 0
.25 - 5
.50 0
.75 5
.90 10
Thus w e exp ect that when any person confronts any item 10 CHIPs below their ability
the probability fo r a successful response is about .90. A t 5 CHIPs below, the predicted
success rate is .75. On the other side, i f an item is5 CHIPs more d ifficu lt than the person
is able, w e exp ect the success rate to drop to .25 and,when the person is at a disadvantage
o f 10 CHIPs, w e exp ect success on ly .10 o f the time.
D = 50 + 4 . 5 5 ( d - 1.4) [8.7.3]
= 4 3 .6 3 + 4.55d
= 4 3 .6 + 4.6d
and so
B = 4 3 .6 + 4.6b
204 BEST TEST DESIGN
B = 4 3 .6 + 4 .6 (1 .4 ) = 5 0 C H IP s .
Table 8.7.1 brings together the logit, N IT , S IT and CHIP scales for the K CTB test.
22 6.26 76 70 73 1.20 6 5 6
21 5.15 70 66 68 0.99 5 4 5
20 4.31 66 62 64 0.89 5 4 4
19 3.60 62 59 61 0.83 4 3 4
18 2.99 59 57 58 0.78 4 3 4
17 2.42 56 54 55 0.76 4 3 3
16 1.88 53 52 53 0.75 4 3 3
15 1.35 50 50 50 0.74 4 3 3
14 0.84 48 48 48 0.73 4 3 3
13 0.35 45 46 46 0.70 4 3 3
12 -0 .1 0 43 44 44 0.67 4 3 3
11 -0 .5 1 40 42 42 0.65 3 3 3
10 -0 .9 0 38 40 40 0.63 3 3 3
9 -1 .2 8 36 39 38 0.62 3 3 3
8 -1 .6 5 34 37 37 0.62 3 3 3
7 -2 .0 2 32 36 35 0.63 3 3 3
6 -2 .4 2 30 34 33 0.65 3 3 3
5 -2 .8 5 28 32 31 0.69 4 3 3
4 -3 .3 3 26 30 29 0.74 4 3 3
3 -3 .9 0 23 28 26 0.82 4 3 4
2 -4 .6 5 19 25 23 0.95 5 4 4
1 -5 .7 5 13 20 18 1.23 7 5 6
The use o f the Rasch m odel in test construction can facilitate test interpretation. We
illustrate this with a reporting form developed fo r the K C T variable.
Figure 8.8.1 provides a map o f the K C T variable. This map shows all o f the data
gathered thus far: the K C T B items positioned along the variable by their d ifficu lty levels,
the substantive criteria o f number o f taps, reverses and distance across blocks and the
norm ative inform ation o f median ages fo r children and mean and standard deviation fo r
adults. The map shows the exten t to which the K C T variable has been defined and how
various possible K C T measures relate to substantive and normative considerations. A K C T
report form can be developed from this map.
N o tice that Person 12M with his score o f 5 is located at -2 .8 logits on the K C T
variable. This puts him halfw ay between 3 and 4 taps substantively and at the 5 year old
median norm atively. Person 88M, however, at 3.0 logits is functioning at 6 taps sub
stantively and at about one standard deviation above the adult mean normatively.
Figure 8.8.3 shows a M isfit Ruler scaled in logits. It is marked to indicate the logit
deviations and a corresponding m isfit index y 2 = (z 2 - 1 ) to the le ft and right o f its
center. N o tice that the unexpected response deviations o f 1, 2, 3 and 4 logits indicate
y 2 ’s o f 2, 6 , 19 and 54 respectively. By positioning the center o f the ruler at the point
on the variable where the person is located and comparing the ruler’s markings with the
person’s response to each item we can calculate, at a glance, the m isfit o f the person’s
record.
U = Q /L ,/4 [7.8.1]
U~N (0,2) .
This easy to calculate statistic can be used to evaluate misfit. When U > 5 the prob
ability that the record is acceptable has dropped b elow .01, and it seems reasonable to
question the validity o f the record.
can be id en tified and dealt w ith. Subsequent cases can then be handled in order o f U
until all useful explanations o f the invalidities im plied b y U > 5 are discovered.
Figure 8.8.2 shows the test segm ent o f 23 K C T B items in order o f increasing d if
ficu lty togeth er w ith three response records. T h e response records show the correct and
in correct responses on the 23 items. T o evaluate the f it o f Person 1 2 M ’s record the center
o f the M isfit R u ler is placed at the arrow m arking his position a t - 2 .8 logits determ ined
b y his score o f 5. His in correct response to Item 3 at - 6 .2 logits produces a y 2 o f about
30 fo r Q = 30 and U = 30/2354 = 6.3. This corresponds to a t = 4.5 which is very close to
the m ore exact value o f t given in Table 7.8.3. Again, w e see that this response is to o
im probable to be accepted as part o f a valid measure o f Person 12M.
T h e M isfit R u ler has also been applied to Person 88M and to the “ Sleeping” pattern
w ith the same score o f 18. T h e pattern fo r Person 88M produces a response record that
yields a 2 y 2 = 2, and U = 0.4. T h e “ sleeping” patterns, h ow ever, produces a
2 y 2 = 4 6 + 2 0 + 2 + 3 = 71
and so a
U = 7 1 /2 3 * = 14.8 .
TABLE 8.8.1
QUICK AN ALYSIS OF
RESPONSE RECORD V A L ID IT Y
Sum of
Unexpected F it
Person Score Responses Statistic
Q = 2 y2 U = Q /L *
12M 5 30 6 .3 *
88M 18 2 0.4
"Sleeping" Pattern 18 4 6 + 20 + 2 + 3 = 71 14.8*
L = 23 ‘ Misfit
CHOOSING A SCALE 209
FIGURE 8.8.3
M IS FIT RULER
f+ + -B -+ + + + -H —I- -H -f -H -+
-4 -3 -2 -1 Logits
5442 32 25 19 15 1 1 8 6 5 3 2 2 y2 2 2 3 5 6 8 11 15 19 25 32 42 54
.02 .05 .1 .2 P .2 .1 .05 .02
HOW T O USE T H E M IS F IT R U L E R :
TABLE C 215
211
212 APPENDICES
I TA B LE A L________________________
J TABLE A I--------------------------------------
R E LA T IV E A B IL IT Y xfw FOR UNIFORM TESTS IN LOGITS
(Continued)
__________________________ I TA B LE B |_________________________
f > .5 0 | f < .5 0
TABLE B
ERROR CO EFFICIENT Cf * FOR UNIFORM TESTS IN LOGITS
(Continued)
f > .5 0 f < .5 0
J TABLE C I
M ISFIT STATISTICS
Difference
Between Relative Number of Items
Person A bility Squared Improbability Efficiency Needed To
and Standardized of the of the Maintain
Item Difficulty Residual Response Observation Equal Precision
(b -d ) z2=exp (b -d ) p =1/(1+z2 ) l=400p (1—p) L= 1000/1
2.1 8 .11 40 25
2.2 9 .10 36 28
2.3 10 .09 33 30
2.4 11 .08 31 32
2.5 12 .08 28 36
2.6 14 .07 25 40
2.7 15 .06 23 43
2.8 17 .06 21 48
2.9 18 .05 20 50
3.0 20 .05 18 55 '
3.1 22 .04 16 61
3.2 25 .04 15 66
3.3 27 .04 14 73
3.4 30 .03 12 83
3.5 33 .03 11 91
3.6 37 .03 10 100
3.7 41 .02 9 106
3.8 45 .02 9 117
3.9 50 .02 8 129
4.0 55 .02 7 142
217
218 REFERENCES
Douglas, Graham A. Test design strategies fo r the Rasch psychometric model. Doctoral
dissertation, University o f Chicago, 1974.
Draba, R. E. The Rasch model and legal criteria o f a “ reasonable” classification. Doctoral
dissertation, University o f Chicago, 1978.
Elliott, C. D., Murray, D. C., and Pearson, L. S. The British A b ility Scales. Slough, Eng
land: National Foundation fo r Educational Research, 1977.
Fischer, G. H. and Scheiblechner, H. H. T w o simple methods fo r asymptotically unbiased
estimation in Rasch’s model with tw o categories o f answers. Research Bulletin N o.
1, Psychological Institute, University o f Vienna, 1970.
Gulliksen, H. Theory o f M ental Tests. N ew York : John Wiley & Sons, 1950.
Gustafsson, J. E. The Rasch model fo r dichotomous items: Theory, applications and a
computer program. R e p o rt N o . 63. Institute o f Education, University o f Goteberg,
1977.
Habermann, S. Maximum likelihood estimates in exponential response models. The
Annals o f Statistics, 1977,5, 815-841.
Loevinger, J. A systematic approach to the construction and evaluation o f tests o f ability.
Psychological Monographs, 1947, 61.
Loevinger, J. Person and population as psychometric concepts. Psychological Review,
1965, 72, 143-155.
Lord, F. M. An analysis o f the Verbal Scholastic Aptitude Test using Bimbaum’s three-
parameter logistic model. Educational and Psychological Measurement, 1968, 28,
989-1020.
Mason, G. P. and Odeh, R. E. A short-cut formula fo r standard deviation. Journal o f
Educational Measurement, 1968, 5, 319-320.
Mead, R. J. Analysis o f fit to the Rasch model. Doctoral dissertation, University o f
Chicago, 1975.
Panchapakesan, N. The simple logistic m odel and mental measurement. Doctoral disser
tation, University o f Chicago, 1969.
Rasch, G. On simultaneous factor analysis in several populations. In Uppsala Symposium
on Psychological F a ctor Analysis. Stockholm: Almquist and Wiksells, 1953, 65-71.
Rasch, G. Probabilistic M odels fo r Som e Intelligence and A tta in m en t Tests. Copenhagen:
Danmarks Paedagogiske Institut, 1960 (T o be reprinted by University o f Chicago
Press, 1980).
Rasch, G. On general laws and the meaning o f measurement in psychology. In Proceed
ings o f the F ou rth Berkeley Symposium on Mathematical Statistics and Probability.
Berkeley: University o f California Press, 1961, 4, 321-333.
Rasch, G. An individualistic approach to item analysis. In P. F. Lazarsfeld and N. W.
Henry (Eds.), Readings in M athematical Social Science. Chicago: Science Research
Associates, 1966a, 89-108.
Rasch, G. An item analysis which takes individual differences into account. British
Journal o f Mathematical and Statistical Psychology. 1966b, 19, 49-57.
Rasch, G. An informal report on the present state o f a theory o f objectivity in compari
sons. In L. J. van der Kamp and C. A . J. Viek (Eds.), Proceedings o f the N U F F IC
International Sum m er Session in Science at “H e t Oude H o f ." Leiden, 1967.
Rasch, G. A mathematical theory o f objectivity and its consequences fo r model construc
tion. In R e p o rt from European M eeting on Statistics, Econom etrics and Manage
m ent Sciences, Amsterdam, 1968.
Rentz, R. R. and Bashaw, W. L. Equating Reading Tests with the Rasch Model. Athens,
Georgia: Educational Resource Laboratory, 1975.
Rentz, R. R. and Bashaw, W. L. The national reference scale for reading: An application
o f the Rasch model. Journal o f Educational Measurement, 1977, 14, 161-180.
REFERENCES 219
A b ility |3 and b ( see Person ability measure) UFORM, 143 - 151, 214, 216
Ad d itive scale factor a ( see Scale additive factor) Expansion factors X and Y , 21 - 22, 30, 40 - 44,
Analysis o f fit (see F it) 50, 62, 148
Extending a variable, 87 - 93
Bank (see Item bank building)
Best test design (see Test design) Fit:
Beta /} (see Person ability measure) analysis, 2 - 4, 23 - 24, 66 - 82
B IC A L , 46 - 54 computer example, 52 - 55, 58 - 59, 80 - 82
control, 4 6 - 4 7 correcting misfit, 181 - 190
output, 46, 48 • 52, 54 crude, 124 - 125
diagnosing misfit, 170 - 180
Calibration 5 and d (see Item d ifficu lty calibra hand example, 69 - 79
tion ) item fit, 52 - 55, 58 - 59, 77 - 79, 121 - 125
Chain, 99 link fit, 93 - 96, 98
Chicago probability unit (see Scale C H IP ) loop fit, 100
CH IP (see Scale C H IP ) person fit, 2 - 4 , 7 6 - 77, 121 - 125, 165 - 180,
Com m on item equating, 108 - 1 0 9 ,1 1 2 - 1 1 8 205 - 209
Com m on person equating, 1 0 6 -1 1 2 response fit, 69 - 77, 121 - 125,165 - 180
Computing algorithms: ruler for fit analysis, 208 - 209
B IC A L , 46 - 54 summary o f fit analysis, 79 - 80
P R O X , 61 - 62 table for fit analysis, 73, 216
hand example, 30 - 44 Fumbling (see Response pattern)
com puter example, 46 - 55
U C O N , 62 - 65 Guessing (see Respone pattern)
U F O R M , 143 -151
tables, 146, 212, 214 Identity line, 89, 92 - 95
Correcting a measure, 181 - 190 Individualized testing (see Tailoring)
Connecting tw o tests, 96 - 98 (see Linking test Information I f |, 16 - 17, 73 - 75, 135, 161 - 164
form s) Intensifying a variable, 87 - 94
Control lines fo r identity plots, 94 - 95 Interval distribution o f items or persons,
Criterion referencing, 118 - 121, 199 - 202, 204, 130- 131,133- 134,137,139
206 - 207 Item:
Crude fit (see F it) characteristic curve, 12 - 14, 51 - 53, 58 - 59
difficulty calibration 5 and d, 17 - 22, 25, 30,
Data matrix, 10, 18, 31, 33, 68, 107 - 109 34 - 38, 40 - 42, 54 - 55, 61 - 65
Degrees o f freedom , 23 - 24, 71, 74, 77, 79 discrimination, ix - x
crude fit, 125 index, 52 - 55
item fit, 24, 77, 79 fit, 52 - 55, 58 - 59, 77 - 79,121 - 125
link analysis, 96 p-value, viii, xi - xiii, 25 - 26
person fit, 23, 76 - 77, 79, 165 - 168 point biserial, viii, x, 26
Delta 6 (see Item d ifficu lty calibration) score Sj, 10, 18 - 22, 32 - 35
Design o f best test (see Test design) Item bank building, 9 8 -1 1 8
Diagnosing misfit, 170 - 180 KCT example, 106 - 118
D ifficu lty 5 and d (see Item difficu lty calibra chain, 99
tion ) link, 96 - 106
Discrimination (see Item discrimination index) loop, 100
network, 101 - 103
Editing data, 31 - 34, 47 - 49 web, 102-106
E fficien cy, 74 - 75, 139, 161,164 Item calibration quality control, 121 - 125 (see
Equating test form s (see Linking test form s) Item fit)
Error coefficien t C f , 135 - 140, 146, 193 - 194,
214 KCT (see Knox Cube Test)
Estimation methods, ix - x, 15 - 20, 44 - 45 Knox Cube Test KCT, 28 - 29
P R O X , 21 - 22, 28 - 45, 50 - 56, 60 - 62, 143, banking KCTB, 106 - 118
149 -150 criterion referencing, 118 - 121, 206 - 207
U C O N , 56 - 65, 142 - 1 4 3,148 - 150 KCTB, 106 - 121
220
IN D E X 221
vv mean square residual for person v v, mean square residual for item i
fv degrees of freedom in vv fj degrees of freedom in Vj
tv standardized mean square vv t; standardized mean square v(
Exceptions to this notation occur when locally convenient, particularly with "s",
N O T A T IO N
(continued)
M
£ (y i) continued sum o f y: over j = 1, M
j 1 '
M
n (Yj) continued product Vj over j = 1, M
E {y } expected value o f y
V {y} variance o f y
RASCH M O D E L