Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Stages of Test Development

1. Test Conceptualization
- begins with asking relevant questions.
Preliminary questions:
1. What is the test designed to measure?
2. What is the objective of the test?
3. Is there a need for this test?
4. Who will use this test?
5. Who will take this test?
6. What content will the test cover?
7. How will the test be administered?
8. What is the ideal format of the test?
9. Should more than one form of the test be developed?
10. What special training will be required of test users for administering or interpreting the
test?
11. What types of responses will be required of test takers?
12. Who benefits from an administration of this test?
13. Is there any potential for harm as the result of an administration of this test?
14. How will meaning be attributed to scores on this test? Norm-referenced versus criterion-
referenced tests (discussed in previous modules)
Best Step: Pilot Work (knowing more about the construct)

2. Test Construction
Scaling – setting rules for assigning numbers in measurement
Types of Scales: Nominal-Ordinal-Interval-Ratio
Scaling Methods:
Rating scale – grouping of words or statements in which the test taker identifies with and then
assigned a number
Common in Psychology: Likert Scale; through the use of summative scale like Likert
scale, your measure may be unidimensional (one score represent the construct) or
multidimensional (several scores or factors) represent the construct
Paired comparisons – test taker compares pairs of stimuli, and selects one which he/she
identifies with
Categorical Scaling – stimuli are placed into two or more categories based on the preference of
the test taker according to the rules being set
Guttman scale/scalogram analysis – response items range from weak to stronger expressions
such that when a test taker’s response is the extreme expression it also connotes support to the
less extreme responses
Writing Items
1. Item pool – collection of items for testing (rule of thumb: pool should contain twice more of the
number of items you intend to have in the final version)
2. Item format – form, plan, structure, arrangement, and layout of individual test items
2.1. Selected response format – select response from a set of alternative responses
2.1.1. Multiple choice – 3 elements namely stem, correct option, and other
incorrect options
2.1.2. Matching item – test taker matches the premise on the first column to the
responses on the second column
2.1.3. Binary choice item – test taker agrees or selects one of the 2 choices
2.2. Constructed response format – require test takers to supply or create the correct
answer, not merely select it
2.2.1. Completion item: short answer or essay – supply word or phrase to
complete the sentence
3. Writing items for computer administration
Item bank – collection of test items

3. Test Tryout
Gather participants to try your draft test. There are several rules of thumb for the number of test
takers for the tryout stage. One suggestion is that there should be between 5 to 10 test takers
per item. This means, if you have 20 items, then you may have to gather 100-200 test takers for
your initial test tryout. Another suggestion is that for each factor, you gather at least 50 test
takers. Hence, if your test provides just one score (unidimensional), then you may gather 50
participants. If two scores, then gather 100 participants, and so on.

4. Item Analysis
1. Item difficulty index or item endorsement index – proportion of the total number of test takers
who answered the item correctly or who endorsed the item
2. Item reliability index – internal consistency of the test
2.1. Exploratory factor analysis
2.2. Confirmatory factor analysis
2.3. Inter-item consistency – Cronbach’s alpha
3. Item validity index – indicates whether the items measure what it aims/purports to measure
4. Item discrimination index – indicates how adequately an item separates or discriminates
between high scorers or low scorers on an entire test
- measure of the difference between the proportion of high scorers answering an item correctly
or endorsing the item, and the proportion of low scorers answering the item correctly or
endorsing the item
5. Test Revision
Items may be reduced and others rewritten. For those existing tests, they are regularly revised
in response to needed improvements due to experiences and insights in using the test. You
have to take note that your test is an ongoing draft - it may go through the 5 stages of test
development again at any point in time when needed.

Classical Test Theory versus Item Response Theory


Reference: Streiner, D. L. (2010). Measure for measure: New developments in measurement
and item response theory. The Canadian Journal of Psychiatry, 55(3), 180-186.
https://journals.sagepub.com/doi/pdf/10.1177/070674371005500310
CTT
In practical terms, it assumes that the more items there are in a scale, the less random error will
be associated with the total score. Hence, there are more valid and reliable but relatively long
standardized tests.
IRT
IRT assumes that the scale is unidimensional (that is, it measures only one trait or attribute);
and that, at any given level of the trait, the probability of endorsing one item is unrelated to the
probability of endorsing any other item (a property called local independence). Hence with the
independence of items, a relatively short test can be achieved.
Nowadays, revision of standardized tests employ IRT. However, with the complex method and
computation of IRT, most are still employing CTT in test construction. It is also ideal to be
familiar with how CTT works and then progress to IRT.

You might also like