Download as pdf or txt
Download as pdf or txt
You are on page 1of 53

By Hui Bian Office for Faculty Excellence

Email: [email protected] Phone: 328-5428 Location: 2307 Old Cafeteria Complex (east campus)

When reliable and valid instruments are not available to measure a particular construct of interest.

You should know

The reliability of the outcomes depend on the soundness of the measures. Validity is the ultimate goal of all instrument construction.

Step 1

Determine what you want to measure Generating an item pool Determine the format for items Expert review of initial item pool Add social desirability items Pilot testing and item analysis Administer instrument to a larger sample Evaluate the items Revise instrument
DeVellis (2003); Fishman & Galguera (2003); Pett, Lackey, & Sullivan (2003)
5

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Step 8

Step 9

Standards for Educational and Psychological Testing 1999


American Educational Research Association (AERA) American Psychological Association (APA) National Council on Measurement in Education (NCME)

The consistency or stability of estimate of scores measured by the instrument over time. Measurement error: the more error, the less reliable
Systematic error: consistently reoccurs on repeated measures of the same instrument. Problems with the underlying construct (measure a different construct: affect validity) Random error: inconsistent and not predictable Environment factors Administration variations
7

Internal consistency
Homogeneity of items within a scale Items share a common cause (latent variable) Higher interitem correlations suggest that items are all measuring the same thing.

Measure of internal consistency


Cronbachs alpha Kuder-Richardson formula 20 or KR-20 for dichotomous items Reliability analysis using SPSS (Cronbachs alpha): data can be dichotomous, ordinal, or interval, but the data should be coded numerically.

Split-half reliability
Compare the first half items to the second half Compare the odd-numbered items with the evennumbered items

Test-retest reliability (temporal stability)


Give one group of items to subjects on two separate oaccasions.

10

Strength of correlation
.00-.29 weak .30-.49 low .50-.69 moderate .70-.89 strong .90-1.00 very strong
Pett, Lackey, Sullivan (2003)

11

Definition The instrument truly measures what it is supposed to measure. Validation is the process of developing valid instrument and assembling validity evidence to support the statement that the instrument is valid. Validation is on-going process and validity evolves during this process.
12

Evidence based on test content Test content refers to the themes, wording, and format of the items and guidelines for procedures regarding administration. Evidence based on response processes Target subjects. For example: whether the format more favors one subgroup than another group; In another word, something irrelevant to the construct may be differentially influencing performance of different subgroups.

13

Evidence based on internal structure


The degree to which the relationships among instrument items and components conform to the construct on which the proposed relationships are based.

Evidence based on relationships to other variables


Relationships of test scores to variable external to the test.
14

It is critical to establish accurate and comprehensive content for an instrument. Selection of content is based on sound theories and empirical evidences or previous research. A content analysis is recommended.
It is the process of analyzing the structure and content of the instrument . Two stages: development stage and appraisal stage
15

Instrument specification
Content of the instrument Number of items The item formats The desired psychometric properties of the items Items and section arrangement (layout) Time of completing survey Directions to the subjects Procedure of administering survey
16

Content evaluation (Guion, 1977)


The content domain must be with a generally accepted meaning. The content domain must be defined unambiguously The content domain must be relevant to the purpose of measurement. Qualified judges must agree that the domain has been adequately sampled. The response content must be reliably observed and evaluated.
17

Content evaluation
Clarity of statements Relevance Coherence Representativeness

18

Documentation of item development procedure Item analysis: item performance Item difficulty Item discrimination Item reliability

19

An scale is required to related to a criterion or gold standard. Collect data from using new developed instrument and from criterion.

20

In order to demonstrate construct validity, developers should provide evidence that the test measures what it is supposed to measure. Construct validation requires the compilation of multiple sources of evidence.
Content validity Item performance Criterion-related validity
21

Validity studies should address both the internal structure of the test and external relations of the test to other variables.
Internal structure: subdomains or subconstruct External relations: relationships between test measures and other constructs or variables.

22

Construct-irrelevant variance
Systematic error May increase or decrease test scores
y = t + e1 + e2 y is the observed score. t is the true score. e1 is random error (affect reliability). e2 is systematic error (affect validity)

23

Construct underrepresentation
It is about fidelity. It is about the dimensions of studied content. Item formats may play a role of construct underrepresentation, for example: the relationship between gender and certain type of item format.

24

What will the instrument measure? Will the instrument measure the construct broadly or specifically, for example: self-efficacy or selfefficacy of avoiding drinking Do all the items tap the same construct or different one? Use sound theories as a guide. Related to content validity issues
25

It is also related to content validity Choose items that truly reflect underlying construct. Borrow or modify items from already existed instruments (they are valid and reliable). Redundancy: more items at this stage than in the final scale. A 10-item scale might evolve from a 40item pool.

26

Writing new items


Wording: clear and inoffensive Avoid lengthy items Consideration of reading difficulty level Avoid items that convey two or more ideas Be careful of positively and negatively worded items

27

Items include two parts: a stem and a series of response options. Number of response options
A scale should discriminate differences in the underlying attributes Respondents ability to discriminate meaningfully between options Examples: Some and few; somewhat and not very
28

Number of response options


Equivocation: you have neutral as a response option

Types of response format


Likert scale Binary options Selected-response format (multiple choice format)

29

Component of instrument Format (font, font size) Layout (how many pages) Instructions to the subjects Wording of the items Response options Number of items
30

Purpose of expert review is to maximize the content validity. Panel of experts are people who are knowledgeable in the content area. Item evaluation How relevant is each item to what you intend to measure? Items clarity and conciseness Missing content Final decision to accept or reject expert recommendations It is developers responsibility
31

It is the tendency of subjects to respond to test items in such a way as to present themselves in socially acceptable terms in order to gain the approval of others. Individual items are influenced by social desirability. 10-item measures by Strahan and Gerbasi (1972)
32

Do those selected items cover the subject completely? How many items should there be? How many subjects do we need to pilot test this instrument?

33

Also called tryout or field test Purposes


Identify, remove or revise bad items Any problems related to item contents and formats Data collection procedure Prevents researchers from spending valuable resources on a study using not valid or reliable instrument Determine the amount of time to complete the instrument

34

Sample size: one tenth the size of the sample for the major study. People who participate in the pilot test can

not be in the final study.

35

Item analysis: it is about item performance. Reliability and validity concerns at item level As means of detecting flawed items Help select items to be included in the test or identify items that need to be revised. Item response theory used to evaluate items. Item selection needs to consider content, process, and item format in addition to item statistics.
36

Item response theory (IRT)


Focus on individual items and their characteristics. Reliability is enhanced not by redundancy but by indentifying better items. IRT items are designed to tap different degrees or levels of the attribute. The goal of IRT is to establish item characteristics independent of who completes them.
37

IRT concentrates on two aspects of an items performance.


Item difficulty: how hard the item is. Item discrimination: its capacity to discriminate.
A less discriminating item has a larger region of ambiguity.

38

Knowing the difficulty of the items can avoid making a test so hard or so easy.

The optimal distribution of difficulty is normal distribution. For dichotomous variable: (correct/wrong)

The rate of wrong answers: 90 students out of 100 get correct answers, item difficulty = 10%

For more than two categories


39

Items a59_9 a59_10 a59_11

Mean 2.04 1.77 1.93 Four-point scale: 1 = Strongly agree, 2 = Agree, 3 = Disagree, 4 = Strongly disagree.
Less difficult Strongly agree More difficult Strongly disagree

a59_12
a59_13 a59_14 a59_15 a59_16 a59_17

1.95
1.60 1.58 1.61 1.87 2.75

a59_30

1.63
40

0 1.6 1.8 1.9 2 2.8

Difficulty distribution on a four-point scale


41

Instrument reliability if item deleted


Deletion of one item can increase overall reliability. Then that item is poor item. We can obtain that statistic from Reliability Analysis (SPSS)

42

43

Item validity
Correlation of each items response with the total test score minus the score for the item in question.

44

Item validity
A bell-shaped distribution with its mean as high as possible. Higher correlation for an item means people with higher total scores are also getting higher item score. Items with low correlation need further examination.

45

Cronbachs alpha

DeVellis (2003)

Below .60 unacceptable Between .60 and .65 undesirable Between .65 and .70 minimally acceptable Between .70 and .80 respectable Between .80 and .90 very good Above .90, should consider shortening the scale

46

Sample size: no golden rules


10-15 subjects/item 300 cases is adequate 50 very poor 100 poor 200 fair 300 good 500 very good 1000 or more excellent
47

Administration threats to validity


Construct underrepresentation Construct irrelevant variance

Efforts to avoid those threats


Standardization Administrator training

48

Item analysis: item performance Factor analysis


Purposes
determine how many latent variables underlie a set of items. label indentified factors to help understand the meaning of underlying latent variables.

49

Factor analysis
Exploratory factor analysis: to explore the structure of a construct. Confirmatory factor analysis: confirm the structure obtained from exploratory factor analysis.

50

Effects of dropping items


Reliability Construct underrepresentation Construct irrelevant variance

51

DeVellis, R. F. (2003). Scale development: theory and application (2nd ed.). Thousand Oaks, CA: Sage Publications, Inc. Downing, S. M. & Haladyna, T. M. (2006). Handbook of test development. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. Fishman, J. A. & Galguera, T. (2003). Introduction to test construction in the social and behavioral sciences: a practical guide. Lanham, MD: Rowman & Littlefield Publishers, Inc. Pett, M. A., Lackey, N. R.,& Sullivan, J. J. (2003). Making sense of factor analysis: the use of factor analysis for instrument development in health care research. Thousand Oaks, CA: Sage Publications, Inc.

52

53

You might also like