Assessment and Evaluation

Concept of Evaluation
According to Kizlik (2011) evaluation is most complex and the least understood term.
Hopkins and Antes (1990) defined evaluation as a continuous inspection of all available
information in order to form a valid judgment of students’ learning and/or the
effectiveness of education program.
Evaluation is based two philosophies. One, traditional philosophy is that
ability to learn is randomly distributed in the general population. This gave birth to
norm-referenced measurement of intellectual abilities. In norm-referenced
measurement, an individual's score is interpreted by comparing the score to those
of a defined group, often called the normative group. The comparison is relative
rather than absolute.
The new philosophy of measurement is based on democratic values and

gives importance to the environment. It is based on the univer-salisation of
education, It assumes that if education is thought universal, the responsibility of
the teacher is to help as many students as possible to learn. Thus we see that
the two philosophies of evaluation are based on different concepts of human
potentialities and their development. One believes that human abilities are not
evenly distributed in the population. Achievement of individual learner differs
greatly whereas the other believes that all learners can attain the mastery of
learning task irrespective of individual differences among them.
William Wiersma and Stephen G Jurs (1990) remarks that Evaluation is a

process that includes measurement and possibly testing but it also contains the
notion of value judgement. If a teacher administers a test to a class and computes
the percentage of correct responses, it is said that measurement and testing has
taken place. The scores must be interpreted which may mean
converting them to values like As Bs Cs and so on or judging them to be
excellent, good, fair or poor. This process is called evaluation.
The central idea in evaluation is "value." When we evaluate a variable, we

are basically judging its worthiness, appropriateness and goodness. In teaching
learning process teachers made students’ evaluations that are usually done in the
context of comparisons between what was intended (learning, progress,
behaviour) and what was obtained.
Evaluation is much more comprehensive term than measurement and

assessment. It includes both quantitative and qualitative descriptions of students’
performance.
Concept of Measurement
According to Carole Tomlinson “Assessment is today's means of modifying

tomorrow's instruction." The term "Educational Measurement" refers to any
device for testing, scaling, and appraising the outcomes of educational process. It
includes administration and scoring tests, validation and standardization, and
application of statistical techniques in the interpretation of test results.
Measurement is the process of quantifying the degree to which someone

or something possesses a given trait, i.e. quality, characteristics or feature.
Measurement permits more objective description concerning traits and facilitates
comparisons. Thus instead of saying that Aslam is underweight for his age and
height, we can say that Aslam is 18 years old, 5.8 tall, and weight only 85
pounds. Further, instead of saying that Aslam is more intelligent than i Ali, we can
say that Aslam has a measured IQ of 125 and Ali has å measured IQ of 88. In
each case, the numerical statement is more precise, more objective, than the
corresponding verbal statement.
Classroom Assessment
. Hamidi (2010) developed a framework to answer the Why; What, How and
When to assess. This is helpful in understanding the true nature of this concept.
Why to Assess: Teachers have clear goals for instruction and they assess
to ensure that these goals have been or are being met. If objectives are the
destination, instruction is the path to it then assessment is a tool to keep the
efforts on track and to ensure that the path is right.
What to Assess: Teachers cannot assess whatever they themselves like. In

classroom assessment, teachers are supposed to assess students' current
abilities in a given skill or task.
Who to Assess: Teachers should treat students as 'real learners', not as course
or unit coverers. They should also predict that some are quick at learning and
some are slow at it. Therefore, classroom assessment calls for a prior realistic
appraisal of the performance of individuals.
How to Assess: Teachers employ different instruments, formal or informal, to

assess their students. Brown and Hudson (1998) reported that teachers use three
sorts of assessment methods – selected-response assessments, constructed-
response assessments, and personal-response assessments. They can adjust
the assessment types to what they are going to assess.
When to Assess: There is a strong agreement of educationists that assessment

is interwoven into instruction. Teachers continue to assess the students learning
throughout the process of teaching. They particularly do formal assessments
when they are going to make instructional decisions at the formative and
summative levels, even if those decisions are small. For example, they assess
when there is a change in the content; when there is a shift in pedagogy.
Types of Assessment
Based upon the functions that it performs, assessment is generally divided into
three types: assessment for learning, assessment of learning and assessment as
learning.
a. Assessment for Learning (Formative Assessment)
Assessment for learning is a continuous and an ongoing assessment that allows teachers to
monitor students on a day-to-day basis and modify their teaching based on what the students need to
be successful. This assessment provides students with the timely, specific feedback that they need to
enhance their learning..
Role of assessment for learning in instructional process can be best understood with the
help of following diagram.
b. Assessment of Learning (Summative Assessment)

Summative assessment or assessment of learning is used to evaluate students’
achievement at some point in time, generally at the end of a course. The purpose of this
assessment is to help the teacher, students and parents know how well student has
completed the learning task.
c. Assessment as Learning
Assessment as learning means to use assessment to develop and support students'
metacognitive skills. This form of assessment is crucial in helping students become lifelong
learners. Students develop a sense of efficacy and critical thinking when they use teacher,
peer and self- assessment feedback to make adjustments, improvements and changes to what
they understand.
Characteristics of Classroom Assessment

1. Effective assessment of student learning begins with educational goals.
2. Assessment works best when it has a clear, explicitly stated purposes.
3. Assessment works best when it is ongoing not episodic.
4. Assessment is effective when representatives from across the

educational community are involved.
5. Assessment makes a difference when it begins with issues of use

and illuminates questions that people really care about.
9. Through effective assessment, educators meet responsibilities to

students and to the public.
Differences of Measurement, Assessment and Evaluation

The terms measurement, assessment, and evaluation are usually confused
with each other. Mostly people use these terms interchangeably. Each of these
terms has a specific meaning sharply distinguished from the others.
Measurement: In general, the term measurement is used to determine the

attributes or dimensions of object. For example, we measure an object to
know how big, tall or heavy it is. In educational perspective measurement
refers to the process of obtaining a numerical description of a student’s
progress towards a pre-determined goal. This process provides the
information regarding how much a student has learnt. Measurement provides
quantitative description of the students’ performance. For example Rafaih
solved 23 arithmetic problems out of 40. But it does not include the
qualitative aspect. For example, Rafaih’s work was neat.
Testing: A test is an instrument or a systematic procedure to measure a particular

characteristic. For example, a test of mathematics will measure the level of the
learners’ knowledge of this particular subject or field.
Assessment: Assessment is a broad term that includes testing. For example,

a teacher may assess the knowledge of English language through a test and
assesses the language proficiency of the students through any other
instrument. For example oral quiz or presentation. Based upon this view, we
can say that every test is assessment but every assessment is not the test.
The term ‘assessment’ is derived from the Latin word ‘assidere’ which means
‘to sit beside’. In contrast to testing, the tone of the term assessment is non-
threatening indicating a partnership based on mutual trust and
understanding. This emphasizes that there should be a positive rather than a
negative association between assessment and the process of teaching and
learning in schools. In the broadest sense assessment is concerned with
children’s progress and achievement.
In a comprehensive and specific way, classroom assessment may be defined as:
the process of gathering, recording, interpreting, using and communicating

information about a child’s progress and achievement during the development of
knowledge, concepts, skills and attitudes. (NCCA, 2004)
In short, we can say that assessment entails much more than testing. It is an ongoing
process that includes many formal and informal activities designed to monitor and
improve teaching and learning.
Evaluation: According to Kizlik (2011) evaluation is most complex
and the least understood term. Hopkins and Antes (1990) defined
evaluation as a continuous inspection of all available information in
order to form a valid judgment of students’ learning and/or the
effectiveness of education program.
The central idea in evaluation is "value." When we evaluate a
variable, we are basically judging its worthiness, appropriateness and
goodness. Evaluation is always done against a standard, objectives or
criterion. In teaching learning process teachers made students’
evaluations that are usually done in the context of comparisons between
what was intended (learning, progress, behaviour) and what was
obtained.
Evaluation is much more comprehensive term than measurement
and assessment. It includes both quantitative and qualitative
descriptions of students’ performance. It always provides a value
judgment regarding the desirability of the performance for example, Very
good, good etc.
Approaches to Evaluation
1 Formative Evaluation
According to Gronlund "Formative evaluation is used to monitor learning

progress during instruction and to provide continuous feedback to both pupil and
teacher concerning learning successes and failures. Feedback to pupils
reinforces successful learning and identifies the learning errors that need
correction. Feedback to the teacher provides information for modifying instruction
and
A short-term objective of formative evaluation may be to help students

pass the end-of-year promotional examination or, long term, the school certificate
examination.
Formative evaluation attempts
I) to identify the content (i.e. knowledge or skills) which have not been
mastered by the student;
ii) to appraise the level of cognitive abilities such as memorization,

classification, comparison, analysis, explanation, quantification, application
and so on.
iii) to spcify the relationships between content and levels of cognitive abilities.
instructional context.
Characteristics of Formative Evaluation:
1. It relatively focuses on molecular analysis.
3. It is interested in the broader experiences of the programme users.
4. It tends to ignore the local effects of a particular programme.
5. It seeks to identify influential variables.
2. Summative Evaluation
Summative evaluation is primarily concerned with purposes, progress and
outcomes of the teaching-learning process. It attempts as far as possible to
determine to what extent the broad objectives of a programme have been
achieved. It is based on the following assumptions.
1. That the programmer's objectives are achievable;
2. That the teaching-learning process has been conducted efficiently;
3. That the teacher-student-material interactions have been conducive to

learning.
4. That there is uniformity in classroom conditions for all learners.
Unlike formative evaluation, which is guidanece-oriented, summative

evaluation is judgmental in nature. Promotion examination, the first school leaving
certificate examination, the public examinations belong to this form of evaluation.
Students, performance in such examinations determine to a large extent their job
career or prospects of further education.
Broad Differences Formative and Summative Evaluation
Characteristics Formative Summative
Purpose To monitor the promote the To check final

progress of the student by status of students
getting feedback
Content focus Narrow scope General Broad

scope
Methods Daily assignments Tests Projects

observations
Frequency Daily Weekly, quarterly

etc
Selection Type Items (objective type)

There are four types of test items in selection category of test which are in common use
today. They are multiple-choice, matching, true-false, and completion items.
1. Multiple Choice Questions
According to Grounlund multiple choice question is probably the most popular as well
as the most widely applicable and effective type of objective test. It can be used
effectively for any level of course outcome. It consists of two parts: the stem, which
states the problem and a list of three to five alternatives, one of which is the correct
(key) answer and the others are distracters (incorrect options that draw the less
knowledge able pupil away from the correct response). Multiple choice questions
consist of three obligatory parts:
1. The question ("body of the question")
2. The correct answer ("the key of the question")
3. Several incorrect alternatives (the so called "distracters") and
optional (and especially valuable in self-assessment)
4. Feedback comment on the student's answer.
The stem may be stated as a direct question or as an incomplete statement. For
example:
Direct question
Which is the capital city of Pakistan? (Stem)
A. Paris. --------------------------------------- (Distracter)
B. Lisbon. -------------------------------------- (Distracter)
C. Islamabad. ---------------------------------- (Key)
D. Rome. --------------------------------------- (Distracter)
Incomplete Statement
The capital city of Pakistan is
A. Paris.
B. Lisbon.
C. Islamabad.
D. Rome.
Students can generally respond to these types of questions quite quickly. As a result,
they are often used to test student’s knowledge of a broad range of content. Creating
these questions can be time consuming because it is often difficult to generate several
plausible distracters. However, they can be marked very quickly.
Multiple Choice Questions Good for:

 Application, synthesis, analysis, and evaluation levels
RULES FOR WRITING MULTIPLE-CHOICE QUESTIONS
There are several rules we can follow to improve the quality of this type of written
examination.
1. Examine only the Important Facts!
Make sure that every question examines only the important knowledge. Avoid detailed
questions - each question has to be relevant for the previously set instructional goals of
the course.
2. Use Simple Language!
Use simple language, taking care of spelling and grammar. Spelling and grammar
mistakes (unless you are testing spelling or grammar) only confuse students.
3. Make the Questions Brief and Clear!
Clear the text of the body of the question from all superfluous words and irrelevant
content. It helps students to understand exactly what is expected of them.
4. Form the Questions Correctly!
Be careful that the formulation of the question does not (indirectly) hide the key to the
correct answer. Student will be able to recognize it easily and will find the right answer
because of the word combination, grammar etc, and not because of their real
knowledge.
5. Take into Consideration the Independence of Questions!
Be careful not to repeat content and terms related to the same theme, since the answer
to one question can become the key to solve another.
6. Offer Uniform Answers!
All offered answers should be unified, clear and realistic. For example, unlikely
realisation of an answer or uneven text quantity of different answers can point to the
right answer. Such a question does not test real knowledge. The position of the key
should be random. If the answers are numbers, they should be listed in an ascending
order.
7. Avoid Asking Negative Questions!
If you use negative questions, negation must be emphasized by using CAPITAL letters,
e.g. "Which of the following IS NOT correct..." or "All of the following statements are
true, EXCEPT...".
8. Avoid Distracters in the Form of "All the answers are correct" or "None of
the Answers is Correct"!
Teachers use these statements most frequently when they run out of ideas for
distracters. Students, knowing what is behind such questions, are rarely misled by it.
Therefore, if you do use such statements, sometimes use them as the key answer.
Furthermore, if a student recognizes that there are two correct answers (out of 5
options), they will be able to conclude that the key answer is the statement "all the
answers are correct", without knowing the accuracy of the other distracters.
9. Distracters must be Significantly Different from the Right Answer (key)!
Distracters which only slightly differ from the key answer are bad distracters. Good or
strong distracters are statements which themselves seem correct, but are not the
correct answer to a particular question.
10. Offer an Appropriate Numbers of Distracters.

The greater the number of distracters, the lesser the possibility that a student could
guess the right answer (key). In higher education tests questions with 5 answers are
used most often (1 key + 4 distracters). That means that a student is 20% likely to
guess the right answer.
Advantages:
Multiple-choice test items are not a panacea. They have advantages and advantages
just as any other type of test item. Teachers need to be aware of these
characteristics in order to use multiple-choice items effectively.
Versatility
Multiple-choice test items are appropriate for use in many different subject-matter areas,
and can be used to measure a great variety of educational objectives. They are
adaptable to various levels of learning outcomes, from simple recall of knowledge to
more complex levels, such as the student’s ability to:
• Analyze phenomena
• Apply principles to new situations
• Comprehend concepts and principles
• Discriminate between fact and opinion
• Interpret cause-and-effect relationships
• Interpret charts and graphs
• Judge the relevance of information
• Make inferences from given data
• Solve problems
Validity
A student is able to answer many multiple- choice items in time. This feature enables
the teacher using multiple-choice items to test a broader sample of course contents in a
given amount of testing time. Consequently, the test scores will likely be more
representative of the students’ overall achievement in the course.
Reliability
Well-written multiple-choice test items compare favourably with other test item types on
the issue of reliability. They are less susceptible to guessing than are true-false test
items, and therefore capable of producing more reliable scores.
Efficiency
Multiple-choice items are amenable to rapid scoring, which is often done by scoring
machines. This expedites the reporting of test results to the student so that any follow-
up clarification of instruction may be done before the course has proceeded much
further. Essay questions, on the other hand, must be graded manually, one at a time.
Overall multiple choice tests are:
 Very effective
 Versatile at all levels
 Minimum of writing for student
 Guessing reduced
 Can cover broad range of content
Disadvantages
Versatility
Since the student selects a response from a list of alternatives rather than supplying or
constructing a response, multiple-choice test items are not adaptable to measuring
certain learning outcomes, such as the student’s ability to:
• Articulate explanations
• Display thought processes
• Furnish information
• Organize personal thoughts.
• Produce original ideas
• Provide examples
Such learning outcomes are better measured by short answer or essay questions, or by
performance tests.
Reliability
Although they are less susceptible to guessing than are true false-test items, multiple-
choice items are still affected to a certain extent. This guessing factor reduces the
reliability of multiple-choice item scores somewhat, but increasing the number of items
on the test offsets this reduction in reliability.
Difficulty of Construction
Good multiple-choice test items are generally more difficult and time-consuming to write
than other types of test items.
2. True/False Questions
A True-False test item requires the student to determine whether a statement is true or
false. The chief disadvantage of this type is the opportunity for successful guessing.
Also known as a “binary-choice” item because there are only two options to select from.
These types of items are more effective for assessing knowledge, comprehension, and
application outcomes as defined in the cognitive domain of Blooms’ Taxonomy of
educational objectives.
Example
Directions: Circle the correct response to the following statements.
1. Allama Iqbal is the founder of Pakistan. T/F
Good for:
 Knowledge level content
 Evaluating student understanding of popular misconceptions
 Concepts with two logical responses
Advantages:
 Easily assess verbal knowledge
 Easy to construct for the teacher
 Easy to score for the examiner
 Helpful for poor students
 Can test large amounts of content
Disadvantages:
 It is difficult to discriminate between students that know the material
and students who don't know.
 Need a large number of items for high reliability.
 Fifty percent guessing factor.
 Assess lower order thinking skills.
 Poor representative of students learning achievement.
Tips for Writing Good True/False items:
 Avoid double negatives.
 Avoid long/complex sentences.
 Use only one central idea in each item.
 Don't emphasize the trivial.
 Use exact quantitative language
 Don't lift items straight from the book.
 Make more false than true (60/40). (Students are more likely to answer
true.)
 The desired method of marking true or false should be clearly explained
before students begin the test.
 Construct statements that are definitely true or definitely false, without
additional qualifications. If opinion is used, attribute it to some source.
Avoid the following:
a. verbal clauses, absolutes, and complex sentences;
b. broad general statements that are usually not true or false without
further qualifications;
c. terms denoting indefinite degree (e.g., large, long time, or regularly)
or absolutes (e.g., never, only, or always).
d. placing items in a systematic order (e.g., TTFF, TFTF, and so on);
e. taking statements directly from the text and presenting them out of
context.
3. Matching items
The matching items consist of two parallel columns. The column on the left contains the
questions to be answered, termed premises; the column on the right, the answers,
termed responses. The student is asked to associate each premise with a response to
form a matching pair.
For example;
Column “A” Capital City Column “B” Country
Islamabad Iran
Tehran Spain
Istanbul Portugal
Madrid Pakistan
Jaddah Turkey
Matching test items are used to test a student's ability to recognize relationships and to
make associations between terms, parts, words, phrases, clauses, or symbols in one
column with related alternatives in another column.
Good for:
 Knowledge level
 Some comprehension level, if appropriately constructed
Advantages:
The chief advantage of matching exercises is that a good deal of factual information can
be tested in minimal time, making the tests compact and efficient. They are especially
well suited to who, what, when and where types of subject matter. Further students
frequently find the tests fun to take because they have puzzle qualities to them.
 Maximum coverage at knowledge level in a minimum amount of
space/prep time
 Valuable in content areas that have a lot of facts
Disadvantages:
The principal difficulty with matching exercises is that teachers often find that the
subject matter is insufficient in quantity or not well suited for matching terms. An
exercise should be confined to homogeneous items containing one type of subject
matter (for instance, authors-novels; inventions inventors; major events-dates terms –
definitions; rules examples and the like).
 Time consuming for students
 Not good for higher levels of learning
Tips for Writing Good Matching items:
Here are some suggestions for writing matching items:
 Keep both the list of descriptions and the list of options fairly short and
homogeneous.
 The list of descriptions on the left side should contain the longer
phrases or statements, whereas the options on the right side should
consist of short phrases, words or symbols.
 Each description in the list should be numbered (each is an item), and
the list of options should be identified by letter.
 Include more options than descriptions. If the option list is longer than
the description list, it is harder for students to eliminate options. If the
option list is shorter, some options must be used more than once.
Always include some options that do not match any of the descriptions,
or some that match more than one, or both.
 Need 15 items or less.
 Use items in response column more than once (reduces the effects of
guessing).
 Put all items on a single page.
4. Completion Items
Like true-false items, completion items are relatively easy to write. These are also
known as “Gap-Fillers.” Most effective for assessing knowledge and comprehension
learning outcomes but can be written for higher level outcomes. e.g.
The capital city of Pakistan is
Suggestions for Writing Completion or Supply Items
Here are our suggestions for writing completion or supply items:
I. If at all possible, items should require a single-word answer or a
brief and definite statement. Avoid statements that are so indefinite
that they may be logically answered by several terms.
a. Poor item:
World War II ended in .
b. Better item:
World War II ended in the year .
II. Be sure the question or statement poses a problem to the
examinee. A direct question is often more desirable than an
incomplete statement because it provides more structure.
III. Be sure the answer that the student is required to produce is
factually correct..
IV. Omit only key words; don’t eliminate so many elements that the
sense of the content is impaired.
a. Poor item:
The type of test item is usually more than the type.
b. Better item:
The supply type of test item is usually graded less objectively than the type.
Supply Type Items

The aviation instructor is able to determine the students' level of generalized knowledge
of a subject through the use of supply-type questions. There are four types of test items
in supply type category of test. Commonly these are completion items, short answers,
restricted response and extended response (essay type comprises the restricted and
extended responses).
Short Answer
Student supplies a response to a question that might consistent of a single word or
phrase. Most effective for assessing knowledge and comprehension learning outcomes
but can be written for higher level outcomes. Short answer items are of two types.
 Simple direct questions
Who was the first president of the Pakistan?
 Completion items
The name of the first president of Pakistan is .
Good for:
 Application, synthesis, analysis, and evaluation levels
Advantages:
Gronlund (1995) writes that short-answer items have a number of advantages.
 They reduce the likelihood that a student will guess the correct answer
 They are relatively easy for a teacher to construct.
 They are adapted to mathematics, the sciences, and foreign languages
where specific types of knowledge are to be tested (The formula for
ordinary table salt is--------).
 They are consistent with the Socratic question and answer format
frequently employed in the elementary grades in teaching basic skills.
Disadvantages:
 May overemphasize memorization of facts
 Take care - questions may have more than one correct answer
 Scoring is laborious
Tips for Writing Good Short Answer Items:
 When using with definitions: supply term, not the definition-for a better
judge of student knowledge.
 For numbers, indicate the degree of precision/units expected.
 Use direct questions, not an incomplete statement.
 If you do use incomplete statements, don't use more than 2 blanks
within an item.
 Arrange blanks to make scoring easy.
 Try to phrase question so there is only one answer possible.
Essay
Essay questions are supply or constructed response type questions and can be the best
way to measure the students' higher order thinking skills, such as applying, organizing,
synthesizing, integrating, evaluating, or projecting while at the same time providing a
measure of writing skills. The student has to formulate and write a response, which may
be detailed and lengthy. The accuracy and quality of the response are judged by the
teacher.
Essay questions provide a complex prompt that requires written responses, which can
vary in length from a couple of paragraphs to many pages. Like short answer questions,
they provide students with an opportunity to explain their understanding and
demonstrate creativity, but make it hard for students to arrive at an acceptable answer
by bluffing. They can be constructed reasonably quickly and easily but marking these
questions can be time-consuming and grade agreement can be difficult.
Essay questions differ from short answer questions in that the essay questions are less
structured. This openness allows students to demonstrate that they can integrate the
course material in creative ways. As a result, essays are a favoured approach to test
higher levels of cognition including analysis, synthesis and evaluation. However, the
requirement that the students provide most of the structure increases the amount of
work required to respond effectively. Students often take longer time to compose a five
paragraph essay than they would take to compose paragraph answer to short answer
questions.
There are 2 major categories of essay questions -- short response (also referred to as
restricted or brief) and extended response.
A. Restricted Response Essay Items
An essay item that poses a specific problem for which a student must recall proper
information, organize it in a suitable manner, derive a defensible conclusion, and
express it within the limits of posed problem, or within a page or time limit, is called a
restricted response essay type item. The statement of the problem specifies response
limitations that guide the student in responding and provide evaluation criteria for
scoring.
Example 1:
List the major similarities and differences in the lives of people living in Islamabad and
Faisalabad.
When Should Restricted Response Essay Items be used?
Restricted Response Essay Items are usually used to:-
 Analyze relationship
 Compare and contrast positions
 State necessary assumptions
 Identify appropriate conclusions
 Explain cause-effect relationship
 Organize data to support a viewpoint
B. Extended Response Essay Type Items
An essay type item that allows the student to determine the length and complexity of
response is called an extended-response essay item. This type of essay is most useful
at the synthesis or evaluation levels of cognitive domain. We are interested in
determining whether students can organize, integrate, express, and evaluate
information, ideas, or pieces of knowledge the extended response items are used.
Example:
Identify as many different ways to generate electricity in Pakistan as you can? Give
advantages and disadvantages of each. Your response will be graded on its accuracy,
comprehension and practical ability. Your response should be 8-10 pages in length and
it will be evaluated according to the RUBRIC (scoring criteria) already provided.
Over all Essay type items (both types restricted response and extended response) are
Good for:
 Application, synthesis and evaluation levels
Advantages:
 Students less likely to guess
 Easy to construct
 Stimulates more study
 Allows students to demonstrate ability to organize knowledge, express
opinions, show originality.
Disadvantages:
 Can limit amount of material tested, therefore has decreased validity.
 Subjective, potentially unreliable scoring.
 Time consuming to score.
Tips for Writing Good Essay Items:
 Provide reasonable time limits for thinking and writing.
 Avoid letting them to answer a choice of questions (You won't get a
good idea of the broadness of student achievement when they only
answer a set of questions.)
 Give definitive task to student-compare, analyze, evaluate, etc.
 Use checklist point system to score with a model answer: write outline,
determine how many points to assign to each part
 Score one question at a time-all at the same time.
Characteristics of Good test

1. Reliability
Reliability refers to the consistency with which a test measures. For example if the
same test is given to two classes and is marked by different teachers even then it
produced the similar results, it may be considered as reliable. We shall clarify the
meaning of consistency in as test with an illustration. It is observed that when an
individual measured the diameter steel ball several times with a pair of calipers, he did
not get exactly the same result every time. The extent of such variations is a measure of
the consistency, or the lack of it, in this measuring situation. Some degree of
inconsistency is present in all measurement procedures.
Definitions of Reliability:
The more general definition of the reliability is: The degree to which a score is stable
and consistent when measured at different times (test-retest reliability), in different
ways (parallel-forms and alternate-forms), or with different items within the same scale
(internal consistency).
Types of Reliability
There are six general classes of reliability estimates, each of which estimates reliability
in a different way. They are:
i) Inter-Rater or Inter-Observer Reliability
To assess the degree to which different raters/observers give consistent estimates of
the same phenomenon. That is if two teachers mark same test and the results are
similar, so it indicates the inter-rater or inter-observer reliability.
ii) Test-Retest Reliability:
When a same test is administered twice and the results of both administrations are
similar, this constitutes the test-retest reliability.
iii) Parallel-Form Reliability:
To assess the consistency of the results of two tests constructed in the same way from
the same content domain. Here the test designer tries to develop two tests of the similar
kinds and after administration the results are similar then it will indicate the parallel form
reliability.
iv) Internal Consistency Reliability:

To assess the consistency of results across items within a test, it is correlation of the
individual items score with the entire test.
v) Split half Reliability:

To assess the consistency of results comparing two halves of single test, these halves
may be even odd items on the single test.
Actually in split-half reliability we randomly divide all items that claim to measure the
same contents into two sets. We administer the entire instrument to a sample of
students and calculate the total score for each randomly divided half. The split-half
reliability estimate is simply the correlation between these two total scores.
The formula used for the reliability of the full test is Spearman-Brown prophecy formula
as given below.
Predicted reliability = kr / (1 + (k-1)r)
where:
 k: Factor by which the length of the test is changed. For example, if original test
is 10 questions and new test is 15 questions, k = 15/10 = 1.5.
 r: Reliability of the original test. We typically use Cronbach’s Alpha for this, which
is a value that ranges from 0 to 1 with higher values indicating higher reliability.
Suppose a company uses a 15-item test to assess employee satisfaction and the test is
known to have a reliability of 0.74.
Predicted reliability = kr / (1 + (k-1)r)
Predicted reliability = 2*.74 / (1 + (2-1)*.74)
Predicted reliability = 0.85
vi) Kuder-Richardson Reliability:

The estimates of internal consistency of the test are commonly calculated by using
The Kuder-Richardson Formula 20, often abbreviated KR-20, is used to measure the internal
consistency reliability of a test in which each question only has two answers: right or wrong.
The Kuder-Richardson Formula 20 is as follows:
KR-20 = (k / (k-1)) * (1 – Σpjqj / σ2)
where:
 k: Total number of questions
 pj: Proportion of individuals who answered question j correctly
 qj: Proportion of individuals who answered question j incorrectly
 σ2: Variance of scores for all individuals who took the test
The value for KR-20 ranges from 0 to 1, with higher values indicating higher reliability.
.
The formula KR21 is as under.
Factors Affecting Reliability

Some of the factors that directly or indirectly affect the test reliability are given as under.
1. Test Length:
As a rule, adding more homogeneous questions to a test will increase the test's
reliability.
2. Method Used to Estimate Reliability:
The reliability coefficient is an estimate that can change depending on the method used
to calculate it. The method chosen to estimate the reliability should fit the way in which
the test will be used.
3. Heterogeneity of Scores
Heterogeneity is referred as the differences among the scores obtained from class.
Increasing the heterogeneity of the examinee sample increases variability (individual
differences) thus reliability increases.
4. Difficulty
A test that is too difficult or too easy reduces the reliability (e.g., fewer test-takers get
the answers correctly or vice-versa). A moderate level of difficulty increases test
reliability.
Validity
The validity of an assessment tool is the degree to which it measures for what it is
designed to measure. For example if a test is designed to measure the skill of addition
of three digit in mathematics but the problems are presented in difficult language that is
not according to the ability level of the students then it may not measure the addition
skill of three digits, consequently will not be a valid test. Many experts of measurement
had defined this term, some of the definitions are given as under.
According to Business Dictionary the “Validity is the degree to which an instrument,
selection process, statistical technique, or test measures what it is supposed to
measure.”
Overall we can say that in terms of assessment, validity is the extent to which a test
measures what it claims to measure. It is vital for a test to be valid in order for the
results to be accurately applied and interpreted.
Validity versus Reliability

A test can be reliable but may not be valid. If test scores are to be used to make
accurate inferences about an examinee's ability, they must be both reliable and valid.
Reliability is a prerequisite for validity and refers to the ability of a test to measure a
particular trait or skill consistently. In simple words we can say that same test
administered to same students may yield same score. However, tests can be highly
reliable and still not be valid for a particular purpose. Consider the example of a
thermometer if there is a systematic error and it measures five degrees higher. When
the repeated readings has been taken under the same conditions the thermometer will
yield consistent (reliable) measurements, but the inference about the temperature is
faulty.
This analogy makes it clear that determining the reliability of a test is an important first
step, but not the defining step, in determining the validity of a test.
Methods of Measuring Validity

1. Content Validity
With respect to educational achievement tests, a test is considered content valid when
the proportion of the material covered in the test approximates the proportion of material
covered in the course.
There are different types of content validity; the major types face validity and the
curricular validity.
2. Construct Validity
Construct validity is a test’s ability to measure factors which are relevant to the field of
study. Construct validity is thus an assessment of the quality of an instrument or
experimental design.
For Example - Integrity is a construct; it cannot be directly observed, yet it is useful for
understanding, describing, and predicting human behaviour.
3. Criterion Validity
It compares the test with other measures or outcomes (the criteria) already held to be
valid. For example, employee selection tests are often validated against measures of
job performance (the criterion), and IQ tests are often validated against measures of
academic performance (the criterion).
4. Concurrent Validity
Concurrent validity refers to the degree to which the scores taken at one point
correlates with other measures (test, observation or interview) of the same construct
that is measured at the same time.
For example:
To assess the validity of a diagnostic screening test. In this case the predictor (X) is the
test and the criterion (Y) is the clinical diagnosis. When the correlation is large this
means that the predictor is useful as a diagnostic tool.
5. Predictive Validity
Predictive validity assures how well the test predicts some future behaviour of the
examinee.
For example, a political poll intends to measure future voting intent. College entry tests
should have a high predictive validity with regard to final exam results. When the two
sets of scores are correlated, the coefficient that results is called the predictive validity
coefficient.
Factors Affecting Validity

Some other factors that may affect the test validity are discussed as under.
1. Instructions :
Unclear instructions may restrict the pupil how to respond to the items, will tend to
reduce validity.
2. Difficult Language Structure:
Language of the test should be simple considering the grade for which the test is meant.
3. Inappropriate Level of Difficulty:

In norm-references tests, items that are too easy or too difficult will not provide reliable
discriminations among pupils and will therefore lower validity.
4. Ambiguity in Items Statements:
Ambiguity sometimes confuses the better pupils more than it does the poor pupils,
causing the items to discriminate in a negative direction.
5. Length of the Test:
If a test is too short to provide a representative sample of the performance we are
interested in, its validity will suffer accordingly. Similarly a too lengthy test is also a
threat to the validity evidence of the test.
6. Identifiable Pattern of Answers:
Placing correct answers in some systematic pattern will enable pupils to guess the
answers to some items more easily, and this will lower validity.
Relationship between Validity and Reliability
Reliability and validity are two different standards used to gauge the usefulness of a
test. Though different, they work together. It would not be beneficial to design a test with
good reliability that did not measure what it was intended to measure. Reliability is a
necessary requirement for validity. This means that you have to have good reliability in
order to have validity. Reliability actually puts a cap or limit on validity, and if a test is not
reliable, it cannot be valid. Establishing good reliability is only the first part of
establishing validity. Having good reliability does not mean we have good validity, it just
means we are measuring something consistently. In short we can say that reliability
means noting when the problem is validity.
Usability of Assessment Tools

Another important feature of a good assessment tool (Classroom test) is its usability.
Usability refers to the extent to which a test can be used by students and teachers to
achieve specified goals in an effective and efficient manner. It also refers to facilities
available to test developers regarding both administration and scoring procedures of a
test. As far as administration is concerned, test developers should be attentive to the
possibilities of giving a test under reasonably acceptable conditions. For example,
suppose a team of experts decide on giving a listening comprehension test to large
groups of examinees. In this case, test developers should make sure those facilities
such as audio equipments and/or suitable acoustic rooms are available. Otherwise, no
matter how reliable and valid the test may be, it will not be practical.
A good classroom test should be “teacher-friendly”. A teacher should be able to
develop, administer and mark it within the available time and with available resources.
Classroom tests are only valuable to students when they are returned promptly and
when the feedback from assessment is understood by the student. In this way, students
can benefit from the test-taking process. The issues regarding usability of the test
include cost of test development and maintenance, time (for development and test
length), resources (everything from computer access, copying facilities, AV equipment
to storage space), ease of marking, availability of suitable/trained markers and
administrative logistics.
The following are two very important aspects that contribute towards the usability of the
test.
Transparency
In simple words transparency is a process which requires from teachers to maintain
objectivity and the honesty for developing, administering, marking and reporting the test
results. Transparency refers to the availability of clear, accurate information to students
about testing. Transparency makes students part of the testing process.
Security
Most teachers feel that security is an issue only in large-scale, high-stakes testing.
However, security is part of both reliability and validity. If a teacher invests time and
energy in developing good tests that accurately reflect the course outcomes, then it is
desirable to be able to recycle the tests or similar materials. This is especially important
if analyses show that the items, distracters and test sections are valid and
discriminating. In some parts of the world, cultural attitudes towards “collaborative test-
taking” are a threat to test security and thus to reliability and validity. As a result, there is
a trade-off between letting tests into the public domain and giving students adequate
information about tests.
Objectivity
The objectivity of a test refers to the degree to which equally competent scorers
obtain the same results. Most standardized tests of aptitude and achievement are high
in objectivity. The test items are of the objective type (e.g., multiple choices),
and the resulting scores are not influenced by the scorer's judgment or opinion. In fact,
such test are usually constructed so that they can be accurately scored by trained clerks
and scoring machines. When such highly objective procedures are used the reliability of
the test results is not affected by the scoring procedures.
For classroom test constructed by teachers, objectivity may play an important role
in obtaining reliable measures of achievement. In essay testing and various
observational procedures the results depend to a large extent on the person doing the
scoring. Different persons get different results, and even the same person may get
different results at different times. Such inconsistency in scoring has an adverse effect
on the reliability of the measures obtained, for the test scores now reflect the opinions
and biases of the scorer as well as the differences among pupils in the characteristic
being measured.
The solution is not to use only objective test and to abandon all subjective methods of
evaluations, as this would have an adverse effect on validity, and as we noted earlier,
validity is the most important quality of evaluation results. A better solution is to select
the evaluation procedure most appropriate for the behavior being evaluated and
then to make the evaluation procedure as objective as possible. In the use of essay
tests, for example objectivity can be increased by careful. phrasing of the questions.

Assessment and Evaluation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Assessment and Evaluation

Uploaded by

Copyright:

Available Formats

Concept of Evaluation

The new philosophy of measurement is based on democratic values and

William Wiersma and Stephen G Jurs (1990) remarks that Evaluation is a

The central idea in evaluation is "value." When we evaluate a variable, we

Evaluation is much more comprehensive term than measurement and

According to Carole Tomlinson “Assessment is today's means of modifying

Measurement is the process of quantifying the degree to which someone

What to Assess: Teachers cannot assess whatever they themselves like. In

How to Assess: Teachers employ different instruments, formal or informal, to

When to Assess: There is a strong agreement of educationists that assessment

b. Assessment of Learning (Summative Assessment)

Characteristics of Classroom Assessment

2. Assessment works best when it has a clear, explicitly stated purposes.

3. Assessment works best when it is ongoing not episodic.

4. Assessment is effective when representatives from across the

5. Assessment makes a difference when it begins with issues of use

9. Through effective assessment, educators meet responsibilities to

Differences of Measurement, Assessment and Evaluation

Measurement: In general, the term measurement is used to determine the

Testing: A test is an instrument or a systematic procedure to measure a particular

Assessment: Assessment is a broad term that includes testing. For example,

In a comprehensive and specific way, classroom assessment may be defined as:

the process of gathering, recording, interpreting, using and communicating

According to Gronlund "Formative evaluation is used to monitor learning

A short-term objective of formative evaluation may be to help students

Formative evaluation attempts

ii) to appraise the level of cognitive abilities such as memorization,

Characteristics of Formative Evaluation:

1. It relatively focuses on molecular analysis.

3. It is interested in the broader experiences of the programme users.

4. It tends to ignore the local effects of a particular programme.

5. It seeks to identify influential variables.

1. That the programmer's objectives are achievable;

2. That the teaching-learning process has been conducted efficiently;

3. That the teacher-student-material interactions have been conducive to

4. That there is uniformity in classroom conditions for all learners.

Unlike formative evaluation, which is guidanece-oriented, summative

Broad Differences Formative and Summative Evaluation

Characteristics Formative Summative

Purpose To monitor the promote the To check final

Content focus Narrow scope General Broad

Methods Daily assignments Tests Projects

Frequency Daily Weekly, quarterly

Selection Type Items (objective type)

Multiple Choice Questions Good for:

10. Offer an Appropriate Numbers of Distracters.

Column “A” Capital City Column “B” Country

Supply Type Items

Characteristics of Good test

iv) Internal Consistency Reliability:

v) Split half Reliability:

vi) Kuder-Richardson Reliability:

Factors Affecting Reliability

Validity versus Reliability

Methods of Measuring Validity

Factors Affecting Validity

3. Inappropriate Level of Difficulty:

Usability of Assessment Tools

You might also like