Information Retrieval

Chapter 5:
Retrieval Evaluation
IR Evaluation

• It is known that measuring or evaluating the

performance and accuracy of the system is very
important after IR system is designed.

• According to (Singhal, 2001), there are two main things

to measure in IR system; these are: effectiveness of the
system and its efficiency
• Effectiveness:-Power to be effective; the quality of being able to
bring about an effect
• How is a system capable of retrieving relevant documents from
the collection?

about user satisfaction

• Efficiency:- The ratio of the output to the input of any system
• Skillfulness in avoiding wasted time and effort

It is about time, space


 To measure ad hoc (informal) information retrieval

effectiveness in the standard way, we need a test collection
consisting of three things:
1. A document collection
2. A test suite (set)of information needs, expressible as
3. A set of relevance judgments, standardly a binary
assessment of either relevant or non relevant for each
query-document pair
Document collection

• Specific questions that might be considered when gathering

documents include:

1. How many items should be gathered?

2. What items should be sampled to create the document
3. What about copyright constraints?
Example (N=128)

The standard approach to information retrieval system

evaluation revolves around the notion of relevant and
non relevant documents.
With respect to a user information need, a document in
the test collection is given a binary classification as
either relevant or non relevant.
This decision is referred to as the gold standard or
ground truth judgment of relevance.
A document is relevant if it addresses the stated
information need, not because it just happens to
contain all the words in the query.

Types of Evaluation Strategies

•System-centered studies
– Given documents, queries, and relevance judgments
• Try several variations of the system
• Measure which system returns the “best” hit list

•User-centered studies
– Given several users, and at least two retrieval systems
• Have each user try the same task on both systems
• Measure which system works the “best” for users information need
Performance measures (Recall, Precision, etc.)

• The two most frequent and basic measures for

information retrieval effectiveness are :
1. Precision and
2. Recall.

Precision (P) is the fraction of retrieved documents that

are relevant
The ability to retrieve top-ranked documents
that are mostly relevant.
Precision is percentage of retrieved documents
that are relevant to the query (i.e. number of retrieved
documents that are relevant).
Precision Formula
Recall (R) is the fraction of relevant documents that are

– The ability of the search to find all of the

relevant items in the corpus

– Recall is percentage of relevant documents

retrieved from the database in response to users query.
Recall Formula

• When do you think the precision/recall has value

100% ? Or sometimes we can get the value of
precision and recall 100% or one. How can we
justify this value?
An IR system returns 8 relevant documents, and 10
non relevant documents. There are a total of 20
relevant documents in the collection.
a. What is the precision of the system on this search,
b. what is its recall?
c. What is F-measure?
R- Precision

Precision at the R-th position in the ranking of results for a

query, where R is the total number of relevant documents.

It requires having a set of known relevant documents, from

which we calculate the precision of the top relevant
documents returned
– Calculate precision after R documents are seen
– Can be averaged over all queries
Example 2:
• Given a query q, for which the relevant documents are d1,
d6, d10, d15, d22, d26, an IR system retrieves the following
ranking: d6, d2, d11, d3, d10, d1, d14, d15, d7, d23.
• compute the precision and recall for this ranking at each
retrieved document.
• The average precision over positions 1, 5, 6, and 8
where relevant documents were found is
(1.0+0.40+0.50+0.50)/6=0.40. The R-precision is the
precision at position 6, which is 3/6=0.50.
total retrieved
Problems with both precision and recall
 Number of irrelevant documents in the collection is not
taken into account.
 Recall is undefined when there is no relevant document
in the collection.
 Precision is undefined when no document is retrieved.
Other measures
 Noise = retrieved irrelevant docs / retrieved docs
 Silence/Miss = non-retrieved relevant docs / relevant

Noise = 1 – Precision; Silence = 1 – Recall


• A single measure that trades off precision versus

recall is the F measure, which is the weighted
harmonic mean of precision and recall:
• One measure of performance that takes into accounts
both recall and precision. Harmonic mean of recall
and precision:
• The following list of Rs and Ns represents relevant (R) and
non relevant (N) returned documents in a ranked list of 20
documents retrieved in response to a query from a collection
of 10,000 documents. The top of the ranked list (the document
the system thinks is most likely to be relevant) is on the left of
the list. This list shows 6 relevant documents. Assume that
there are 8 relevant documents in total in the collection.
• Calculate the following:

a) What is the precision of the system on the top 20?

b) What is recall?
c) What is p@10?
d) What is the F-measure on the top 20?
e) Assume that these 20 documents are the complete result set
of the system. What is the MAP for the query?
f) Noise
g) Silence
Difficulties in Evaluating IR System

 IR systems essentially facilitate communication between a

user and document collections
 Relevance is a measure of the effectiveness of
– Effectiveness is related to the relevancy of retrieved
– Relevance: relates to problem, information need,
query and a document or surrogate

 Relevance judgments is made by

– The user who posed the retrieval problem

– An external judge

– Is the relevance judgment made by users and external

person the same?

 Relevance judgment is usually:


– Subjective: Depends upon a specific user’s judgment.

– Situational: Relates to user’s current needs.
– Cognitive: Depends on human perception and
– Dynamic: Changes over time.
Information Retrieval

Chapter 6:
Query Languages and Operations
• Information is the main value of Information Society.

• Depending on the particular application scenario and on the

type of information that has to be managed and searched,
different techniques need to be devised.

• The dictionary definition of query is a set of instructions passed

to a database to retrieve particular data.

• A query is the formulation of a user information need.

• A query is composed of keywords and the documents

containing such keywords are searched for popular and
Intuitive, Easy to express, Allow fast ranking.
Query language (QL) refers to any computer programming language
that requests and retrieves data from database and information
systems by sending queries.

• Query Languages: A source language consisting of procedural

operators that invoke functions to be executed.
Keyword-based queries

 Queries are combinations of words.

 The document collection is searched for documents that

contain these words.

 Word queries are intuitive, easy to express and provide fast

popular Keyword-based queries are
1. Single-word queries:
 A query is a single word
 Simplest form of query.
 All documents that include this word are retrieved.

 Documents may be ranked by the frequency of this word in the

2. phrase queries:
A query is a sequence of words treated as a single unit. Also
called “literal string” or “exact phrase” query, Phrase is usually
surrounded by quotation marks,

All documents that include this phrase are retrieved, Usually,

separators (commas, colons, etc.) and “trivial words” (e.g., “a”,
“the”, or “of”) in the phrase are ignored,
In effect, this query is for a set of words that must appear in
sequence, Allows users to specify a context and thus gain

Example: “United States of America”.

3. Multiple-word queries:

A query is a set of words (or phrases).

Two interpretations:
• A document is retrieved if it includes any of the query words.
• A document is retrieved if it includes each of the query
Documents may be ranked by the number of query words they
contain: A document containing n query words is ranked higher
than a document containing n-1 query words.

Documents containing all the query words are ranked at the top.
Documents containing only one query word are ranked at
Frequency counts may still be used to break ties among documents
that contain the same query words.
4. Proximity queries:
Restrict the distance within a document between two search
Important for large documents in which the two search words
may appear in different contexts.

Proximity specifications limit the acceptable occurrences and

hence increase the precision of the search.
General Format: Word1 within m units of Word2. Unit may be
character, word, paragraph, etc.
• nuclear within 0 paragraphs of cleanup

Finds documents that discuss “nuclear” and “cleanup” in the

same paragraph.

• united within 5 words of american

Structural queries

 So far, we assumed documents that are entirely free of

 Structured documents would allow more powerful queries.

 Queries could combine text queries with structural queries:

queries that relate to the structure of the document.

 Mixing contents and structure in queries:

• Contents words, phrases, or patterns and

• Structural constraints containment, proximity, or other

restrictions on structural elements
• Example: Retrieve documents that contain a page in which the
phrase “terrorist attack” appears in the text and a photo whose
caption contains the phrase “World Trade Center”.

• The corresponding query could be: same page (“terrorist

attack”, photo (caption (“World Trade Center”))).

 Three main structures

 Fixed structure
Hypertext structure
Hierarchical structure
Fixed structure
 Document is divided to a fixed set of fields, much like a filled

 Fields may be associated with types, such as date.

 Each field has text and fields cannot nest or overlap.

 Queries (multiple-words, Boolean, proximity, patterns, etc.) are

targeted at particular fields.
Hypertext structure
Hierarchical structure

 Intermediate model between fixed structure and hypertext.

 The “anarchic” hypertext network is restricted to a hierarchical
 The model allows recursive decomposition of documents.

 Queries may combine Regular text queries, which are targeted

at particular areas (the target area is defined by a “path
expression”) and Queries on the structure itself; for example
“retrieve documents with at least 5 sections
Relevance feedback

 After initial retrieval results are presented, allow the user to

provide feedback on the relevance of one or more of the
retrieved documents.
 The system use this feedback information to reformulate the
query and Produce new results based on reformulated query.
After that allows more interactive, multi-pass process.
 The idea of relevance feedback (RF) is to involve the user in
RELEVANCE FEEDBACK the retrieval process so as to
improve the final result set.

 In particular, the user gives feedback on the relevance of

documents in an initial set of results.
The basic procedure is:
 The user issues a (short, simple) query.
 The system returns an initial set of retrieval results.

 The user marks some returned documents as relevant or

non relevant.

 The system computes a better representation of the

information need based on the user feedback.

 The system displays a revised set of retrieval results.


Chapter 5 and 6

