Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
Unit 1 - Modern Information Retrieval - WWW - Rgpvnotes.in
UNIT – 1
Syllabus
Introduction: Information versus data retrieval, the retrieval process, taxonomy of Information Retrieval
Models.
Figure-1 Interaction of the user with the retrieval system through distinct tasks.
Classic information retrieval systems normally allow information or data retrieval. Hypertext systems are
usually tuned for providing quick browsing. Modern digital library and Web interfaces might attempt to
combine these tasks to provide improved retrieval capabilities. However, combination of retrieval and
browsing is not yet a well established approach and is not the dominant paradigm.
Figure 1 illustrates the interaction of the user through the different tasks we identify. Information and data
retrieval are usually provided by most modern information retrieval systems (such as Web interfaces).
Further, such systems might also provide some (still limited) form of browsing. While combining
information and data retrieval with browsing is not yet a common practice, it might become so in the
future.
Both retrieval and browsing are, in the language of the World Wide Web, `pulling' actions. That is, the user
requests the information in an interactive manner. An alternative is to do retrieval in an automatic and
permanent fashion using software agents which push the information towards the user. For instance,
information useful to a user could be extracted periodically from a news service. In this case, we say that
the IR system is executing a particular retrieval task which consists of filtering relevant information for later
inspection by the user.
Figure 2: Logical view of a document: from full text to a set of index terms.
The full text is clearly the most complete logical view of a document but its usage usually implies higher
computational costs. A small set of categories (generated by a human specialist) provides the most concise
logical view of a document but its usage might lead to retrieval of poor quality. Several intermediate logical
views (of a document) might be adopted by an information retrieval system as illustrated in Figure 2.
Besides adopting any of the intermediate representations, the retrieval system might also recognize the
internal structure normally present in a document (e.g., chapters, sections, subsections, etc.). This
information on the structure of the document might be quite useful and is required by structured text
retrieval models.
As illustrated in Figure 2, we view the issue of logically representing a document as a continuum in which
the logical view of a document might shift (smoothly) from a full text representation to a higher level
representation specified by a human subject.
volumes of data. Different index structures might be used, but the most popular one is the inverted file as
indicated in Figure 3. The resources (time and storage space) spent on defining the text database and
building the index are amortized by querying the retrieval system many times.
Table 1: Retrieval models most frequently associated with distinct combinations of a document logical
view and user task.
A similar but different task is one in which the queries remain relatively static while new documents come
into the system (and leave). This is the case with the stock market and with news wiring services. This
operational mode has been termed filtering.
In a filtering task, a user profile des ri ing the user’s preferen es is onstru ted. Such a profile is then
compared to the incoming documents in an attempt to determine those which might be of interest to this
particular user.
A variation of this procedure is to rank the filtered documents and show this ranking to the user. The
motivation is that the user can examine a smaller number of documents if he assumes that the ones at the
top of the ranking are more likely to be relevant. This variation of filtering is called routing but it is not
popular.