Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

July 13, 1998

CMPT-354-97.2 Lecture Notes

Chapter 11

Indexing & Hashing


1. Many queries reference only a small proportion of records in a le. For example, nding all records at
Perryridge branch only returns records where bname = \Perryridge".
2. We should be able to locate these records directly, rather than having to read every record and check its
branch-name. We then need extra le structuring.

11.1 Basic Concepts


1. An index for a le works like a catalogue in a library. Cards in alphabetic order tell us where to nd books
by a particular author.
2. In real-world databases, indices like this might be too large to be ecient. We'll look at more sophisticated
indexing techniques.
3. There are two kinds of indices.
 Ordered indices: indices are based on a sorted ordering of the values.
 Hash indices: indices are based on the values being distributed uniformly across a range of buckets. The
buckets to which a value is assigned is determined by a function, called a hash function.
4. We will consider several indexing techniques. No one technique is the best. Each technique is best suited for
a particular database application.
5. Methods will be evaluated on:
(a) Access Types | types of access that are supported eciently, e.g., value-based search or range search.
(b) Access Time | time to nd a particular data item or set of items.
(c) Insertion Time | time taken to insert a new data item (includes time to nd the right place to insert).
(d) Deletion Time | time to delete an item (includes time taken to nd item, as well as to update the
index structure).
(e) Space Overhead | additional space occupied by an index structure.
6. We may have more than one index or hash function for a le. (The library may have card catalogues by
author, subject or title.)
7. The attribute or set of attributes used to look up records in a le is called the search key (not to be confused
with primary key, etc.).
1

CHAPTER 11. INDEXING & HASHING

2
Brighton

217

750

Downtown

101
110

500

Downtown
Mianus

215

Perriridge
Perriridge

102
201

400
900

Perriridge

218

700

Redwood

222

700

Round Hill

305

600
700

350

Figure 11.1: Sequential le for deposit records.

11.2 Ordered Indices


1. In order to allow fast random access, an index structure may be used.
2. A le may have several indices on di erent search keys.
3. If the le containing the records is sequentially ordered, the index whose search key speci es the sequential
order of the le is the primary index, or clustering index. Note: The search key of a primary index is
usually the primary key, but it is not necessarily so.
4. Indices whose search key speci es an order di erent from the sequential order of the le are called the
secondary indices, or nonclustering indices.

11.2.1 Primary Index


1. Index-sequential les: Files are ordered sequentially on some search key, and a primary index is associated
with it.

Dense and Sparse Indices


1. There are Two types of ordered indices:

Dense Index:
 An index record appears for every search key value in le.
 This record contains search key value and a pointer to the actual record.
Sparse Index:
 Index records are created only for some of the records.
 To locate a record, we nd the index record with the largest search key value less than or equal to the

search key value we are looking for.


 We start at that record pointed to by the index record, and proceed along the pointers in the le (that
is, sequentially) until we nd the desired record.
2. Figures 11.2 and 11.3 show dense and sparse indices for the deposit le.
3. Notice how we would nd records for Perryridge branch using both methods. (Do it!)
4. Dense indices are faster in general, but sparse indices require less space and impose less maintenance for
insertions and deletions. (Why?)

11.2. ORDERED INDICES

Brighton
Downtown

Brighton

Mianus

Downtown

217

Green

750
500

Downtown

101
110

Johnson

Perriridge
Redwood

Mianus

215

Peterson
Smith

600
700

Round Hill

Perriridge
Perriridge

102
201

Hayes
Williams

400
900

Perriridge

218

Lyle

700

Redwood

222
305

Lindsay

700

Turner

350

Round Hill

Figure 11.2: Dense index.

Brighton
Mianus
Redwood

Brighton
Downtown
Downtown
Mianus
Perriridge
Perriridge
Perriridge
Redwood
Round Hill

217
101
110
215
102
201
218
222
305

Figure 11.3: Sparse index.

Green
Johnson

750

Peterson
Smith

600
700

Hayes
Williams
Lyle
Lindsay
Turner

500

400
900
700
700
350

CHAPTER 11. INDEXING & HASHING

4
index block 0

Data block 0

index block 1

data
block 1

outer index
inner index

Figure 11.4: Two-level sparse index.


5. A good compromise: to have a sparse index with one entry per block.
Why is this good?
 Biggest cost is in bringing a block into main memory.
 We are guaranteed to have the correct block with this method, unless record is on an over ow block
(actually could be several blocks).
 Index size still small.

Multi-Level Indices
1. Even with a sparse index, index size may still grow too large. For 100,000 records, 10 per block, at one index
record per block, that's 10,000 index records! Even if we can t 100 index records per block, this is 100
blocks.
2. If index is too large to be kept in main memory, a search results in several disk reads.
 If there are no over ow blocks in the index, we can use binary search.
 This will read as many as 1 + log2(b) blocks (as many as 7 for our 100 blocks).
 If index has over ow blocks, then sequential search typically used, reading all b index blocks.
3. Solution: Construct a sparse index on the index (Figure 11.4).
4. Use binary search on outer index. Scan index block found until correct index record found. Use index
record as before - scan block pointed to for desired record.
5. For very large les, additional levels of indexing may be required.
6. Indices must be updated at all levels when insertions or deletions require it.
7. Frequently, each level of index corresponds to a unit of physical storage (e.g. indices at the level of track,
cylinder and disk).

11.2. ORDERED INDICES

Green
Lindsay

Brighton

Smith

Downtown

217

Green

750
500

Downtown

101
110

Johnson

Mianus

215

Peterson
Smith

600
700

Perriridge
Perriridge

102
201

Hayes
Williams

400
900

Perriridge

218

Lyle

700

Redwood

222

Lindsay

700

Round Hill

305

Turner

350

Figure 11.5: Sparse secondary index on cname.

Index Update
Regardless of what form of index is used, every index must be updated whenever a record is either inserted into
or deleted from the le.
1. Deletion:
 Find (look up) the record
 If the last record with a particular search key value, delete that search key value from index.
 For dense indices, this is like deleting a record in a le.
 For sparse indices, delete a key value by replacing key value's entry in index by next search key value.
If that value already has an index entry, delete the entry.
2. Insertion:
 Find place to insert.
 Dense index: insert search key value if not present.
 Sparse index: no change unless new block is created. (In this case, the rst search key value appearing
in the new block is inserted into the index).

11.2.2 Secondary Indices


1. If the search key of a secondary index is not a candidate key, it is not enough to point to just the rst record
with each search-key value because the remaining records with the same search-key value could be anywhere
in the le. Therefore, a secondary index must contain pointers to all the records.
2. We can use an extra-level of indirection to implement secondary indices on search keys that are not candidate
keys. A pointer does not point directly to the le but to a bucket that contains pointers to the le.






See Figure 11.5 on secondary key cname.


To perform a lookup on Peterson, we must read all three records pointed to by entries in bucket 2.
Only one entry points to a Peterson record, but three records need to be read.
As le is not ordered physically by cname, this may take 3 block accesses.

3. Secondary indices must be dense, with an index entry for every search-key value, and a pointer to every
record in the le.

CHAPTER 11. INDEXING & HASHING

4. Secondary indices improve the performance of queries on non-primary keys.


5. They also impose serious overhead on database modi cation: whenever a le is updated, every index must
be updated.
6. Designer must decide whether to use secondary indices or not.

11.3 B+-Tree Index Files


1. Primary disadvantage of index-sequential le organization is that performance degrades as the le grows.
This can be remedied by costly re-organizations.
2. B+ -tree le structure maintains its eciency despite frequent insertions and deletions. It imposes some
acceptable update and space overheads.
3. A B+ -tree index is a balanced tree in which every path from the root to a leaf is of the same length.
4. Each nonleaf node in the tree must have between dn=2e and n children, where n is xed for a particular tree.

11.3.1 Structure of a B+-Tree


1. A B+ -tree index is a multilevel index but is structured di erently from that of multi-level index sequential
les.
2. A typical node (Figure 11.6) contains up to n , 1 search key values K1 ; K2; : : :; Kn,1, and n pointers
P1; P2; : : :; Pn. Search key values in a node are kept in sorted order.
P1 K1 P2       Pn,1 Kn,1 Pn
Figure 11.6: Typical node of a B+-tree.
3. For leaf nodes, Pi (i = 1; : : :; n , 1) points to either a le record with search key value Ki , or a bucket of
pointers to records with that search key value. Bucket structure is used if search key is not a primary key,
and le is not sorted in search key order.
Pointer Pn (nth pointer in the leaf node) is used to chain leaf nodes together in linear order (search key
order). This allows ecient sequential processing of the le.
The range of values in each leaf do not overlap.
4. Non-leaf nodes form a multilevel index on leaf nodes.
A non-leaf node may hold up to n pointers and must hold dn=2e pointers. The number of pointers in a node
is called the fan-out of the node.
Consider a node containing m pointers. Pointer Pi (i = 2; : : :; m) points to a subtree containing search key
values  Ki,1 and < Ki . Pointer Pm points to a subtree containing search key values  Km,1 . Pointer P1
points to a subtree containing search key values < K1 .
5. Figures 11.7 (textbook Fig. 11.8) and textbook Fig. 11.9 show B+ -trees for the deposit le with n=3 and
n=5.

11.3. B+ -TREE INDEX FILES

7
Perriridge

Mianus

Brighton

Downtown

Redwood

Mianus

Perriridge

Redwood

Round Hill

Figure 11.7: B+-tree for deposit le with n = 3.

11.3.2 Queries on B+-Trees


1. Suppose we want to nd all records with a search key value of k.
 Examine the root node and nd the smallest search key value Ki > k.
 Follow pointer Pi to another node.
 If k < K1 follow pointer P1.
 Otherwise, nd the appropriate pointer to follow.
 Continue down through non-leaf nodes, looking for smallest search key value > k and following the
corresponding pointer.
 Eventually we arrive at a leaf node, where pointer will point to the desired record or bucket.
2. In processing a query, we traverse a path from the root to a leaf node. If there are K search key values in
the le, this path is no longer than logdn=2e (K).
This means that the path is not long, even in large les. For a 4k byte disk block with a search-key size of
12 bytes and a disk pointer of 8 bytes, n is around 200. If n = 100, a look-up of 1 million search-key values
may take log50(1; 000; 000) = 4 nodes to be accessed. Since root is in usually in the bu er, so typically it
takes only 3 or fewer disk reads.

11.3.3 Updates on B+ -Trees

1. Insertions and Deletions:


Insertion and deletion are more complicated, as they may require splitting or combining nodes to keep the
tree balanced. If splitting or combining are not required, insertion works as follows:
 Find leaf node where search key value should appear.
 If value is present, add new record to the bucket.
 If value is not present, insert value in leaf node (so that search keys are still in order).
 Create a new bucket and insert the new record.
If splitting or combining are not required, deletion works as follows:
 Deletion: Find record to be deleted, and remove it from the bucket.
 If bucket is now empty, remove search key value from leaf node.
2. Insertions Causing Splitting:
When insertion causes a leaf node to be too large, we split that node. In Figure 11.8, assume we wish to
insert a record with a bname value of \Clearview".
 There is no room for it in the leaf node where it should appear.
 We now have n values (the n , 1 search key values plus the new one we wish to insert).
 We put the rst dn=2e values in the existing node, and the remainder into a new node.

CHAPTER 11. INDEXING & HASHING

 Figure 11.10 shows the result.


 The new node must be inserted into the B+ -tree.
 We also need to update search key values for the parent (or higher) nodes of the split leaf node. (Except






if the new node is the leftmost one)


Order must be preserved among the search key values in each node.
If the parent was already full, it will have to be split.
When a non-leaf node is split, the children are divided among the two new nodes.
In the worst case, splits may be required all the way up to the root. (If the root is split, the tree becomes
one level deeper.)
Note: when we start a B+ -tree, we begin with a single node that is both the root and a single leaf.
When it gets full and another insertion occurs, we split it into two leaf nodes, requiring a new root.

3. Deletions Causing Combining:


Deleting records may cause tree nodes to contain too few pointers. Then we must combine nodes.
 If we wish to delete \Downtown" from the B+ -tree of Figure 11.11, this occurs.
 In this case, the leaf node is empty and must be deleted.
 If we wish to delete \Perryridge" from the B+ -tree of Figure 11.11, the parent is left with only one
pointer, and must be coalesced with a sibling node.
 Sometimes higher-level nodes must also be coalesced.
 If the root becomes empty as a result, the tree is one level less deep (Figure 11.13).
 Sometimes the pointers must be redistributed to keep the tree balanced.
 Deleting \Perryridge" from Figure 11.11 produces Figure 11.14.
4. To summarize:
 Insertion and deletion are complicated, but require relatively few operations.
 Number of operations required for insertion and deletion is proportional to logarithm of number of
search keys.
 B+ -trees are fast as index structures for database.

11.3.4 B+ -Tree File Organization


1. The B+ -tree structure is used not only as an index but also as an organizer for records into a le.
2. In a B+ -tree le organization, the leaf nodes of the tree store records instead of storing pointers to records,
as shown in Fig. 11.17.
3. Since records are usually larger than pointers, the maximum number of records that can be stored in a leaf
node is less than the maximum number of pointers in a nonleaf node.
4. However, the leaf node are still required to be at least half full.
5. Insertion and deletion from a B+ -tree le organization are handled in the same way as that in a B+ -tree
index.
6. When a B+ -tree is used for le organization, space utilization is particularly important. We can improve the
space utilization by involving more sibling nodes in redistribution during splits and merges.
7. In general, if m nodes are involved in redistribution, each node can be guaranteed to contain at least d(m ,
1)n=me entries. However, the cost of update becomes higher as more siblings are involved in redistribution.

11.4. B-TREE INDEX FILES

9
P1 K1 P2       Pn,1 Kn,1 Pn
(a)
K1 P2 B2 K2       Pn,1 Bn,1 Kn,1 Pn
(b)

P1 B1

Figure 11.8: Leaf and nonleaf node of a B-tree.

11.4 B-Tree Index Files


1. B-tree indices are similar to B+ -tree indices.






Di erence is that B-tree eliminates the redundant storage of search key values.
In B+ -tree of Figure 11.11, some search key values appear twice.
A corresponding B-tree of Figure 11.18 allows search key values to appear only once.
Thus we can store the index in less space.

2. Advantages:

 Lack of redundant storage (but only marginally di erent).


 Some searches are faster (key may be in non-leaf node).
3. Disadvantages:
 Leaf and non-leaf nodes are of di erent size (complicates storage)
 Deletion may occur in a non-leaf node (more complicated)
Generally, the structural simplicity of B+ -tree is preferred.

11.5 Static Hashing


1. Index schemes force us to traverse an index structure. Hashing avoids this.

11.5.1 Hash File Organization

1. Hashing involves computing the address of a data item by computing a function on the search key value.
2. A hash function h is a function from the set of all search key values K to the set of all bucket addresses B.

 We choose a number of buckets to correspond to the number of search key values we will have stored in





the database.
To perform a lookup on a search key value Ki , we compute h(Ki ), and search the bucket with that
address.
If two search keys i and j map to the same address, because h(Ki ) = h(Kj ), then the bucket at the
address obtained will contain records with both search key values.
In this case we will have to check the search key value of every record in the bucket to get the ones we
want.
Insertion and deletion are simple.

CHAPTER 11. INDEXING & HASHING

10

Hash Functions
1. A good hash function gives an average-case lookup that is a small constant, independent of the number of
search keys.
2. We hope records are distributed uniformly among the buckets.
3. The worst hash function maps all keys to the same bucket.
4. The best hash function maps all keys to distinct addresses.
5. Ideally, distribution of keys to addresses is uniform and random.
6. Suppose we have 26 buckets, and map names beginning with ith letter of the alphabet to the ith bucket.
 Problem: this does not give uniform distribution.
 Many more names will be mapped to \A" than to \X".
 Typical hash functions perform some operation on the internal binary machine representations of characters in a key.
 For example, compute the sum, modulo # of buckets, of the binary representations of characters of the
search key.
 See Figure 11.18, using this method for 10 buckets (assuming the ith character in the alphabet is
represented by integer i).

Handling of bucket over ows


1. Open hashing occurs where records are stored in di erent buckets. Compute the hash function and search

the corresponding bucket to nd a record.


2. Closed hashing occurs where all records are stored in one bucket. Hash function computes addresses within
that bucket. (Deletions are dicult.) Not used much in database applications.
3. Drawback to our approach: Hash function must be chosen at implementation time.
 Number of buckets is xed, but the database may grow.
 If number is too large, we waste space.
 If number is too small, we get too many \collisions", resulting in records of many search key values
being in the same bucket.
 Choosing the number to be twice the number of search key values in the le gives a good space/performance
tradeo .

11.5.2 Hash Indices


1. A hash index organizes the search keys with their associated pointers into a hash le structure.
2. We apply a hash function on a search key to identify a bucket, and store the key and its associated pointers
in the bucket (or in over ow buckets).
3. Strictly speaking, hash indices are only secondary index structures, since if a le itself is organized using
hashing, there is no need for a separate hash index structure on it.

11.6 Dynamic Hashing


1. As the database grows over time, we have three options:
 Choose hash function based on current le size. Get performance degradation as le grows.
 Choose hash function based on anticipated le size. Space is wasted initially.

11.6. DYNAMIC HASHING

11
i1
hash prefix
i
00
01
10
11

bucket 1
i2

i3

bucket 2

bucket 3

Figure 11.9: General extendable hash structure.

 Periodically re-organize hash structure as le grows. Requires selecting new hash function, recomputing

all addresses and generating new bucket assignments. Costly, and shuts down database.
2. Some hashing techniques allow the hash function to be modi ed dynamically to accommodate the growth or
shrinking of the database. These are called dynamic hash functions.
 Extendable hashing is one form of dynamic hashing.
 Extendable hashing splits and coalesces buckets as database size changes.
 This imposes some performance overhead, but space eciency is maintained.
 As reorganization is on one bucket at a time, overhead is acceptably low.
3. How does it work?
 We choose a hash function that is uniform and random that generates values over a relatively large
range.
 Range is b-bit binary integers (typically b=32).
 232 is over 4 billion, so we don't generate that many buckets!
 Instead we create buckets on demand, and do not use all b bits of the hash initially.
 At any point we use i bits where 0  i  b.
 The i bits are used as an o set into a table of bucket addresses.
 Value of i grows and shrinks with the database.
 Figure 11.19 shows an extendable hash structure.
 Note that the i appearing over the bucket address table tells how many bits are required to determine
the correct bucket.
 It may be the case that several entries point to the same bucket.
 All such entries will have a common hash pre x, but the length of this pre x may be less than i.
 So we give each bucket an integer giving the length of the common hash pre x.
 This is shown in Figure 11.9 (textbook 11.19) as ij .
 Number of bucket entries pointing to bucket j is then 2(i,i ) .
4. To nd the bucket containing search key value Kl :
 Compute h(Kl ).
j

CHAPTER 11. INDEXING & HASHING

12

 Take the rst i high order bits of h(Kl ).


 Look at the corresponding table entry for this i-bit string.
 Follow the bucket pointer in the table entry.
5. We now look at insertions in an extendable hashing scheme.
 Follow the same procedure for lookup, ending up in some bucket j.
 If there is room in the bucket, insert information and insert record in the le.
 If the bucket is full, we must split the bucket, and redistribute the records.
 If bucket is split we may need to increase the number of bits we use in the hash.
6. Two cases exist:
1. If i = ij , then only one entry in the bucket address table points to bucket j.
 Then we need to increase the size of the bucket address table so that we can include pointers to the two
buckets that result from splitting bucket j.
 We increment i by one, thus considering more of the hash, and doubling the size of the bucket address
table.
 Each entry is replaced by two entries, each containing original value.
 Now two entries in bucket address table point to bucket j.
 We allocate a new bucket z, and set the second pointer to point to z.
 Set ij and iz to i.
 Rehash all records in bucket j which are put in either j or z.
 Now insert new record.
 It is remotely possible, but unlikely, that the new hash will still put all of the records in one bucket.
 If so, split again and increment i again.
2. If i > ij , then more than one entry in the bucket address table points to bucket j.
 Then we can split bucket j without increasing the size of the bucket address table (why?).
 Note that all entries that point to bucket j correspond to hash pre xes that have the same value on the
leftmost ij bits.
 We allocate a new bucket z, and set ij and iz to the original ij value plus 1.
 Now adjust entries in the bucket address table that previously pointed to bucket j.
 Leave the rst half pointing to bucket j, and make the rest point to bucket z.
 Rehash each record in bucket j as before.
 Reattempt new insert.
7. Note that in both cases we only need to rehash records in bucket j.
8. Deletion of records is similar. Buckets may have to be coalesced, and bucket address table may have to be
halved.
9. Insertion is illustrated for the example deposit le of Figure 11.20.
 32-bit hash values on bname are shown in Figure 11.21.
 An initial empty hash structure is shown in Figure 11.22.
 We insert records one by one.
 We (unrealistically) assume that a bucket can only hold 2 records, in order to illustrate both situations
described.

11.7. COMPARISON OF INDEXING AND HASHING














13

As we insert the Perryridge and Round Hill records, this rst bucket becomes full.
When we insert the next record (Downtown), we must split the bucket.
Since i = i0 , we need to increase the number of bits we use from the hash.
We now use 1 bit, allowing us 21 = 2 buckets.
This makes us double the size of the bucket address table to two entries.
We split the bucket, placing the records whose search key hash begins with 1 in the new bucket, and
those with a 0 in the old bucket (Figure 11.23).
Next we attempt to insert the Redwood record, and nd it hashes to 1.
That bucket is full, and i = i1 .
So we must split that bucket, increasing the number of bits we must use to 2.
This necessitates doubling the bucket address table again to four entries (Figure 11.24).
We rehash the entries in the old bucket.
We continue on for the deposit records of Figure 11.20, obtaining the extendable hash structure of Figure
11.25.

10. Advantages:
 Extendable hashing provides performance that does not degrade as the le grows.
 Minimal space overhead - no buckets need be reserved for future use. Bucket address table only contains
one pointer for each hash value of current pre x length.
11. Disadvantages:
 Extra level of indirection in the bucket address table
 Added complexity
12. Summary: A highly attractive technique, provided we accept added complexity.

11.7 Comparison of Indexing and Hashing


1. To make a wise choice between the methods seen, database designer must consider the following issues:
 Is the cost of periodic re-organization of index or hash structure acceptable?
 What is the relative frequence of insertion and deletion?
 Is it desirable to optimize average access time at the expense of increasing worst-case access time?
 What types of queries are users likely to pose?
2. The last issue is critical to the choice between indexing and hashing. If most queries are of the form
select A1; A2; : : :; An
from r
where Ai = c
then to process this query the system will perform a lookup on an index or hash structure for attribute Ai
with value c.
3. For these sorts of queries a hashing scheme is preferable.
 Index lookup takes time proportional to log of number of values in R for Ai .
 Hash structure provides lookup average time that is a small constant (independent of database size).
4. However, the worst-case favors indexing:
 Hash worst-case gives time proportional to the number of values in R for Ai .

CHAPTER 11. INDEXING & HASHING

14

 Index worst case still log of number of values in R.


5. Index methods are preferable where a range of values is speci ed in the query, e.g.
select A1; A2; : : :; An
from r
where Ai  c2 and Ai  c1
This query nds records with Ai values in the range from c1 to c2 .
6.

 Using an index structure, we can nd the bucket for value c1, and then follow the pointer chain to read









the next buckets in alphabetic (or numeric) order until we nd c2 .


If we have a hash structure instead of an index, we can nd a bucket for c1 easily, but it is not easy to
nd the \next bucket".
A good hash function assigns values randomly to buckets.
Also, each bucket may be assigned many search key values, so we cannot chain them together.
To support range queries using a hash structure, we need a hash function that preserves order.
For example, if K1 and K2 are search key values and K1 < K2 then h(K1 ) < h(K2 ).
Such a function would ensure that buckets are in key order.
Order-preserving hash functions that also provide randomness and uniformity are extremely dicult to
nd.
Thus most systems use indexing in preference to hashing unless it is known in advance that range queries
will be infrequent.

11.8 Index De nition in SQL


1. Some SQL implementations includes data de nition commands to create and drop indices. The IBM SAASQL commands are

 An index is created by
create index <index-name>
on r (<attribute-list>)
 The attribute list is the list of attributes in relation r that form the search key for the index.
 To create an index on bname for the branch relation:
create index b-index
on branch (bname)
 If the search key is a candidate key, we add the word unique to the de nition:
create unique index b-index
on branch (bname)
 If bname is not a candidate key, an error message will appear.
 If the index creation succeeds, any attempt to insert a tuple violating this requirement will fail.
 The unique keyword is redundant if primary keys have been de ned with integrity constraints already.
2. To remove an index, the command is
drop index <index-name>

11.9. MULTIPLE-KEY ACCESS

15
Brighton

Downtown Mianus

Perriridge

Green
Hayes
Johnson
Lyle
Peterson
Smith
Turner
Williams

To records of deposit file

Figure 11.10: Grid structure for deposit le.

11.9 Multiple-Key Access


1. For some queries, it is advantageous to use multiple indices if they exist.
2. If there are two indices on deposit, one on bname and one on cname, then suppose we have a query like

select balance
from deposit
where bname = \Perryridge" and balance = 1000

3. There are 3 possible strategies to process this query:


 Use the index on bname to nd all records pertaining to Perryridge branch. Examine them to see if
balance = 1000
 Use the index on balance to nd all records pertaining to Williams. Examine them to see if bname =
\Perryridge".
 Use index on bname to nd pointers to records pertaining to Perryridge branch. Use index on balance
to nd pointers to records pertaining to 1000. Take the intersection of these two sets of pointers.
4. The third strategy takes advantage of the existence of multiple indices. This may still not work well if
 There are a large number of Perryridge records AND
 There are a large number of 1000 records AND
 Only a small number of records pertain to both Perryridge and 1000.
5. To speed up multiple search key queries special structures can be maintained.

11.9.1 Grid File


1. A grid structure for queries on two search keys is a 2-dimensional grid, or array, indexed by values for the
search keys. Figure 11.10 (textbook 11.31) shows part of a grid structure for the deposit le.
2. A particular entry in the array contains pointers to all records with the speci ed search key values.
 No special computations need to be done
 Only the right records are accessed
 Can also be used for single search key queries (one column or row)
 Easy to extend to queries on n search keys { construct an n-dimensional array.

CHAPTER 11. INDEXING & HASHING

16

 Signi cant improvement in processing time for multiple-key queries.


 Imposes space overhead.
 Performance overhead on insertion and deletion.

11.9.2 Partitioned Hashing


1.

 An alternative approach to multiple-key queries.


 To construct a structure for queries on deposit involving bname and cname, we construct a hash structure

for the key (cname, bname).


 We split this hash function into two parts, one for each part of the key.
 The rst part depends only on the cname value.
 The second part depends only on the bname value.
 Figure 11.32 shows a sample partitioned hash function.
 Note that pairs with the same cname or bname value will have 3 bits the same in the appropriate
position.
 To nd the balance in all of Williams' accounts at the Perryridge branch, we compute h(Williams,
Perryridge) and access the hash structure.
2. The same hash structure can be used to answer a query on one of the search keys:
 Compute part of partitioned hash.
 Access hash structure and scan buckets for which that part of the hash coincides.
 Text doesn't say so, but the hash structure must have some grid-like form imposed on it to enable
searching the structure based on only some part of the hash.
3. Partitioned hashing can also be extended to n-key search.

You might also like