Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

What is Hashing?

• Hashing is the process of mapping large amount of data item to smaller table with the help of
hashing function.
• Hashing is also known as Hashing Algorithm or Message Digest Function.
• It is a technique to convert a range of key values into a range of indexes of an array.
• It is used to facilitate the next level searching method when compared with the linear or binary
search.
• Hashing allows to update and retrieve any data entry in a constant time O(1).
• Constant time O(1) means the operation does not depend on the size of the data.
• Hashing is used with a database to enable items to be retrieved more quickly.
• It is used in the encryption and decryption of digital signatures.

What is Hash Function?

• A fixed process converts a key to a hash key is known as a Hash Function.


• This function takes a key and maps it to a value of a certain length which is called a Hash
value or Hash.
• Hash value represents the original string of characters, but it is normally smaller than the original.
• It transfers the digital signature and then both hash value and signature are sent to the receiver.
Receiver uses the same hash function to generate the hash value and then compares it to that
received with the message.
• If the hash values are same, the message is transmitted without errors.

What is Hash Table?

• Hash table or hash map is a data structure used to store key-value pairs.
• It is a collection of items stored to make it easy to find them later.
• It uses a hash function to compute an index into an array of buckets or slots from which the desired
value can be found.
• It is an array of list where each list is known as bucket.
• It contains value based on the key.
• Hash table is used to implement the map interface and extends Dictionary class.
• Hash table is synchronized and contains only unique elements.

• The above figure shows the hash table with the size of n = 10. Each position of the hash table is
called as Slot. In the above hash table, there are n slots in the table, names = {0, 1, 2, 3, 4, 5, 6, 7, 8,
9}. Slot 0, slot 1, slot 2 and so on. Hash table contains no items, so every slot is empty.
• As we know the mapping between an item and the slot where item belongs in the hash table is called
the hash function. The hash function takes any item in the collection and returns an integer in the
range of slot names between 0 to n-1.
• Suppose we have integer items {26, 70, 18, 31, 54, 93}. One common method of determining a hash
key is the division method of hashing and the formula is :

Hash Key = Key Value % Number of Slots in the Table

• Division method or reminder method takes an item and divides it by the table size and returns the
remainder as its hash value.

Data Item Value % No. of Slots Hash Value

26 26 % 10 = 6 6

70 70 % 10 = 0 0

18 18 % 10 = 8 8

31 31 % 10 = 1 1

54 54 % 10 = 4 4

93 93 % 10 = 3 3
• After computing the hash values, we can insert each item into the hash table at the designated
position as shown in the above figure. In the hash table, 6 of the 10 slots are occupied, it is referred
to as the load factor and denoted by, λ = No. of items / table size. For example , λ = 6/10.
• It is easy to search for an item using hash function where it computes the slot name for the item and
then checks the hash table to see if it is present.
• Constant amount of time O(1) is required to compute the hash value and index of the hash table at
that location.

Linear Probing

• Take the above example, if we insert next item 40 in our collection, it would have a hash value of 0
(40 % 10 = 0). But 70 also had a hash value of 0, it becomes a problem. This problem is called
as Collision or Clash. Collision creates a problem for hashing technique.
• Linear probing is used for resolving the collisions in hash table, data structures for maintaining a
collection of key-value pairs.
• Linear probing was invented by Gene Amdahl, Elaine M. McGraw and Arthur Samuel in 1954 and
analyzed by Donald Knuth in 1963.
• It is a component of open addressing scheme for using a hash table to solve the dictionary problem.
• The simplest method is called Linear Probing. Formula to compute linear probing is:

P = (1 + P) % (MOD) Table_size
For example,

If we insert next item 40 in our collection, it would have a hash value of 0 (40 % 10 = 0). But 70 also
had a hash value of 0, it becomes a problem.

Linear probing solves this problem:

P = H(40)
44 % 10 = 0
Position 0 is occupied by 70. so we look elsewhere for a position to store 40.

Using Linear Probing:


P= (P + 1) % table-size
0 + 1 % 10 = 1
But, position 1 is occupied by 31, so we look elsewhere for a position to store 40.

Using linear probing, we try next position : 1 + 1 % 10 = 2


Position 2 is empty, so 40 is inserted there.

Collision in Hashing-

In hashing,
• Hash function is used to compute the hash value for a key.
• Hash value is then used as an index to store the key in the hash table.
• Hash function may return the same hash value for two or more keys.

When the hash value of a key maps to an already occupied bucket of the hash table,
it is called as a Collision.

Collision Resolution Techniques-


Collision Resolution Techniques are the techniques used for resolving or handling the collision.

Collision resolution techniques are classified as-

1. Separate Chaining
2. Open Addressing

In this article, we will discuss about separate chaining.

Separate Chaining-

To handle the collision,


• This technique creates a linked list to the slot for which collision occurs.
• The new key is then inserted in the linked list.
• These linked lists to the slots appear like chains.
• That is why, this technique is called as separate chaining.

Time Complexity-

For Searching-

• In worst case, all the keys might map to the same bucket of the hash table.
• In such a case, all the keys will be present in a single linked list.
• Sequential search will have to be performed on the linked list to perform the search.
• So, time taken for searching in worst case is O(n).

For Deletion-

• In worst case, the key might have to be searched first and then deleted.
• In worst case, time taken for searching is O(n).
• So, time taken for deletion in worst case is O(n).

Load Factor (α)-

Load factor (α) is defined as-

If Load factor (α) = constant, then time complexity of Insert, Search, Delete = Θ(1)

PRACTICE PROBLEM BASED ON SEPARATE


CHAINING-
Problem-

Using the hash function ‘key mod 7’, insert the following sequence of keys in the hash table-
50, 700, 76, 85, 92, 73 and 101

Use separate chaining technique for collision resolution.

Solution-

The given sequence of keys will be inserted in the hash table as-

Step-01:

• Draw an empty hash table.


• For the given hash function, the possible range of hash values is [0, 6].
• So, draw an empty hash table consisting of 7 buckets as-

Step-02:
• Insert the given keys in the hash table one by one.
• The first key to be inserted in the hash table = 50.
• Bucket of the hash table to which key 50 maps = 50 mod 7 = 1.
• So, key 50 will be inserted in bucket-1 of the hash table as-

Step-03:

• The next key to be inserted in the hash table = 700.


• Bucket of the hash table to which key 700 maps = 700 mod 7 = 0.
• So, key 700 will be inserted in bucket-0 of the hash table as-
Step-04:

• The next key to be inserted in the hash table = 76.


• Bucket of the hash table to which key 76 maps = 76 mod 7 = 6.
• So, key 76 will be inserted in bucket-6 of the hash table as-

Step-05:
• The next key to be inserted in the hash table = 85.
• Bucket of the hash table to which key 85 maps = 85 mod 7 = 1.
• Since bucket-1 is already occupied, so collision occurs.
• Separate chaining handles the collision by creating a linked list to bucket-1.
• So, key 85 will be inserted in bucket-1 of the hash table as-

Step-06:

• The next key to be inserted in the hash table = 92.


• Bucket of the hash table to which key 92 maps = 92 mod 7 = 1.
• Since bucket-1 is already occupied, so collision occurs.
• Separate chaining handles the collision by creating a linked list to bucket-1.
• So, key 92 will be inserted in bucket-1 of the hash table as-
Step-07:

• The next key to be inserted in the hash table = 73.


• Bucket of the hash table to which key 73 maps = 73 mod 7 = 3.
• So, key 73 will be inserted in bucket-3 of the hash table as-

Step-08:
• The next key to be inserted in the hash table = 101.
• Bucket of the hash table to which key 101 maps = 101 mod 7 = 3.
• Since bucket-3 is already occupied, so collision occurs.
• Separate chaining handles the collision by creating a linked list to bucket-3.
• So, key 101 will be inserted in bucket-3 of the hash table as-

Open Addressing-

In open addressing,
• Unlike separate chaining, all the keys are stored inside the hash table.
• No key is stored outside the hash table.

Techniques used for open addressing are-


• Linear Probing
• Quadratic Probing
• Double Hashing

Operations in Open Addressing-

Let us discuss how operations are performed in open addressing-


Insert Operation-

• Hash function is used to compute the hash value for a key to be inserted.
• Hash value is then used as an index to store the key in the hash table.

In case of collision,
• Probing is performed until an empty bucket is found.
• Once an empty bucket is found, the key is inserted.
• Probing is performed in accordance with the technique used for open addressing.

Search Operation-

To search any particular key,


• Its hash value is obtained using the hash function used.
• Using the hash value, that bucket of the hash table is checked.
• If the required key is found, the key is searched.
• Otherwise, the subsequent buckets are checked until the required key or an empty bucket is found.
• The empty bucket indicates that the key is not present in the hash table.

Delete Operation-

• The key is first searched and then deleted.


• After deleting the key, that particular bucket is marked as “deleted”.

NOTE-

• During insertion, the buckets marked as “deleted” are treated like any other empty bucket.
• During searching, the search is not terminated on encountering the bucket marked as “deleted”.
• The search terminates only after the required key or an empty bucket is found.

Open Addressing Techniques-

Techniques used for open addressing are-


1. Linear Probing-

In linear probing,
• When collision occurs, we linearly probe for the next bucket.
• We keep probing until an empty bucket is found.

Advantage-

• It is easy to compute.

Disadvantage-

• The main problem with linear probing is clustering.


• Many consecutive elements form groups.
• Then, it takes time to search an element or to find an empty bucket.

Time Complexity-

Worst time to search an element in linear probing is O (table size).

This is because-
• Even if there is only one element present and all other elements are deleted.
• Then, “deleted” markers present in the hash table makes search the entire table.

2. Quadratic Probing-

In quadratic probing,
• When collision occurs, we probe for i2‘th bucket in ith iteration.
• We keep probing until an empty bucket is found.

3. Double Hashing-
In double hashing,
• We use another hash function hash2(x) and look for i * hash2(x) bucket in ith iteration.
• It requires more computation time as two hash functions need to be computed.

Comparison of Open Addressing Techniques-

Linear Probing Quadratic Probing Double Hashing

Primary Clustering Yes No No

Secondary Clustering Yes Yes No

Number of Probe
Sequence M m m2
(m = size of table)

Cache performance Best Lies between the two Poor

Conclusions-

• Linear Probing has the best cache performance but suffers from clustering.
• Quadratic probing lies between the two in terms of cache performance and clustering.
• Double caching has poor cache performance but no clustering.

Load Factor (α)-

Load factor (α) is defined as-


In open addressing, the value of load factor always lie between 0 and 1.

This is because-
• In open addressing, all the keys are stored inside the hash table.
• So, size of the table is always greater or at least equal to the number of keys stored in the table.

PRACTICE PROBLEM BASED ON OPEN


ADDRESSING-

Problem-

Using the hash function ‘key mod 7’, insert the following sequence of keys in the hash table-
50, 700, 76, 85, 92, 73 and 101

Use linear probing technique for collision resolution.

Solution-

The given sequence of keys will be inserted in the hash table as-

Step-01:

• Draw an empty hash table.


• For the given hash function, the possible range of hash values is [0, 6].
• So, draw an empty hash table consisting of 7 buckets as-
Step-02:

• Insert the given keys in the hash table one by one.


• The first key to be inserted in the hash table = 50.
• Bucket of the hash table to which key 50 maps = 50 mod 7 = 1.
• So, key 50 will be inserted in bucket-1 of the hash table as-

Step-03:
• The next key to be inserted in the hash table = 700.
• Bucket of the hash table to which key 700 maps = 700 mod 7 = 0.
• So, key 700 will be inserted in bucket-0 of the hash table as-

Step-04:

• The next key to be inserted in the hash table = 76.


• Bucket of the hash table to which key 76 maps = 76 mod 7 = 6.
• So, key 76 will be inserted in bucket-6 of the hash table as-
Step-05:

• The next key to be inserted in the hash table = 85.


• Bucket of the hash table to which key 85 maps = 85 mod 7 = 1.
• Since bucket-1 is already occupied, so collision occurs.
• ATo handle the collision, linear probing technique keeps probing linearly until an empty bucket is
found.
• The first empty bucket is bucket-2.
• So, key 85 will be inserted in bucket-2 of the hash table as-

Step-06:

• The next key to be inserted in the hash table = 92.


• Bucket of the hash table to which key 92 maps = 92 mod 7 = 1.
• Since bucket-1 is already occupied, so collision occurs.
• To handle the collision, linear probing technique keeps probing linearly until an empty bucket is
found.
• The first empty bucket is bucket-3.
• So, key 92 will be inserted in bucket-3 of the hash table as-
Step-07:

• The next key to be inserted in the hash table = 73.


• Bucket of the hash table to which key 73 maps = 73 mod 7 = 3.
• Since bucket-3 is already occupied, so collision occurs.
• To handle the collision, linear probing technique keeps probing linearly until an empty bucket is
found.
• The first empty bucket is bucket-4.
• So, key 73 will be inserted in bucket-4 of the hash table as-
Step-08:

• The next key to be inserted in the hash table = 101.


• Bucket of the hash table to which key 101 maps = 101 mod 7 = 3.
• Since bucket-3 is already occupied, so collision occurs.
• To handle the collision, linear probing technique keeps probing linearly until an empty bucket is
found.
• The first empty bucket is bucket-5.
• So, key 101 will be inserted in bucket-5 of the hash table as-

Definition of Perfect Hashing


Perfect hashing is defined as a model of hashing in which any set of n elements can be stored in
a hash table of equal size and can have lookups performed in constant time. It was specifically
invented and discussed by Fredman, Komlos and Szemeredi (1984) and has therefore been
nicknamed as "FKS Hashing".

Definition of Static Hashing


Static Hashing defines another form of the hashing problem which permits users to accomplish
lookups on a finalized dictionary set (that means all objects in the dictionary are final as well as
not changing).
Application
Since static hashing needs that the database, its objects and reference remain the same its
applications are limited. Databases which contain information which experiences rare change are
also eligible as it would only require a full rehash of the whole database on rare occasion. Various
examples of this hashing scheme include sets of words and definitions of specific languages, sets
of significant data for an organization's personnel, etc.

Implementation
In the static case, we are provided a set with a total of p entries, each one associated with a unique
key, ahead of time. Fredman, Komlós and Szemerédi select a first-level hash table with size s =
2(p-1) buckets. To construct, p entries are separated into q buckets by the top-level hashing
function, where q = 2(p-1). Then for each bucket with r entries, a second-level table is allocated
with r2 slots, and its hash function is chosen at random from a universal hash function set so that
it becomes collision-free and stored alongside the hash table. If the hash function randomly chosen
creates a table with collisions, a new hash function is randomly chosen until a collision-free table
can be guaranteed. At last, with the collision-free hash, the r entries are hashed into the second-
level table.

Definition
Dynamic perfect hashing is defined as a programming method for resolving collisions in a hash
table data structure.

Application
While more memory-intensive than its hash table counterparts, this method is ideal for situations
where fast queries, insertions, and deletions must be performed on a large set of elements.

Implementation
Dietzfelbinger et al. explain a dynamic dictionary Algorithm that, when a set of m items is
incrementally appended to the dictionary, membership queries always consume constant time and
therefore O(1) worst-case time, the total storage needed is O(m) (linear), and O(1) expected
amortized insertion and deletion time (amortized constant time).In the dynamic case, when a key
is inserted into the hash table, if its entry in its respective sub table is occupied, then a collision is
experienced and the sub table is rebuilt based on its new total entry count and randomly chosen
hash function. Because the load factor of the second-level table is remained low, rebuilding is not
frequent, and the amortized expected cost of insertions as well as amortized expected cost of
deletions are O(1).
Additionally, the ultimate size of the top-level table or any of the sub tables has no prior knowledge
in the dynamic case. One technique for maintaining expected O(m) space of the table is to prompt
a full rebuilding when a sufficient number of insertions and deletions have experienced. As long
as the total number of insertions or deletions exceeds the number of elements at the time of last
construction, the amortized expected cost of insertion and deletion remain O(1) by considering full
rehashing.
Perfect Hashing – How it Works
This is used when the keys stored in the hash table are expected to be static. In this
case perfect hashing guarantees excellent average as well as worst-case performance.
This is how real dictionaries work. You most of the time just need to read from a
dictionary.

Some application of perfect hashing includes:

• data storage on a CD ROM


• set of reserved words in a programming language

This hashing technique is called perfect hashing because it takes constant time, O(1)
to search for an item in the table.

1. Overview of Perfect Hashing


2. Choosing a hash function
3. Estimating expected collision
4. Analysis of Perfect Hashing

1. Overview of Perfect Hashing


Perfect hashing is implemented using two hash tables, one at each level. Each of the
table uses universal hashing. The first level is the same a hashing with chaining such
that n elements is hashed into m slots in the hash table. This is done using a has
function selected from a universal family of hash functions.

The second level of uses a second hash table (instead of a linked list used in chaining).
Elements that hash to the same slot j in the first hash table are stored in a second hash
table. This second hash table is known as secondary hash table . The hash
function hj is carefully chosen such that there are no collision in the secondary table.
To ensure there are no collisions in the secondary hash table Sj, we need to make the
size mj of the secondary table equal to the square of the number of keys hashing into
slot j in the first table. That is:

mj = nj2

The hash functions for the primary hash table is carefully chosen so that we limit the
expected total amount of space used to be O(n). Take some time to watch the video
explanation of Perfect Hashing.

3. Choosing Hash Functions


The first level hash function is chosen from the universal hash family ℋ pm where p
is a prime number larger than any key value. So the keys that hash into slot j of the
first hash table are placed into a secondary hash table Sj using the hash function
hj which is chosen from another family of hash functions ℋ p,m . The size of the
j

secondary hash table is mj.

The objective is to achieve a situation where there are no collisions in the secondary
hash table. At least no expected collisions. Let’s examine the expected colliding
elements.

4. Estimating Expected Collisions


Theorem: Assuming n keys are stored in a hash table of size m = using a hash
function h which is chosen randomly from a universal family of hash functions. Then
the probability that there would be any collision is less than 1/2

Proof: There are 〈n choose 2〉 pairs of keys that may collide. Each pair collides with a
probability of 1/m if h is chosen randomly from a universal class ℋ of hash
functions. We use a random variable X to count the number of collisions.

Let E(X) = expected number of collisions


It then follows that a hash function h that is chosen randomly from the set ℋ is have
a high likelihood of having no collisions.

5. Analysis of Perfect Hashing


If the number of keys n is small we can get away with using a hash table of size m =
n2 and choosing a random has function from ℋ . But if n is large, the m = n2 would be
expensive. Therefore, we resort to using two level hash hashing:

in the first level, hash function h is used to hash n keys into m slots where m = n.
Then the number of keys hashing to slot j is nj.

in the second level, we now use a secondary hash table of size mj = nj2 which ensures a
collision does not occur (given the table is static)

DOUBLE HASHING
In this section we will see what is Double Hashing technique in open addressing scheme. There is
an ordinary hash function h´(x) : U → {0, 1, . . ., m – 1}. In open addressing scheme, the actual
hash function h(x) is taking the ordinary hash function h’(x) when the space is not empty, then
perform another hash function to get some space to insert.
h1(x)=xmodmh1(x)=x mod m
h2(x)=xmodm′h2(x)=x mod m′
h(x,i)=(h1(x)+ih2)mod mh(x,i)=(h1(x)+ih2)modm
The value of i = 0, 1, . . ., m – 1. So we start from i = 0, and increase this until we get one free
space. So initially when i = 0, then the h(x, i) is same as h´(x).

Example
Suppose we have a list of size 20 (m = 20). We want to put some elements in linear probing fashion.
The elements are {96, 48, 63, 29, 87, 77, 48, 65, 69, 94, 61}

h1(x)=xmod20h1(x)=xmod20
h2(x)=xmod13h2(x)=xmod13
x h(x, i) = (h1 (x) + ih2(x)) mod 20

Hash Table

You might also like