Lecture 15
Lecture 15
• PCY Pass 1:
(a) Count occurrences of all items.
(b) For each itemset, consisting of items {i1;….. ik}, hash all pairs to an itemset of the hash table, and
increment the count of the itemset by 1.
(c) At the end of the pass, determine L1, the items with counts at least minsup.
(d) Also at the end, determine those itemsets with counts at least minsup.
• Key point: a pair (i; j) cannot be frequent unless it hashes to a frequent itemset, so pairs that hash to
other itemsets need not be candidates in C2.
(e) Replace the hash table by a bitmap, with one bit per itemset: 1 if the itemset was frequent count
>=minsup ), 0 if not.
PCY Algorithm ---Pass 1
FOR (each itemset) {
FOR (each item)
add 1 to item’s count
2
FOR (each pair of items) {
hash the pair to an itemset
add 1 to the count for that itemset
}
}
PCY Algorithm ---Between Passes
(a) Replace the itemsets by a bit-vector:
• 0 means the itemset did not have a count >the minsup; 1 means it did.
(b) Note that a (say) 32-bit integer is replaced by 1 bit, so the bit vector requires little second-pass space.
(c) Also, decide which items are frequent and list them for the second pass.
• PCY Pass 2:
(a) Main memory holds a list of all the frequent items, i.e. L1.
(b) Main memory also holds the bitmap summarizing the results of the hashing from pass 1.
• Key point: The itemsets must use 16 or 32 bits for a count, but these are compressed to 1 bit. Thus,
even if the hash table occupied almost the entire main memory on pass 1, its bitmap occupies no
more than 1/16 of main memory on pass 2.
(c) Finally, main memory also holds a table with all the candidate pairs and their counts. A pair (i; j) can
be a candidate in C2 only if all of the following are true:
• i is in L1.
• j is in L1.
• (i; j) hashes to a frequent itemset.
It is the last condition that distinguishes PCY from straight a-priori and reduces the requirements for
memory in pass 2.
(d) During pass 2, we consider each itemset, and each pair of its items, making the test outlined above. If
a pair meets all three conditions, add to its count in memory, or create an entry for it if one does not yet
exist.
PCY Algorithm ---Pass 2
• Count all pairs {i,j } that meet the conditions:
1.Both i and j are frequent items.
2.The pair {i,j}, hashes to an itemset number whose bit in the bit vector is 1.
• Notice all these conditions are necessary for the pair to have a chance of being frequent.
3
When does PCY beat A priori?
When there are too many pairs of items from L1 to fit a table of candidate pairs and their counts in main
memory, yet the number of frequent itemsets in the PCY algorithm is sufficiently small that it reduces
the size of C2 below what can fit in memory (even with 1/16 of it given over to the bitmap).
When will most of the itemsets be infrequent in PCY?
When there are a few frequent pairs, but most pairs are so infrequent that even when the counts of all the
pairs that hash to a given itemset are added, they still are unlikely to sum to minsup or more.
Example:-
• Items = {milk, coke, pepsi, bold, juice}.
• minsup = 3 itemsets.
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, p, b, j}
B7 = {c, b, j} B8 = {b, p}
• Frequent itemsets: {m}, {c}, {b}, {p}, {j}, {m, b}, {m, p}, {b, p}.
Matrix Representation
• Columns = items.
• itemsets = rows.
• Entry (r , c ) = 1 if item c is in itemset r ;
= 0 if not.
• Assume matrix is almost all 0’s.
In Matrix Form
m c p b j
{m,c,p} 1 1 0 1 0
{m,p} 1 0 1 0 0
{m,b} 1 0 0 1 0
{c,j} 0 1 0 0 1
{m,p,b} 1 0 1 1 0
{m,p,b,j} 1 0 1 1 1
{c,b,j} 0 1 0 1 1
{p,b} 0 0 1 1 0
Similarity of Columns
• Think of a column as the set of rows in which it has 1.
• The similarity of columns C1 and C2, sim (C1,C2), is the ratio of the sizes of the intersection
and union of C1 and C2.
• Our goal of finding correlated columns becomes that of finding similar columns.
4