- (25 pts.)
Imagine there are 100 baskets, numbered 1,2,...,100, and 100 items,
similarly numbered.
Item i is in basket j if and only if i divides
j evenly.
For example, basket 24 is the set of items {1,2,3,4,6,8,12,24}.
Answer the following:
- a)
-
What is the support for the set of items {3,4}?
- b)
-
If the support threshold is 10, what are all the frequent items?
- c)
-
What triple of items has the highest support? (Think carefully --- your
first instinct may not be right.)
- d)
-
Are there any association rules that have 100% confidence, but do not
have item 1 on the right?
Justify your answer.
- (25 pts.)
Suppose items are numbered 0,1,...,n-1, where n is the
number of items.
We mentioned in our discussion of the a-priori algorithm that it was
possible to create an array of counts for pairs, so that given pair
{i,j}, it is possible to calculate easily the array entry for
this pair.
This technique allows us to devote only 4 bytes per pair in main memory
to a count for that pair (assuming 4 bytes/integer is sufficient).
Give a method for assigning array elements to pairs, such that one can
figure out the index of the element for a pair, from the two item
numbers (integers) in the pair. Note that a pair must be associated
with the same array element regardless of the order in which the pair's
two integers are presented, and the array must not have wasted space;
i.e., each array element is assigned to a unique pair.
- (25 pts.)
Suppose we wish to find frequent pairs of items using the a-priori algorithm
under the following assumptions:
- s, the support threshold, is 10,000.
- There are one million items, which are represented by the integers
0,1,...,999999.
- There are 25,000 frequent items, that is, items that occur 10,000
times or more.
- There are one million pairs that occur 10,000 times or more.
- There are one billion pairs that occur exactly once.
Half of these pairs consist of two frequent items, the other half each
have at least one nonfrequent item.
- No other pairs occur at all.
- Integers are always represented by 4 bytes.
Estimate the minimum amount of main memory needed to run the a-priori
algorithm successfully.
Show your reasoning.
You should begin with an estimate of the memory needed for the first
pass.
Then, consider how counts for pairs should be organized in the second
pass.
Do you want to renumber the frequent items and use the technique of
Problem 2?
If so, explain how the translation of numbers is stored, and how much
space it takes.
Or would you prefer to maintain a count for only those pairs of frequent
items that actually occur in the data?
If so, explain how you will do so, and how much space it takes.
- (25 pts.)
Repeat Question (3) for the PCY algorithm.
You may assume that when you hash pairs on the first pass:
- Frequent pairs each hash to a different bucket.
- Infrequent pairs divide exactly evenly among buckets, including
those that contain a frequent pair.
- Each infrequent pair (in particular, a pair that consists of two
frequent items) is equally likely to belong to any bucket.
You should explain your reasoning.
For example:
- a)
-
How many buckets will you use on the first pass, and how much space do
they take?
- b)
-
How many candidate pairs will there be on the second pass?
- c)
-
How do you store and lookup counts of candidate pairs on the second
pass?
- d)
-
How much space is needed to maintain those counts?
- e)
-
What other space is needed on the second pass?
Note: you do not have to answer the above as if they were five parts to
a question.
However, you should think about each of these issues, and others, as part of your
overall response to the question.
Also, observe that the more buckets you use on the first pass, the more
space you need, but the fewer candidates you have on the second pass.
The minimum space utilzation will occur when the space needs on the two
passes are equal.