R:
C:
INSTRUCTIONS: Write the answer in the designated lines. Your answers must be well-organized and well-written. Write your name on all pages. Use PENCIL ONLY.

 
QUESTIONS

  • Assume data provided below is discretized with categorical Categories of each feature have the same significance. The distance between data points Pi and Pj is d(i,j) = 1 – (m/N), where m and N denote the number of matches and the number of features, respectively. Features are Age (A), Salary (S), Number of movie watched/month (N), Size of family (F).

 

  • Obtain the contingency tables and calculate the dissimilarity DisSim(i,j) and the similarity Sim(i,j) between data points Pi and
  A S N F
P1 1 2 2 2
P2 3 1 3 1
P3 2 2 2 2

 

P3
 
P1
 
DisSim(1,3) =       
Sim(1,3) =             

 

P3
 
P2
 
DisSim(2,3) =       
Sim(2,3) =             

 

P2
 
P1
 
DisSim(1,2) =       
Sim(1,2) =             

 

 
 
 
 
 
 
 
 

  • Calculate proximity Prx(Pi, Pj) between data points based on the supremum distance measure for the same data set assuming values are numeric (not categorical). What is the most similar data points? Why? Show your
Prx(Pi, Pj) P1 P2 P3
P1      
P2      
P3      

 
 

  • Assume that each data point {P1, P2, P3} represents a transaction. Convert these transactions which are in the horizontal data format into the vertical data format. For an “attribute=value”, such as “A=1” use the format “A1” in the vertical data format. Add columns as

 

ITEM                  
TID_SET                  

 
 
 
 
 
 
 
 
 
 

  • Given samples below,
    • Find clusters using the algorithm K-medoids (closest to the mean). Initial seed points are P1 for cluster-1 (“o”) and P10 for cluster-2 (“”). Use Manhattan Use ceiling for decimal values. In case of equal distance, keep the point in its current cluster. Show your calculations only for the first iteration. Plot points with their clusters (use curvy line) and mark centroid point with “*” at each iteration; use the charts provided. Start clustering on the first chart below.

 

  P1 P2 P3 P4 P5 P6 P7 P8 P9 P10
X 5 15 20 25 30 30 35 40 40 60
Y 30 30 20 40 35 50 25 30 45 50
P10
P6
P4
P9
P5
P1
P2
P8
P7
P3

 
 

  • Find clusters using the hierarchical algorithm. Use the Complete Link (MAX) as the inter-cluster proximity measure and Manhattan Use ceiling for decimal values. Show clusters with curvy line at each iteration on the charts below; name them with “K” for cluster designation such as “K1” for the cluster-1. Calculate proximity matrix values for each iteration and enter them into the table below; when a cluster is in equal proximity to others, then merge with the one with larger size. Start clustering on the first chart below.
P10
P6
P4
P9
P5
P1
P2
P8
P7
P3

 

MAX(i,j)
MAX(i,j)
MAX(i,j)
MAX(i,j)

 
 

3)  Assume that you have a data set represented by three (3) attributes A, B, C. The value categories for each attribute are V(A) = {a1, a2}, V(B) = {b1, b2, b3}, and V(C) = {c1, c2}; note that an item (attribute=value), e.g. “A

= 1” is represented by “A1”. The distribution of the transactions is given in the table-1.
 

  • Fill up NOT-shaded empty cells in table-1. Show how to calculate each

 
Table-1: The distribution of the transactions

A Count(ai) P(ai) B Count(ai,bj) P(ai,bj) C Count(ai,bj,ck) P(ai,bj,ck)
a1 Count(a1)=      _ P(a1)=          _ b1 Count(a1,b1)=          _ P(a1,b1)=              c1 Count(a1,b1,c1)=      0.05
a1     b1     c2                0.10
a1     b2 Count(a1,b2)=          _ P(a1,b2)=              c1                0.20
a1     b2     c2                0.05
a1     b3 Count(a1,b3)=          _ P(a1,b3)=              c1                0.30
a1     b3     c2                0.10
a2 Count(a2)=      _ P(a2)=          _ b1 Count(a2,b1)=          _ P(a2,b1)=              c1                0.02
a2     b1     c2                0.03
a2     b2 Count(a2,b2)=          _ P(a2,b2)=              c1                0.01
a2     b2     c2                0.04
a2     b3 Count(a2,b3)=          _ P(a2,b3)=              c1                0.05
a2     b3     c2                0.05
TOTAL                                                                                             200 1.00

 
 
 
 
 
 

  • Calculate support of each item. Then, provide results in the table below. Add column(s) as you
ITEM                  
Support(ITEM)                  

 
 
 
 

  • Given the minimum support as 23%, using closure property find all frequent items-sets. Draw a lattice-tree to show generating combinations; use the lattice tree format given. The nodes must include item and the corresponding support value S(*).

 
 

Item-1
S(item-1)
Item-2 S(item-1, item-
Item-K S(item-1, item-K)

 

Item-2 S(item-2)
Item-3 S(item-2, item-3)
Item-K S(item-2, item-K)

 
 
 
 
 
 
 
 

3.4)   Assume S = {A2, B3, C2} is a frequent 3-items-set. Find frequent 2-items-sets from given the set S.

 
 
 

  • What would be the biggest minimum-support value considering rules with 2- and 3-items-sets only?

 
 
 

  • Given the rule “C1 à A2, B3”,
    • What is the local frequency of observing the consequent given the antecedent is observed?

 
 
 
 

  • Compare this local frequency against the global frequency of the consequent. What do you think about association degree of the antecedent and the consequent?

 
 
 

  • Discuss interestingness of the rule based on 3.6)3.1. Lift(C1 à A2, B3)

 
 
 
3.6)3.2.      Lift(C1 à NOT {A2, B3} )
 
 
ANSWERS