Tutorial on High Performance Data Mining

icon

30

pages

icon

English

icon

Documents

Écrit par

Publié par

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

icon

30

pages

icon

English

icon

Documents

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

High Performance Data MiningChapter 4: Association RulesVipin KumarArmy High Performance Computing Research CenterDepartment of Computer ScienceUniversity of Minnesota http://www.cs.umn.edu/~kumar© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 1Chapter 4: Algorithms for Association Rules DiscoveryOutline Serial Association Rule Discovery– Definition and Complexity.– Apriori Algorithm. Parallel Algorithms– Need– Count Distribution, Data Distribution– Intelligent Data Distribution, Hybrid Distribution– Experimental Results© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 2Association Rule Discovery: Support and ConfidenceTID Items1 Bread, Milk2 Beer, Diaper, Bread, Eggs3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Bread, Diaper, MilkExample:Association Rule:X ys,{Diaper, Milk} Beers, (X y) (Diaper, Milk, Beer) 2Support:s (s P(X, y))s 0.4| T |Total Number of Transactions 5 (X y) (Diaper,Milk,Beer)Confidence: ( P(y | X)) 0.66 (X) | (Diaper,Milk) |© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 3Handling Exponential Complexity Given n transactions and m different items:m1O(m2 )– number of possible association rules: mO(nm2 )– ...
Voir icon arrow

Publié par

Langue

English

iHhgP reofmrance Data Mi
Chapter 4: Association Rules
Vipin Kumar
nin
Army High Performance Computing Research Center Department of Computer Science University of Minnesota
http://www.cs.umn.edu/~kumar
g
© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 1
Chapter 4: Algorithms for Association Rules Discovery
Outline Serial Association Rule Discovery Definition and Complexity. Apriori Algorithm. Parallel Algorithms Need Count Distribution, Data Distribution Intelligent Data Distribution, Hybrid Distribution Experimental Results
© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 2
Association Rule Discovery: Support and Confidence
TID 1 2 3 4 5
Items Bread, Milk Beer, Dia er, Bread, E s Beer, Coke, Dia er, Milk Beer, Bread, Dia er, Milk Coke, Bread, Dia er, Milk
Association Rule:Xs,y Support:s|X(T(|)ysP(X, y)) Confidence:((XX)|(y)P(y | X))
Example: {Diaper, Milk}s,Beer s 0.4 2(Diaper, Milk, Beer)    Total Number of Transactions 5 (Diaper, Milk, Beer)0.66  |(Diaper, Milk)
© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 3
Handling Exponential Complexity
iG nevrantacnsontians :smte intrefeifmdd number of possible association rules:O(m2m1) computation complexity:O(nm2m)
Systematic search for all patterns, based on support constraint[Agarwal & Srikant]: If {A,B} has support at least, then both A and B have support at least If either A or B has support less than, then {A,B} has support less than. Use patterns ofn-1 items to find patterns ofnitems.
© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 4
Apriori Principle
Collect single item counts. Find large items. adpnrci gr ieF str st,h  cpoauni>d altaeeamn d=i of items. Find candidate triplets, count them => large tripletsof items, And so on... diuG gninirPlpic uentfreqf a teo ussbre y:evE itemset has to be frequent.  candidates.Used for pruning many
© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 5
Illustrating Apriori Principle
Item Bread Coke Milk Beer Diaper Eggs
CountItems (1-itemsets) 4 2 4 3 4 1
Minimum Support = 3
Itemset Bread,Milk Bread,Beer {Bread,Diaper} {Milk,Beer} Milk,Dia er Beer,Dia er
If every subset is considered, 6C1+6C2+6C3= 41 With support-based pruning, 6 + 6 + 2 = 14   
Count 3 2 3 2 3 3
Pairs (2-itemsets)
Triplets (3-itemsets)
ItemsetCount {Bread,Milk,Diaper} 3 {Milk,Diaper,Beer} 2
© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 6
Apriori Algorithm
F1= {frequent 1-item sets}; k = 2; while( Fk-1is not empty ) { Ck= Apriori_generate( Fk-1); for all transactions t in T { Subset( Ck, t ); } Fk= { c in Cks.t. c.count >= minimum_support};
} Answer = union of all sets Fk;
© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 7
Association Rule Discovery: Apriori generate _
Apriori_generate( F(k-1) ) { join Fk-1with Fk-1 such that,  c1= (i1, i2, .. , ik-1) and c2= (j1, j2, .. , jk-1) join together if  ip= jpfor 1 <= p <= k-1, and then new candidate, c, has a form  c = (i1,i2,..,ik-1, jk-1). c is then added to ahash-treestructure.
}
© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 8
Counting Candidates
Frequent Itemsets are found by counting candidates. Simple way: Search for each candidate in each transaction. Expensive!!!
Transactions
N
M
Candidates
© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 9
Association Rule Discovery: Hash tree for fast access.
Hash Function
1,4,7
2,5,8
3,6,9
1 4 5
1 2 4 4 5 7
1 2 5 4 5 8
Candidate Hash Tree
1 3 6
1 5 9
2 3 4 5 6 7
3 4 5
3 5 6 3 5 7 6 8 9
3 6 7 3 6 8
© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 10
Association Rule Discovery: Subset Operation
1 + 2 3 5 6
1 4 5
1 2 3 5 6transaction
2 3 4 5 6 7
1 3 6 3 4 5
2 + 3 5 6
3 5 6 3 5 7 6 8 9
3 + 5 6
3 6 7 3 6 8
Hash Function
1,4,7 2,5,8
3,6,9
1 2 4 1 2 5 1 5 9  4 5 7 4 5 8 © R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 11
Voir icon more
Alternate Text