Tutorial on High Performance Data Mining

pages

English

Documents

Écrit par
Mjoshi

Publié par
Thuwyer

Lire

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

pages

English

Documents

Lire

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

Publié par

Thuwyer

Nombre de lectures

Langue

English

High Performance Data MiningChapter 4: Association RulesVipin KumarArmy High Performance Computing Research CenterDepartment of Computer ScienceUniversity of Minnesota http://www.cs.umn.edu/~kumar© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 1Chapter 4: Algorithms for Association Rules DiscoveryOutline Serial Association Rule Discovery– Definition and Complexity.– Apriori Algorithm. Parallel Algorithms– Need– Count Distribution, Data Distribution– Intelligent Data Distribution, Hybrid Distribution– Experimental Results© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 2Association Rule Discovery: Support and ConfidenceTID Items1 Bread, Milk2 Beer, Diaper, Bread, Eggs3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Bread, Diaper, MilkExample:Association Rule:X ys,{Diaper, Milk} Beers, (X y) (Diaper, Milk, Beer) 2Support:s (s P(X, y))s 0.4| T |Total Number of Transactions 5 (X y) (Diaper,Milk,Beer)Confidence: ( P(y | X)) 0.66 (X) | (Diaper,Milk) |© R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 3Handling Exponential Complexity Given n transactions and m different items:m1O(m2 )– number of possible association rules: mO(nm2 )– ...

Voir

Publié par

Thuwyer

Langue

English

iHhgP reofmrance Data Mi

Chapter 4: Association Rules

Vipin Kumar

nin

Army High Performance Computing Research Center Department of Computer Science University of Minnesota

http://www.cs.umn.edu/~kumar

Chapter 4: Algorithms for Association Rules Discovery

Outline Serial Association Rule Discovery Definition and Complexity. Apriori Algorithm. Parallel Algorithms Need Count Distribution, Data Distribution Intelligent Data Distribution, Hybrid Distribution Experimental Results

Association Rule Discovery: Support and Confidence

TID 1 2 3 4 5

Items Bread, Milk Beer, Dia er, Bread, E s Beer, Coke, Dia er, Milk Beer, Bread, Dia er, Milk Coke, Bread, Dia er, Milk

Association Rule:Xs,y Support:s|X(T(|)ysP(X, y)) Confidence:((XX)|(y)P(y | X))

Example: {Diaper, Milk}s,Beer s 0.4 2(Diaper, Milk, Beer)    Total Number of Transactions 5 (Diaper, Milk, Beer)0.66  |(Diaper, Milk)

Handling Exponential Complexity

iG nevrantacnsontians :smte intrefeifmdd number of possible association rules:O(m2m1) computation complexity:O(nm2m)

Systematic search for all patterns, based on support constraint[Agarwal & Srikant]: If {A,B} has support at least, then both A and B have support at least If either A or B has support less than, then {A,B} has support less than. Use patterns ofn-1 items to find patterns ofnitems.

Apriori Principle

Collect single item counts. Find large items. adpnrci gr ieF str st,h cpoauni>d altaeeamn d=i of items. Find candidate triplets, count them => large tripletsof items, And so on... diuG gninirPlpic uentfreqf a teo ussbre y:evE itemset has to be frequent.  candidates.Used for pruning many

Illustrating Apriori Principle

Item Bread Coke Milk Beer Diaper Eggs

CountItems (1-itemsets) 4 2 4 3 4 1

Minimum Support = 3

Itemset Bread,Milk Bread,Beer {Bread,Diaper} {Milk,Beer} Milk,Dia er Beer,Dia er

If every subset is considered, 6C1+6C2+6C3= 41 With support-based pruning, 6 + 6 + 2 = 14

Count 3 2 3 2 3 3

Pairs (2-itemsets)

Triplets (3-itemsets)

ItemsetCount {Bread,Milk,Diaper} 3 {Milk,Diaper,Beer} 2

Apriori Algorithm

F1= {frequent 1-item sets}; k = 2; while( Fk-1is not empty ) { Ck= Apriori_generate( Fk-1); for all transactions t in T { Subset( Ck, t ); } Fk= { c in Cks.t. c.count >= minimum_support};

} Answer = union of all sets Fk;

Association Rule Discovery: Apriori generate _

Apriori_generate( F(k-1) ) { join Fk-1with Fk-1 such that, c1= (i1, i2, .. , ik-1) and c2= (j1, j2, .. , jk-1) join together if ip= jpfor 1 <= p <= k-1, and then new candidate, c, has a form c = (i1,i2,..,ik-1, jk-1). c is then added to ahash-treestructure.

}

Counting Candidates

Frequent Itemsets are found by counting candidates. Simple way: Search for each candidate in each transaction. Expensive!!!

Transactions

Candidates

Association Rule Discovery: Hash tree for fast access.

Hash Function

1,4,7

2,5,8

3,6,9

1 4 5

1 2 4 4 5 7

1 2 5 4 5 8

Candidate Hash Tree

1 3 6

1 5 9

2 3 4 5 6 7

3 4 5

3 5 6 3 5 7 6 8 9

3 6 7 3 6 8

Association Rule Discovery: Subset Operation

1 + 2 3 5 6

1 4 5

1 2 3 5 6transaction

2 3 4 5 6 7

1 3 6 3 4 5

2 + 3 5 6

3 5 6 3 5 7 6 8 9

3 + 5 6

3 6 7 3 6 8

Hash Function

1,4,7 2,5,8

3,6,9

1 2 4 1 2 5 1 5 9 4 5 7 4 5 8 © R. Grossman, C. Kamath, V. Kumar Data Mining for Scientific and Engineering Applications Ch 4/ 11

Voir