icdm-tutorial

icon

60

pages

icon

English

icon

Documents

Écrit par

Publié par

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

icon

60

pages

icon

English

icon

Documents

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

Christian BöhmUniversity for Health Informatics and TechnologyPowerful Database Primitivesto Support High Performance Data MiningTutorial, IEEE Int. Conf. on Data Mining, Dec/09/2002Motivation2120Christian BöhmƒƒƒƒƒƒƒƒƒHigh Performance Data Mining Marketing Fraud Detection CRM Online Scoring OLAPFast decisions require knowledge just in time3120Previous Approaches to Fast Data MiningSamplingApproximations (grid) Loss of qualityDimensionality reduct.Expensive & complexParallelismAll approaches combinable with DB primitivesKDD appl. get parallelism for free4120Christian Böhm Christian BöhmFeature Based Similarity5120Simple Similarity Queries• Specify query object and- Find similar objects – range query- Find the k most similar objects – nearest neighbor q.6120Christian Böhm Christian BöhmÎÎMultidimensional Index Structure (R-Tree)Directory Page: Data Page: Rectangle , Address1 1Point : x , x , x , ...1 11 12 132 2Point : x , x , x , ...2 21 22 23Rectangle , Address3 3Point : x , x , x , ...3 31 32 334 47120Similarity – Range Queries• Given: Query point qMaximum distance ε• Formal definition:• Cardinality of the result set isdifficult to control:ε too small no results8 ε too large complete DB120Christian Böhm Christian BöhmIndex Based Processing of Range Queries9120Similarity – Nearest Neighbor Queries• Given: Query point q• Formal definition:• Ties must be handled:- Result ...
Voir icon arrow

Publié par

Langue

English

Christian Böhm University for Health Informatics and Technology
Powerful Database Primitives to Support High Performance Data Mining
Tutorial, IEEE Int. Conf. on Data Mining, Dec/09/2002
Motivation
3 120
4 120
High Performance Data Mining
ƒMarketing ƒFraud Detection ƒCRM ƒOnline Scoring ƒOLAP
Fast decisions require knowledge just in time
Previous Approaches to Fast Data Mining
ƒSampling ƒ of quality LossApproximations (grid) ƒDimensionality reduct. ƒParallelism Expensive & complex
All approaches combinable with DB primitives KDD appl. get parallelism for free
5 120
6 120
Feature Based Similarity
Simple Similarity Queries
Specify query object and -Find similar objects  range query -Find thekmost similar objects  nearest neighbor q.
7 120
8 120
Multidimensional Index Structure (R-Tree)
Directory Page: Rectangle1, Address1 Rectangle2, Address2 Rectangle3, Address3 Rectangle4, Address4
Similarity  Range Queries
Given: Query pointq Maximum distanceε Formal definition:
Cardinality of the result set is difficult to control: εtoo smallÎno results εtoo largeÎcomplete DB
9 120
10 120
Index Based Processing of Range Queries
Similarity  Nearest Neighbor Queries
Given: Query pointq
Formal definition:
Ties must be handled: -Result set enlargement -Non-determinism (dont care)
11 120
12 120
Index Based Processing of NN Queries
k-Nearest Neighbor Search and Ranking
k-nearest neighbor query: -Do not only search only for one nearest neighbor butk -Stop distance is the distance of thekth(last) candidate point -
Ranking-query: -Incremental version ofk-nearest neighbor search -First call ofeFNhct(txe)returns first neighbor -Second call oftcFe)t(exhNreturns second neighbor... -Typically only few results are fetchedÆDont generate all!
13 120
14 120
Advanced Applications: Duplicates
Duplicate detection -E.g. Astronomical catalogue matching
C1
C2
Similarity queries for large number of query obj
Advanced Applications: Data Mining
Density based clustering (DBSCAN)
15 120
16 120
What is a Similarity Join?
Given two setsR, Sof points Find all pairs of points according to similarity
R
S
Various exact definitions for the similarity join
Organization of the Tutorial
Motivation Defining the Similarity Join Applications of the Similarity Join Similarity Join Algorithms Conclusion & Future Potential
18 120
Defining the Similarity Join
What Is a Similarity Join?
Intuitive notion: 3 properties of the similarity join The similarity join is ajoinin the relational sense Two setsRandSare combined into one such that the new set contains pairs of points that fulfill a join condition
metricobjects
Vectoror rather than ordinary tuples of any type The join condition involvessimilarity
19 120
20 120
What Is a Similarity Join?
Similarity Join
Distance Range Join NN-based Approaches
Closest Pair Query
Distance Range Join (ε-Join)
Intuitition:Given parameterε All pairs of points where distance≤ ε
Formal Definition:
k-NN Join
In SQL-like notation: SELECT*FROMR, SWHERE||R.objS.obj||≤ ε
21 120
22 120
Distance Range Join (ε-Join)
 Most widespread and best evaluated join  Often also calledthesimilarity join
Distance Range Join (ε-Join)
The distance rangeselfjoin
is of particular importance for data mining (clustering) and robust similarity search Change definition to exclude trivial results
Voir icon more
Alternate Text