tutorial-pkdd06

icon

60

pages

icon

English

icon

Documents

Écrit par

Publié par

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

icon

60

pages

icon

English

icon

Documents

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

Randomization based Privacy Preserving Data MiningXintao WuDepartment of Computer ScienceUniversity of North Carolina at CharlotteECML/PKDD06, BerlinPrivacy Case• Nydia Velázquez (1982)Three weeks after Nydia Velázquez won the New York Democratic Party's nomination to serve in the U.S. House of Representatives, somebody at St. Claire Hospital in New York faxed Velázquez's medical records to the New York Post. The records detailed the care that Velázquez had received at the hospital after a suicide attempt--an attempt that had happened several years before the election. Database Nation: The Death of Privacy in the 21st Century, SimsonGarfinkel, Jan 2000, 1-56592-653-6ECML/PKDD06 Tutorial 2•1Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpgECML/PKDD06 Tutorial 3Source: http://www.privacyinternational.org/survey/dpmap.jpgECML/PKDD06 Tutorial 4•2uuuuuuNational Laws• USA n HIPAA for health carePassed August 21, 96lowest bar and the States are welcome to enact more stringent ruless California State Bill 1386n Grann-Leach-Bliley Act of 1999 for financial institutionsn COPPA for childern’s online privacyn etc.• Canadan PIPEDA 2000Personal Information Protection and Electronic Documents ActEffective from Jan 2004• European Union (Directive 94/46/EC)n Passed by European Parliament Oct ...
Voir icon arrow

Publié par

Langue

English

Randomization based Privacy Preserving
Data Mining
Xintao Wu
Department of Computer Science
University of North Carolina at Charlotte
ECML/PKDD06, Berlin
Privacy Case
• Nydia Velázquez (1982)
Three weeks after Nydia Velázquez won the New
York Democratic Party's nomination to serve in the
U.S. House of Representatives, somebody at St.
Claire Hospital in New York faxed Velázquez's
medical records to the New York Post. The records
detailed the care that Velázquez had received at
the hospital after a suicide attempt--an attempt
that had happened several years before the
election.
Database Nation: The Death of Privacy in the 21st Century, Simson
Garfinkel, Jan 2000, 1-56592-653-6
ECML/PKDD06 Tutorial
2
•1Source: http://www.privacyinternational.org/issues/foia/foia-laws.jpg
ECML/PKDD06 Tutorial
3
Source: http://www.privacyinternational.org/survey/dpmap.jpg
ECML/PKDD06 Tutorial
4
•2u
u
u
u
u
u
National Laws
• USA
n HIPAA for health care
Passed August 21, 96
lowest bar and the States are welcome to enact more stringent rules
s California State Bill 1386
n Grann-Leach-Bliley Act of 1999 for financial institutions
n COPPA for childern’s online privacy
n etc.
• Canada
n PIPEDA 2000
Personal Information Protection and Electronic Documents Act
Effective from Jan 2004
• European Union (Directive 94/46/EC)
n Passed by European Parliament Oct 95 and Effective from Oct 98.
n Provides guidelines for member state legislation
n Forbids sharing data with states that do not protect privacy
ECML/PKDD06 Tutorial
5
Mining vs. Privacy
• Data mining
n The goal of data mining is summary results (e.g., classification,
cluster, association rules etc.) from the data (distribution)
• Individual Privacy
n Individual values in database must not be disclosed, or at least no
close estimation can be got by attackers
n Contractual limitations: privacy policies, corporate agreements
• Privacy Preserving Data Mining (PPDM)
n How to transform data such that
we can build a good data mining model (data utility)
while preserving privacy at the record level (privacy)?
ECML/PKDD06 Tutorial
6
•3Two Approaches
• Distributed • Randomization
n Suitable for multi-party n Perturb data to protect privacy
platforms of individual records.
n Secure multi-party computation n Preserve intrinsic distributions
necessary for modeling.
n Tolerated disclosure:
computationally private n Tolerated disclosure:
statistically private
• Our focus• See some other excellent
tutorials
ECML/PKDD06 Tutorial
7
Other Tutorials on PPDM
• Privacy in data system, Rakesh Agrawal, PODS03
• Privacy preserving data mining, Chris Clifton, PKDD02,
KDD03
• Preserving privacy in database systems, Johann-Chrostoph
Freytag, WAIM06
• Models and methods for privacy preserving data publishing
and analysis, Johannes Gehrke, ICDM05, ICDE06, KDD06
• Cryptographic techniques in privacy preserving data mining,
Helger Lipmaa, PKDD06
ECML/PKDD06 Tutorial
8
•4u
u
u
u
Scope
ssn name zip race … age Sex Bal income … IntP
1 28223 Asian … 20 M 10k 85k … 2k
2 28223 Asian … 30 F 15k 70k … 18k
3 28262 Black … 20 M 50k 120k … 35k
4 28261 White … 26 M 45k 23k … 134k
. . . … . . . . … .
N 28223 Asian … 20 M 80k 110k … 15k
Part II:
focus on Random Response Part I: focus
.
ECML/PKDD06 Tutorial
9
Outline (Part I)
Randomization based PPDM
n Additive noise
n Rotation
n General Linear Transformation
n Condensation or modeling based
Attacking Method
n Additive noise
IQR (from distribution)
Spectral Filtering, PCA, SVD
n Rotation and General Linear Transformation
ICA
A-priori Knowledge Based Attack
ECML/PKDD06 Tutorial
10
•5Additive Noise Randomization Example
Bal income … IntP
1 10k 85k … 2k
2 15k 70k … 18k
3 50k 120k … 35k
4 45k 23k … 134k
. . . … .
N 80k 110k … 15k
10 85 2 7.334 3.759 0.09917.334 88.759 2.099
15 70 18 4.199 7.537 7.93919.199 77.537 25.939 = +
50 120 35 9.199 8.447 3.67859.199 128.447 38.678
45 23 134 6.208 7.313 1.93951.208 30.313 135.939
80 110 15 9.048 5.692 6.31889.048 115.692 21.318
= +Y X E
ECML/PKDD06 Tutorial
11
Assumption of Additive Noise
• In this tutorial, we assume the additive noise E is
independent with the original data X.
• If E is correlated with X, it may significantly affect data
utility although it may better preserve privacy.
n See Huang, Du and Chen SIGMOD05
ECML/PKDD06 Tutorial
12
•6Additive Randomization (Z=X+Y)
• R.Agrawal and R.Srikant SIGMOD 00
Alice’s
age 30 | 70K | ... 50 | 40K | ... ...
Add random
Randomizer Randomizer
number to
Age
65 | 20K | ... 25 | 60K | ... ...
30
becomes
65
Reconstruct Reconstruct(30+35)
Distribution Distribution ...
of Age of Salary
Classification
Model
Algorithm
ECML/PKDD06 Tutorial
13
Reconstruction Problem
• Original values x , x , ..., x1 2 n
n from probability distribution X (unknown)
• To hide these values, we use y , y , ..., y1 2 n
n from probability distribution Y
• Given
n x +y , x +y , ..., x +y1 1 2 2 n n
n the probability distribution of Y
Estimate the probability distribution of X.
ECML/PKDD06 Tutorial
14
•7Intuition (Reconstruct single point)
• Use Bayes' rule for density functions
V10 90
Age
Original distribution for Age
Probabilistic estimate of original value of V
ECML/PKDD06 Tutorial
15
Intuition (Reconstruct single point)
• Use Bayes' rule for density functions
V10 90
Age
Original Distribution for Age
Probabilistic estimate of original value of V
ECML/PKDD06 Tutorial
16
•8-
¥
-
¥
-
¥
¥
-
-
-
Reconstruct the Distribution
• Combine estimates of where point came from for all the
points:
n Gives estimate of original distribution.
10 90
Age
n j1 f ((x + y ) a) f (a)Y i i Xf =X ∑
jn i=1 f ((x + y ) a) f (a)Y i i X∫
ECML/PKDD06 Tutorial
17
Distribution Reconstruction Alg.
• Bootstrapping Algorithm
0
f = uniform distributionX
j = 0
repeat
n j1 f ((x + y ) a) f (a)j+1 Y i i X f (a) =
X ∑
jn i=1 f ((x + y ) a) f (a)Y i i X∫
j = j +1
until (stopping criterion met)
• Converges to maximum likelihood estimate
n Agrawal and Aggarwal PODS 01
• Extension to muti-variate case
n Domingo-Ferrer et al. PSD04
ECML/PKDD06 Tutorial
18
•9Works well
1200
1000
800
Original
600 Randomized
Reconstructed
400
200
0
20 60
Age
ECML/PKDD06 Tutorial
19
More Example
ECML/PKDD06 Tutorial
20
•10
Num ber of P eople

Voir icon more
Alternate Text