131
pages
English
Documents
Obtenez un accès à la bibliothèque pour le consulter en ligne En savoir plus
Découvre YouScribe et accède à tout notre catalogue !
Découvre YouScribe et accède à tout notre catalogue !
131
pages
English
Documents
Obtenez un accès à la bibliothèque pour le consulter en ligne En savoir plus
Learning with sparsity-inducing norms
Francis Bach
INRIA - Ecole Normale Sup´erieure
MLSS 2008 - Ile de R´e, 2008Supervised learning and regularization
• Data: x ∈X, y ∈Y, i = 1,...,n
i i
• Minimize with respect to function f∈F:
n
X
λ
2
ℓ(y,f(x )) + kfk
i i
2
i=1
Error on data + Regularization
Loss & function space ? Norm ?
• Two issues:
– Loss
– Function space / normUsual losses [SS01, STC04]
• Regression: y∈R, prediction yˆ=f(x),
1 2
– quadratic cost ℓ(y,f(x)) = (y−f(x))
2
• Classification : y∈{−1,1} prediction yˆ= sign(f(x))
– loss of the form ℓ(y,f(x)) =ℓ(yf(x))
– “True” cost: ℓ(yf(x)) = 1
yf(x)<0
– Usual convex costs:
5
0−1
hinge
4
square
logistic
3
2
1
0
−3 −2 −1 0 1 2 3 4Regularizations
• Main goal: control the “capacity” of the learning problem
• Two main lines of work
1. Use Hilbertian (RKHS) norms
– Non parametric supervised learning and kernel methods
– Well developped theory [SS01, STC04, Wah90]
2. Use “sparsity inducing” norms
P
p
– main example: ℓ -normkwk = |w|
1 1 i
i=1
– Perform model selection as well as regularization
– Often used heuristically
• Goal of the course: Understand how and when to use
sparsityinducing normsWhy ℓ -norms lead to sparsity?
1
1
2
• Example 1: quadratic problem in 1D, i.e. min x −xy+λ|x|
x∈R
2
• Piecewise quadratic function with a kink at zero
– Derivative at 0+: g =λ−y and 0−: g =−λ−y
+ −
– x = 0 is the solution iff g > 0 and g 6 0 (i.e.,|y|6λ)
+ −
∗
– x> 0 is the solution iff g 6 0 (i.e., y>λ)⇒ x =y−λ
+
∗
– x6 0 is the solution iff g 6 0 (i.e., y6−λ)⇒ x =y+λ
−
∗
• Solution x = sign(y)(|y|−λ) = soft thresholding
+Why ℓ -norms lead to sparsity?
1
• Example 2: isotropic quadratic problem
p p
X X
1 1
2 ⊤ ⊤
• min x − xy +λkxk = min x x−x y+λkxk
i i 1 1
i
p p
x∈R x∈R
2 2
i=1 i=1
∗
• solution: x = sign(y )(|y|−λ)
i i +
i
• decoupled soft thresholdingWhy ℓ -norms lead to sparsity?
1
• Example 3: general quadratic problem
– coupled soft thresolding
• Geometric interpretation
– NB : Penalizing is “equivalent” to constrainingCourse Outline
1
1. ℓ -norm regularization
• Review of nonsmooth optimization problems and algorithms
• Algorithms for the Lasso (generic or dedicated)
• Examples
2. Extensions
• Group Lasso and multiple kernel learning (MKL) + case study
• Sparse methods for matrices
• Sparse PCA
3. Theory - Consistency of pattern selection
• Low and high dimensional setting
• Links with compressed sensingℓ -norm regularization
1
p
• Data: covariates x ∈R , responses y ∈Y, i = 1,...,n, given in
i i
p n×p
vector y∈R and matrix X∈R
p
• Minimize with respect to loadings/weights w∈R :
n
X
⊤
ℓ(y,w x ) + λkwk
i i 1
i=1
Error on data + Regularization
• Including a constant term b?
• Assumptions on loss:
– convex and differentiable in the second variable
– NB: with the square loss ⇒ basis pursuit (signal processing)
[CDS01], Lasso (statistics/machine learning) [Tib96]A review of nonsmooth convex
analysis and optimization
• Analysis: optimality conditions
• Optimization: algorithms
– First order methods
– Second order methods
• Books: Boyd & VandenBerghe [BV03], Bonnans et al.[BGLS03],
Nocedal & Wright [NW06], Borwein & Lewis [BL00]