Advantages of SVMs - 1
A principled approach to classification, regression and novelty detection
Good generalization capabilities
Hypothesis has an explicit dependence on data, via support vectors – hence,
can readily interpret model
4Advantages of SVMs - 2
Learning involves optimization of a convex function (no local minima as in
neural nets)
Only a few parameters are required to tune the learning machine (unlike lots of
weights and learning parameters, hidden layers, hidden units, etc as in neural
nets)
77 trang |
Chia sẻ: thanhle95 | Lượt xem: 552 | Lượt tải: 1
Bạn đang xem trước 20 trang tài liệu Bài giảng Máy học nâng cao - Chương 10: Support vector machine - Trịnh Tấn Đạt, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Trịnh Tấn Đạt
Khoa CNTT – Đại Học Sài Gòn
Email: trinhtandat@sgu.edu.vn
Website: https://sites.google.com/site/ttdat88/
Contents
Introduction
Review of Linear Algebra
Classifiers & Classifier Margin
Linear SVMs: Optimization Problem
Hard Vs Soft Margin Classification
Non-linear SVMs
Introduction
Competitive with other classification methods
Relatively easy to learn
Kernel methods give an opportunity to extend the idea to
Regression
Density estimation
Kernel PCA
Etc.
3
Advantages of SVMs - 1
A principled approach to classification, regression and novelty detection
Good generalization capabilities
Hypothesis has an explicit dependence on data, via support vectors – hence,
can readily interpret model
4
Advantages of SVMs - 2
Learning involves optimization of a convex function (no local minima as in
neural nets)
Only a few parameters are required to tune the learning machine (unlike lots of
weights and learning parameters, hidden layers, hidden units, etc as in neural
nets)
5
Prerequsites
Vectors, matrices, dot products
Equation of a straight line in vector notation
Familiarity with
Perceptron is useful
Mathematical programming will be useful
Vector spaces will be an added benefit
The more comfortable you are with Linear Algebra, the easier this material will
be
6
What is a Vector ?
Think of a vector as a directed line segment in
N-dimensions! (has “length” and
“direction”)
Basic idea: convert geometry in higher
dimensions into algebra!
Once you define a “nice” basis along each
dimension: x-, y-, z-axis
Vector becomes a 1 x N matrix!
v = [a b c]T
Geometry starts to become linear algebra on
vectors like v!
=
c
b
a
v
x
y
v
7
Vector Addition: A+B
),(),(),( 22112121 yxyxyyxx ++=+=+ wv
A
B
A
B
C
A+B = C
(use the head-to-tail method to
combine vectors)
A+B
8
Scalar Product: av
),(),( 2121 axaxxxaa ==v
v
av
Change only the length (“scaling”), but keep direction fixed.
Sneak peek: matrix operation (Av) can change length,
direction and also dimensionality!
9
Vectors: Magnitude (Length) and Phase (direction)
x
y
||v||
Alternate representations:
Polar coords: (||v||, )
Complex numbers: ||v||ej
“phase”
,1 If
1
2
),,
2
,
1
(
=
=
=
=
v
n
i
i
xv
n
xxxv T
(Magnitude or “2-norm”)
(unit vector => pure direction)
10
a unit vector
Inner (dot) Product: v.w or wTv
v
w
22112121 .),).(,(. yxyxyyxxwv +==
The inner product is a SCALAR!
cos||||||||),).(,(. 2121 wvyyxxwv ==
wvwv ⊥= 0.
If vectors v, w are “columns”, then dot product is wTv
11
Projections w/ Orthogonal Basis
Get the component of the vector on each axis:
dot-product with unit vector on each axis!
Aside: this is what Fourier transform does!
Projects a function onto a infinite number of orthonormal basis functions: (ej or ej2n), and
adds the results up (to get an equivalent “representation” in the “frequency” domain).
12
Projection: Using Inner Products -1
p = a (aTx)
||a|| = aTa = 1
13
Projection: Using Inner Products -2
Note: the “error vector” e = b-p
is orthogonal (perpendicular) to p.
i.e. Inner product: (b-p)Tp = 0
p = a (aTb)/ (aTa)
14
Review of Linear Algebra - 1
Consider
w1x1+ w2x2 + b = 0 = w
Tx + b = w.x + b
In the x1x2-coordinate system, this is the equation of a straight
line
Proof: Rewrite this as
x2 = (w1/w2)x1 + (1/w2) b = 0
Compare with y = m x + c
This is the equation of a straight line with slope m = (w1/w2) and
intercept c = (1/w2)
15
Review of Liner Algebra - 2
1. w.x = 0 is the eqn of a st line through origin
2. w. x + b = 0 is the eqn of any straight line
3. w. x + b = +1 is the eqn of a straight line parallel to (2)
on the positive side of Eqn (1) at a distance 1
4. w. x + b = -1 is the eqn of a straight line parallel to (2)
on the negative side of Eqn (1) at a distance 1
16
Define a Binary Classifier
▪ Define f as a classifier
▪ f = f (w, x, b) = sign (w.x + b)
▪ If f = +1, x belongs to Class 1
▪ If f = - 1, x belongs to Class 2
▪ We call f a linear classifier because
w.x + b = 0 is a straight line.
This line is called the class boundary
17
Linear Classifiers
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you classify
this data?
w x + b<0
w x + b>0
18
Linear Classifiers
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you classify
this data?
19
Linear Classifiers
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you classify
this data?
20
Linear Classifiers
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Any of these would be
fine..
..but which is best?
21
Linear Classifiers
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
How would you classify
this data?
Misclassified
to +1 class
22
Classifier Margin
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
Define the margin of a
linear classifier as the
width that the boundary
could be increased by
before hitting a
datapoint.
23
Maximum Margin
f x
yest
denotes +1
denotes -1
f(x,w,b) = sign(w x + b)
The maximum margin
linear classifier is the
linear classifier with the
maximum margin.
This is the simplest kind
of SVM (Called an LSVM)
Linear SVM
Support Vectors are
those datapoints that
the margin pushes up
against
1. Maximizing the margin is good according to intuition
and PAC theory
2. Implies that only support vectors are important;
other training examples are ignorable.
3. Empirically it works very very we l.
24
Significance of Maximum Margin - 1
From the perspective of statistical learning theory, the motivation for
considering the Binary Classifier SVM’s comes from theoretical bounds on
generalization error
These bounds have two important features
25
Significance of Maximum Margin - 2
The upper bound on the generalization error does not depend upon the
dimensionality of the space
The bound is minimized by maximizing the margin
26
27
SVMs: Three Main Ideas
1. Define an optimal hyperplane for a linearly separable case:
1. One that maximizes the margin
2. Solve the optimization problem
2. Extend the definition to non-linearly separable cases:
1. Have a penalty term for misclassifications
3. Map data to a high dimensional space where it is easier to
classify with linear decision surfaces:
1. reformulate problem so that data is mapped implicitly to this space
28
Setting Up the Optimization Problem
Var1
Var2kbxw −=+
kbxw =+
0=+ bxw
kk
w
The hyperplanes
when k=1 are
canonical planes
An Observation
The vector w is perpendicular
to the Plus plane. Why?
Why choose wx+b = +1 and
wx+b = -1 as the planes defining
margin boundaries?
Let u and v be two vectors in the
Plus plane. Then what is
w.(u-v)?
Because sign (wx+b) has TWO
degrees of freedom and what
matters in their ratio
29
Linear SVM Mathematically
What we know:
w . x+ + b = +1
w . x- + b = -1
w . (x+-x-) = 2
x -
x+ M=Margin Width
30
Also
x+ = x- + λ w
|x+ - x-|= M
Width of the Margin
What we know:
31
M = || x+ - x- ||= || lw ||
=l ||w ||= l w.w
=
2 w.w
w.w
=
2
w.w
w. x+ + b = +1
w. x- + b = -1
x+ = x- +lw
|| x+ - x- ||=M
l =
2
w.w
Learning the Maximum Margin Classifier
Given a guess of w and b we can compute
whether all data points are in the correct half-planes
the width of the margin
Now we just need to write a program to search the space
of w’s and b’s to find the widest margin that matches all
the data points. How?
Gradient descent? Matrix Inversion? EM? Newton’s
Method?
32
Learning via Quadratic Programming
QP is a well-studied class of optimization algorithms to
maximize a quadratic function of some real-valued variables
subject to linear constraints.
33
Linear SVM Mathematically
◼ Goal: 1) Correctly classify all training data
if yi = +1
if yi = -1
for all i
2) Maximize the Margin
same as minimize
◼ We can formulate a Quadratic Optimization Problem and solve for w and b
◼ Minimize
subject to
M =
2
||w ||
F(w) =
1
2
wTw
1+ bwxi
wxi +b £-1
1)( + bwxy ii
1)( + bwxy ii
i
1
2
wTw
34
Solving the Constrained Minimization
Classical method is to minimize the associated un-constrained problem using
Lagrange multipliers. That is minimize
This is done by finding the saddle points:
35
=
−+−=
N
i
iii bxwywwbwL
1
1)).((.
2
1
),(
¶L
¶b
= 0 gives aiå yi = 0
..contd
36
¶L
¶w
= 0 gives w = aiyixiå
Which when substituted back in L tells us that we should maximize the
functional:
Subject to alphas greater than or equal to 0
W (a)= ai
i=1
N
å -
1
2
aia jyiy j (xi - x j )
i, j=1
N
å
...contd
Subject to the constraints
and
37
ai ³ 0
aiyi
i=1
N
å = 0
Decision Surface
The decision surface then is defined by
where z is a test vector
38
D(z)= sign aiyi(xi - z)+b
i
N
å
æ
è
ç
ö
ø
÷
Solving the Optimization Problem
◼ Need to optimize a quadratic function subject to linear constraints.
◼ Quadratic optimization problems are a well-known class of mathematical
programming problems, and many (rather intricate) algorithms exist for solving
them.
◼ The solution involves constructing a dual problem where a Lagrange multiplier
αi is associated with every constraint in the primary problem:
Find w and b such that
Φ(w) =½ wTw is minimized;
and for all {(xi ,yi)}: yi (w
Txi + b) ≥ 1
Find α1αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and
(1) Σαiyi = 0
(2) αi ≥ 0 for all αi
39
The Optimization Problem Solution
◼ The solution has the form:
◼ Each non-zero αi indicates that corresponding xi is a support vector.
◼ Then the classifying function will have the form:
◼ Notice that it relies on an inner product between the test point x and the
support vectors xi – we will return to this later.
◼ Also keep in mind that solving the optimization problem involved computing
the inner products xi
Txj between all pairs of training points.
w =Σαiyixi b= yk- wTxk for any xk such that αk 0
f(x) = ΣαiyixiTx + b
40
Dataset with noise
◼ Hard Margin: So far we required
◼ all data points be classified correctly
◼ Allowed NO training errors
◼ What if the training set is noisy?
- Solution 1: use very powerful kernels
denotes +1
denotes -1
OVERFITTING!
41
Slack variables ξi can be added to allow misclassification of
difficult or noisy examples.
Soft Margin Classification
What should our quadratic
optimization criterion be?
Minimize
=
+
R
k
kεC
1
.
2
1
ww
42
Hard Margin Vs Soft Margin
◼ The old formulation:
◼ The new formulation incorporating slack variables:
◼ Parameter C can be viewed as a way to control overfitting.
◼ This is “constrained optimization”
Find w and b such that
Φ(w) =½ wTw is minimized and for all {(xi ,yi)}
yi (w
Txi + b) ≥ 1
Find w and b such that
Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)}
yi (w
Txi + b) ≥ 1- ξi and ξi ≥ 0 for all i
43
Linear SVMs: Summary
◼ The classifier is a separating hyperplane.
◼ Most “important” training points are support vectors; they define the
hyperplane.
◼ Quadratic optimization algorithms can identify which training points xi are
support vectors with non-zero Lagrangian multipliers αi.
◼ Both in the dual formulation of the problem and in the solution training points
appear only inside dot productas:
Find α1αN such that
Q(α) =Σαi - ½ΣΣαiαjyiyjxi
Txj is maximized and
(1) Σαiyi = 0
(2) 0 ≤ αi ≤ C for all αi
f(x) = ΣαiyixiTx + b
44
Why Go to Dual Formulation?
The vector w could be infinite-dimensional and poses problems
computationally
There are only as many Lagrange variables, “alphas”, as there are training
instances
As a bonus, it turns out that the “alphas” are non-zero only for the support
vectors (far fewer in number than the training data)
45
Non-linear SVM
47
Disadvantages of Linear Decision Surfaces
Var1
Var2
48
Advantages of Non-Linear Surfaces
Var1
Var2
49
Linear Classifiers in High-Dimensional Spaces
Var1
Var2 Constructed Feature 1
Find function (x) to map to a
different space
Constructed Feature 2
Non-linear SVMs
◼ Datasets that are linearly separable with some noise work out great:
◼ But what are we going to do if the dataset is just too hard?
◼ How about mapping data to a higher-dimensional space:
0 x
0 x
0 x
x2
50
The last figure can be thought of also as a nonlinear basis function in two
dimensions
That is, we used the basis z = (x,x2)
51
Non-linear SVMs: Feature spaces
◼ General idea: the original input space can always be mapped to some higher-
dimensional feature space where the training set is separable:
Φ: x → φ(x)
52
What is the Mapping Function?
The idea is to achieve linear separation by mapping the data into a higher
dimensional space
Let us call Φ the function that achieves this mapping.
What is this Φ?
53
54
Let us recall the formula we used earlier -
Linear SVM lecture:
The dual formulation
Dual Formula – 1 (Linear SVM)
55
).(,
2
1
1 11
lklkkl
R
k
R
l
kllk
R
k
k xxyyQQ =−
= ==
subject to
Notice the dot product
00
1
=
=
k
R
k
kk ykforC
Dual Formula – 2 (Linear SVM soft margin)
56
w = ak
k=1
R
å ykxk
b = yK (1-xK )- xK . wK
where
K = argmaxkak
Now classify, using f (x,w,b)= sign(w. x+b)
For the Non-linear Case....
Let us replace the inner product (xi . xj) by
Φ(xi). Φ(xj) to define the operations in the new higher
dimensional space
If there is a “kernel function” K such that
K(xi, xj) = Φ(xi). Φ(xj) = Φ(xi)
TΦ(xj)
then we do not need to know Φ explicitly.
This strategy is preferred because the alternative of working
with Φ is expensive, computationally.
57
Dual in New (Feature) Space
58
max ak
k=1
R
å -
1
2
akalQkl
l=1
R
å
k=1
R
å where Qkl = ykyl (F(xk ).F(xl ))
s.t. 0 £ak £C, "k and akyk
k=1
R
å = 0
Now define
w = ak
k s.t.ak>0
å ykF(xk ); b = yK (1-xK )- xK .wK
where K = argmaxk (ak )K = argmaxk (ak )
Classify with f (x,w,b)= sign(w.F(x)+b)
Computational Burden
Because we’re working in a higher-dimension space (and potentially even an
infinite-dimensional space), calculating φ(xi)
T φ(xj) may be intractable.
We must do R2/2 dot products to evaluate Q
Each dot product requires m2/2 additions and multiplications
The whole thing costs R2m2/4.
Too high!!! Or, does it? Really?
59
How to Avoid the Burden?
Instead, we have the kernel trick, which tells us that
K(xi, xj) = (1 + xi . xj)
2 = φ(xi) . φ(xj)
for the given φ. Thus, we can simplify our
calculations.
Re-writing the dual in terms of the kernel,
60
maxa³0 ai -
1
2
aia jyi y jK(xi, x j )
i, j
åå
é
ë
ê
ê
ù
û
ú
ú
Decision Function
To classify a novel instance x once you’ve learned the optimal αi parameters,
all you have to do is calculate
(by setting and using the kernel trick).
61
f (x)= sign(w. x+b)= sign( aiyiK(xi, x)å +b)
w = aiå yif(xi )
A Note
Note that αi is only non-zero for instances φ(xi) on or
near the boundary - those are called the support vectors
since they alone specify the decision boundary.
We can toss out the other data points once training is
complete. Thus, we only sum over the xi which constitute
the support vectors.
62
Consider a Φ. Φas shown below
63
F(a).F(b) =
1
2a1
2am
a1
2
a2
2
am
2
2a1a2
2a1a3
2a1am
2a2a3
2a2am
2am-1am
é
ë
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ù
û
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
.
1
2b1
2bm
b1
2
b2
2
bm
2
2b1b2
2b1b3
2b1bm
2b2b3
2b2bm
2bm-1bm
é
ë
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ê
ù
û
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
ú
Collecting terms in the dot product
First term = 1 +
Next m terms =
Next m terms =
Rest =
Therefore
64
F(a).F(b) =1+2 ai
i=1
m
å bi + ai
2
i=1
m
å bi
2 + 2aia jbi
j=i+1
m
å
i=1
m
å bj
2ai
i=1
m
å bi
2ai
i=1
m
å bi
ai
2
i=1
m
å bi
2
2aia jbi
j=i+1
m
å
i=1
m
å bj
Out of Curiosity
65
(1+ a. b)2 = (a. b)2 + 2(a. b)+1
= aibi
i=1
m
å
æ
è
ç
ö
ø
÷
2
+2 aibi
i=1
m
å
æ
è
ç
ö
ø
÷+1
= aibi
j=1
m
å
i=1
m
å a jbj +2 aibi
i=1
m
å
æ
è
ç
ö
ø
÷+1
= (aibi )
2
i=1
m
å +2 aibi
j=i+1
m
å
i=1
m
å a jbj + 2 aibi
i=1
m
å
æ
è
ç
ö
ø
÷+1
Both are Same
Comparing term by term, we see
Φ.Φ = (1 + a.b)2
But computing the right side is lot more efficient, O(m) (m
additions and multiplications)
Let us call (1 + a.b)2 = K(a,b) = Kernel
66
Φ in “Kernel Trick” Example
2-dimensional vectors x = [x1 x2];
Let K(xi,xj)=(1 + xi
Txj)
2
,
Need to show that K(xi,xj)= φ(xi)
Tφ(xj):
K(xi,xj) = (1 + xi
Txj)
2
= 1+ xi1
2xj1
2 + 2 xi1xj1 xi2xj2+ xi2
2xj2
2 + 2xi1xj1 + 2xi2xj2
= [1 xi1
2 √2 xi1xi2 xi2
2 √2xi1 √2xi2]
T
[1 xj1
2 √2 xj1xj2 xj2
2 √2xj1 √2xj2]
= φ(xi)
Tφ(xj),
where φ(x) = [1 x1
2 √2 x1x2 x2
2 √2x1 √2x2]
67
Other Kernels
Beyond polynomials there are other high dimensional basis
functions that can be made practical by finding the right kernel
function
68
Examples of Kernel Functions
◼ Linear: K(xi,xj)= xi
Txj
◼ Polynomial of power p: K(xi,xj)= (1+ xi
Txj)
p
◼ Gaussian (radial-basis function network):
◼ Sigmoid: K(xi,xj)= tanh(β0xi
Txj + β1)
)
2
exp(),(
2
2
ji
ji
xx
xx
−
−=K
69
The function we end up optimizing is
70
ak
k=1
R
å -
1
2
akalQkl
l=1
R
å
k=1
R
å where Qkl = ykylK(xk, xl )
s.t. 0 £ak £C, "k and akyk
k=1
R
å = 0
Multi-class classification
Multi-class classification
One versus all classification
Multi-class SVM
Multi-class SVM
SVM Software
Python: scikit-learn module
LibSVM (C++)
SVMLight (C)
Torch (C++)
Weka (Java)
75
Research
One-class SVM (unsupervised learning): outlier detection
Weibull-calibrated SVM (W-SVM) / PI -SVM: open set recognition
Homework
CIFAR-10 image recognition using SVM
The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with
6000 images per class. * There are 50000 training images and 10000 test images.
These are the classes in the dataset: airplane, automobile, bird, cat, deer, dog, frog,
horse, ship, truck
Hint : https://github.com/wikiabhi/Cifar-10
https://github.com/mok232/CIFAR-10-Image-Classification