Bài giảng Máy học nâng cao - Chương 10: Support vector machine - Trịnh Tấn Đạt

Advantages of SVMs - 1  A principled approach to classification, regression and novelty detection  Good generalization capabilities  Hypothesis has an explicit dependence on data, via support vectors – hence, can readily interpret model 4Advantages of SVMs - 2  Learning involves optimization of a convex function (no local minima as in neural nets)  Only a few parameters are required to tune the learning machine (unlike lots of weights and learning parameters, hidden layers, hidden units, etc as in neural nets)

77 trang | Chia sẻ: thanhle95 | Lượt xem: 439 | Lượt tải: 1

Bạn đang xem trước 20 trang tài liệu Bài giảng Máy học nâng cao - Chương 10: Support vector machine - Trịnh Tấn Đạt, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

Trịnh Tấn Đạt Khoa CNTT – Đại Học Sài Gòn Email: trinhtandat@sgu.edu.vn Website: https://sites.google.com/site/ttdat88/ Contents  Introduction  Review of Linear Algebra  Classifiers & Classifier Margin  Linear SVMs: Optimization Problem  Hard Vs Soft Margin Classification  Non-linear SVMs Introduction  Competitive with other classification methods  Relatively easy to learn  Kernel methods give an opportunity to extend the idea to  Regression  Density estimation  Kernel PCA  Etc. 3 Advantages of SVMs - 1  A principled approach to classification, regression and novelty detection  Good generalization capabilities  Hypothesis has an explicit dependence on data, via support vectors – hence, can readily interpret model 4 Advantages of SVMs - 2  Learning involves optimization of a convex function (no local minima as in neural nets)  Only a few parameters are required to tune the learning machine (unlike lots of weights and learning parameters, hidden layers, hidden units, etc as in neural nets) 5 Prerequsites  Vectors, matrices, dot products  Equation of a straight line in vector notation  Familiarity with  Perceptron is useful  Mathematical programming will be useful  Vector spaces will be an added benefit  The more comfortable you are with Linear Algebra, the easier this material will be 6 What is a Vector ?  Think of a vector as a directed line segment in N-dimensions! (has “length” and “direction”)  Basic idea: convert geometry in higher dimensions into algebra!  Once you define a “nice” basis along each dimension: x-, y-, z-axis  Vector becomes a 1 x N matrix!  v = [a b c]T  Geometry starts to become linear algebra on vectors like v!           = c b a v  x y v 7 Vector Addition: A+B ),(),(),( 22112121 yxyxyyxx ++=+=+ wv A B A B C A+B = C (use the head-to-tail method to combine vectors) A+B 8 Scalar Product: av ),(),( 2121 axaxxxaa ==v v av Change only the length (“scaling”), but keep direction fixed. Sneak peek: matrix operation (Av) can change length, direction and also dimensionality! 9 Vectors: Magnitude (Length) and Phase (direction) x y ||v||  Alternate representations: Polar coords: (||v||, ) Complex numbers: ||v||ej “phase” ,1 If 1 2 ),, 2 , 1 ( = = = =  v n i i xv n xxxv T (Magnitude or “2-norm”) (unit vector => pure direction) 10 a unit vector Inner (dot) Product: v.w or wTv v w  22112121 .),).(,(. yxyxyyxxwv +== The inner product is a SCALAR! cos||||||||),).(,(. 2121 wvyyxxwv == wvwv ⊥= 0. If vectors v, w are “columns”, then dot product is wTv 11 Projections w/ Orthogonal Basis  Get the component of the vector on each axis:  dot-product with unit vector on each axis! Aside: this is what Fourier transform does! Projects a function onto a infinite number of orthonormal basis functions: (ej or ej2n), and adds the results up (to get an equivalent “representation” in the “frequency” domain). 12 Projection: Using Inner Products -1 p = a (aTx) ||a|| = aTa = 1 13 Projection: Using Inner Products -2 Note: the “error vector” e = b-p is orthogonal (perpendicular) to p. i.e. Inner product: (b-p)Tp = 0 p = a (aTb)/ (aTa) 14 Review of Linear Algebra - 1  Consider w1x1+ w2x2 + b = 0 = w Tx + b = w.x + b  In the x1x2-coordinate system, this is the equation of a straight line Proof: Rewrite this as x2 = (w1/w2)x1 + (1/w2) b = 0 Compare with y = m x + c This is the equation of a straight line with slope m = (w1/w2) and intercept c = (1/w2) 15 Review of Liner Algebra - 2 1. w.x = 0 is the eqn of a st line through origin 2. w. x + b = 0 is the eqn of any straight line 3. w. x + b = +1 is the eqn of a straight line parallel to (2) on the positive side of Eqn (1) at a distance 1 4. w. x + b = -1 is the eqn of a straight line parallel to (2) on the negative side of Eqn (1) at a distance 1 16 Define a Binary Classifier ▪ Define f as a classifier ▪ f = f (w, x, b) = sign (w.x + b) ▪ If f = +1, x belongs to Class 1 ▪ If f = - 1, x belongs to Class 2 ▪ We call f a linear classifier because w.x + b = 0 is a straight line. This line is called the class boundary 17 Linear Classifiers f x  yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? w x + b<0 w x + b>0 18 Linear Classifiers f x  yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? 19 Linear Classifiers f x  yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? 20 Linear Classifiers f x  yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Any of these would be fine.. ..but which is best? 21 Linear Classifiers f x  yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? Misclassified to +1 class 22 Classifier Margin f x  yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. 23 Maximum Margin f x  yest denotes +1 denotes -1 f(x,w,b) = sign(w x + b) The maximum margin linear classifier is the linear classifier with the maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Support Vectors are those datapoints that the margin pushes up against 1. Maximizing the margin is good according to intuition and PAC theory 2. Implies that only support vectors are important; other training examples are ignorable. 3. Empirically it works very very we l. 24 Significance of Maximum Margin - 1  From the perspective of statistical learning theory, the motivation for considering the Binary Classifier SVM’s comes from theoretical bounds on generalization error  These bounds have two important features 25 Significance of Maximum Margin - 2  The upper bound on the generalization error does not depend upon the dimensionality of the space  The bound is minimized by maximizing the margin 26 27 SVMs: Three Main Ideas 1. Define an optimal hyperplane for a linearly separable case: 1. One that maximizes the margin 2. Solve the optimization problem 2. Extend the definition to non-linearly separable cases: 1. Have a penalty term for misclassifications 3. Map data to a high dimensional space where it is easier to classify with linear decision surfaces: 1. reformulate problem so that data is mapped implicitly to this space 28 Setting Up the Optimization Problem Var1 Var2kbxw −=+  kbxw =+  0=+ bxw  kk w  The hyperplanes when k=1 are canonical planes An Observation  The vector w is perpendicular to the Plus plane. Why?  Why choose wx+b = +1 and wx+b = -1 as the planes defining margin boundaries?  Let u and v be two vectors in the Plus plane. Then what is w.(u-v)?  Because sign (wx+b) has TWO degrees of freedom and what matters in their ratio 29 Linear SVM Mathematically What we know:  w . x+ + b = +1  w . x- + b = -1  w . (x+-x-) = 2 x - x+ M=Margin Width 30 Also x+ = x- + λ w |x+ - x-|= M Width of the Margin  What we know: 31 M = || x+ - x- ||= || lw || =l ||w ||= l w.w = 2 w.w w.w = 2 w.w w. x+ + b = +1 w. x- + b = -1 x+ = x- +lw || x+ - x- ||=M l = 2 w.w Learning the Maximum Margin Classifier Given a guess of w and b we can compute whether all data points are in the correct half-planes the width of the margin Now we just need to write a program to search the space of w’s and b’s to find the widest margin that matches all the data points. How? Gradient descent? Matrix Inversion? EM? Newton’s Method? 32 Learning via Quadratic Programming QP is a well-studied class of optimization algorithms to maximize a quadratic function of some real-valued variables subject to linear constraints. 33 Linear SVM Mathematically ◼ Goal: 1) Correctly classify all training data if yi = +1 if yi = -1 for all i 2) Maximize the Margin same as minimize ◼ We can formulate a Quadratic Optimization Problem and solve for w and b ◼ Minimize subject to M = 2 ||w || F(w) = 1 2 wTw 1+ bwxi wxi +b £-1 1)( + bwxy ii 1)( + bwxy ii i 1 2 wTw 34 Solving the Constrained Minimization  Classical method is to minimize the associated un-constrained problem using Lagrange multipliers. That is minimize  This is done by finding the saddle points: 35   = −+−= N i iii bxwywwbwL 1 1)).((. 2 1 ),(   ¶L ¶b = 0 gives aiå yi = 0 ..contd 36 ¶L ¶w = 0 gives w = aiyixiå  Which when substituted back in L tells us that we should maximize the functional:  Subject to alphas greater than or equal to 0 W (a)= ai i=1 N å - 1 2 aia jyiy j (xi - x j ) i, j=1 N å ...contd  Subject to the constraints  and 37 ai ³ 0 aiyi i=1 N å = 0 Decision Surface  The decision surface then is defined by where z is a test vector 38 D(z)= sign aiyi(xi - z)+b i N å æ è ç ö ø ÷ Solving the Optimization Problem ◼ Need to optimize a quadratic function subject to linear constraints. ◼ Quadratic optimization problems are a well-known class of mathematical programming problems, and many (rather intricate) algorithms exist for solving them. ◼ The solution involves constructing a dual problem where a Lagrange multiplier αi is associated with every constraint in the primary problem: Find w and b such that Φ(w) =½ wTw is minimized; and for all {(xi ,yi)}: yi (w Txi + b) ≥ 1 Find α1αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjxiTxj is maximized and (1) Σαiyi = 0 (2) αi ≥ 0 for all αi 39 The Optimization Problem Solution ◼ The solution has the form: ◼ Each non-zero αi indicates that corresponding xi is a support vector. ◼ Then the classifying function will have the form: ◼ Notice that it relies on an inner product between the test point x and the support vectors xi – we will return to this later. ◼ Also keep in mind that solving the optimization problem involved computing the inner products xi Txj between all pairs of training points. w =Σαiyixi b= yk- wTxk for any xk such that αk 0 f(x) = ΣαiyixiTx + b 40 Dataset with noise ◼ Hard Margin: So far we required ◼ all data points be classified correctly ◼ Allowed NO training errors ◼ What if the training set is noisy? - Solution 1: use very powerful kernels denotes +1 denotes -1 OVERFITTING! 41 Slack variables ξi can be added to allow misclassification of difficult or noisy examples. Soft Margin Classification What should our quadratic optimization criterion be? Minimize  = + R k kεC 1 . 2 1 ww 42 Hard Margin Vs Soft Margin ◼ The old formulation: ◼ The new formulation incorporating slack variables: ◼ Parameter C can be viewed as a way to control overfitting. ◼ This is “constrained optimization” Find w and b such that Φ(w) =½ wTw is minimized and for all {(xi ,yi)} yi (w Txi + b) ≥ 1 Find w and b such that Φ(w) =½ wTw + CΣξi is minimized and for all {(xi ,yi)} yi (w Txi + b) ≥ 1- ξi and ξi ≥ 0 for all i 43 Linear SVMs: Summary ◼ The classifier is a separating hyperplane. ◼ Most “important” training points are support vectors; they define the hyperplane. ◼ Quadratic optimization algorithms can identify which training points xi are support vectors with non-zero Lagrangian multipliers αi. ◼ Both in the dual formulation of the problem and in the solution training points appear only inside dot productas: Find α1αN such that Q(α) =Σαi - ½ΣΣαiαjyiyjxi Txj is maximized and (1) Σαiyi = 0 (2) 0 ≤ αi ≤ C for all αi f(x) = ΣαiyixiTx + b 44 Why Go to Dual Formulation?  The vector w could be infinite-dimensional and poses problems computationally  There are only as many Lagrange variables, “alphas”, as there are training instances  As a bonus, it turns out that the “alphas” are non-zero only for the support vectors (far fewer in number than the training data) 45 Non-linear SVM 47 Disadvantages of Linear Decision Surfaces Var1 Var2 48 Advantages of Non-Linear Surfaces Var1 Var2 49 Linear Classifiers in High-Dimensional Spaces Var1 Var2 Constructed Feature 1 Find function (x) to map to a different space Constructed Feature 2 Non-linear SVMs ◼ Datasets that are linearly separable with some noise work out great: ◼ But what are we going to do if the dataset is just too hard? ◼ How about mapping data to a higher-dimensional space: 0 x 0 x 0 x x2 50  The last figure can be thought of also as a nonlinear basis function in two dimensions  That is, we used the basis z = (x,x2) 51 Non-linear SVMs: Feature spaces ◼ General idea: the original input space can always be mapped to some higher- dimensional feature space where the training set is separable: Φ: x → φ(x) 52 What is the Mapping Function?  The idea is to achieve linear separation by mapping the data into a higher dimensional space  Let us call Φ the function that achieves this mapping.  What is this Φ? 53 54 Let us recall the formula we used earlier - Linear SVM lecture: The dual formulation Dual Formula – 1 (Linear SVM) 55 ).(, 2 1 1 11 lklkkl R k R l kllk R k k xxyyQQ =−  = ==  subject to  Notice the dot product 00 1 =  = k R k kk ykforC  Dual Formula – 2 (Linear SVM soft margin) 56 w = ak k=1 R å ykxk b = yK (1-xK )- xK . wK where K = argmaxkak Now classify, using f (x,w,b)= sign(w. x+b) For the Non-linear Case.... Let us replace the inner product (xi . xj) by Φ(xi). Φ(xj) to define the operations in the new higher dimensional space  If there is a “kernel function” K such that K(xi, xj) = Φ(xi). Φ(xj) = Φ(xi) TΦ(xj) then we do not need to know Φ explicitly. This strategy is preferred because the alternative of working with Φ is expensive, computationally. 57 Dual in New (Feature) Space 58 max ak k=1 R å - 1 2 akalQkl l=1 R å k=1 R å where Qkl = ykyl (F(xk ).F(xl )) s.t. 0 £ak £C, "k and akyk k=1 R å = 0 Now define w = ak k s.t.ak>0 å ykF(xk ); b = yK (1-xK )- xK .wK where K = argmaxk (ak )K = argmaxk (ak ) Classify with f (x,w,b)= sign(w.F(x)+b) Computational Burden  Because we’re working in a higher-dimension space (and potentially even an infinite-dimensional space), calculating φ(xi) T φ(xj) may be intractable.  We must do R2/2 dot products to evaluate Q  Each dot product requires m2/2 additions and multiplications  The whole thing costs R2m2/4.  Too high!!! Or, does it? Really? 59 How to Avoid the Burden?  Instead, we have the kernel trick, which tells us that K(xi, xj) = (1 + xi . xj) 2 = φ(xi) . φ(xj) for the given φ. Thus, we can simplify our calculations. Re-writing the dual in terms of the kernel, 60 maxa³0 ai - 1 2 aia jyi y jK(xi, x j ) i, j åå é ë ê ê ù û ú ú Decision Function  To classify a novel instance x once you’ve learned the optimal αi parameters, all you have to do is calculate (by setting and using the kernel trick). 61 f (x)= sign(w. x+b)= sign( aiyiK(xi, x)å +b) w = aiå yif(xi ) A Note Note that αi is only non-zero for instances φ(xi) on or near the boundary - those are called the support vectors since they alone specify the decision boundary. We can toss out the other data points once training is complete. Thus, we only sum over the xi which constitute the support vectors. 62 Consider a Φ. Φas shown below 63 F(a).F(b) = 1 2a1 2am a1 2 a2 2 am 2 2a1a2 2a1a3 2a1am 2a2a3 2a2am 2am-1am é ë ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ù û ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú . 1 2b1 2bm b1 2 b2 2 bm 2 2b1b2 2b1b3 2b1bm 2b2b3 2b2bm 2bm-1bm é ë ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ê ù û ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú ú Collecting terms in the dot product  First term = 1 +  Next m terms =  Next m terms =  Rest =  Therefore 64 F(a).F(b) =1+2 ai i=1 m å bi + ai 2 i=1 m å bi 2 + 2aia jbi j=i+1 m å i=1 m å bj 2ai i=1 m å bi 2ai i=1 m å bi ai 2 i=1 m å bi 2 2aia jbi j=i+1 m å i=1 m å bj Out of Curiosity 65 (1+ a. b)2 = (a. b)2 + 2(a. b)+1 = aibi i=1 m å æ è ç ö ø ÷ 2 +2 aibi i=1 m å æ è ç ö ø ÷+1 = aibi j=1 m å i=1 m å a jbj +2 aibi i=1 m å æ è ç ö ø ÷+1 = (aibi ) 2 i=1 m å +2 aibi j=i+1 m å i=1 m å a jbj + 2 aibi i=1 m å æ è ç ö ø ÷+1 Both are Same Comparing term by term, we see Φ.Φ = (1 + a.b)2 But computing the right side is lot more efficient, O(m) (m additions and multiplications) Let us call (1 + a.b)2 = K(a,b) = Kernel 66 Φ in “Kernel Trick” Example 2-dimensional vectors x = [x1 x2]; Let K(xi,xj)=(1 + xi Txj) 2 , Need to show that K(xi,xj)= φ(xi) Tφ(xj): K(xi,xj) = (1 + xi Txj) 2 = 1+ xi1 2xj1 2 + 2 xi1xj1 xi2xj2+ xi2 2xj2 2 + 2xi1xj1 + 2xi2xj2 = [1 xi1 2 √2 xi1xi2 xi2 2 √2xi1 √2xi2] T [1 xj1 2 √2 xj1xj2 xj2 2 √2xj1 √2xj2] = φ(xi) Tφ(xj), where φ(x) = [1 x1 2 √2 x1x2 x2 2 √2x1 √2x2] 67 Other Kernels Beyond polynomials there are other high dimensional basis functions that can be made practical by finding the right kernel function 68 Examples of Kernel Functions ◼ Linear: K(xi,xj)= xi Txj ◼ Polynomial of power p: K(xi,xj)= (1+ xi Txj) p ◼ Gaussian (radial-basis function network): ◼ Sigmoid: K(xi,xj)= tanh(β0xi Txj + β1) ) 2 exp(),( 2 2  ji ji xx xx − −=K 69  The function we end up optimizing is 70 ak k=1 R å - 1 2 akalQkl l=1 R å k=1 R å where Qkl = ykylK(xk, xl ) s.t. 0 £ak £C, "k and akyk k=1 R å = 0 Multi-class classification Multi-class classification  One versus all classification Multi-class SVM Multi-class SVM SVM Software  Python: scikit-learn module  LibSVM (C++)  SVMLight (C)  Torch (C++)  Weka (Java)  75 Research  One-class SVM (unsupervised learning): outlier detection  Weibull-calibrated SVM (W-SVM) / PI -SVM: open set recognition Homework  CIFAR-10 image recognition using SVM  The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. * There are 50000 training images and 10000 test images.  These are the classes in the dataset: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck  Hint : https://github.com/wikiabhi/Cifar-10 https://github.com/mok232/CIFAR-10-Image-Classification