Bài giảng Máy học nâng cao - Chương 9: Dimensionality reduction and feature selection - Trịnh Tấn Đạt

Introduction  Dimensionality reduction is one of the most popular techniques to remove noisy (i.e., irrelevant) and redundant features.  Dimensionality reduction techniques: feature extraction v.s feature selection  feature extraction: given N features (set X), extract M new features (set Y) by linear or nonlinear combination of all the N features (i.e. PCA, LDA)  feature selection: choose a best subset of highly discriminant features of size M from the available N features (i.e. Information Gain, ReliefF, Fisher Score)

81 trang | Chia sẻ: thanhle95 | Lượt xem: 742 | Lượt tải: 1

Bạn đang xem trước 20 trang tài liệu Bài giảng Máy học nâng cao - Chương 9: Dimensionality reduction and feature selection - Trịnh Tấn Đạt, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

Trịnh Tấn Đạt Khoa CNTT – Đại Học Sài Gòn Email: trinhtandat@sgu.edu.vn Website: https://sites.google.com/site/ttdat88/ Contents  Introduction: dimensionality reduction and feature selection  Dimensionality Reduction  Principal Component Analysis (PCA)  Fisher’s linear discriminant analysis (LDA)  Example: Eigenface  Feature Selection  Homework Introduction  High-dimensional data often contain redundant features  reduce the accuracy of data classification algorithms  slow down the classification process  be a problem in storage and retrieval  hard to interpret (visualize)  Why we need dimensionality reduction???  To avoid “curse of dimensionality”  To reduce feature measurement cost  To reduce computational cost https://en.wikipedia.org/wiki/Curse_of_dimensionality Introduction  Dimensionality reduction is one of the most popular techniques to remove noisy (i.e., irrelevant) and redundant features.  Dimensionality reduction techniques: feature extraction v.s feature selection  feature extraction: given N features (set X), extract M new features (set Y) by linear or non- linear combination of all the N features (i.e. PCA, LDA)  feature selection: choose a best subset of highly discriminant features of size M from the available N features (i.e. Information Gain, ReliefF, Fisher Score) Dimensionality Reduction Principal component analysis (PCA) ❖ Variance v.s. Covariance  Variance : phương sai của một biến ngẫu nhiên là thước đo sự phân tán thống kê của biến đó, nó hàm ý các giá trị của biến đó thường ở cách giá trị kỳ vọng bao xa.  Covariance: hiệp phương sai là độ đo sự biến thiên cùng nhau của hai biến ngẫu nhiên (phân biệt với phương sai - đo mức độ biến thiên của một biến) 1 ( )(y ) Cov( , ) (N 1) N i i i x x y X Y = − − = −  Low variance High variance Principal component analysis (PCA)  Mean (expected value): giá trị “mong muốn”, biểu diễn giá trị trung bình của một biến.  Standard Deviation: Độ lệch chuẩn đo tính biến động của giá trị mang tính thống kê. Nó cho thấy sự chênh lệch về giá trị của từng thời điểm đánh giá so với giá trị trung bình. Principal component analysis (PCA)  Representing Covariance between dimensions as a matrix e.g. for 3 dimensions:  cov(x,y) = cov(y,x) hence matrix is symmetrical about the diagonal  N-dimensional data will result in NxN covariance matrix Principal component analysis (PCA)  What is the interpretation of covariance calculations? e.g.: dữ liệu 2 chiều x: số lượng giờ học một môn học y: điểm số của một môn học covariance value ~ 104.53 what does this value mean? -> số lượng giờ học tăng  , điểm số  Principal component analysis (PCA)  Exact value is not as important as it’s sign.  A positive value of covariance indicates both dimensions increase or decrease together (e.g. as the number of hours studied increases, the marks in that subject increase.)  A negative value indicates while one increases the other decreases, or vice-versa (e.g. active social life v.s performance in class.)  If covariance is zero: the two dimensions are independent of each other (e.g. heights of students vs the marks obtained in a subject.) Principal component analysis (PCA) Principal component analysis (PCA)  Principal components analysis (PCA) là một phương pháp để đơn giản hóa một tập dữ liệu (simplify a dataset) , chằng hạn giảm số chiều của dữ liệu. “It is a linear transformation that chooses a new coordinate system for the data set such that  the greatest variance by any projection of the data set comes to lie on the first axis (then called the first principal component),  the second greatest variance on the second axis  and so on. ”  PCA có thể được dùng để giảm số chiều bằng cách loại bỏ những thành phần chính không quan trọng. Principal component analysis (PCA)  Ví dụ: khi thực hiện các phân tích đa biến mà trong đó các biến có tương quan với nhau gây nhiều khó khăn loại bỏ sự tương quan này bằng cách xoay trục (cơ sở) dữ liệu trên trục mới đã giảm sự tương quan đáng kể (biến Y1 và Y2 gần như không tương quan) sự thay đổi của dữ liệu phụ thuộc phần lớn vào biến Y1 giảm số chiều dữ liệu mà không làm giàm quá nhiều “phương sai” của dữ liệu Principal component analysis (PCA)  Note:  Giúp giảm số chiều của dữ liệu;  Thay vì giữ lại các trục tọa độ của không gian cũ, PCA xây dựng một không gian mới ít chiều hơn, nhưng lại có khả năng biểu diễn dữ liệu tốt tương đương không gian cũ. Principal component analysis (PCA)  Ví dụ: Khám phá liên kết tiềm ẩn nhờ đổi hệ trục tọa độ, cách nhìn khác nhau về cùng một dữ liệu. Principal component analysis (PCA)  Ví dụ: Notice that "the maximum variance" and "the minimum error" are reached at the same time, namely when the line points to the magenta ticks Principal component analysis (PCA)  How to find the optimal linear transformation A ( where y = Ax) -1. Origin of PCA coordinate  mean of samples -2. Maximize projected variance -3. Minimize projection cost min x y− Principal component analysis (PCA) https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.12-Example-Principal-Components- Analysis/ Principal component analysis (PCA)  Note:  The eigenvectors of the covariance matrix define a new coordinate system  Eigenvector with largest eigenvalue captures the most variation among training vectors x  eigenvector with smallest eigenvalue has least variation  The eigenvectors are known as principal components https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.12-Example-Principal-Components-Analysis/ Principal component analysis (PCA)  Algorithm: 7 steps  Input : N mẫu input 1. Tính vector trung bình của toàn bộ dữ liệu: 2. Trừ mỗi điểm dữ liệu đi vector trung bình của toàn bộ dữ liệu: Note: subtracting the mean is equivalent to translating the coordinate system to the location of the mean Principal component analysis (PCA) 3. Tính covaricance matrix 4. Perform the eigendecomposition : tìm eigenvectors và eigenvalues của S và sắp xếp theo thứ từ giảm dần của eigenvalues . 5. Chọn K eigenvectors với K trị riêng lớn nhất để xây dựng ma trận UK (projection matrix) có các cột tạo thành một hệ trực giao. K vectors này, còn được gọi là các thành phần chính, tạo thành một không gian con gần với phân bố của dữ liệu ban đầu đã chuẩn hoá. Principal component analysis (PCA) 6. Chiếu dữ liệu ban đầu đã chuẩn hoá xuống không gian con tìm được. 7. Dữ liệu mới chính là toạ độ của các điểm dữ liệu trên không gian mới. Dữ liệu ban đầu có thể tính được xấp xỉ theo dữ liệu mới như sau: PCA Principal component analysis (PCA) Principal component analysis (PCA) Principal component analysis (PCA) Principal component analysis (PCA) Principal component analysis (PCA) Principal component analysis (PCA) PCA  Ví dụ: Principal component analysis (PCA)  Problem: In the case of the data in the two classes, after projection along the first principal eigenvector, remain well separated. In the case of the data in the two classes, after projection along the first principal eigenvector, is not well separated because the mean of the two classes are too close Principal component analysis (PCA)  Disadvantage of PCA Principal Component Analysis •higher variance •bad for discriminability Fisher Linear Discriminant Linear Discriminant Analysis •smaller variance •good discriminability Fisher’s linear discriminant analysis  The objective of LDA is to perform dimensionality reduction while preserving as much of the class discriminatory information as possible.  Assume we have a set of l-dimensional samples , N1 of which belong to class 1 , and N2 to class 2 . We seek to obtain a scalar y by projecting the samples x onto a line: }x,,x,{x 21 N xwy T= Fisher’s linear discriminant analysis ❖ Case: TWO CLASSES  Of all the possible lines we would like to select the one that maximizes the separability of the scalars. This is illustrated for the two-dimensional case in the following figures The two classes are not well separated when projected onto this line This line succeeded in separating the two classes and in the meantime reducing the dimensionality of our problem from two features (x1,x2) to only a scalar value y. Fisher’s linear discriminant analysis  The Fisher linear discriminant is defined as the linear function that maximizes the criterion function: xwT 2 2 2 1 2 21 ）-（   + =FDR where  1,  2 are the mean values of y, and 1 ,  2 are the variances of y in the two classes, respectively. LDA...two classes  In order to find the optimum projection w*, we need to express FDR as an explicit function of w.  For two equiprobable classes ➢ SW is known as the within-class scatter matrix, with S1 , S2 being the respective covariance matrices.  Now, the scatter of the projection y can then be expressed as a function of the scatter matrix in feature space x. )( 2 1 21w SSS += T ii i i mxmxN S i )()( 1 x −−=   wSw T w 2 2 2 1 =+  LDA...two classes  Sb is known as the between-class scatter matrix ➢ m0 is the overall mean of the data x in the original l-dimensional space and m1 , m2 are the mean values in the two classes, respectively  Similarly, the difference between the projected means (in y-space) can be expressed in terms of the means in the original feature space (x-space). TT mmmmmmmmS ))(( 2 1 ))(( 2 1 02020101b −−+−−= ww）-（ b 2 21 S T= LDA...two classes  We can finally express the Fisher criterion in terms of SW and Sb as: wS wS FDR w T b T w w）-（ 2 2 2 1 2 21 = + =   LDA...two classes  First method：eigen-decompostition  Second method： 0 w w dw d =        wS wS w T b T wwSS bw = −1 scalarFDR == )( w w maxargmaxarg 21 1* mmS wS wS FDRw w w T b T ww −=        == − where Fisher’s linear discriminant analysis  Case: C-classes Principal component analysis (PCA)  PCA vs. LDA LDA PCA Principal component analysis (PCA)  PCA vs. LDA Principal component analysis (PCA)  Improvements of PCA  Probabilistic PCA  Kernel PCA  Robust PCA ( - low rank matrix recovery) Linear Discriminant Analysis (LDA)  Improvements of LDA  Probabilistic LDA  Kernel Fisher Discriminant Analysis Example: Eigenface  Dùng PCA như một phương pháp rút trích đặc trưng (feature extraction) cho hình ảnh khuôn mặt (facial images).  Eigenface là một trong các phương pháp phổ biến nhất trong bài toán nhận dạng khuôn mặt.  Ý tưởng của Eigenface là đi tìm một không gian có số chiều nhỏ hơn để mô tả mỗi khuôn mặt, từ đó sử dụng vector trong không gian thấp này như là feature vector cho việc thực hiện classification.  Điều đáng nói là một bức ảnh khuôn mặt có kích thước khoảng 200 × 200 sẽ có số chiều là 40k - là một số cực lớn, trong khi đó, feature vector thường chỉ có số chiều bằng vài trăm.  Các Eigenfaces chính là các eigenvectors ứng với các trị riêng lớn nhất của ma trận hiệp phương sai. Eigenface  Face Identification using PCA  An image is a point in a high dimensional space  An N x M image is a point in RNxM  We can define vectors in this space as we did in the 2D case Eigenface: Training  Get 𝑀 training samples with variances Images are in same size Eigenface: Training  Step 0: Convert all the images in vector form.  Step 1: Calculate the mean (Average Face) Eigenface: Training  Step 2: Normalize vectors  Step 3: Form the covariance matrix Eigenface: Training  Step 4: We calculate the Eigen vectors of Covariance Matrix  Step 5: So, we choose k eigenvectors corresponding to k largest eigenvalues to construct a projection matrix Very high dimension So, we compute Eigenface: Training  Step 6-7: Training Images projected to face space. Eigenface: Training Eigenface: Test  Take query image t  Substact mean image : y = t –   Project y into eigenface space and compute projection  Compare projection I with all N training projections  Simple comparison metric: Euclidean  Classifier: multiclass-logistic regression, SVM, Eigenface Eigenface  Pros  Ease of implementation  No knowledge of geometry or specific feature of the face required  Cons  Sensitive to head scale  Applicable only to front view  Good performance only under controlled background (not including natural scenes) State-of-the-art face recognition performance  OpenFace  FaceNet embedding  InsightFace  Feature Selection Introduction: Feature Selection  Need for a small number of discriminative features  To avoid “curse of dimensionality”  To reduce feature measurement cost  To reduce computational burden  Feature selection algorithms can be categorized  Supervised: filter models, wrapper models, and embedded models  Unsupervised  Semi-supervised Introduction  A feature selection method consists of four basic steps  subset generation  subset evaluation  stopping criterion  result validation Introduction Feature Selection for Classification Introduction  Feature selection for classification attempts to select the minimally sized subset of features according to the following criteria:  the classification accuracy does not significantly decrease; and  the resulting class distribution, given only the values for the selected features, is as close as possible to the original class distribution, given all features. Introduction  Feature selection is an optimization problem.  Search the space of possible feature subsets.  Pick the subset that is optimal or near-optimal with respect to a certain criterion. Search strategies Evaluation strategies - Optimum - Filter methods - Heuristic - Wrapper methods - Randomized Introduction Algorithms for Flat Features  Three groups:  filter models  wrapper models  embedded models  Features are assumed to be independent Algorithms for Flat Features 1. Filter Models  Filter models evaluate features without utilizing any classification algorithms  The objective function evaluates feature subsets by their information content, typically interclass distance, statistical dependence or information- theoretic measures.  Performance criteria : Fisher score, Mutual information, ReliefF Algorithms for Flat Features ❖ Fisher Score  Features with high quality should assign similar values to instances in the same class and different values to instances from different classes.  The score for the i-th feature Si will be calculated by Fisher Score as Algorithms for Flat Features ❖ Mutual Information based on Methods  Information gain is used to measure the dependence between features and labels  Calculates the information gain between the i-th feature fi and the class labels C as  A feature is relevant if it has a high information gain Algorithms for Flat Features Algorithms for Flat Features 2. Wrapper Models  It utilize a specific classifier to evaluate the quality of selected features, and offer a simple and powerful way to address the problem of feature selection, regardless of the chosen learning machine  A typical wrapper model will perform the following steps:  Step 1: searching a subset of features,  Step 2: evaluating the selected subset of features by the performance of the classifier,  Step 3: repeating Step 1 and Step 2 until the desired quality is reached.  Search strategies: hill-climbing, best-first, GA, Algorithms for Flat Features Algorithms for Flat Features Algorithms for Flat Features 3. Embedded Models  Embedded Models embedding feature selection with classifier construction, have the advantages of (1) wrapper models — they include the interaction with the classification model and (2) filter models—and they are far less computationally intensive than wrapper methods.  Three types of embedded methods  Pruning methods: that first utilize all features to train a model and then attempt to eliminate some features by setting the corresponding coefficients to 0, while maintaining model performance such as recursive feature elimination using a support vector machine.  The second are models with a built-in mechanism for feature selection such as ID3 and C4.5.  The third are regularization models with objective functions that minimize fitting errors and in the meantime force the coefficients to be small or to be exactly zero. Features with coefficients that are close to 0 are then eliminated. Methods: Lasso Regularization, Adaptive Lasso, Bridge regularization, Elastic net regularization. Algorithms for Flat Features ❖ Regularization methods  Classifier induction and feature selection are achieved simultaneously by estimating w with properly tuned penalties.  Feature selection is achieved and only features with nonzero coefficients in w will be used in the classifier  Lasso Regularization: Lasso regularization is based on l1-norm of the coefficient of w and defined as l1 regularization can generate an estimation of w with exact zero coefficients. In other words, there are zero entities in w, which denotes that the corresponding features are eliminated during the classifier learning process. Therefore, it can be used for feature selection. Algorithms for Structured Features  For many real-world applications, the features exhibit certain intrinsic structures, e.g., spatial or temporal smoothness, disjoint/overlapping groups, trees, and graphs.  Incorporating knowledge about the structures of features may significantly improve the classification performance and help identify the important features.  We focus on linear classifiers with structured features to minimize a empirical error penalized by a regularization term as where G denotes the structure of features, and α controls the trade-off between data fitting and regularization Research ▪ Algorithms for Structured Features ▪ Algorithms for Streaming Features Discussions and Challenges 1. Scalability  With the growth of dataset sizes, the scalability of current algorithms may be in jeopardy, especially with these domains that require an online classifier.  The scalability of feature selection algorithms is a big problem. Usually, they require a sufficient number of samples to obtain, statically, adequate results. Discussions and Challenges 2. Stability  It is defined as the sensitivity of the selection process to data perturbation in the training set.  The underlying characteristics of data can greatly affect the stability of an algorithm.  These characteristics include dimensionality m, sample size n, and different data distribution across different folds, and the stability issue tends to be data dependent Discussions and Challenges 3. Linked Data  Linked data has become ubiquitous in real-world applications such as tweets in Twitter (tweets linked through hyperlinks), social networks in Facebook (people connected by friendships), and biological networks (protein interaction networks).  Linked data is patently not independent and identically distributed (i.i.d.).  Feature selection methods for linked data need to solve the following immediate challenges:  How to exploit relations among data instances; and  How to take advantage of these relations for feature selection.  Two attempts to handle linked data w.r.t. feature selection for classification are LinkedFS and FSNet Homework 1) Apply feature selection for Boston dataset recognition. Ref: https://datasciencebeginners.com/2018/11/26/using-scikit-learn-in- python-for-feature-selection/ Homework 2) Apply PCA as dimension reduction for iris recognition Ref : https://towardsdatascience.com/pca-using-python-scikit-learn- e653f8989e60