Introduction
Dimensionality reduction is one of the most popular techniques to remove noisy (i.e., irrelevant)
and redundant features.
Dimensionality reduction techniques: feature extraction v.s feature selection
feature extraction: given N features (set X), extract M new features (set Y) by linear or nonlinear combination of all the N features (i.e. PCA, LDA)
feature selection: choose a best subset of highly discriminant features of size M from the
available N features (i.e. Information Gain, ReliefF, Fisher Score)
81 trang |
Chia sẻ: thanhle95 | Lượt xem: 553 | Lượt tải: 1
Bạn đang xem trước 20 trang tài liệu Bài giảng Máy học nâng cao - Chương 9: Dimensionality reduction and feature selection - Trịnh Tấn Đạt, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
Trịnh Tấn Đạt
Khoa CNTT – Đại Học Sài Gòn
Email: trinhtandat@sgu.edu.vn
Website: https://sites.google.com/site/ttdat88/
Contents
Introduction: dimensionality reduction and feature selection
Dimensionality Reduction
Principal Component Analysis (PCA)
Fisher’s linear discriminant analysis (LDA)
Example: Eigenface
Feature Selection
Homework
Introduction
High-dimensional data often contain redundant features
reduce the accuracy of data classification algorithms
slow down the classification process
be a problem in storage and retrieval
hard to interpret (visualize)
Why we need dimensionality reduction???
To avoid “curse of dimensionality”
To reduce feature measurement cost
To reduce computational cost
https://en.wikipedia.org/wiki/Curse_of_dimensionality
Introduction
Dimensionality reduction is one of the most popular techniques to remove noisy (i.e., irrelevant)
and redundant features.
Dimensionality reduction techniques: feature extraction v.s feature selection
feature extraction: given N features (set X), extract M new features (set Y) by linear or non-
linear combination of all the N features (i.e. PCA, LDA)
feature selection: choose a best subset of highly discriminant features of size M from the
available N features (i.e. Information Gain, ReliefF, Fisher Score)
Dimensionality Reduction
Principal component analysis (PCA)
❖ Variance v.s. Covariance
Variance : phương sai của một biến ngẫu nhiên là thước đo sự phân tán thống kê của
biến đó, nó hàm ý các giá trị của biến đó thường ở cách giá trị kỳ vọng bao xa.
Covariance: hiệp phương sai là độ đo sự biến thiên cùng nhau của hai biến ngẫu nhiên
(phân biệt với phương sai - đo mức độ biến thiên của một biến)
1
( )(y )
Cov( , )
(N 1)
N
i i
i
x x y
X Y =
− −
=
−
Low variance High variance
Principal component analysis (PCA)
Mean (expected value): giá trị “mong muốn”,
biểu diễn giá trị trung bình của một biến.
Standard Deviation: Độ lệch chuẩn đo tính
biến động của giá trị mang tính thống kê. Nó
cho thấy sự chênh lệch về giá trị của từng thời
điểm đánh giá so với giá trị trung bình.
Principal component analysis (PCA)
Representing Covariance between dimensions as a matrix e.g. for 3 dimensions:
cov(x,y) = cov(y,x) hence matrix is symmetrical about the diagonal
N-dimensional data will result in NxN covariance matrix
Principal component analysis (PCA)
What is the interpretation of covariance calculations?
e.g.: dữ liệu 2 chiều
x: số lượng giờ học một môn học
y: điểm số của một môn học
covariance value ~ 104.53
what does this value mean?
-> số lượng giờ học tăng , điểm số
Principal component analysis (PCA)
Exact value is not as important as it’s sign.
A positive value of covariance indicates both dimensions increase or
decrease together (e.g. as the number of hours studied increases, the marks
in that subject increase.)
A negative value indicates while one increases the other decreases, or
vice-versa (e.g. active social life v.s performance in class.)
If covariance is zero: the two dimensions are independent of each other
(e.g. heights of students vs the marks obtained in a subject.)
Principal component analysis (PCA)
Principal component analysis (PCA)
Principal components analysis (PCA) là một phương pháp để đơn giản hóa một tập dữ liệu
(simplify a dataset) , chằng hạn giảm số chiều của dữ liệu.
“It is a linear transformation that chooses a new coordinate system for the data set such that
the greatest variance by any projection of the data set comes to lie on the first axis
(then called the first principal component),
the second greatest variance on the second axis
and so on. ”
PCA có thể được dùng để giảm số chiều bằng cách loại bỏ những thành phần chính không
quan trọng.
Principal component analysis (PCA)
Ví dụ:
khi thực hiện các phân tích đa biến mà
trong đó các biến có tương quan với nhau
gây nhiều khó khăn
loại bỏ sự tương quan này bằng cách xoay trục (cơ sở)
dữ liệu trên trục mới đã giảm
sự tương quan đáng kể
(biến Y1 và Y2 gần như không
tương quan)
sự thay đổi của dữ liệu phụ
thuộc phần lớn vào biến Y1
giảm số chiều dữ liệu mà
không làm giàm quá nhiều
“phương sai” của dữ liệu
Principal component analysis (PCA)
Note:
Giúp giảm số chiều của dữ liệu;
Thay vì giữ lại các trục tọa độ của không gian cũ, PCA xây dựng một không gian
mới ít chiều hơn, nhưng lại có khả năng biểu diễn dữ liệu tốt tương đương không
gian cũ.
Principal component analysis (PCA)
Ví dụ:
Khám phá liên kết tiềm ẩn nhờ đổi hệ trục tọa độ, cách nhìn khác nhau về cùng một dữ liệu.
Principal component analysis (PCA)
Ví dụ:
Notice that "the maximum variance" and "the minimum
error" are reached at the same time, namely when the line
points to the magenta ticks
Principal component analysis (PCA)
How to find the optimal linear transformation A ( where y = Ax)
-1. Origin of PCA coordinate mean of samples
-2. Maximize projected variance
-3. Minimize projection cost min x y−
Principal component analysis (PCA)
https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.12-Example-Principal-Components-
Analysis/
Principal component analysis (PCA)
Note:
The eigenvectors of the covariance matrix define a new coordinate system
Eigenvector with largest eigenvalue captures the most variation among
training vectors x
eigenvector with smallest eigenvalue has least variation
The eigenvectors are known as principal components
https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.12-Example-Principal-Components-Analysis/
Principal component analysis (PCA)
Algorithm: 7 steps
Input : N mẫu input
1. Tính vector trung bình của toàn bộ dữ liệu:
2. Trừ mỗi điểm dữ liệu đi vector trung bình của toàn bộ dữ liệu:
Note: subtracting the mean is equivalent to translating the coordinate system to the location of the mean
Principal component analysis (PCA)
3. Tính covaricance matrix
4. Perform the eigendecomposition : tìm eigenvectors và eigenvalues của S và sắp xếp
theo thứ từ giảm dần của eigenvalues .
5. Chọn K eigenvectors với K trị riêng lớn nhất để xây dựng ma trận UK (projection
matrix) có các cột tạo thành một hệ trực giao.
K vectors này, còn được gọi là các thành phần chính, tạo thành một không gian
con gần với phân bố của dữ liệu ban đầu đã chuẩn hoá.
Principal component analysis (PCA)
6. Chiếu dữ liệu ban đầu đã chuẩn hoá xuống không gian con tìm được.
7. Dữ liệu mới chính là toạ độ của các điểm dữ liệu trên không gian mới.
Dữ liệu ban đầu có thể tính được xấp xỉ theo dữ liệu mới như sau:
PCA
Principal component analysis (PCA)
Principal component analysis (PCA)
Principal component analysis (PCA)
Principal component analysis (PCA)
Principal component analysis (PCA)
Principal component analysis (PCA)
PCA
Ví dụ:
Principal component analysis (PCA)
Problem:
In the case of the data in the two classes, after projection
along the first principal eigenvector, remain well separated.
In the case of the data in the two classes, after
projection along the first principal eigenvector, is
not well separated because the mean of the two
classes are too close
Principal component analysis (PCA)
Disadvantage of PCA Principal Component Analysis
•higher variance
•bad for discriminability
Fisher Linear Discriminant
Linear Discriminant Analysis
•smaller variance
•good discriminability
Fisher’s linear discriminant analysis
The objective of LDA is to perform dimensionality reduction while preserving as
much of the class discriminatory information as possible.
Assume we have a set of l-dimensional samples , N1 of which
belong to class 1 , and N2 to class 2 . We seek to obtain a scalar y by projecting
the samples x onto a line:
}x,,x,{x 21 N
xwy T=
Fisher’s linear discriminant analysis
❖ Case: TWO CLASSES
Of all the possible lines we would like to select the one that maximizes
the separability of the scalars. This is illustrated for the two-dimensional
case in the following figures
The two classes are not well
separated when projected onto
this line
This line succeeded in
separating the two classes
and in the meantime
reducing the dimensionality
of our problem from two
features (x1,x2) to only a
scalar value y.
Fisher’s linear discriminant analysis
The Fisher linear discriminant is defined as the linear function
that maximizes the criterion function:
xwT
2
2
2
1
2
21 )-(
+
=FDR
where 1, 2 are the mean values of y,
and 1 , 2 are the variances of y in the
two classes, respectively.
LDA...two classes
In order to find the optimum projection w*, we need to express FDR as an
explicit function of w.
For two equiprobable classes
➢ SW is known as the within-class scatter matrix, with S1 , S2 being the respective covariance
matrices.
Now, the scatter of the projection y can then be expressed as a function of the
scatter matrix in feature space x.
)(
2
1
21w SSS +=
T
ii
i
i mxmxN
S
i
)()(
1
x
−−=
wSw T w
2
2
2
1 =+
LDA...two classes
Sb is known as the between-class scatter matrix
➢ m0 is the overall mean of the data x in the original l-dimensional space and m1 , m2 are the
mean values in the two classes, respectively
Similarly, the difference between the projected means (in y-space) can be
expressed in terms of the means in the original feature space (x-space).
TT mmmmmmmmS ))((
2
1
))((
2
1
02020101b −−+−−=
ww)-( b
2
21 S
T=
LDA...two classes
We can finally express the Fisher criterion in terms of SW and Sb as:
wS
wS
FDR
w
T
b
T
w
w)-(
2
2
2
1
2
21 =
+
=
LDA...two classes
First method:eigen-decompostition
Second method:
0
w
w
dw
d
=
wS
wS
w
T
b
T
wwSS bw =
−1 scalarFDR ==
)(
w
w
maxargmaxarg 21
1* mmS
wS
wS
FDRw w
w
T
b
T
ww
−=
== −
where
Fisher’s linear discriminant analysis
Case: C-classes
Principal component analysis (PCA)
PCA vs. LDA
LDA PCA
Principal component analysis (PCA)
PCA vs. LDA
Principal component analysis (PCA)
Improvements of PCA
Probabilistic PCA
Kernel PCA
Robust PCA ( - low rank matrix recovery)
Linear Discriminant Analysis (LDA)
Improvements of LDA
Probabilistic LDA
Kernel Fisher Discriminant Analysis
Example: Eigenface
Dùng PCA như một phương pháp rút trích đặc trưng (feature extraction) cho
hình ảnh khuôn mặt (facial images).
Eigenface là một trong các phương pháp phổ biến nhất trong bài toán nhận dạng
khuôn mặt.
Ý tưởng của Eigenface là đi tìm một không gian có số chiều nhỏ hơn để mô tả mỗi
khuôn mặt, từ đó sử dụng vector trong không gian thấp này như là feature vector cho
việc thực hiện classification.
Điều đáng nói là một bức ảnh khuôn mặt có kích thước khoảng 200 × 200 sẽ có số
chiều là 40k - là một số cực lớn, trong khi đó, feature vector thường chỉ có số chiều
bằng vài trăm.
Các Eigenfaces chính là các eigenvectors ứng với các trị riêng lớn nhất của ma trận
hiệp phương sai.
Eigenface
Face Identification using PCA
An image is a point in a high dimensional space
An N x M image is a point in RNxM
We can define vectors in this space as we did in the 2D case
Eigenface: Training
Get 𝑀 training samples with variances
Images are in same size
Eigenface: Training
Step 0: Convert all the images in vector form.
Step 1: Calculate the mean (Average Face)
Eigenface: Training
Step 2: Normalize vectors
Step 3: Form the covariance matrix
Eigenface: Training
Step 4: We calculate the Eigen vectors of Covariance Matrix
Step 5: So, we choose k eigenvectors corresponding to k largest eigenvalues to
construct a projection matrix
Very high dimension
So, we compute
Eigenface: Training
Step 6-7: Training Images projected to face space.
Eigenface: Training
Eigenface: Test
Take query image t
Substact mean image : y = t –
Project y into eigenface space and compute projection
Compare projection I with all N training projections
Simple comparison metric: Euclidean
Classifier: multiclass-logistic regression, SVM,
Eigenface
Eigenface
Pros
Ease of implementation
No knowledge of geometry or specific feature of the face required
Cons
Sensitive to head scale
Applicable only to front view
Good performance only under controlled background (not including natural
scenes)
State-of-the-art face recognition performance
OpenFace
FaceNet embedding
InsightFace
Feature Selection
Introduction: Feature Selection
Need for a small number of discriminative features
To avoid “curse of dimensionality”
To reduce feature measurement cost
To reduce computational burden
Feature selection algorithms can be categorized
Supervised: filter models, wrapper models, and embedded models
Unsupervised
Semi-supervised
Introduction
A feature selection method consists of four basic steps
subset generation
subset evaluation
stopping criterion
result validation
Introduction
Feature Selection for Classification
Introduction
Feature selection for classification attempts to select the minimally sized
subset of features according to the following criteria:
the classification accuracy does not significantly decrease; and
the resulting class distribution, given only the values for the selected features,
is as close as possible to the original class distribution, given all features.
Introduction
Feature selection is an optimization problem.
Search the space of possible feature subsets.
Pick the subset that is optimal or near-optimal with respect to a certain
criterion.
Search strategies Evaluation strategies
- Optimum - Filter methods
- Heuristic - Wrapper methods
- Randomized
Introduction
Algorithms for Flat Features
Three groups:
filter models
wrapper models
embedded models
Features are assumed to be independent
Algorithms for Flat Features
1. Filter Models
Filter models evaluate features without
utilizing any classification algorithms
The objective function evaluates feature subsets
by their information content, typically interclass
distance, statistical dependence or information-
theoretic measures.
Performance criteria : Fisher score, Mutual
information, ReliefF
Algorithms for Flat Features
❖ Fisher Score
Features with high quality should assign similar values to instances in the same class and
different values to instances from different classes.
The score for the i-th feature Si will be calculated by Fisher Score as
Algorithms for Flat Features
❖ Mutual Information based on Methods
Information gain is used to measure the dependence between features and labels
Calculates the information gain between the i-th feature fi and the class labels C as
A feature is relevant if it has a high information gain
Algorithms for Flat Features
Algorithms for Flat Features
2. Wrapper Models
It utilize a specific classifier to evaluate the quality of selected features, and offer
a simple and powerful way to address the problem of feature selection,
regardless of the chosen learning machine
A typical wrapper model will perform the following steps:
Step 1: searching a subset of features,
Step 2: evaluating the selected subset of features by the performance of the
classifier,
Step 3: repeating Step 1 and Step 2 until the desired quality is reached.
Search strategies: hill-climbing, best-first, GA,
Algorithms for Flat Features
Algorithms for Flat Features
Algorithms for Flat Features
3. Embedded Models
Embedded Models embedding feature selection with classifier construction, have the
advantages of (1) wrapper models — they include the interaction with the
classification model and (2) filter models—and they are far less computationally
intensive than wrapper methods.
Three types of embedded methods
Pruning methods: that first utilize all features to train a model and then attempt to eliminate some features by
setting the corresponding coefficients to 0, while maintaining model performance such as recursive feature
elimination using a support vector machine.
The second are models with a built-in mechanism for feature selection such as ID3 and C4.5.
The third are regularization models with objective functions that minimize fitting errors and in the meantime
force the coefficients to be small or to be exactly zero. Features with coefficients that are close to 0 are then
eliminated. Methods: Lasso Regularization, Adaptive Lasso, Bridge regularization, Elastic net
regularization.
Algorithms for Flat Features
❖ Regularization methods
Classifier induction and feature selection are achieved simultaneously by estimating w with properly tuned
penalties.
Feature selection is achieved and only features with nonzero coefficients in w will be used in the classifier
Lasso Regularization: Lasso regularization is based on l1-norm of the coefficient of w and defined as
l1 regularization can generate an estimation of w with exact zero coefficients. In other words, there are zero
entities in w, which denotes that the corresponding features are eliminated during the classifier learning
process. Therefore, it can be used for feature selection.
Algorithms for Structured Features
For many real-world applications, the features exhibit certain intrinsic structures, e.g., spatial or
temporal smoothness, disjoint/overlapping groups, trees, and graphs.
Incorporating knowledge about the structures of features may significantly improve the
classification performance and help identify the important features.
We focus on linear classifiers with structured features to minimize a empirical error penalized by
a regularization term as
where G denotes the structure of features, and α controls the trade-off between data fitting and
regularization
Research
▪ Algorithms for Structured Features
▪ Algorithms for Streaming Features
Discussions and Challenges
1. Scalability
With the growth of dataset sizes, the scalability of current
algorithms may be in jeopardy, especially with these domains that
require an online classifier.
The scalability of feature selection algorithms is a big problem.
Usually, they require a sufficient number of samples to obtain,
statically, adequate results.
Discussions and Challenges
2. Stability
It is defined as the sensitivity of the selection process to data
perturbation in the training set.
The underlying characteristics of data can greatly affect the stability
of an algorithm.
These characteristics include dimensionality m, sample size n, and
different data distribution across different folds, and the stability
issue tends to be data dependent
Discussions and Challenges
3. Linked Data
Linked data has become ubiquitous in real-world applications such as tweets in
Twitter (tweets linked through hyperlinks), social networks in Facebook (people
connected by friendships), and biological networks (protein interaction
networks).
Linked data is patently not independent and identically distributed (i.i.d.).
Feature selection methods for linked data need to solve the following immediate
challenges:
How to exploit relations among data instances; and
How to take advantage of these relations for feature selection.
Two attempts to handle linked data w.r.t. feature selection for classification are
LinkedFS and FSNet
Homework
1) Apply feature selection for Boston dataset recognition.
Ref: https://datasciencebeginners.com/2018/11/26/using-scikit-learn-in-
python-for-feature-selection/
Homework
2) Apply PCA as dimension reduction for iris recognition
Ref : https://towardsdatascience.com/pca-using-python-scikit-learn-
e653f8989e60