机器学习-特征抽取-LDA(Linear Discriminant Analysis)

Section I: Brief Introduction on LDA

Linear Discriminat Analysis (LDA) can be used as a technique for feature extraction to increase the computational efficiency and reduce the degree of overfitting due to the curse of dimensionality in non-regularized model. The general concept behind LDA is very similar to PCA. Whereas PCA attempts to find the orthogonal component axes of maximum variance in a dataset, the goal in LDA is to find the feature subspace that optimizes class separability. In contrast with PCA, LDA is a supervised algorithm.

Personal Views
LDA是一种监督学习算法,其在于利用类别标签来计算类间距离和类内部距离矩阵后,以类似于PCA算法计算两者矩阵整体的特征值和特征向量。因此,从某种角度,LDA和PCA是略微相似的。

FROM
Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京:东南大学出版社,2018.

Section II: Code Bundle

代码

import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from PCA.visualize import plot_decision_regions

#Section 1: Prepare data
plt.rcParams['figure.dpi']=200
plt.rcParams['savefig.dpi']=200
font = {'family': 'Times New Roman',
        'weight': 'light'}
plt.rc("font", **font)

#Section 2: Load data and split it into train/test dataset
wine=datasets.load_wine()
X,y=wine.data,wine.target
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=1,stratify=y)

st=StandardScaler()
X_train_std=st.fit_transform(X_train)
X_test_std=st.transform(X_test)

#Section 3: Use LDA to feature reduction
lda=LDA(n_components=2)
X_train_lda=lda.fit_transform(X_train_std,y_train)
print("Eigenvalue Ratios in Descending Order",lda.explained_variance_ratio_)

lr=LogisticRegression()
lr.fit(X_train_lda,y_train)

plot_decision_regions(X_train_lda,y_train,classifier=lr)
plt.xlabel('LD 1')
plt.ylabel('LD 2')
plt.title("LDA - Test Dataset")
plt.legend(loc='upper left')
plt.savefig('./fig1.png')
plt.show()

结果

在这里插入图片描述

Eigenvalue Ratios in Descending Order [0.68259828 0.31740172]

由上运行结果可以得知,Linear Discriminant成分至多为类别数量与1的差值。这一点,可以通过设置LDA的n_component=特征原始数量,来分析。有趣的是,对于数据集的特征为13,设置n_component=13,运行后特征值仍为2个,即酒类别为3。

Eigenvalue Ratios in Descending Order [0.68259828 0.31740172]

参考文献
Sebastian Raschka, Vahid Mirjalili. Python机器学习第二版. 南京:东南大学出版社,2018.