python iris 自编PCA和LDA dict转dataframe 平行坐标图

文章内容:使用鸢尾花数据,将sklearn自带的iris从字典dict格式转化为dataframe格式,用平行坐标图进行可视化,由图认为有必要做PCA和LDA,利用PCA和LDA的原理自编函数实现降维分析,分别绘制图像

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import warnings
import numpy as np
import seaborn as sns
from sklearn.model_selection import train_test_split  
from sklearn import datasets
from pandas.plotting import parallel_coordinates
plt.rcParams['font.sans-serif'] = ['SimHei']  # 绘图时可以显示中文
plt.rcParams['axes.unicode_minus']=False   # 绘图时显示负号
warnings.filterwarnings("ignore")  # 不要显示警告

1)read data

In [220]:

iris_data = datasets.load_iris() 

2) dict 转 dataframe

In [221]:

iris0 = pd.DataFrame(data= np.c_[iris_data['data'], iris_data['target']], columns = iris_data['feature_names'] + ['target'])

3)target 中 0,1,2 转字符

In [222]:

iris = iris0
species = pd.Series(iris.target)
species = species.replace(0, 'setosa')
species = species.replace(1, 'versicolor')
species = species.replace(2, 'virginica')
iris['target'] = species 

4) basic infomation

In [223]:

iris.sample(6)

Out[223]:

  sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target
64 5.6 2.9 3.6 1.3 versicolor
32 5.2 4.1 1.5 0.1 setosa
144 6.7 3.3 5.7 2.5 virginica
86 6.7 3.1 4.7 1.5 versicolor
69 5.6 2.5 3.9 1.1 versicolor
46 5.1 3.8 1.6 0.2 setosa

In [224]:

iris.shape

Out[224]:

(150, 5)

In [225]:

iris.describe().T

Out[225]:

  count mean std min 25% 50% 75% max
sepal length (cm) 150.0 5.843333 0.828066 4.3 5.1 5.80 6.4 7.9
sepal width (cm) 150.0 3.057333 0.435866 2.0 2.8 3.00 3.3 4.4
petal length (cm) 150.0 3.758000 1.765298 1.0 1.6 4.35 5.1 6.9
petal width (cm) 150.0 1.199333 0.762238 0.1 0.3 1.30 1.8 2.5

5) 绘图观察

In [226]:

parallel_coordinates(iris,'target',color=('#556270','#4ECDC4', '#C7F464'),alpha = 0.5)
plt.title('Parallel Coordinates Graph of Iris',fontsize ='xx-large')
plt.legend(loc='upper center', bbox_to_anchor=(0.5,-0.1),ncol=3,fancybox=True,shadow=True)
plt.show()

python iris 自编PCA和LDA dict转dataframe 平行坐标图

从上图可以看出,sepal length、petal length和petal width对于三种target的区分都比较好,可以考虑判别分析。 并且在这三个变量的纵轴上,都为绿色virginica的样本分布在上方,蓝色versicolor的样本在中间,黑色setosa在最下面,有相似的趋势,考虑主成分分析。

2. PCA

1) 计算协方差阵

In [227]:

iris_covmat0 = np.cov(iris.iloc[:,[0,1,2,3]].T)
iris_covmat = pd.DataFrame(data = np.cov(iris.iloc[:,[0,1,2,3]].T), columns = iris_data['feature_names'])
iris_covmat.index = iris_data['feature_names']
iris_covmat

Out[227]:

  sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
sepal length (cm) 0.685694 -0.042434 1.274315 0.516271
sepal width (cm) -0.042434 0.189979 -0.329656 -0.121639
petal length (cm) 1.274315 -0.329656 3.116278 1.295609
petal width (cm) 0.516271 -0.121639 1.295609 0.581006

2) 计算特征值特征向量

In [228]:

eigenvalues, eigenvectors = np.linalg.eig(iris_covmat)
eigenvalues

Out[228]:

array([4.22824171, 0.24267075, 0.0782095 , 0.02383509])

特征值已经从大到小排好了。

In [229]:

eigenvectors

Out[229]:

array([[ 0.36138659, -0.65658877, -0.58202985,  0.31548719],
       [-0.08452251, -0.73016143,  0.59791083, -0.3197231 ],
       [ 0.85667061,  0.17337266,  0.07623608, -0.47983899],
       [ 0.3582892 ,  0.07548102,  0.54583143,  0.75365743]])

3) 选择主成分的数量

In [230]:

var_percent = eigenvalues / np.sum(eigenvalues)
np.cumsum(var_percent)

Out[230]:

array([0.92461872, 0.97768521, 0.99478782, 1.        ])

线性变换后,样本在由前两个特征向量构成的子空间上的方差占总方差的97.77%,占比很高,所以选取的主成分数量为2

4)线性变换得到投影到子空间后的坐标

In [231]:

iris_pca = np.dot(iris.iloc[:,[0,1,2,3]],eigenvectors[:,:2])  # 投影
iris_pca[:5, :]  # 前五行

Out[231]:

array([[ 2.81823951, -5.64634982],
       [ 2.78822345, -5.14995135],
       [ 2.61337456, -5.18200315],
       [ 2.75702228, -5.0086536 ],
       [ 2.7736486 , -5.65370709]])

5) 可视化

In [232]:

%matplotlib inline
iris_pca_df = pd.DataFrame(data=iris_pca, columns=['PC1', 'PC2'])  # 转化为DataFrame以便绘制图像
iris_pca_df['target'] = species 
sns.lmplot(x='PC1', y='PC2', data=iris_pca_df, hue='target', fit_reg=False, legend=True)

Out[232]:

<seaborn.axisgrid.FacetGrid at 0x2991adf97b8>

python iris 自编PCA和LDA dict转dataframe 平行坐标图

降至二维后,各点在坐标轴内较为随机地散布,三个target类型分的也比较开,显然很适合做判别分析。

3.LDA

1)划分训练集测试集

In [233]:

x_train, x_test, y_train, y_test = train_test_split(iris_data.data,iris_data.target, test_size = 0.3, random_state = 100)
x0_train = x_train[y_train==0]
x1_train = x_train[y_train==1]
x2_train = x_train[y_train==2]

2) find within-class scatter matrix

In [234]:

n0 = np.shape(x0_train)[0]  # 训练集中target标签为0的样本数量
n1 = np.shape(x1_train)[0]
n2 = np.shape(x2_train)[0]
n = np.shape(x_train)[0]
Sw0 = (n0 - 1) * np.cov(np.transpose(x0_train))  # 类间散布矩阵等于协方差阵乘以n-1
Sw1 = (n1 - 1) * np.cov(np.transpose(x1_train))
Sw2 = (n2 - 1) * np.cov(np.transpose(x2_train))
Sw = (Sw0 + Sw1 + Sw2) / n
Sw

Out[234]:

array([[0.27156253, 0.096888  , 0.16735612, 0.03506082],
       [0.096888  , 0.09970043, 0.06113117, 0.03120712],
       [0.16735612, 0.06113117, 0.18831667, 0.04396675],
       [0.03506082, 0.03120712, 0.04396675, 0.03860174]])

3)find between-class scatter matrix

In [235]:

mu0 = np.mean(x0_train, axis = 0)  # 类内均值
mu1 = np.mean(x1_train, axis = 0)
mu2 = np.mean(x2_train, axis = 0)
mu = np.mean(x_train, axis = 0)  # 总均值
d0 = np.matrix(mu0 - mu)
d1 = np.matrix(mu1 - mu)
d2 = np.matrix(mu2 - mu)
Sb = (n0 * np.dot(np.transpose(d0), d0) + n1 * np.dot(np.transpose(d1), d1) + n2 * np.dot(np.transpose(d2), d2))/n
Sb

Out[235]:

matrix([[ 0.37662795, -0.12517371,  1.01197721,  0.43036775],
        [-0.12517371,  0.07380887, -0.36319466, -0.14548195],
        [ 1.01197721, -0.36319466,  2.7415246 ,  1.15841421],
        [ 0.43036775, -0.14548195,  1.15841421,  0.49196153]])

4) find eigenvalues and eigenvectors

In [236]:

eig_matrix = np.dot(np.linalg.inv(Sw), Sb)
eig_values, eig_vectors = np.linalg.eig(eig_matrix)
eig_values

Out[236]:

array([3.12987432e+01, 2.92769629e-01, 1.31100176e-17, 1.10499984e-14])

5)写出线性变换后投影到子空间上的坐标

In [237]:

x_train_lda = np.dot(x_train, eig_vectors[:,:2])  # 投影
x_train_lda[:5, :]  # 前五行

Out[237]:

matrix([[-1.30026736,  1.98915717],
        [-1.12281316,  1.53624681],
        [-1.37932661,  2.19577832],
        [ 0.78989652,  1.48524196],
        [-1.28157792,  1.57919854]])

6)可视化

A. 训练集投影至二维后的散点图

In [238]:

%matplotlib inline
x_train_lda_df = pd.DataFrame(data=x_train_lda, columns=['LDA1', 'LDA2'])
x_train_lda_df['target'] = y_train 
sns.lmplot(x='LDA1', y='LDA2', data=x_train_lda_df, hue='target', fit_reg=False, legend=True)

Out[238]:

<seaborn.axisgrid.FacetGrid at 0x299197e2ef0>

python iris 自编PCA和LDA dict转dataframe 平行坐标图

target的三种在LDA1轴上分的很开,类中心互相离得很远,分类效果很好。

B.测试集投影至二维后的散点图

In [239]:

x_test_lda = np.dot(x_test, eig_vectors[:,:2])
%matplotlib inline
x_test_lda_df = pd.DataFrame(data=x_test_lda, columns=['LDA1', 'LDA2'])
x_test_lda_df['target'] = y_test
sns.lmplot(x='LDA1', y='LDA2', data=x_test_lda_df, hue='target', fit_reg=False, legend=True)

Out[239]:

<seaborn.axisgrid.FacetGrid at 0x2991aff6748>

python iris 自编PCA和LDA dict转dataframe 平行坐标图

测试集投影后也分的很开,效果很好。 其实,我原本打算用LDA做分类器,用投影矩阵预测测试集的分类来着,但不知道如何预测,也没有找出来target0,1,2的估计子空间之类的。

7) PCA 和 LDA 的比较

PCA和LDA都把原来的四维数据降至二维,区别有 a.将线性判别分析散点图投影到LDA轴上时,三类之间几乎没有重叠;而PCA散点图投影到PA轴上时,橙色的versicolor和绿色的virginica有较大的重叠 b.PCA的投影子空间的维数为2,是根据投影后保留方差比例选择的,还可以选择子空间维数为3和4; 而LDA的投影子空间维数为2,是因为其维数不能超过类别数3,还可以选择投影子空间维数为1,根据散点图,LDA1轴对样本具有十分优秀的区分。