python iris 自编PCA和LDA dict转dataframe 平行坐标图
文章内容:使用鸢尾花数据,将sklearn自带的iris从字典dict格式转化为dataframe格式,用平行坐标图进行可视化,由图认为有必要做PCA和LDA,利用PCA和LDA的原理自编函数实现降维分析,分别绘制图像
import pandas as pd import matplotlib.pyplot as plt import matplotlib.mlab as mlab import warnings import numpy as np import seaborn as sns from sklearn.model_selection import train_test_split from sklearn import datasets from pandas.plotting import parallel_coordinates plt.rcParams['font.sans-serif'] = ['SimHei'] # 绘图时可以显示中文 plt.rcParams['axes.unicode_minus']=False # 绘图时显示负号 warnings.filterwarnings("ignore") # 不要显示警告
1)read data
In [220]:
iris_data = datasets.load_iris()
2) dict 转 dataframe
In [221]:
iris0 = pd.DataFrame(data= np.c_[iris_data['data'], iris_data['target']], columns = iris_data['feature_names'] + ['target'])
3)target 中 0,1,2 转字符
In [222]:
iris = iris0 species = pd.Series(iris.target) species = species.replace(0, 'setosa') species = species.replace(1, 'versicolor') species = species.replace(2, 'virginica') iris['target'] = species
4) basic infomation
In [223]:
iris.sample(6)
Out[223]:
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | target | |
---|---|---|---|---|---|
64 | 5.6 | 2.9 | 3.6 | 1.3 | versicolor |
32 | 5.2 | 4.1 | 1.5 | 0.1 | setosa |
144 | 6.7 | 3.3 | 5.7 | 2.5 | virginica |
86 | 6.7 | 3.1 | 4.7 | 1.5 | versicolor |
69 | 5.6 | 2.5 | 3.9 | 1.1 | versicolor |
46 | 5.1 | 3.8 | 1.6 | 0.2 | setosa |
In [224]:
iris.shape
Out[224]:
(150, 5)
In [225]:
iris.describe().T
Out[225]:
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
sepal length (cm) | 150.0 | 5.843333 | 0.828066 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 |
sepal width (cm) | 150.0 | 3.057333 | 0.435866 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 |
petal length (cm) | 150.0 | 3.758000 | 1.765298 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 |
petal width (cm) | 150.0 | 1.199333 | 0.762238 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 |
5) 绘图观察
In [226]:
parallel_coordinates(iris,'target',color=('#556270','#4ECDC4', '#C7F464'),alpha = 0.5) plt.title('Parallel Coordinates Graph of Iris',fontsize ='xx-large') plt.legend(loc='upper center', bbox_to_anchor=(0.5,-0.1),ncol=3,fancybox=True,shadow=True) plt.show()
从上图可以看出,sepal length、petal length和petal width对于三种target的区分都比较好,可以考虑判别分析。 并且在这三个变量的纵轴上,都为绿色virginica的样本分布在上方,蓝色versicolor的样本在中间,黑色setosa在最下面,有相似的趋势,考虑主成分分析。
2. PCA
1) 计算协方差阵
In [227]:
iris_covmat0 = np.cov(iris.iloc[:,[0,1,2,3]].T) iris_covmat = pd.DataFrame(data = np.cov(iris.iloc[:,[0,1,2,3]].T), columns = iris_data['feature_names']) iris_covmat.index = iris_data['feature_names'] iris_covmat
Out[227]:
sepal length (cm) | sepal width (cm) | petal length (cm) | petal width (cm) | |
---|---|---|---|---|
sepal length (cm) | 0.685694 | -0.042434 | 1.274315 | 0.516271 |
sepal width (cm) | -0.042434 | 0.189979 | -0.329656 | -0.121639 |
petal length (cm) | 1.274315 | -0.329656 | 3.116278 | 1.295609 |
petal width (cm) | 0.516271 | -0.121639 | 1.295609 | 0.581006 |
2) 计算特征值特征向量
In [228]:
eigenvalues, eigenvectors = np.linalg.eig(iris_covmat) eigenvalues
Out[228]:
array([4.22824171, 0.24267075, 0.0782095 , 0.02383509])
特征值已经从大到小排好了。
In [229]:
eigenvectors
Out[229]:
array([[ 0.36138659, -0.65658877, -0.58202985, 0.31548719], [-0.08452251, -0.73016143, 0.59791083, -0.3197231 ], [ 0.85667061, 0.17337266, 0.07623608, -0.47983899], [ 0.3582892 , 0.07548102, 0.54583143, 0.75365743]])
3) 选择主成分的数量
In [230]:
var_percent = eigenvalues / np.sum(eigenvalues) np.cumsum(var_percent)
Out[230]:
array([0.92461872, 0.97768521, 0.99478782, 1. ])
线性变换后,样本在由前两个特征向量构成的子空间上的方差占总方差的97.77%,占比很高,所以选取的主成分数量为2
4)线性变换得到投影到子空间后的坐标
In [231]:
iris_pca = np.dot(iris.iloc[:,[0,1,2,3]],eigenvectors[:,:2]) # 投影 iris_pca[:5, :] # 前五行
Out[231]:
array([[ 2.81823951, -5.64634982], [ 2.78822345, -5.14995135], [ 2.61337456, -5.18200315], [ 2.75702228, -5.0086536 ], [ 2.7736486 , -5.65370709]])
5) 可视化
In [232]:
%matplotlib inline iris_pca_df = pd.DataFrame(data=iris_pca, columns=['PC1', 'PC2']) # 转化为DataFrame以便绘制图像 iris_pca_df['target'] = species sns.lmplot(x='PC1', y='PC2', data=iris_pca_df, hue='target', fit_reg=False, legend=True)
Out[232]:
<seaborn.axisgrid.FacetGrid at 0x2991adf97b8>
降至二维后,各点在坐标轴内较为随机地散布,三个target类型分的也比较开,显然很适合做判别分析。
3.LDA
1)划分训练集测试集
In [233]:
x_train, x_test, y_train, y_test = train_test_split(iris_data.data,iris_data.target, test_size = 0.3, random_state = 100) x0_train = x_train[y_train==0] x1_train = x_train[y_train==1] x2_train = x_train[y_train==2]
2) find within-class scatter matrix
In [234]:
n0 = np.shape(x0_train)[0] # 训练集中target标签为0的样本数量 n1 = np.shape(x1_train)[0] n2 = np.shape(x2_train)[0] n = np.shape(x_train)[0] Sw0 = (n0 - 1) * np.cov(np.transpose(x0_train)) # 类间散布矩阵等于协方差阵乘以n-1 Sw1 = (n1 - 1) * np.cov(np.transpose(x1_train)) Sw2 = (n2 - 1) * np.cov(np.transpose(x2_train)) Sw = (Sw0 + Sw1 + Sw2) / n Sw
Out[234]:
array([[0.27156253, 0.096888 , 0.16735612, 0.03506082], [0.096888 , 0.09970043, 0.06113117, 0.03120712], [0.16735612, 0.06113117, 0.18831667, 0.04396675], [0.03506082, 0.03120712, 0.04396675, 0.03860174]])
3)find between-class scatter matrix
In [235]:
mu0 = np.mean(x0_train, axis = 0) # 类内均值 mu1 = np.mean(x1_train, axis = 0) mu2 = np.mean(x2_train, axis = 0) mu = np.mean(x_train, axis = 0) # 总均值 d0 = np.matrix(mu0 - mu) d1 = np.matrix(mu1 - mu) d2 = np.matrix(mu2 - mu) Sb = (n0 * np.dot(np.transpose(d0), d0) + n1 * np.dot(np.transpose(d1), d1) + n2 * np.dot(np.transpose(d2), d2))/n Sb
Out[235]:
matrix([[ 0.37662795, -0.12517371, 1.01197721, 0.43036775], [-0.12517371, 0.07380887, -0.36319466, -0.14548195], [ 1.01197721, -0.36319466, 2.7415246 , 1.15841421], [ 0.43036775, -0.14548195, 1.15841421, 0.49196153]])
4) find eigenvalues and eigenvectors
In [236]:
eig_matrix = np.dot(np.linalg.inv(Sw), Sb) eig_values, eig_vectors = np.linalg.eig(eig_matrix) eig_values
Out[236]:
array([3.12987432e+01, 2.92769629e-01, 1.31100176e-17, 1.10499984e-14])
5)写出线性变换后投影到子空间上的坐标
In [237]:
x_train_lda = np.dot(x_train, eig_vectors[:,:2]) # 投影 x_train_lda[:5, :] # 前五行
Out[237]:
matrix([[-1.30026736, 1.98915717], [-1.12281316, 1.53624681], [-1.37932661, 2.19577832], [ 0.78989652, 1.48524196], [-1.28157792, 1.57919854]])
6)可视化
A. 训练集投影至二维后的散点图
In [238]:
%matplotlib inline x_train_lda_df = pd.DataFrame(data=x_train_lda, columns=['LDA1', 'LDA2']) x_train_lda_df['target'] = y_train sns.lmplot(x='LDA1', y='LDA2', data=x_train_lda_df, hue='target', fit_reg=False, legend=True)
Out[238]:
<seaborn.axisgrid.FacetGrid at 0x299197e2ef0>
target的三种在LDA1轴上分的很开,类中心互相离得很远,分类效果很好。
B.测试集投影至二维后的散点图
In [239]:
x_test_lda = np.dot(x_test, eig_vectors[:,:2]) %matplotlib inline x_test_lda_df = pd.DataFrame(data=x_test_lda, columns=['LDA1', 'LDA2']) x_test_lda_df['target'] = y_test sns.lmplot(x='LDA1', y='LDA2', data=x_test_lda_df, hue='target', fit_reg=False, legend=True)
Out[239]:
<seaborn.axisgrid.FacetGrid at 0x2991aff6748>
测试集投影后也分的很开,效果很好。 其实,我原本打算用LDA做分类器,用投影矩阵预测测试集的分类来着,但不知道如何预测,也没有找出来target0,1,2的估计子空间之类的。
7) PCA 和 LDA 的比较
PCA和LDA都把原来的四维数据降至二维,区别有 a.将线性判别分析散点图投影到LDA轴上时,三类之间几乎没有重叠;而PCA散点图投影到PA轴上时,橙色的versicolor和绿色的virginica有较大的重叠 b.PCA的投影子空间的维数为2,是根据投影后保留方差比例选择的,还可以选择子空间维数为3和4; 而LDA的投影子空间维数为2,是因为其维数不能超过类别数3,还可以选择投影子空间维数为1,根据散点图,LDA1轴对样本具有十分优秀的区分。