Python3《机器学习实战》笔记：K-近邻算法

2.1 实施KNN算法

python3实现KNN算法，本书采用的是python2，转化为python3

import numpy as np
#运算符模块
import operator
def createDataSet():
    group = np.array([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1]])
    labels = ['A', 'A', 'B', 'B']
    return group, labels

#K-近邻算法
def classify0(inX, dataSet, labels, k):

    #获取shape的第一个值
    dataSetSize = dataSet.shape[0]

    #tile函数把inX重复dataSetSize遍，1列，用欧拉定理进行计算
    diffMat = np.tile(inX, (dataSetSize, 1)) - dataSet
    sqDiffMat = diffMat ** 2
    sqDistances = sqDiffMat.sum(axis=1)
    distances = sqDistances ** 0.5

    #argsort函数返回的是数组值从小到大的索引值
    sortedDistIndices = distances.argsort()
    classCount = {}
    for i in range(k):
        voteIlabel = labels[sortedDistIndices[i]]
        classCount[voteIlabel] = classCount.get(voteIlabel, 0) + 1
        sortedClassCount = sorted(classCount.items(),
                                  key=operator.itemgetter(1), reverse=True)
    # 返回最近邻的点
    return sortedClassCount[0][0]

测试结果如下：
输入：

import kNN
group , labels = kNN.createDataSet()
print(kNN.classify0([0,0],group,labels,3)

输出：

2.2 使用K——近邻算法对约会网站的匹配效果进行改进

下载《机器学习实战》的辅助材料，下载地址为https://github.com/frankstar007/kNN
数据集放在 2.2data 中，可以下载使用（注意文件名字是：datingTestSet2，本书中没有2）

2.2.1 在KNN.py中加入下列代码：

def file2matrix(filename):
    #打开文件
    fr=open(filename)
    #readlines() 方法用于读取所有行(直到结束符 EOF)并返回列表，
    #该列表可以由 Python 的 for... in ... 结构进行处理。
    #返回类型为一个列表
    arrayOLines=fr.readlines()
    #列表的长度
    numberOfLines=len(arrayOLines)
    #设置numberOfLines行3列的0矩阵
    returnMat=np.zeros((numberOfLines,3))
    #设置空列表
    classLabelVector=[]
    index=0
    for line in arrayOLines:
        #利用函数strip截取掉所有的回车符
        line=line.strip()
        #使用tab字符\t将整行的数据分割为1个元素
        listFromLine=line.split('\t')
        #选取前三个矩阵，将它们存储在特征矩阵中
        returnMat[index,:]=listFromLine[0:3]
        #将列表中的最后一行元素存储在classLabelVector中
        classLabelVector.append(listFromLine[-1])
        index+=1
    return returnMat,classLabelVector

在test.py中进行测试：

import kNN
datingDataMat,datingLabels = kNN.file2matrix('datingTestSet2.txt')
print(datingDataMat)
print(datingLabels[0:20])

[[  4.09200000e+04   8.32697600e+00   9.53952000e-01]
 [  1.44880000e+04   7.15346900e+00   1.67390400e+00]
 [  2.60520000e+04   1.44187100e+00   8.05124000e-01]
 ..., 
 [  2.65750000e+04   1.06501020e+01   8.66627000e-01]
 [  4.81110000e+04   9.13452800e+00   7.28045000e-01]
 [  4.37570000e+04   7.88260100e+00   1.33244600e+00]]
[3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]

2.2.2 分析数据：使用Matplotlib创建散点图

import matplotlib
import os
import matplotlib.pyplot as plt
from numpy import *
fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(datingDataMat[:,1],datingDataMat[:,2])
plt.show()

Python3《机器学习实战》笔记：K-近邻算法

本题目一共有三组特征值：玩游戏视频所耗时间的百分比；每周消费的冰淇淋公升数；每年获取的飞行常客里程数；
分别绘制出彩色图像

import matplotlib
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(221) 
ax2 = fig.add_subplot(222)
ax3 = fig.add_subplot(223)
ax4 = fig.add_subplot(224)

ax.scatter(datingDataMat[:,1], datingDataMat[:,2])
ax2.scatter(datingDataMat[:,1], datingDataMat[:,2],
            15.0*array(list(map(int,datingLabels))),
            15.0*array(list(map(int,datingLabels)))) 
#数据乘以特征值，更好的区别特征数据
ax3.scatter(datingDataMat[:,0], datingDataMat[:,2],
            15.0*array(list(map(int,datingLabels))),
            15.0*array(list(map(int,datingLabels))))
ax4.scatter(datingDataMat[:,0], datingDataMat[:,1],
            15.0*array(list(map(int,datingLabels))),
            15.0*array(list(map(int,datingLabels))))
plt.show()

Python3《机器学习实战》笔记：K-近邻算法

2.2.3 准备数据：归一化数值

根据表格中的数值和欧拉两点之间距离，数值差值最大的属性对计算结果的影响最大，当数据的样本特征权重不一样，就会导致某一个特征权重的差距太大影响到整体的距离，因此要使用归一化来将这种不同取值范围的特征值归一化，将取值范围处理为0到1，或者-1到1之间;使用如下公式可以讲任意取值范围的特征值转化为0到1区间内的值：

                        newValue = (oldValue-min) / (max-min)

这里的new和old都针对的是某一列里的一个，而在这里使用应该是列表整体的使用了公式，故得到的是一个列表类型的newValue；

def autoNorm(dataSet):#输入为数据集数据
    minVals = dataSet.min(0)#获得数据每列的最小值,minval是个列表
    maxVals = dataSet.max(0)#获得数据每列的最大值,maxval是个列表
    ranges = maxVals - minVals#获得取值范围
    normDataSet = zeros(shape(dataSet)) #初始化归一化数据集
    m = dataSet.shape[0]#得到行
    normDataSet = dataSet - tile(minVals,(m,1))
    normDataSet = normDataSet/tile(ranges,(m,1)) #特征值相除
    return normDataSet,ranges , minVals#返回归一化矩阵，取值范围， 最小值

test测试：

import kNN
from numpy import *
import operator
datingDataMat,datingLabels = kNN.file2matrix('datingTestSet2.txt')
normMat , ranges , minval= kNN.autoNorm(datingDataMat)
print(normMat,'\n' ,ranges,'\n' , minval)

输出结果：

[[ 0.44832535  0.39805139  0.56233353]
 [ 0.15873259  0.34195467  0.98724416]
 [ 0.28542943  0.06892523  0.47449629]
 ..., 
 [ 0.29115949  0.50910294  0.51079493]
 [ 0.52711097  0.43665451  0.4290048 ]
 [ 0.47940793  0.3768091   0.78571804]] #归一化矩阵
 [  9.12730000e+04   2.09193490e+01   1.69436100e+00] #取值范围：max-min
 [ 0.        0.        0.001156] #最小值

2.2.4 测试算法：作为完整的程序验证分类器

def datingClassTest():
    hoRatio = 0.10 #测试数据占总样本的10%
    datingDataMat,datingLabels = file2matrix('datingTestSet2.txt') #样本集，样本标签
    normMat , ranges , minVals = autoNorm(datingDataMat) #归一化处理样本集，然后得到取值范围和最小值
    m = normMat.shape[0]#样本集行数
    numTestVecs = int(m*hoRatio) #测试样本集的数量
    errorCount = 0.0#初始化错误率
    for i in range(numTestVecs):#对样本集进行错误收集
        classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m], 3)#kNN
        print("The classifier came back with : %d , the real answer is : %d" % (int(classifierResult),int(datingLabels[i])))
        if(classifierResult!=datingLabels[i]):
            errorCount+=1.0
    print("the total error rate if :%f" % (errorCount/float(numTestVecs)))#计算错误率并输出

test测试代码

KNN.datingClassTest()

输出结果：

The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 3 , the real answer is : 3
The classifier came back with : 1 , the real answer is : 1
The classifier came back with : 2 , the real answer is : 2

最后一步操作：约会网站预测函数

最后一个主要是构建分类器，然后自己读入数据给出结果。

def classfyPerson():
    resultList = ['not at all' , 'in small doese ' , 'in large dose'] #分类器
    precentTats = float(raw_input("precentage of time spent playint video games?")) #输入数据
    ffMiles = float(raw_input("frequent flier miles earned per year"))
    iceCream = float(raw_input("liters of ice cream consumed per year?"))
    datingDataMat , datingLabels = file2matrix('datingTestSet2.txt') #训练集
    normMat , ranges , minVals =  autoNorm(datingDataMat) #进行训练
    inArr =array([ffMiles,precentTats,iceCream]) #把特征加入矩阵
    #4个输入参数分别为：用于分类的输入向量inX，输入的训练样本集dataSet，标签向量labels，选择最近邻居的数目k
    classfierResult = classify0((inArr-minVals)/ranges,normMat,datingLabels,3) #归一化处理矩阵，并且结果就是***-1就是对应
    print "You will probably like this person : " , resultList[classfierResult - 1 ]

输出结果：

You will probably like this person :  in small doses

感谢学长博客的帮助
借鉴博客为：https://blog.csdn.net/qq_33638791/article/details/53163659
源自于《机器学习实战》

Python3《机器学习实战》笔记：K-近邻算法

2.1 实施KNN算法

2.2 使用K——近邻算法对约会网站的匹配效果进行改进

2.2.1 在KNN.py中加入下列代码：

2.2.2 分析数据：使用Matplotlib创建散点图

2.2.3 准备数据： 归一化数值

2.2.4 测试算法：作为完整的程序验证分类器

最后一步操作：约会网站预测函数

相关推荐

2.2.3 准备数据：归一化数值