线性回归
标准方程法
一般我们使用梯度下降法求解线性回归,而要求解最优方程往往需要迭代很多步,标准方程法可以一步到位。假设有一个代价函数:J(θ)=aθ2+bθ+c 。找出能使代价函数最小化的θ,也就是求出J关于θ的导数,当该导数为0的时候,θ最小。
标准方程法就是直接将上市经过求导转化成
推导过程
由上式可知,X为m*(n+1)维的矩阵,Y为m*1维的矩阵,最后一个矩阵少写了一项。
样例
代码如下:
# 线性回归(标准方程法)
import numpy as np
import matplotlib.pyplot as plt
def loadDataSet(filename):
# numFeat = len(open(filename).readline().split())
dataSet = []
Labels = []
for line in open(filename).readlines():
dataSet.append([float(line.split()[0]), float(line.split()[1])])
Labels.append(float(line.split()[-1]))
return dataSet, Labels
def standRegres(xArr, yArr):
xMat = np.mat(xArr)
yMat = np.mat(yArr).T # 先转换成n行一列
temp = (xMat.T * xMat)
if np.linalg.det(temp) == 0.0: # 判断行列式是否为0
print("行列式为0,无法计算")
return
else:
w = temp.I * xMat.T * yMat
return w
def draw():
dataSet, Labels = loadDataSet("./Data/ex0.txt")
plt.scatter([x[1] for x in dataSet], Labels)
plt.plot([x[1] for x in dataSet], (np.mat(dataSet) * standRegres(dataSet, Labels)).T.tolist()[0]) # 矩阵转列表
plt.show()
dataSet, Labels = loadDataSet("./Data/ex0.txt")
print(standRegres(dataSet, Labels))
# 计算出的w
draw()
print(np.corrcoef(np.mat(Labels), (np.mat(dataSet) * standRegres(dataSet, Labels)).T)) # 计算相关系数
拟合后的结果如下
局部加权回归
线性回归的问题是可能出现欠拟合现象,而解决这种现象的比较常见的方法就是局部加权线性回归,在该方法中,我们给带测点附近每个点都引入一个权重。
一看就知道,这个拟合过于勉强,不合格,我们可以用微积分的思想,将之分成小块,拟合效果就会得到很大的增强。
因此,与之对应的,回归系数形式如下:
其中,w表示为
显然,k值越大,w变化的越慢,反之越快,也就是k值越大,x(i)受附近点影响越小,可能产生欠拟合,k值约小,x(i)受附近点影响越大,可能产生过拟合。
代码
# 局部加权线性回归
import numpy as np
import matplotlib.pyplot as plt
def loadDataSet(filename):
# numFeat = len(open(filename).readline().split())
dataSet = []
Labels = []
for line in open(filename).readlines():
dataSet.append([float(line.split()[0]), float(line.split()[1])])
Labels.append(float(line.split()[-1]))
return dataSet, Labels
# 得到某个点的预测值
def lwlr(testpoint, dataSet, Labels, k):
xMat = np.mat(dataSet)
yMat = np.mat(Labels)
m = np.shape(xMat)[0]
weights = np.mat(np.eye(m))
for j in range(m):
diffMat = testpoint - xMat[j, :]
# 取矩阵中某一特定的值应该写在一个[]里,不能分开写
# a = np.array([[1,2],[2,3]])
# print(a[0][1])
# 2
# a = np.mat([[1,2],[2,3]])
# print(a[0,1])
# 2
weights[j, j] = np.exp(diffMat * diffMat.T / (-2 * (k ** 2)))
xTx = xMat.T * weights * xMat
if np.linalg.det(xTx) == 0.0:
print("行列式为0,不能求逆")
return
ws = xTx.I * xMat.T * weights * yMat.T
return testpoint * ws
# 得到所有的预测值
def lwlrTest(textArr, xArr, yArr, k=0.01):
m = np.shape(textArr)[0]
yHat = np.zeros(m)
for i in range(m):
yHat[i] = lwlr(xArr[i], xArr, yArr, k)
return yHat
def draw():
dataSet, Labels = loadDataSet("./Data/ex0.txt")
plt.scatter([x[1] for x in dataSet], Labels)
# 由于得到的结果是曲线,因此要对x轴进行排序
xIndex = np.array([x[1] for x in dataSet]).argsort() # 返回x排序后的下标
xSort = np.array([x[1] for x in dataSet])[xIndex]
ySort = lwlrTest(dataSet, dataSet, Labels,0.01)[xIndex]
plt.plot(xSort, ySort, 'r')
plt.show()
dataSet, Labels = loadDataSet("./Data/ex0.txt")
draw()
print(lwlrTest(dataSet, dataSet, Labels, 0.01))
# [ 3.12204471 3.73284336 4.69692033 4.25997574 4.67205815 3.89979584
# 3.64981617 3.06462993 3.23234485 3.24169633 3.42741618 4.10707034
# 4.24787613 3.40879727 4.65288661 4.03784328 3.61542726 3.25787329
# 4.08932965 3.39783761 3.12629106 3.12106493 4.57784181 4.22499774
# 3.03329674 3.57546269 3.07586196 3.36765021 4.05435451 3.92530073
# 3.08411123 3.22212205 3.95125595 4.53636662 4.63368585 3.47279924
# 3.67613549 4.48770772 3.70250157 4.54964976 3.29004763 4.13549978
# 3.45441958 3.39926586 3.9452161 3.08816389 3.62751621 3.84688831
# 3.84428111 4.35125277 4.31315478 3.43422203 3.31571015 4.3785447
# 3.57384393 3.58150337 3.7748906 4.18601319 4.00967978 3.52940957
# 3.25875168 4.19859322 4.23146773 4.13867447 3.21901745 4.0561676
# 3.65354789 3.66128122 4.325913 3.2040255 3.20300481 4.13294306
# 3.43941808 4.69379223 3.50821162 4.6235812 3.5279148 3.47737983
# 3.48050144 3.30514953 3.27235 4.25119686 4.44639135 4.31574207
# 3.73569804 3.28217383 4.56339108 3.54148328 3.3488199 3.51093301
# 3.57813857 4.13735536 3.9977728 3.15137008 4.17283219 4.61871301
# 3.32921189 4.43940561 3.57243898 3.73152804 3.79828782 3.94136439
# 3.47899493 4.56628532 4.51160109 3.9733276 3.1197512 3.57661822
# 3.73123612 3.67842653 4.18508115 3.46974383 3.86130443 3.1210056
# 3.07220096 4.02363416 4.61429802 4.14613244 3.74472547 3.87129551
# 4.27883053 3.10557174 3.82647897 3.79897863 3.6139764 4.61780999
# 3.71504003 4.56915748 3.27243479 4.12131729 4.0529964 4.45536498
# 3.23319944 4.18302013 3.30977094 3.34551764 3.13135453 3.49307969
# 3.31583735 4.06166289 3.59160637 3.68319531 3.82205056 3.2672067
# 3.29857281 4.47854412 3.38678321 3.90341148 4.51777512 3.18751671
# 4.56478643 3.03242227 3.97026179 3.12394666 4.03112577 3.90977536
# 4.11231956 4.53650412 3.35367751 4.18823888 4.64293322 4.67257923
# 4.68364746 4.68880048 4.25604553 3.4371958 4.15197762 4.60125152
# 3.40639871 3.19458504 3.37689095 4.48312084 3.34094505 4.63318002
# 4.6577835 4.27264237 3.76316251 3.80539621 3.9389488 3.57651641
# 4.50851832 4.56653485 3.75661285 3.20488852 3.61653365 4.38980705
# 3.66379623 4.67402706 4.33220297 4.11436188 4.48354863 4.03319687
# 3.4399745 4.24675055 3.84489727 3.44303652 4.42193351 3.12605035
# 3.90132188 3.20415097]
值得注意的是,局部加权回归是对于每一个样本点,都利用计算回归系数的公式计算一次,每个样本点都有不同回归系数,因此效率可能会比较差,但是准确度很高。
拟合效果用红线表示
三克油啦!!!后面会写一篇梯度下降法处理回归问题的博客(关注哦)。。。