multiple linear regression
multiple linear regression
我使用skleanrn训练了一组数据,其中数据使用pandas库读取excel表,求出测试数据的均方误差和画出测试数据与预测值的图。数据集去我的资源下载Advertising.csv
1.交叉验证的库
from sklearn.model_selection import train_test_split
2.pandas的两个主要数据结构:Series和DataFrame:
- Series类似于一维数组,它有一组数据以及一组与之相关的数据标签(即索引)组成。
- DataFrame是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型。DataFrame既有行索引也有列索引,它可以被看做由Series组成的字典。
import numpy as np import pandas as pd from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split # 交叉验证的库在model_selection中 import matplotlib.pyplot as plt def data_deal(): data = pd.read_csv('I:\python Machine learning\multiple linear Regression/Advertising.csv') # display the last 5 rows X = data[['TV', 'radio', 'newspaper']] y = data['sales'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0) return X_train, X_test, y_train, y_test def cal_rmse(y_pred,y_test):#计算多元回归的均方误差 去衡量算法的好坏 y_test=np.array(y_test) sum=0 for i in range(len(y_test)): sum+=(y_test[i]-y_pred[i])**2 mse=np.sqrt(sum/len(y_test)) print('Rmse is :', mse) return mse def plot_show(y_pred,y_test): plt.plot(range(len(y_test)),y_test,color='r',label='test') plt.plot(range(len(y_pred)),y_pred,color='blue',label='predict') plt.legend(loc="upper right") # 显示图中的标签 plt.xlabel("the number of sales") plt.ylabel('value of sales') plt.show() def main(): X_train, X_test, y_train, y_test=data_deal() linreg = LinearRegression() model = linreg.fit(X_train, y_train) y_pred = linreg.predict(X_test) mse=cal_rmse(y_pred,y_test) plot_show(y_pred,y_test) if __name__=='__main__': main()
结果显示:Rmse is : 1.71957177333
图片如下所示: