《机器学习100天》学习笔记——Day 3_Multiple_Linear_Regression(多元线性回归)
100-Days-Of-ML-Code
中文版《机器学习100天》
GitHub :https://github.com/MLEveryday/100-Days-Of-ML-Code
第一步:数据预处理
(1)导入库:
# Importing the libraries
import pandas as pd
import numpy as np
(2)导入数据集
# Importing the dataset
dataset = pd.read_csv('D:/PycharmProjects/DataSet/50_Startups.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 4 ].values
部分数据如下图所示,其中前四列为特征,第五列为输出(也就是需要预测的变量)
(3)将类别数据数字化
# Encoding Categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[: , 3] = labelencoder.fit_transform(X[ : , 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
在这里Labelenconder是把字符串变为0,1,2(分别对应California,Florida,New York),之后再用OneHotEncoder编码(100,010,001)。
LabelEncoder 可以理解为一个打标签的机器,用来对分类型特征值进行编码,即对不连续的数值或文本进行编码。
OneHotEncoder:将每一个分类特征变量的m个可能的取值转变成m个二值特征,对于每一条数据这m个值中仅有一个特征值为1,其他的都为0。
参考:https://blog.****.net/quintind/article/details/79850455
(4)躲避虚拟变量陷阱
# Avoiding Dummy Variable Trap
X = X[: , 1:]
存在所谓的虚拟变量陷阱,意思就是:其实state只有3种取值,理论上2位二进制就可以表示,而这里用100,010,001三种表示。其实若把第一位统一去掉,变为00,10,01也是可以区分的。所以这里需要做一个处理:躲避虚拟变量陷阱,把第一列去掉了。
(5)拆分数据集为训练集和测试集
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
第二步:在训练集上训练多元线性回归模型
# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
第三步:在测试集上预测结果
# Predicting the Test set results
y_pred = regressor.predict(X_test)
print(X_test)
print(y_pred)
第四步:使用r2_score评价回归模型
# regression evaluation
from sklearn.metrics import r2_score
print(r2_score(Y_test,y_pred))
结果如下:
模型越好:r2→1
模型越差:r2→0
对于r2_score的介绍可参考:https://blog.****.net/qq_41929011/article/details/88877009
完整代码及学习图谱如下:
# Importing the libraries
import pandas as pd
import numpy as np
# Importing the dataset
dataset = pd.read_csv('D:/PycharmProjects/DataSet/50_Startups.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 4 ].values
# Encoding Categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[: , 3] = labelencoder.fit_transform(X[ : , 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
# Avoiding Dummy Variable Trap
X = X[: , 1:]
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)
# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)
# Predicting the Test set results
y_pred = regressor.predict(X_test)
print(X_test)
print(y_pred)
# regression evaluation
from sklearn.metrics import r2_score
print(r2_score(Y_test,y_pred))