《机器学习100天》学习笔记——Day 3_Multiple_Linear_Regression(多元线性回归)

100-Days-Of-ML-Code
中文版《机器学习100天》
GitHub :https://github.com/MLEveryday/100-Days-Of-ML-Code

第一步:数据预处理

(1)导入库:

# Importing the libraries
import pandas as pd
import numpy as np

(2)导入数据集

# Importing the dataset
dataset = pd.read_csv('D:/PycharmProjects/DataSet/50_Startups.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : ,  4 ].values

部分数据如下图所示,其中前四列为特征,第五列为输出(也就是需要预测的变量)
《机器学习100天》学习笔记——Day 3_Multiple_Linear_Regression(多元线性回归)
(3)将类别数据数字化

# Encoding Categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[: , 3] = labelencoder.fit_transform(X[ : , 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

在这里Labelenconder是把字符串变为0,1,2(分别对应California,Florida,New York),之后再用OneHotEncoder编码(100,010,001)。

LabelEncoder 可以理解为一个打标签的机器,用来对分类型特征值进行编码,即对不连续的数值或文本进行编码。
OneHotEncoder:将每一个分类特征变量的m个可能的取值转变成m个二值特征,对于每一条数据这m个值中仅有一个特征值为1,其他的都为0。
参考:https://blog.****.net/quintind/article/details/79850455

(4)躲避虚拟变量陷阱

# Avoiding Dummy Variable Trap
X = X[: , 1:]

存在所谓的虚拟变量陷阱,意思就是:其实state只有3种取值,理论上2位二进制就可以表示,而这里用100,010,001三种表示。其实若把第一位统一去掉,变为00,10,01也是可以区分的。所以这里需要做一个处理:躲避虚拟变量陷阱,把第一列去掉了。
(5)拆分数据集为训练集和测试集

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

第二步:在训练集上训练多元线性回归模型

# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

第三步:在测试集上预测结果

# Predicting the Test set results
y_pred = regressor.predict(X_test)
print(X_test)
print(y_pred)

第四步:使用r2_score评价回归模型

# regression evaluation
from sklearn.metrics import r2_score
print(r2_score(Y_test,y_pred))

结果如下:
《机器学习100天》学习笔记——Day 3_Multiple_Linear_Regression(多元线性回归)
模型越好:r2→1
模型越差:r2→0
对于r2_score的介绍可参考:https://blog.****.net/qq_41929011/article/details/88877009

完整代码及学习图谱如下:

# Importing the libraries
import pandas as pd
import numpy as np

# Importing the dataset
dataset = pd.read_csv('D:/PycharmProjects/DataSet/50_Startups.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : ,  4 ].values

# Encoding Categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[: , 3] = labelencoder.fit_transform(X[ : , 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()

# Avoiding Dummy Variable Trap
X = X[: , 1:]

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

# Fitting Multiple Linear Regression to the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, Y_train)

# Predicting the Test set results
y_pred = regressor.predict(X_test)
print(X_test)
print(y_pred)

# regression evaluation
from sklearn.metrics import r2_score
print(r2_score(Y_test,y_pred))

《机器学习100天》学习笔记——Day 3_Multiple_Linear_Regression(多元线性回归)