对数据集类别不平衡,重采样和权重的一些探讨
非平衡数据对算法的影响及应对措施。
首先加载数据集,并拆分训练集和测试集。 数据集在这里下载:数据集下载.(信用卡欺诈比赛).
import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
data = pd.read_csv('../input/creditcard.csv')
# Separata data into X/y
y = data['Class'].values
X = data.drop(['Class', 'Time'], axis=1).values
num_neg = (y==0).sum()
num_pos = (y==1).sum()
# Scaling..
scaler = RobustScaler()
X = scaler.fit_transform(X)
# Split into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
看下数据类别分布.
欺诈案例很少 (class=1) vs 非欺诈案例占大多数 (class=0),告诉我们世间还是好人多.
import seaborn as sns
print(data.groupby('Class').size())
sns.countplot(x="Class", data=data)
Class
0 284315
1 492
dtype: int64
Out[2]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8bb4cc2748>
我们先假装这种不平衡不存在,拿Logistic Regression试试水..
结果糟透了,很显然,对于这样的数据简单的LR并不能让我们满意.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from mlxtend.plotting import plot_decision_regions, plot_confusion_matrix
from matplotlib import pyplot as plt
lr = LogisticRegression()
# Fit..
lr.fit(X_train, y_train)
# Predict..
y_pred = lr.predict(X_test)
# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))
precision recall f1-score support
0 1.00 1.00 1.00 56849
1 0.87 0.53 0.66 113
avg / total 1.00 1.00 1.00 56962
Out[3]:
(<matplotlib.figure.Figure at 0x7f8bf42a8b38>,
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ba82ac0b8>)
一种可能的解决方案是告诉逻辑回归存在类别不平衡,并对误差加权,权重与类别不平衡成比例.
然而,这最终导致出现一个毛病:几乎所有欺诈行为都被检测到,但是这样就会有很多假阴性.....
lr = LogisticRegression(class_weight='balanced')
# Fit..
lr.fit(X_train, y_train)
# Predict..
y_pred = lr.predict(X_test)
# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))
precision recall f1-score support
0 1.00 0.98 0.99 56849
1 0.07 0.90 0.13 113
avg / total 1.00 0.98 0.99 56962
Out[4]:
(<matplotlib.figure.Figure at 0x7f8bb14b76a0>,
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ba7e5f4e0>)
还有种做法是,我们手动调整类别权重,然后去找到一个FP,FN和检测到欺诈案例的trade-off. F1-score刚好可以用来干这事.在这个时候必须祭出我的小抄表了:
下面就来看看调整权重 对F1的影响:
from sklearn.model_selection import GridSearchCV
weights = np.linspace(0.05, 0.95, 20)
gsc = GridSearchCV(
estimator=LogisticRegression(),
param_grid={
'class_weight': [{0: x, 1: 1.0-x} for x in weights]
},
scoring='f1',
cv=3
)
grid_result = gsc.fit(X, y)
print("Best parameters : %s" % grid_result.best_params_)
# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
'weight': weights })
dataz.plot(x='weight')
Best parameters : {'class_weight': {0: 0.14473684210526316, 1: 0.85526315789473684}}
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ba7df25f8>
通过优化参数我们又训练了一个LR模型,这个模型的性能就好多了.
lr = LogisticRegression(**grid_result.best_params_)
# Fit..
lr.fit(X_train, y_train)
# Predict..
y_pred = lr.predict(X_test)
# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))
precision recall f1-score support
0 1.00 1.00 1.00 56849
1 0.81 0.78 0.79 113
avg / total 1.00 1.00 1.00 56962
Out[6]:
(<matplotlib.figure.Figure at 0x7f8bb14a4f98>,
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ba05662e8>)
还有种搞法是二次采样(re-sample)来平衡真假类别的比例. 这样的结果和加权重类似,看代码:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline
pipe = make_pipeline(
SMOTE(),
LogisticRegression()
)
# Fit..
pipe.fit(X_train, y_train)
# Predict..
y_pred = pipe.predict(X_test)
# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))
precision recall f1-score support
0 1.00 0.98 0.99 56849
1 0.07 0.90 0.13 113
avg / total 1.00 0.98 0.99 56962
Out[7]:
(<matplotlib.figure.Figure at 0x7f8ba0530d30>,
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ba04a86a0>)
SMOTE存在和自动平衡添加权重类似的问题,最终预测结果中包含许多假的欺诈案例.
手动调参来一波:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
pipe = make_pipeline(
SMOTE(),
LogisticRegression()
)
weights = np.linspace(0.005, 0.05, 10)
gsc = GridSearchCV(
estimator=pipe,
param_grid={
#'smote__ratio': [{0: int(num_neg), 1: int(num_neg * w) } for w in weights]
'smote__ratio': weights
},
scoring='f1',
cv=3
)
grid_result = gsc.fit(X, y)
print("Best parameters : %s" % grid_result.best_params_)
# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
'weight': weights })
dataz.plot(x='weight')
Best parameters : {'smote__ratio': 0.015000000000000003}
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ba04d5940>
使用优化出来的最佳参数拟合模型:
pipe = make_pipeline(
SMOTE(ratio=0.015),
LogisticRegression()
)
# Fit..
pipe.fit(X_train, y_train)
# Predict..
y_pred = pipe.predict(X_test)
# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))
precision recall f1-score support
0 1.00 1.00 1.00 56849
1 0.81 0.79 0.80 113
avg / total 1.00 1.00 1.00 56962
Out[9]:
(<matplotlib.figure.Figure at 0x7f8ba0462e10>,
<matplotlib.axes._subplots.AxesSubplot at 0x7f8ba03fd2b0>)
就酱!!✿✿ヽ(°▽°)ノ✿