对数据集类别不平衡，重采样和权重的一些探讨

非平衡数据对算法的影响及应对措施。

首先加载数据集，并拆分训练集和测试集。数据集在这里下载：数据集下载.（信用卡欺诈比赛）.

import numpy as np
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split

data = pd.read_csv('../input/creditcard.csv')

# Separata data into X/y
y = data['Class'].values
X = data.drop(['Class', 'Time'], axis=1).values

num_neg = (y==0).sum()
num_pos = (y==1).sum()

# Scaling..
scaler = RobustScaler()
X = scaler.fit_transform(X)

# Split into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

看下数据类别分布.

欺诈案例很少 (class=1) vs 非欺诈案例占大多数 (class=0)，告诉我们世间还是好人多.

import seaborn as sns

print(data.groupby('Class').size())

sns.countplot(x="Class", data=data)


Class
0    284315
1       492
dtype: int64
Out[2]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8bb4cc2748>

对数据集类别不平衡，重采样和权重的一些探讨

我们先假装这种不平衡不存在，拿Logistic Regression试试水..

结果糟透了，很显然，对于这样的数据简单的LR并不能让我们满意.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from mlxtend.plotting import plot_decision_regions, plot_confusion_matrix
from matplotlib import pyplot as plt

lr = LogisticRegression()

# Fit..
lr.fit(X_train, y_train)

# Predict..
y_pred = lr.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     56849
          1       0.87      0.53      0.66       113

avg / total       1.00      1.00      1.00     56962

Out[3]:

(<matplotlib.figure.Figure at 0x7f8bf42a8b38>,
 <matplotlib.axes._subplots.AxesSubplot at 0x7f8ba82ac0b8>)

对数据集类别不平衡，重采样和权重的一些探讨

一种可能的解决方案是告诉逻辑回归存在类别不平衡，并对误差加权，权重与类别不平衡成比例.

然而，这最终导致出现一个毛病：几乎所有欺诈行为都被检测到，但是这样就会有很多假阴性.....

lr = LogisticRegression(class_weight='balanced')

# Fit..
lr.fit(X_train, y_train)

# Predict..
y_pred = lr.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))
             precision    recall  f1-score   support

          0       1.00      0.98      0.99     56849
          1       0.07      0.90      0.13       113

avg / total       1.00      0.98      0.99     56962

Out[4]:

(<matplotlib.figure.Figure at 0x7f8bb14b76a0>,
 <matplotlib.axes._subplots.AxesSubplot at 0x7f8ba7e5f4e0>)

对数据集类别不平衡，重采样和权重的一些探讨

还有种做法是，我们手动调整类别权重，然后去找到一个FP,FN和检测到欺诈案例的trade-off. F1-score刚好可以用来干这事.在这个时候必须祭出我的小抄表了：

对数据集类别不平衡，重采样和权重的一些探讨

下面就来看看调整权重对F1的影响：

from sklearn.model_selection import GridSearchCV

weights = np.linspace(0.05, 0.95, 20)

gsc = GridSearchCV(
    estimator=LogisticRegression(),
    param_grid={
        'class_weight': [{0: x, 1: 1.0-x} for x in weights]
    },
    scoring='f1',
    cv=3
)
grid_result = gsc.fit(X, y)

print("Best parameters : %s" % grid_result.best_params_)

# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
                       'weight': weights })
dataz.plot(x='weight')
Best parameters : {'class_weight': {0: 0.14473684210526316, 1: 0.85526315789473684}}
Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ba7df25f8>

对数据集类别不平衡，重采样和权重的一些探讨

通过优化参数我们又训练了一个LR模型，这个模型的性能就好多了.

lr = LogisticRegression(**grid_result.best_params_)

# Fit..
lr.fit(X_train, y_train)

# Predict..
y_pred = lr.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     56849
          1       0.81      0.78      0.79       113

avg / total       1.00      1.00      1.00     56962

Out[6]:

(<matplotlib.figure.Figure at 0x7f8bb14a4f98>,
 <matplotlib.axes._subplots.AxesSubplot at 0x7f8ba05662e8>)

对数据集类别不平衡，重采样和权重的一些探讨

还有种搞法是二次采样（re-sample）来平衡真假类别的比例. 这样的结果和加权重类似，看代码：

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline

pipe = make_pipeline(
    SMOTE(),
    LogisticRegression()
)

# Fit..
pipe.fit(X_train, y_train)

# Predict..
y_pred = pipe.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))
             precision    recall  f1-score   support

          0       1.00      0.98      0.99     56849
          1       0.07      0.90      0.13       113

avg / total       1.00      0.98      0.99     56962

Out[7]:

(<matplotlib.figure.Figure at 0x7f8ba0530d30>,
 <matplotlib.axes._subplots.AxesSubplot at 0x7f8ba04a86a0>)

对数据集类别不平衡，重采样和权重的一些探讨

SMOTE存在和自动平衡添加权重类似的问题，最终预测结果中包含许多假的欺诈案例.

手动调参来一波：

import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

pipe = make_pipeline(
    SMOTE(),
    LogisticRegression()
)

weights = np.linspace(0.005, 0.05, 10)

gsc = GridSearchCV(
    estimator=pipe,
    param_grid={
        #'smote__ratio': [{0: int(num_neg), 1: int(num_neg * w) } for w in weights]
        'smote__ratio': weights
    },
    scoring='f1',
    cv=3
)
grid_result = gsc.fit(X, y)

print("Best parameters : %s" % grid_result.best_params_)

# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
                       'weight': weights })
dataz.plot(x='weight')
Best parameters : {'smote__ratio': 0.015000000000000003}
Out[8]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f8ba04d5940>

对数据集类别不平衡，重采样和权重的一些探讨

使用优化出来的最佳参数拟合模型：

pipe = make_pipeline(
    SMOTE(ratio=0.015),
    LogisticRegression()
)

# Fit..
pipe.fit(X_train, y_train)

# Predict..
y_pred = pipe.predict(X_test)

# Evaluate the model
print(classification_report(y_test, y_pred))
plot_confusion_matrix(confusion_matrix(y_test, y_pred))
             precision    recall  f1-score   support

          0       1.00      1.00      1.00     56849
          1       0.81      0.79      0.80       113

avg / total       1.00      1.00      1.00     56962

Out[9]:

(<matplotlib.figure.Figure at 0x7f8ba0462e10>,
 <matplotlib.axes._subplots.AxesSubplot at 0x7f8ba03fd2b0>)

对数据集类别不平衡，重采样和权重的一些探讨

就酱！！✿✿ヽ(°▽°)ノ✿

对数据集类别不平衡，重采样和权重的一些探讨

相关推荐