垃圾邮件识别

数据集:Enron-Spam数据集

正常邮件内容举例如下:

Subject: christmas baskets
the christmas baskets have been we have ordered several baskets individual earth - sat freeze - smith barney group baskets rodney keys matt rodgers charlie notis jon davis move
team
phillip randle chris hyde
harvey
freese
faclities

垃圾邮件内容举例如下:

Subject: fw : this is the solution i mentioned lsc
oo
thank you ,
your email address was obtained from a purchased list , reference # 2020 mid = 3300 . if you wish to unsubscribe from this list , please click here and enter

your name into the remove box . if you have previously unsubscribed
and are still receiving this message , you may email our abuse
control center , or call 1 - 888 - 763 - 2497 , or write us at : nospam , 6484 coral way , miami , fl , 33155 " . 2002

web credit inc . all rights reserved .

文本处理领域这里有三种方法:词袋模型、词汇表模型或者TF-IDF模型。

朴素贝叶斯算法,使用词袋模型:

垃圾邮件识别

max_features值越大,模型评估准确度越高,同时整个系统运算时间也增长,当max_features超过约13000以后,系统准确率反而下降,所以将max_features设置为13000左 右,系统准确度达到最大,接近96.4%。但是实验表明,当max_features超过5000时计算时间明显过长,且对准确率提升不明显,所以折中角度max_features取5000也可满足实验要求。

支持向量机算法:

将数据集合随机分配成训练集合和测试集合,其中测试集合比例为40%:

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, random_state = 0)

实例化支持向量机算法,在训练集上训练并在测试集上预测:

clf = svm.SVC() clf.fit(x_train, y_train) y_pred = clf.predict(x_test)

评估结果的准确度和TP、FP、TN、FN4个值:print metrics.accuracy_score(y_test, y_pred)

print metrics.confusion_matrix(y_test, y_pred)

多层感知机(Multi-layer Perceptron,MLP):

将数据集合随机分配成训练集合和测试集合,其中测试集合比例为40%:

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, random_state = 0)

实例化MLP,在训练集上训练并在测试集上预测:

clf = MLPClassifier(solver='lbfgs', alpha=1e-5,

hidden_layer_sizes = (5, 2),

random_state = 1) clf.fit(x_train, y_train)

y_pred = clf.predict(x_test)

评估结果的准确度和TP、FP、TN、FN4个值:

print metrics.accuracy_score(y_test, y_pred)

print metrics.confusion_matrix(y_test, y_pred)

深度学习算法之CNN:

将数据集合随机分配成训练集合和测试集合,其中测试集合比例为40%:x,y=get_features_by_tf()

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, random_state = 0)

将训练和测试数据进行填充和转换,不到最大长度的数据填充0,由于是二分类问题,把标记数据二值化。定义输入参数的最大长度为文档的最大长度:

trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.) testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)

testY = to_categorical(testY, nb_classes=2)
network = input_data(shape=[None,max_document_length], name='input')

定义CNN模型,其实使用3个数量为128核,长度分别为3、4、5的一维卷积函数处理数据:

network = tflearn.embedding(network, input_dim=1000000, output_dim=128)
branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2") branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2") branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2") network = merge([branch1, branch2, branch3], mode='concat', axis=1)
network = tf.expand_dims(network, 2)
network = global_max_pool(network)
network = dropout(network, 0.8)
network = fully_connected(network, 2, activation='softmax')
network = regression(network, optimizer='adam', learning_rate=0.001,

loss='categorical_crossentropy', name='target')

实例化CNN对象并进行训练数据,一共训练5轮:

model = tflearn.DNN(network, tensorboard_verbose=0) model.fit(trainX, trainY,

n_epoch=5, shuffle=True, validation_set=(testX, testY), show_metric=True, batch_size=100,run_id="spam")

深度学习算法之RNN

1)将Enron-Spam数据集的文件提取词汇表。2)随机划分为训练集和测试集。3)使用RNN算法在训练集上训练,获得模型数据。4)使用模型数据在测试集上进行预测。
5)验证RNN算法预测效果。 将数据集合随机分配成训练集合和测试集合,其中测试集合比例为40%:

x,y=get_features_by_tf()
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, random_state = 0)

将训练和测试数据进行填充和转换,不到最大长度的数据填充0,由于是二分类问题,把标记数据二值化。定义输入参数的最大长度为文档的最大长度:

trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.) testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
# Converting labels to binary vectors
trainY = to_categorical(trainY, nb_classes=2)

testY = to_categorical(testY, nb_classes=2)
network = input_data(shape=[None,max_document_length], name='input')

定义RNN结构,使用最简单的单层LSTM结构:

# Network building
net = tflearn.input_data([None, max_document_length])
net = tflearn.embedding(net, input_dim=1024000, output_dim=128)
net = tflearn.lstm(net, 128, dropout=0.8)
net = tflearn.fully_connected(net, 2, activation='softmax')
net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,

loss='categorical_crossentropy')

实例化RNN对象并进行训练数据,一共训练5轮:

# Training
model = tflearn.DNN(net, tensorboard_verbose=0)
model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,

batch_size=10,run_id="spm-run",n_epoch=5)

from sklearn.feature_extraction.text import CountVectorizer
import os
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
import numpy as np
from sklearn import svm
from sklearn.feature_extraction.text import TfidfTransformer
import tensorflow as tf
import tflearn
from tflearn.layers.core import input_data, dropout, fully_connected
from tflearn.layers.conv import conv_1d, global_max_pool
from tflearn.layers.conv import conv_2d, max_pool_2d
from tflearn.layers.merge_ops import merge
from tflearn.layers.estimator import regression
from tflearn.data_utils import to_categorical, pad_sequences
from sklearn.neural_network import MLPClassifier
from tflearn.layers.normalization import local_response_normalization
from tensorflow.contrib import learn


max_features=5000
max_document_length=100



def load_one_file(filename):
    x=""
    with open(filename) as f:
        for line in f:
            line=line.strip('\n')
            line = line.strip('\r')
            x+=line
    return x

def load_files_from_dir(rootdir):
    x=[]
    list = os.listdir(rootdir)
    for i in range(0, len(list)):
        path = os.path.join(rootdir, list[i])
        if os.path.isfile(path):
            v=load_one_file(path)
            x.append(v)
    return x

def load_all_files():
    ham=[]
    spam=[]
    for i in range(1,2):
        path="/Users/zhanglipeng/Desktop/2book-master/data/mail/enron%d/ham/" % i
        print "Load %s" % path
        ham+=load_files_from_dir(path)
        path="/Users/zhanglipeng/Desktop/2book-master/data/mail/enron%d/spam/" % i
        print "Load %s" % path
        spam+=load_files_from_dir(path)
    return ham,spam

def get_features_by_wordbag():
    ham, spam=load_all_files()
    x=ham+spam
    y=[0]*len(ham)+[1]*len(spam)
    vectorizer = CountVectorizer(
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    print vectorizer
    x=vectorizer.fit_transform(x)
    x=x.toarray()
    return x,y

def show_diffrent_max_features():
    global max_features
    a=[]
    b=[]
    for i in range(1000,20000,2000):
        max_features=i
        print "max_features=%d" % i
        x, y = get_features_by_wordbag()
        x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.4, random_state=0)
        gnb = GaussianNB()
        gnb.fit(x_train, y_train)
        y_pred = gnb.predict(x_test)
        score=metrics.accuracy_score(y_test, y_pred)
        a.append(max_features)
        b.append(score)
        plt.plot(a, b, 'r')
    plt.xlabel("max_features")
    plt.ylabel("metrics.accuracy_score")
    plt.title("metrics.accuracy_score VS max_features")
    plt.legend()
    plt.show()

def do_nb_wordbag(x_train, x_test, y_train, y_test):
    print "NB and wordbag"
    gnb = GaussianNB()
    gnb.fit(x_train,y_train)
    y_pred=gnb.predict(x_test)
    print metrics.accuracy_score(y_test, y_pred)
    print metrics.confusion_matrix(y_test, y_pred)

def do_svm_wordbag(x_train, x_test, y_train, y_test):
    print "SVM and wordbag"
    clf = svm.SVC()
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print metrics.accuracy_score(y_test, y_pred)
    print metrics.confusion_matrix(y_test, y_pred)

def get_features_by_wordbag_tfidf():
    ham, spam=load_all_files()
    x=ham+spam
    y=[0]*len(ham)+[1]*len(spam)
    vectorizer = CountVectorizer(binary=False,
                                 decode_error='ignore',
                                 strip_accents='ascii',
                                 max_features=max_features,
                                 stop_words='english',
                                 max_df=1.0,
                                 min_df=1 )
    print vectorizer
    x=vectorizer.fit_transform(x)
    x=x.toarray()
    transformer = TfidfTransformer(smooth_idf=False)
    print transformer
    tfidf = transformer.fit_transform(x)
    x = tfidf.toarray()
    return  x,y


def do_cnn_wordbag(trainX, testX, trainY, testY):
    global max_document_length
    print "CNN and tf"

    trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.)
    testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Building convolutional network
    network = input_data(shape=[None,max_document_length], name='input')
    network = tflearn.embedding(network, input_dim=1000000, output_dim=128)
    branch1 = conv_1d(network, 128, 3, padding='valid', activation='relu', regularizer="L2")
    branch2 = conv_1d(network, 128, 4, padding='valid', activation='relu', regularizer="L2")
    branch3 = conv_1d(network, 128, 5, padding='valid', activation='relu', regularizer="L2")
    network = merge([branch1, branch2, branch3], mode='concat', axis=1)
    network = tf.expand_dims(network, 2)
    network = global_max_pool(network)
    network = dropout(network, 0.8)
    network = fully_connected(network, 2, activation='softmax')
    network = regression(network, optimizer='adam', learning_rate=0.001,
                         loss='categorical_crossentropy', name='target')
    # Training
    model = tflearn.DNN(network, tensorboard_verbose=0)
    model.fit(trainX, trainY,
              n_epoch=5, shuffle=True, validation_set=(testX, testY),
              show_metric=True, batch_size=100,run_id="spam")

def do_rnn_wordbag(trainX, testX, trainY, testY):
    global max_document_length
    print "RNN and wordbag"

    trainX = pad_sequences(trainX, maxlen=max_document_length, value=0.)
    testX = pad_sequences(testX, maxlen=max_document_length, value=0.)
    # Converting labels to binary vectors
    trainY = to_categorical(trainY, nb_classes=2)
    testY = to_categorical(testY, nb_classes=2)

    # Network building
    net = tflearn.input_data([None, max_document_length])
    net = tflearn.embedding(net, input_dim=10240000, output_dim=128)
    net = tflearn.lstm(net, 128, dropout=0.8)
    net = tflearn.fully_connected(net, 2, activation='softmax')
    net = tflearn.regression(net, optimizer='adam', learning_rate=0.001,
                             loss='categorical_crossentropy')

    # Training
    model = tflearn.DNN(net, tensorboard_verbose=0)
    model.fit(trainX, trainY, validation_set=(testX, testY), show_metric=True,
              batch_size=10,run_id="spm-run",n_epoch=5)


def do_dnn_wordbag(x_train, x_test, y_train, y_testY):
    print "DNN and wordbag"

    # Building deep neural network
    clf = MLPClassifier(solver='lbfgs',
                        alpha=1e-5,
                        hidden_layer_sizes = (5, 2),
                        random_state = 1)
    print  clf
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print metrics.accuracy_score(y_test, y_pred)
    print metrics.confusion_matrix(y_test, y_pred)



def  get_features_by_tf():
    global  max_document_length
    x=[]
    y=[]
    ham, spam=load_all_files()
    x=ham+spam
    y=[0]*len(ham)+[1]*len(spam)
    vp=tflearn.data_utils.VocabularyProcessor(max_document_length=max_document_length,
                                              min_frequency=0,
                                              vocabulary=None,
                                              tokenizer_fn=None)
    x=vp.fit_transform(x, unused_y=None)
    x=np.array(list(x))
    return x,y



if __name__ == "__main__":
    print "Hello spam-mail"
    print "get_features_by_wordbag"
    x,y=get_features_by_wordbag()
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, random_state = 0)
    do_svm_wordbag(x_train, x_test, y_train, y_test)
    #do_nb_wordbag(x_train, x_test, y_train, y_test)
    #do_dnn_wordbag(x_train, x_test, y_train, y_test)

    print "get_features_by_wordbag_tfidf"
    x,y=get_features_by_wordbag_tfidf()
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, random_state = 0)
    do_svm_wordbag(x_train, x_test, y_train, y_test)
    #do_dnn_wordbag(x_train, x_test, y_train, y_test)
    #NB
    #do_nb_wordbag(x_train, x_test, y_train, y_test)
    #show_diffrent_max_features()

    #SVM
    #do_svm_wordbag(x_train, x_test, y_train, y_test)

    #DNN
    #do_dnn_wordbag(x_train, x_test, y_train, y_test)

    #print "get_features_by_tf"
    #x,y=get_features_by_tf()
    #x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.4, random_state = 0)
    #CNN
    #do_cnn_wordbag(x_train, x_test, y_train, y_test)


    #RNN
    #do_rnn_wordbag(x_train, x_test, y_train, y_test)

输出结果:

Hello spam-mail

get_features_by_wordbag

Load /Users/zhanglipeng/Desktop/2book-master/data/mail/enron1/ham/

Load /Users/zhanglipeng/Desktop/2book-master/data/mail/enron1/spam/

CountVectorizer(analyzer=u'word', binary=False, decode_error='ignore',

        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',

        lowercase=True, max_df=1.0, max_features=5000, min_df=1,

        ngram_range=(1, 1), preprocessor=None, stop_words='english',

        strip_accents='ascii', token_pattern=u'(?u)\\b\\w\\w+\\b',

        tokenizer=None, vocabulary=None)

SVM and wordbag

/anaconda2/envs/python27/lib/python2.7/site-packages/sklearn/svm/base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.

  "avoid this warning.", FutureWarning)

0.7491541807636539

[[1460    3]

[ 516   90]]

get_features_by_wordbag_tfidf

Load /Users/zhanglipeng/Desktop/2book-master/data/mail/enron1/ham/

Load /Users/zhanglipeng/Desktop/2book-master/data/mail/enron1/spam/

CountVectorizer(analyzer=u'word', binary=False, decode_error='ignore',

        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',

        lowercase=True, max_df=1.0, max_features=5000, min_df=1,

        ngram_range=(1, 1), preprocessor=None, stop_words='english',

        strip_accents='ascii', token_pattern=u'(?u)\\b\\w\\w+\\b',

        tokenizer=None, vocabulary=None)

TfidfTransformer(norm=u'l2', smooth_idf=False, sublinear_tf=False,

         use_idf=True)

SVM and wordbag

0.7071048815853069

[[1463    0]

[ 606    0]]

(python27) zhanglipengdeMacBook-Pro:code zhanglipeng$