用户画像

基于用户搜索关键词数据为用户打上标签（年龄，性别，学历）

整体流程

（一）数据预处理

编码方式转换
对数据搜索内容进行分词
词性过滤
数据检查

（二）特征选择

建立word2vec词向量模型
对所有搜索数据求平均向量

（三）建模预测

利用逻辑回归模型对用户类别进行预测
将原始数据转换成utf-8编码，防止后续出现各种编码问题

import csv

def code_coversion(filename):
    #原始数据存储路径
    data_path = 'F:\\data_load\\' + filename
    #生成数据路径
    csvfile = open(data_path + '.csv', 'w')
    writer = csv.writer(csvfile)
    writer.writerow(['ID', 'age', 'Gender', 'Education', 'QueryList'])
    #转换成utf-8编码的格式
    with open(data_path, 'r',encoding='gb18030',errors='ignore') as f:
        lines = f.readlines()
        for line in lines[0:-1]:
            try:
                line.strip()          
                data = line.split("\t")
                writedata = [data[0], data[1], data[2], data[3]]
                querystr = ''
                data[-1]=data[-1][:-1]
                for d in data[4:]:
                    try:
                        cur_str = d.encode('utf8')
                        cur_str = cur_str.decode('utf8')
                        querystr += cur_str + '\t'
                    except:
                        continue
                        #print (data[0][0:10])
                querystr = querystr[:-1]
                writedata.append(querystr)
                writer.writerow(writedata)
            except:
                #print (data[0][0:20])
                continue
    csvfile.close()

code_coversion('user_tag_query.10W.TRAIN')
code_coversion('user_tag_query.10W.TEST')

生成对应的数据表

import pandas as pd

#编码转换完成的数据
trainname = 'F:\\data_load\\user_tag_query_TRAIN.csv'
testname = 'F:\\data_load\\user_tag_query_TEST.csv'
data = pd.read_csv(trainname, encoding='gbk')
print(data.head())

#分别生成三种标签数据（性别、年龄、学历）
data.age.to_csv('F:\\data_load\\train_age.csv',index=False)
data.Gender.to_csv('F:\\data_load\\train_gender.csv',index=False)
data.Education.to_csv('F:\\data_load\\train_education.csv',index=False)
#将搜索数据单独拿出来
data.QueryList.to_csv('F:\\data_load\\train_querylist.csv', index=False)

#导入测试数据
data = pd.read_csv(testname, encoding='gbk')
print(data.info())
data.QueryList.to_csv('F:\\data_load\\test_querylist.csv', index=False)

用户画像

对用户的搜索数据进行分词与词性过滤

这里需要分别对训练集和测试集进行相同的操作，路径名字要改动一下

import pandas as pd
import numpy as np
import jieba
import jieba.posseg
import time
import os, sys

def input(trainname):
    traindata = []
    with open(trainname, 'rb') as f:
        line = f.readline()
        count = 0
        while line:
            try:
                traindata.append(line)
                count += 1
            except:
                prit('error:', line, count)
            line= f.readline()
    return traindata

start = time.clock()

filepath = 'F:\\data_load\\test_querylist.csv'
QueryList = input(filepath)

writepath = 'F:\\data_load\\test_querylist_writefile.csv'
csvfile = open(writepath, 'w')
POS = {}
for i in range(len(QueryList)):
    s = []
    str=""
    words = jieba.posseg.cut(QueryList[i])#带有词性的精确分词模式
    allowPOS = ['n', 'v', 'j']
    for word, flag in words:
        POS[flag] = POS.get(flag, 0) + 1
        if (flag[0] in allowPOS) and len(word) >= 2:
            str += word + ' '
    cur_str = str.encode('utf8')
    cur_str = cur_str.decode('utf8')
    s.append(cur_str)
    csvfile.write(' '.join(s)+'\n')
csvfile.close()

end = time.clock()
print('total time: %f s' % (end - start))

total time: 10012.192463 s

使用Gensim库建立word2vec词向量模型

参数定义：

sentences：可以是一个list
sg：用于设置训练算法，默认为0，对应CBOW算法；sg=1则采用skip-gram算法。
size：是指特征向量的维度，默认为100。大的size需要更多的训练数据,但是效果会更好. 推荐值为几十到几百。
window：表示当前词与预测词在一个句子中的最大距离是多少
alpha: 是学习速率
seed：用于随机数发生器。与初始化词向量有关。
min_count: 可以对字典做截断. 词频少于min_count次数的单词会被丢弃掉, 默认值为5
max_vocab_size: 设置词向量构建期间的RAM限制。如果所有独立单词个数超过这个，则就消除掉其中最不频繁的一个。每一千万个单词需要大约1GB的RAM。设置成None则没有限制。
workers参数控制训练的并行数。
hs: 如果为1则会采用hierarchica·softmax技巧。如果设置为0（defau·t），则negative sampling会被使用。
negative: 如果>0,则会采用negativesamp·ing，用于设置多少个noise words
iter：迭代次数，默认为5

from gensim.models import word2vec
#将数据变换成list of list 格式
train_path = 'F:\\data_load\\train_querylist_writefile.csv'
with open(train_path, 'r') as f:
    My_list = []
    lines = f.readlines()
    for line in lines:
        cur_list = []
        line = line.strip()
        data = line.split(' ')
        for d in data:
            cur_list.append(d)
        My_list.append(cur_list)
    model = word2vec.Word2Vec(My_list, size=300, window=10, workers=4)
    savepath = '_word2vec_' + '300'+'.model' #保存model
    
    model.save(savepath)

接下里看一下模型的效果：

model.wv.most_similar('大哥')

[(‘黑社会’, 0.5804992914199829),
(‘大嫂’, 0.562471330165863),
(‘男儿’, 0.49141111969947815),
(‘二哥’, 0.48255860805511475),
(‘四爷’, 0.48209255933761597),
(‘莫磊’, 0.4794856309890747),
(‘阿哲’, 0.47818657755851746),
(‘铁蛋’, 0.4763179421424866),
(‘招惹’, 0.4760439097881317),
(‘钟情’, 0.4748595356941223)]

可以看到依据训练数据可以得到于大哥相近的向量有以上这些

model.wv.most_similar('清华')

[(‘清华大学’, 0.6093192100524902),
(‘劝阻’, 0.5702500343322754),
(‘北大’, 0.5518572330474854),
(‘开课’, 0.5475547909736633),
(‘闹事’, 0.5349792242050171),
(‘特教’, 0.5277222394943237),
(‘附属中学’, 0.5266857147216797),
(‘北京大学’, 0.5259680151939392),
(‘校长’, 0.516852617263794),
(‘补课’, 0.4978478252887726)]

从结果来看还是蛮不错的

加载训练好的word2vec模型，求用户搜索结果的平均向量

import gensim
train_path = 'F:\\data_load\\train_querylist_writefile.csv'
from gensim.models import word2vec
cur_model = word2vec.Word2Vec.load('_word2vec_300.model')
with open(train_path, 'r') as f:
    cur_index = 0
    lines = f.readlines()
    doc_cev = np.zeros((len(lines), 300))
    for line in lines:
        word_vec = np.zeros((1,300))
        words = line.strip().split(' ')
        word_num = 0
        #求模型的平均向量
        for word in words:
            if word in cur_model:
                word_num += 1
                word_vec += np.array([cur_model[word]])
        doc_cev[cur_index] = word_vec / float(word_num)
        cur_index += 1

用户画像
接下来构建用户的标签：分别位性别、教育程度、年龄

genderlabel = np.loadtxt(open('F:\\data_load\\train_gender.csv','r')).astype(int)
educationlabel = np.loadtxt(open('F:\\data_load\\train_education.csv', 'r')).astype(int)
agelabel = np.loadtxt(open('F:\\data_load\\train_age.csv', 'r')).astype(int)

由于数据中有些类别数据采取不全而采用0值替代的异常值，且数据量较多，可以将异常值去掉

def removezero(x, y):
    nozero = np.nonzero(y)
    y = y[nozero]
    x = np.array(x)
    x = x[nozero]
    return x, y

gender_train, genderlabel = removezero(doc_cev, genderlabel)
age_train, agelabel = removezero(doc_cev, genderlabel)
education_train, educationlabel = removezero(doc_cev, genderlabel)
print (gender_train.shape,genderlabel.shape)
print (age_train.shape,agelabel.shape)
print (education_train.shape,educationlabel.shape)

(87790, 300) (87790,)
(87790, 300) (87790,)
(87790, 300) (87790,)

绘图函数，以性别为例，绘制混淆矩阵

import matplotlib.pyplot as plt
import itertools
%matplotlib inline

def plot_confusion_matix(cm, classes, title='Gender_Confusion matrix', cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)
    
    thresh = cm.max() / 2
    for i,j in itertools.product(range(cm.shape[0], cm.shape[1])):
        plt.text(j, i, cm[i,j],
                horizontalalignment='center',
                color = 'white' if cm[i, j] > thresh else 'black')
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predict label')

先建立一个基础分类模型

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

#gender_train 为每个用户搜索的数据的词向量，genderlabel 为用户的性别
X_train, X_test, y_train, y_test = train_test_split(gender_train, genderlabel,\
                                                   test_size=0.2, random_state=0)
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
y_pred = lr_model.predict(X_test)
print(lr_model.score(X_test, y_test))

cnf_matrix = confusion_matrix(y_test, y_pred)
print('Recall metric in the testing dataset:', cnf_matrix[1,1] / (cnf_matrix[1,1] + cnf_matrix[1,0]))
print('accuracy metric in the testing dataset:', (cnf_matrix[0,0] + cnf_matrix[1, 1]) / \
     (cnf_matrix[0,0] + cnf_matrix[0,1] + cnf_matrix[1,0] + cnf_matrix[1,1]))

#Plot non_normalized confusion matrix
classes = [0, 1]
plt.figure()
plot_confusion_matix(cm=cnf_matrix, 
                     classes=classes,
                    title='Gender_Confusion matrix')
plt.show()

可以看到已经成功的将准确率达到了81.7%，但是召回率确实只有77.2%，效果一般。这里只是以性别为例子，后续还可已将用户年龄、教育程度、喜好等分类出来给用户贴上标签，逐渐完善用户画像，具体得看业务需求。
本次的例子由于是类别样本分布不均匀导致效果一般，严格来说，任何数据集上都有数据不平衡现象，我们可以通过一下几点去解决类别分布不均匀的问题：
1、采样：
采样方法是通过对训练集进行处理使其从不平衡的数据集变成平衡的数据集，在大部分情况下会对最终的结果带来提升。采样分为上采样（Oversampling）和下采样（Undersampling），上采样是把小众类复制多份，下采样是从大众类中剔除一些样本，或者说只从大众类中选取部分样本。
2、数据合成
数据合成方法是利用已有样本生成更多样本，这类方法在小数据场景下有很多成功案例，比如医学图像分析等。
目前我只接触到了这两种方法，如有需求可以观看这篇博客：https://blog.csdn.net/lujiandong1/article/details/52658675