小白数据挖掘学习笔记1

使用python进行关联性分析

依据同时购买两种商品的概率进行相关程度的度量,据此确定哪些商品适合放在一起出售

基于python 3.6.4,在进行分析之前,安装numpy库,scipy库和scikit—learn

小白数据挖掘学习笔记1

导入数据集,affi.txt,在百度文库中已上传,自行转换为txt格式即可https://wenku.baidu.com/view/5ba316c9710abb68a98271fe910ef12d2af9a987

统计数据集中交易信息的个数,并对数据集中的数据重新命名。sample表示一条交易信息,sample[3]表示apples

小白数据挖掘学习笔记1

统计交易信息中购买apples的人数,通过检测sample[3]是否为1判断。

小白数据挖掘学习笔记1

rule_valid,表示购买了苹果sample[3]同时购买了sample[4]香蕉的人数,rule_invalid,表示购买了苹果sample[3]但没有购买sample[4]香蕉的人数,据此可以得到,苹果,香蕉之间的一条想关性规则

规则的优劣一般用 支持度(support)和置信度(confidence)来衡量,可以依据规则在数据集中出现的次数对其进行计算

小白数据挖掘学习笔记1

得到关于apples,bananas这条规则的支持度和置信度。为了统计数据集中所有的相关规则,依据有效规则和

无效规则这两种情况创建字典来存放计算结果。字典的键由条件premise和结论conclusion组成,这里使用dafaultdict,

避免了查找的键值不存在时报错。

小白数据挖掘学习笔记1

计算过程采用循环结构,依次对每条购物信息及每条信息中的特征值进行处理。

第一个特征为规则的前提条件,顾客购买了某一物品sample[premise]

小白数据挖掘学习笔记1

检测个体是否购买了某样商品,如果没有,continue,继续检测下一个条件,

如果由购买行为,该条件出现次数加1。在遍历过程中要跳过条件和结果相同的部分,如“如果购买了苹果,也购买了苹果”

这种规则没有意义。如果规则适用于个体,valid_rules字典中,增加一次,反之invalid_rules增加一次。

小白数据挖掘学习笔记1

输出支持度,置信度计算结果

小白数据挖掘学习笔记1

输出特定的规则。

规则中的很多支持度很低,并没有实际应用价值,考虑将支持度结果进行排序,输出前五个。

Python 字典(Dictionary) items() 函数以列表返回可遍历的(键, 值) 元组数组

小白数据挖掘学习笔记1

输出按支持度排序的前五个

小白数据挖掘学习笔记1

输出按置信度排序的前五个

从运行结果,apples,cheese,bananas之间的关联程度较高,在实际中,将这几种商品放置在一起进行销售可以方便顾客

同时,如果有促销活动,对于apples和cheese其置信度最高,即使有折扣,这两件商品的销量也不会有很大提升,因为客户

本身就倾向于同时购买两种产品。完整代码如下:

小白数据挖掘学习笔记1

import numpy as np
dataset_filename = "affi.txt"
X = np.loadtxt(dataset_filename)
n_samples, n_features = X.shape
print(X[:10])
print("this dataset has {0}samples and {1} features".format(n_samples, n_features))
features = ["bread", "milk", "cheese", "apples", "bananas"] # name of the feature

# example to calculate apples purchase
num_apple_purchase = 0
for sample in X:
    if sample[3] == 1:
        num_apple_purchase += 1
print("{0} people bought Apples".format(num_apple_purchase))

# cases that person bought apple and bananas at same time
rule_valid = 0
rule_invalid = 0
for sample in X:
    if sample[3] == 1:
        if sample[4] == 1:
            rule_valid += 1
        else:
            rule_invalid += 1
print("{0}cases of the rule being valid were discovered".format(rule_valid))
print("{0}cases of the rule being invalid were discovered".format(rule_invalid))
print("*"*100)
# compute the support and confidence
support = rule_valid
confidence = rule_valid/ num_apple_purchase
print("the support is {0} and the confidence is {1:.3f}".format(support, confidence))
# .3f means let  number in 3 significant digits
print(confidence)
print("*"*100)

# compute all possible rules
from collections import defaultdict
# create default data dictionary,if key not exist,use default value
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)

# rules include premise and conclusion
for sample in X:
    for premise in range(n_features):
        if sample[premise] == 0:
            continue
        num_occurences[premise] += 1
        for conclusion in range(n_features):
            if premise == conclusion:
                continue
            if sample[conclusion] == 1:
                valid_rules[(premise, conclusion)] += 1
            else:
                invalid_rules[(premise, conclusion)] += 1

support = valid_rules
confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
    rule = (premise, conclusion)
    confidence[rule] = valid_rules[rule] / num_occurences[premise]
for premise, conclusion in confidence:
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print("Rule if person buy {0},they will also buy {1}".format(premise_name, conclusion_name))
    print("- confidence:{0:.3f}".format(confidence[(premise, conclusion)]))
    print("- support:{0}".format(support[(premise, conclusion)]))
    print("")
print("*"*100)

# output rule in def
def print_rule(premise, conclusion, support, confidence, features):
    premise_name = features[premise]
    conclusion_name = features[conclusion]
    print("Rule if person buy {0},they will also buy {1}".format(premise_name, conclusion_name))
    print("- confidence:{0:.3f}".format(confidence[(premise, conclusion)]))
    print("- support:{0}".format(support[(premise, conclusion)]))
    print("")
print("*"*100)
# buy milk and apple at same time
premise = 1
conclusion = 3
print_rule(premise, conclusion, support, confidence, features)

# sort by support
# dict.items return the key value
from pprint import pprint
pprint(list(support.items()))
from operator import itemgetter
sorted_support = sorted(support.items(), key = itemgetter(1), reverse = True)
for index in range(5):
    print("Rule #{0}".format(index+1))
    (premise,conclusion) = sorted_support[index][0]
    print_rule(premise, conclusion, support, confidence, features)

# sort by confidence
from pprint import pprint
pprint(list(confidence.items()))
from operator import itemgetter
sorted_confidence = sorted(confidence.items(), key = itemgetter(1), reverse = True)
for index in range(5):
    print("Rule #{0}".format(index+1))
    (premise,conclusion) = sorted_confidence[index][0]
    print_rule(premise, conclusion, support, confidence, features)