基于用户的过滤:推荐系统

问题描述:

我知道这不是一个特定的编码问题,但这是最适合提问的地方。所以请耐心等待。基于用户的过滤:推荐系统

假设我有一个像下面给出一个字典,列出10喜欢每个人

likes={ 
    "rajat":{"music","x-men","programming","hindi","english","himesh","lil wayne","rap","travelling","coding"}, 
    "steve":{"travelling","pop","hanging out","friends","facebook","tv","skating","religion","english","chocolate"}, 
    "toby":{"programming","pop","rap","gardens","flowers","birthday","tv","summer","youtube","eminem"}, 
    "ravi":{"skating","opera","sony","apple","iphone","music","winter","mango shake","heart","microsoft"}, 
    "katy":{"music","pics","guitar","glamour","paris","fun","lip sticks","cute guys","rap","winter"}, 
    "paul":{"office","women","dress","casuals","action movies","fun","public speaking","microsoft","developer"}, 
    "sheila":{"heart","beach","summer","laptops","youtube","movies","hindi","english","cute guys","love"}, 
    "saif":{"women","beach","laptops","movies","himesh","world","earth","rap","fun","eminem"} 
    "mark":{"pilgrimage","programming","house","world","books","country music","bob","tom hanks","beauty","tigers"}, 
    "stuart":{"rap","smart girls","music","wrestling","brock lesnar","country music","public speaking","women","coding","iphone"}, 
    "grover":{"skating","mountaineering","racing","athletics","sports","adidas","nike","women","apple","pop"}, 
    "anita":{"heart","sunidhi","hindi","love","love songs","cooking","adidas","beach","travelling","flowers"}, 
    "kelly":{"travelling","comedy","tv","facebook","youtube","cooking","horror","movies","dublin","animals"}, 
    "dino":{"women","games","xbox","x-men","assassin's creed","pop","rap","opera","need for speed","jeans"}, 
    "priya":{"heart","mountaineering","sky diving","sony","apple","pop","perfumes","luxury","eminem","lil wayne"}, 
    "brenda":{"cute guys","xbox","shower","beach","summer","english","french","country music","office","birds"} 
} 

我怎么能确定谁拥有类似likes.Or人的项目,也许谁二人类似于most.Also这将是如果您可以将我指向适用于基于用户或基于项目的过滤的示例或教程,则会很有帮助。

+1

编程集体智慧的[第2章](http://books.google.co.uk/books?id=fEsZ3Ey-Hq4C&lpg=PP1&pg=PA7#v=onepage&q&f=false)对此进行了全面的介绍。示例代码在Python中,这是另一个优点。 – 2012-07-16 10:52:02

+0

我知道这本书,但它是非常古老的(2007年出版),网络已经发生了很大的变化。所以我不认为这本书的大部分例子今天都会有效。 – 2012-07-16 10:55:37

+4

基本技术仍然适用于您提供的样本数据。如果你正在寻找更复杂/可扩展的东西,那么你可能想在你的问题中提到这一点。它可能也值得一提,你已经尝试或考虑过。 – 2012-07-16 10:59:54

(声明,我不是这方面擅长,只有具备集体滤波的传递知识,下面是一个简单的资源集合,我发现有用)

这个的基础知识在Chapter 2 of the "Programming Collective Intelligence" book中有相当全面的介绍。示例代码在Python中,这是另一个优点。

您也可能会发现这个网站很有用 - A Programmer's Guide to Data Mining,特别是Chapter 2Chapter 3其中讨论了推荐系统和基于项目的筛选。

总之,可以使用诸如计算Pearson Correlation Coefficient,Cosine Similarity,k-nearest neighbours等的技术来基于他们已经喜欢/购买/投票的项目来确定用户之间的相似性。

请注意,这里有各种为此目的而编写的python库, pysuggestCrabpython-recsysSciPy.stats.stats.pearsonr

对于用户数量超过项目数量的大型数据集,您可以通过反演数据并计算项目之间的相关性(例如基于项目的过滤)来更好地扩展解决方案,并使用它来推断相似的用户。当然,您不会实时执行此操作,但会将定期重新计算安排为后端任务。有些方法可以并行/分配,以大幅度缩短计算时间(假设您有资源投入)。

我能想到的最基本的方法是找到每个人的喜好列表之间的交集,其中最喜欢匹配的两个人将具有最高的交集数量。

可以使用类似list(set(list1).intersection(list2))的东西。这将返回一个包含定义交叉点的项目的列表。

请记住,这种方法不能很好地扩展到大量条目,因为它要求每个用户喜欢相互比较,它的复杂度大约为O(n^2),其中n是用户的数量。

在你的一些评论你提到的协同过滤,但通常适用于具有相同项目由不同的用户排名,然后队伍之间找到相似之处,这样你就可以推断谁拥有在排一些项目的用户以同样的方式,但不是其他项目(在这里您使用在其他项目上给予类似排名的用户的排名)。我不认为这是相同的问题。

SequenceMatcher in difflib对这种事情很有用。如果使用ratio()它返回对应于两个序列之间的相似性0和1之间的值,从该文档:

返回序列相似性的量度,在范围内的浮子[0,1] 。 其中T是两个序列中元素的总数,M是 匹配的数目,这是2.0 * M/T.注意,如果 序列是相同的,则这是1.0;如果它们没有共同之处,则为0.0 。

从你的榜样,只有'rajat'针对其他人比较,(由[]开关内部{}修正到词典):

import difflib 
for key in likes: 
    print 'rajat', key, difflib.SequenceMatcher(None,likes['rajat'],likes[key]).ratio() 
#Output: 
rajat sheila 0.2 
rajat katy 0.2 
rajat brenda 0.1 
rajat saif 0.2 
rajat dino 0.2 
rajat toby 0.2 
rajat mark 0.1 
rajat steve 0.1 
rajat priya 0.1 
rajat grover 0.0 
rajat ravi 0.1 
rajat rajat 1.0 
rajat stuart 0.2 
rajat kelly 0.1 
rajat paul 0.0 
rajat anita 0.2 
+0

谢谢,但我看起来像“协作过滤”。任何关于协作过滤的帮助将不胜感激。 – 2012-07-16 10:57:33

for k in likes: 
    if likes["rajat"] & likes[k]: 
     print k, likes["rajat"] & likes[k] 
    else: 
     print k, " No Like with rajat" 

Output 

sheila set(['hindi', 'english']) 
katy set(['music', 'rap']) 
brenda set(['english']) 
saif set(['himesh', 'rap']) 
dino set(['x-men', 'rap']) 
toby set(['programming', 'rap']) 
mark set(['programming']) 
steve set(['travelling', 'english']) 
priya set(['lil wayne']) 
grover No Likes with rajat 
ravi set(['music']) 
rajat set(['lil wayne', 'x-men', 'himesh', 'coding', 'programming', 'music', 'hindi', 'rap', 'english', 'travelling']) 
stuart set(['music', 'coding', 'rap']) 
kelly set(['travelling']) 
paul No Likes with rajat 
anita set(['travelling', 'hindi']) 

这会比较常见,如“拉雅”的与字典的其他成员。必须有一个更好的方法来做到这一点

使用python recsys库[http://ocelma.net/software/python-recsys/build/html/quickstart.html]

from recsys.algorithm.factorize import SVD 
from recsys.datamodel.data import Data 

likes={ 
    "rajat":{"music","x-men","programming","hindi","english","himesh","lil wayne","rap","travelling","coding"}, 
    "steve":{"travelling","pop","hanging out","friends","facebook","tv","skating","religion","english","chocolate"}, 
    "toby":{"programming","pop","rap","gardens","flowers","birthday","tv","summer","youtube","eminem"}, 
    "ravi":{"skating","opera","sony","apple","iphone","music","winter","mango shake","heart","microsoft"}, 
    "katy":{"music","pics","guitar","glamour","paris","fun","lip sticks","cute guys","rap","winter"}, 
    "paul":{"office","women","dress","casuals","action movies","fun","public speaking","microsoft","developer"}, 
    "sheila":{"heart","beach","summer","laptops","youtube","movies","hindi","english","cute guys","love"}, 
    "saif":{"women","beach","laptops","movies","himesh","world","earth","rap","fun","eminem"}, 
    "mark":{"pilgrimage","programming","house","world","books","country music","bob","tom hanks","beauty","tigers"}, 
    "stuart":{"rap","smart girls","music","wrestling","brock lesnar","country music","public speaking","women","coding","iphone"}, 
    "grover":{"skating","mountaineering","racing","athletics","sports","adidas","nike","women","apple","pop"}, 
    "anita":{"heart","sunidhi","hindi","love","love songs","cooking","adidas","beach","travelling","flowers"}, 
    "kelly":{"travelling","comedy","tv","facebook","youtube","cooking","horror","movies","dublin","animals"}, 
    "dino":{"women","games","xbox","x-men","assassin's creed","pop","rap","opera","need for speed","jeans"}, 
    "priya":{"heart","mountaineering","sky diving","sony","apple","pop","perfumes","luxury","eminem","lil wayne"}, 
    "brenda":{"cute guys","xbox","shower","beach","summer","english","french","country music","office","birds"} 
} 

data = Data() 
VALUE = 1.0 
for username in likes: 
    for user_likes in likes[username]: 
     data.add_tuple((VALUE, username, user_likes)) # Tuple format is: <value, row, column> 

svd = SVD() 
svd.set_data(data) 
k = 5 # Usually, in a real dataset, you should set a higher number, e.g. 100 
svd.compute(k=k, min_values=3, pre_normalize=None, mean_center=False, post_normalize=True) 

svd.similar('sheila') 
svd.similar('rajat') 

结果A液:

In [11]: svd.similar('sheila') 
Out[11]: 
[('sheila', 0.99999999999999978), 
('brenda', 0.94929845546505753), 
('anita', 0.85943494201162518), 
('kelly', 0.53385495931440263), 
('saif', 0.39985366653259058), 
('rajat', 0.30757664244952165), 
('toby', 0.28541364367155014), 
('priya', 0.26184289111194581), 
('steve', 0.25043700194182622), 
('katy', 0.21812807229358305)] 

In [12]: svd.similar('rajat') 
Out[12]: 
[('rajat', 1.0000000000000004), 
('mark', 0.89164019482177692), 
('katy', 0.65207273451425907), 
('stuart', 0.61675507205285718), 
('steve', 0.55730648750670264), 
('anita', 0.49836982296014803), 
('brenda', 0.42759524471725929), 
('kelly', 0.40436047539358799), 
('toby', 0.35972227835054826), 
('ravi', 0.31113813325818901)] 
+0

谢谢!我一直在寻找这样的一段时间 – nickromano 2013-09-18 00:49:24

+0

伟大的图书馆! (我注意到你是作者)。但是,与Python 3不兼容。 – Siddhartha 2017-12-10 18:42:48

人们也可以使用scikit-学做基于用户的过滤:

以更简单的例如,如果您有:

"stuart":{"rap","rock"}

,你想研究他的音乐品味相似性:

"toby:{"hip-hop","pop","rap"}

您可以使用sklearn的成对余弦相似功能,

from sklearn.feature_extraction.text import CountVectorizer 
from sklearn.metrics.pairwise import cosine_similarity 

vec = CountVectorizer(analyzer='char') 
vec.fit(stuart_list) 

x = cosine_similarity(vec.transform(toby_list), 
       vec.transform(stuart_list)) 

,这将给你一个余弦矩阵,如:

[[ 0.166 0.327 1] 
[ 0.123 0.267 0.230]] 

其中第一行代表rap与托比所有3个选择的余弦相似度。请注意,1表示完全相似,用适当的三角函数表示2个选项的角度为0°(即相同),因此余弦为1.

第二行类似代表rock的余弦与托比的所有选择相似。

我找不到找到sklearn中两个列表之间的总体相似度的方法,但是,考虑到余弦矩阵,您可以计算其中的1 s的数量,并将其作为相似度数字。或者您可以统计0.9 s及以上的数字来解释“hip-hop”和“hiphop”等几乎相同的词。

(Sklearn也有euclidean相似性,可用作余弦相似性的替代品。)