基于内容推荐系统中的常识（ACM暑校）

How to represent content to improve information access and build a new generation of services for user modeling and recommender systems?

1. 内容主旨

Why？ ...... 1. Why do we need intelligent information access? （为什么我们需要只能信息访问） 2. Why do we need content? （为什么我们需要内容） 3. Why do we need semantics? （为什么我们需要语义）
How？ ...... 1. How to introduce semantics? （如何介绍语义） 2. Basics of Natural Language Processing （自然语言处理基础） 3. Encoding exogenous semantics,i.e. explicit semantics （编码外部语义，如显式语义） 4. Encoding endogenous semantics, i.e. implicit semantics （编码内部语义，如隐式语义）
What？ ...... 1. Explanation of Recommendations （推荐系统的解释） 2. Serendipity in Recommender Systems （推荐系统的惊喜度）

2. Why？

Why do we need intelligent information access?

由于信息负载（information overload），生理上不可能实时跟踪信息流。

Challenge：为了有效地应对信息超载和有限理性，我们需要对信息流进行过滤(Filter)。因此，我们需要智能信息访问的技术和算法。

Why do we need content?

在推荐系统领域，其实基于内容的推荐并不是必须的。我们都知道，目前存在基于协同滤波（矩阵分解、张量分解）的推荐系统、基于内容的推荐系统、混合推荐系统等。但是，内容的存在可以补偿协同滤波的劣势，如数据稀疏性问题、

原因大概有以下三个方面：

（1）通俗地讲，（利用社交媒体上传播的信息等）扩展和改进用户建模。（2）克服典型的协同过滤和矩阵分解问题。（3）因为搜索引擎不能没有内容而工作。

Why do we need semantics?

深层理性要求对文本内容所传达的信息有深刻的理解。为了实现这一目标，提高用户图谱/画像的质量和智能信息访问平台的有效性至关重要。（1）通过将物品描述与用户兴趣进行匹配而进行推荐；（2）推荐是通过将存储在用户画像中的功能与要推荐的项目的功能相匹配而生成。

基于内容推荐系统中的常识（ACM暑校）

单纯依靠文本是不可靠的，因为文本明显存在一词多义、文本意象等。如下所示：

很明显：（1）单纯的基于内容的表示不能处理多义词；（2）纯基于内容的表示很容易将推荐系统推向两个极端! （3）多词的概念（同义词），如Artificial intelligence、AI等；（4）基于内容的推荐系统是语言依赖性的（如英语、汉语、德文），然而基于语义的推荐系统是不受语言限制的。

因此，研究基于语义的推荐系统，有以下好处：

（1）In general: to improve content representation in intelligent information access platforms；（2）To avoid typical issues of
natural language representations (polysemy, synonymy, multi-word concepts, etc.) （3）To model user preferences in an effective way （4）To better understand the information spread on social media （5）To provide multilingual recommendations

3. How？

How to introduce semantics?

如何将我们连接到我们正在寻找的信息（搜索任务）或我们想接触的信息（建议和用户建模任务）？（1）我们需要一些“智能”支持（作为智能信息访问技术）（2）我们需要更好地理解和表示内容。最根本的基础就是自然语言处理。

Basics of Natural Language Processing？

（1）Normalization 去掉不需要的字符/标记(如HTML/XML标记、标点符号、数字等)；（2）tokenization 将文本分解为token（分词）（3） stopword removal 排除语义内容较少的常用词；（4） lemmatization 将变体形式减少为基形式，例如am、are、is->be （5） stemming 将术语简化为它们的“根”，例如automate(s), automatic, automation 都简化为automat。

对文本内容进行简单的NLP处理后，我们需要给每个特征赋予权重，下面展示了使用TF-IDF（terms frequency – inverse document frequency ，词频-逆文档频率）进行特征权重赋值。术语的权重就是通过术语频率权重和反比文档频率权重的乘积得到。

基于内容推荐系统中的常识（ACM暑校）

tf：术语在文档中出现的次数；idf：取决于集合中术语的稀有性；tf-idf：随着文档中出现的次数和集合中术语的稀有性而增加。

后面的处理就是传统的机器学习内容，一般通过Vector Space Model 和 Similarity between vectors进行处理。但是这种以单词为核心的内容推荐系统性能往往不好，以为单词难以表征内容的语义，或者说文章的内容主旨。因此，更加合理的内容使用方式，应该是关注文本概念/主旨，而并不是仅仅是关键词。

Exogenous semantics ,i.e. explicit semantics & Endogenous semantics, i.e. implicit semantics

语义表示 = 显式语义 + 隐式语义；显式语义：基于外部知识的集成的自上而下的方法来表示内容，能够在内容表达中提供语言、文化和背景知识。隐式语义：自下而上的方法，通过分析一个词在普通和具体语言行为背景下的用法规则来确定该词的含义。

Encoding exogenous semantics,i.e. explicit semantics

（1）通过映射描述（语义概念）物品的特征引入语义；（2）通过将物品链接到知识图来引入语义；

Encoding endogenous semantics, i.e. implicit semantics

其实，我们也可以直接利用大量的内容学习单词的表示。根据术语用法学习的语义称为“distributional”。Distributional 假设：在类似内容中使用的术语具有相似的含义。

Distributional Semantics（分布式语义）：一个词的意思取决于它的用法，通过分析大量的文本数据语料库，可以推断出有关术语用法（含义）的信息。例如：Beer and wine, dog and cat share a similar meaning since they are often used in similar contexts。

分布式语义的好处在于：（1）我们可以利用数据的语料库直接学习语言术语的语义向量空间表示；（2）轻量级语义，未正式定义；（3）高度的灵活性，每一个术语都可以用一个向量进行表示；（4）内容具有不同的粒度；

分布式语义的弊端在于：（1）需要大量的内容进行学习；（2）这个矩阵非常的大，很难去构建（特征太多，需要去裁减）

4. What ?

Explanation of Recommendations

可解释性的目标：（1）透明度：解释系统的工作原理；（2）可审查性：允许用户告诉系统它是错误的；（3）说服力：说服用户尝试或购买；（4）真实性：提高用户对系统的信心；（5）有效性：帮助用户做出正确的决策；（6）效率：帮助用户更快地做出决策；（7）满意度：用户增加使用或享受便利性；

Serendipity in Recommender Systems

Serendipity = attractive + unexpected；个性化推荐是一个好事，但是千篇一律就会让大家感到厌烦，所以惊喜度也是推荐系统需要考虑的一个因素，他可以强化用户粘性。但是，如何在推荐过程中引入惊喜度？很明显，语义匹配不是很好的解决方案。语义画像可能比基于关键字的配置文件提供更准确的建议。

参考文献：

Semantics-aware Recommender Systems：

C. Musto, G.Semeraro, M.de Gemmis, P. Lops. A Hybrid Recommendation Framework Exploiting Linked Open Data and Graph-based Features. UMAP 2017

Cross-language Recommender Systems：

F. Narducci, P. Basile, C. Musto, P. Lops, A. Caputo, M. de Gemmis, L. Iaquinta, G. Semeraro: Conceptbased item representations for a cross-lingual content-based recommendation process. Inf. Sci. 374: 15-31 (2016)

Explanations：

C. Musto, F. Narducci, P. Lops, M. de Gemmis, G. Semeraro: ExpLOD: A Framework for Explaining Recommendations based on the Linked Open Data Cloud. In Proc. of the 10th ACM Conference on Recommender Systems (RecSys '16). ACM, New York, NY, USA, 151-154.

Serendipity：

M. de Gemmis, P. Lops, G. Semeraro, C. Musto. An Investigation on the Serendipity Problem in Recommender Systems. Information Processing and Management, 2015 DOI: 10.1016/j.ipm.2015.06.008

基于内容推荐系统中的常识（ACM暑校）

1. 内容主旨

2. Why？

3. How？

4. What ?

相关推荐