UoG Text as Data Lecture1

1. Text Processing

首先 token 的概念


e.g. “House of Tartan sells blankets” --------> “House”, “of”, “Tartan” “sells”, “blankets”


e.g., “House”, “of”, “Tartan” “sells”, “blankets” --------> “house”, “of”, “tartan” “sell”, “blanket”



Definition: process of grouping together the different inflected forms of a word to a base form




Definition: Process for reducing inflected words to their stem or root form

stemming: “computer”, “computers”, “computing”, “compute”--------> comput


区别:stemming: wolves -------->wolv

           lemmatization -------->wolf

2.Text collections

N = number of all token occurrences (word count)   文章一共有多少个字(token)
V = vocabulary = set of types (unique normalized tokens) 文章字典是多少(unique token)


1)Representing the text (有了字典后,如何表示一篇文章)

one-hot encoding

每篇文章用一个 dimension = |V|的向量表示,该单词出现值=1,不出现=0

实现方法: 用字典来存储V , 遇到新的文章进来,不断给字典添加新单词

这样每篇文章都可以用向量来表示,并且基本都是稀疏的, 非常容易被压缩(用其他稀疏压缩算法来表示,节省存储空间)

注意:We may already have “frozen” our dictionary – then new words are “OOV” out-of-vocabulary, also
known as “UNK” for unknown terms, 还需要在字典里加一个叫 <UNK>的KEY来表示unkown terms。字典建立好之后新来的文章会遇到一些奇怪的单词(不常见),可以把他们算作<UNK>。rare terms都可以算作<UNK>,不然字典太大了。或者,用hash把所有单词映射,一个key对应多个单词。

3. Text Similarity文本相似度


E.g. grouping together tweets or news articles about the same event.  tweets的聚类

E.g. identifying documents similar to a user's query.  query的信息检索document

以下的Similarity都是基于set的(set-based similarity),即one-hot encoding方法来表示向量,不统计每个document里面某个term出现多少次。原因有:1.Work well for short pieces of text.  2.Simple (trivial) to compute with basic data structures.数据结构简单 3.Fundamental building block of more complex (learned) functions. 为之后更复杂的算法作铺垫 4.There are fast and efficient approximations!有快速有效的近似值,为以后大数据作铺垫

方法1:Dice coefficient: Dice系数

1)先计算两个documents的on-hot encoding,向量X和向量Y

2)matching coefficient 和overlap coefficient 和 Dice coefficient

方法2:Jaccard Similarity

方法3:Tversky Index (1977),综合了Jaccard和Dice

