您的位置: 首页 > 文章 > UoG Text as Data Lecture2

UoG Text as Data Lecture2

分类: 文章 • 2024-07-18 18:31:46

1.Term Frequency & Bag-of-words

与上一期的one-hot encoding不同，bag-of-words会记录每个term的frequency

bag-of-words假设：If a term occurs lots in a document it should imply something about what that document is about. A relaxation of the binary occurrence assumption.一个term在文章中出现次数越多，那么这篇文章与它的相关性就越大，aboutness就越大

bag-of-words也可以用sparse来表示（省空间）： UoG Text as Data Lecture2

1）document-term matrix (DTM)

UoG Text as Data Lecture2

例子

UoG Text as Data Lecture2

2. Vector Space Model，用高维向量表示

由于每篇document的长度不一样，如果用先前的方法不公平。因此可以计算两个高维向量的夹角，夹角小表示它们靠的近，相似度高。用cosine来计算。

很多搜索引擎用的都是BoW Vector来表示/ 还通常用于spam email filtering

UoG Text as Data Lecture2

UoG Text as Data Lecture2

1)stopwords

在统计term frequency之前，先得消除停用词，比如a,I, you, me, the 这种常见的又没用的词。它们将不会被统计进去

2）Term Frequency 单词在文中的重要性

仅仅用term在文章中出现的次数frequency来表示还不够。

原因：Aboutness does not increase linearly with term frequency.

eg：A document with 10 occurrences of the term is more related than a document with 1 occurrence of the term. But it is not 10 times more relevant.

因此我们将term frequency变成 1+ log（tf）。出现了1000次的单词比出现了1次的单词重要4倍，挺合理的。

UoG Text as Data Lecture2

3) Document Frequency 该文章在整个预料库里面的重要性

UoG Text as Data Lecture2

N表示整个语料库里面有多少篇文章， df表示出现过该单词的文章一共有多少篇。log如上面的tf用来减弱权重

df越小表示这个单词越小众（在整个语料库中出现的次数少），则idf就越大，表示出现过该单词的文章就越有价值。

4）tf-idf weighting 每篇文章关于某个单词的权重

UoG Text as Data Lecture2

权重 = 这个单词在文章中的重要性 * 这篇文章在整个语料库里面的重要性。

注：在计算cosine similarity的时候，通常用这个公式，而不是 term frequency.

其他演变形式：

5) Zipf's Law 奇普夫定律

奇普夫定律证明了为啥我们要用idf

奇普夫定律：在自然语言的语料库里，一个单词出现的频率与它在频率表里的排名成反比。