从推文中删除停用词Python

问题描述：

我试图从我从Twitter导入的推文中删除停用词。删除停用词后，字符串列表将被放置在同一行的新列中。我可以一次轻松地完成这一行，但试图在整个数据框上循环方法似乎并不成功。从推文中删除停用词Python

我该怎么做？

摘录我的数据：

tweets['text'][0:5] 
Out[21]: 
0 Why #litecoin will go over 50 USD soon ? So ma... 
1 get 20 free #bitcoin spins at... 
2 Are you Bullish or Bearish on #BMW? Start #Tra... 
3 Are you Bullish or Bearish on the S&amp;P 500?... 
4 TIL that there is a DAO ExtraBalance Refund. M...

在单行方案的以下工作：

from nltk.corpus import stopwords 
stop_words = set(stopwords.words('english')) 
tweets['text-filtered'] = "" 

word_tokens = word_tokenize(tweets['text'][1]) 
filtered_sentence = [w for w in word_tokens if not w in stop_words] 
tweets['text-filtered'][1] = filtered_sentence 

tweets['text-filtered'][1] 
Out[22]: 
['get', 
'20', 
'free', 
'#', 
'bitcoin', 
'spins', 
'withdraw', 
'free', 
'#', 
'btc', 
'#', 
'freespins', 
'#', 
'nodeposit', 
'#', 
'casino', 
'#', 
'...', 
':']

我在一个循环的尝试并不成功：

for i in tweets: 
    word_tokens = word_tokenize(tweets.get(tweets['text'][i], False)) 
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    tweets['text-filtered'][i] = filtered_sentence

一个片段的追溯：

Traceback (most recent call last): 

    File "<ipython-input-23-6d7dace7a2d0>", line 2, in <module> 
    word_tokens = word_tokenize(tweets.get(tweets['text'][i], False)) 

... 

KeyError: 'id'

基于@ Prune的回复，我设法纠正了我的错误。这里是一个可能的解决方案：

count = 0  
for i in tweets['text']: 
    word_tokens = word_tokenize(i) 
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    tweets['text-filtered'][count] = filtered_sentence 
    count += 1

我以前的尝试是循环访问数据框，tweets的列。推文中的第一列是“id”。

tweets.columns 
Out[30]: 
Index(['id', 'user_bg_color', 'created', 'geo', 'user_created', 'text', 
     'polarity', 'user_followers', 'user_location', 'retweet_count', 
     'id_str', 'user_name', 'subjectivity', 'coordinates', 
     'user_description', 'text-filtered'], 
     dtype='object')

当你得到一个解决方案时，请记住投票有用的东西并接受你最喜欢的答案（即使你必须自己写），所以堆栈溢出可以正确地存档问题。 – Prune

答

你感到困惑列表索引：

for i in tweets: 
    word_tokens = word_tokenize(tweets.get(tweets['text'][i], False)) 
    filtered_sentence = [w for w in word_tokens if not w in stop_words] 
    tweets['text-filtered'][i] = filtered_sentence

注意tweets是一本字典; tweets['text']字符串列表。因此，for i in tweets以任意顺序返回tweets中的所有密钥：字典密钥。看起来“id”是第一个返回的。当您尝试分配tweets['text-filtered']['id'] = filtered_sentence时，就没有这样的元素。

尝试更温和地进行编码：从内部开始，每次编码几行，然后按照更复杂的控制结构工作。在继续之前调试每个添加。在这里，你似乎已经失去了什么是数字索引，什么是列表，什么是字典。

由于您没有做任何可见的调试，或提供了上下文，我无法为您修复整个程序 - 但这应该让您开始。

索引，列表和字典之间的混淆是问题所在！我根据你的建议更新了我的答案 – Kevin

从推文中删除停用词Python

相关推荐