在Python中使用带有AWS Lambda函数的NLTK语料库

问题描述：

在AWS Lambda中使用NLTK语料库（特别是停用词）时遇到困难。我知道需要下载语料库，并使用NLTK.download（'stopwords'）来完成，并将它们包含在用于上载nltk_data/corpora/stopwords中的lambda模块的zip文件中。在Python中使用带有AWS Lambda函数的NLTK语料库

在代码中的用法如下：

from nltk.corpus import stopwords 
stopwords = stopwords.words('english') 
nltk.data.path.append("/nltk_data")

这从拉姆达日志输出

module initialization error: 
********************************************************************** 
    Resource u'corpora/stopwords' not found. Please use the NLTK 
    Downloader to obtain the resource: >>> nltk.download() 
    Searched in: 
    - '/home/sbx_user1062/nltk_data' 
    - '/usr/share/nltk_data' 
    - '/usr/local/share/nltk_data' 
    - '/usr/lib/nltk_data' 
    - '/usr/local/lib/nltk_data' 
    - '/nltk_data' 
**********************************************************************

我还试图直接通过包括

加载数据返回以下错误

nltk.data.load("/nltk_data/corpora/stopwords/english")

这产生了一个不同的错误，低于

module initialization error: Could not determine format for file:///stopwords/english based on its file 
extension; use the "format" argument to specify the format explicitly.

它有可能从Lambda zip加载数据时出现问题，需要它存储在外部..说在S3上，但这似乎有点奇怪。

任何想法什么格式

有没有人知道我可能会出错？

试'停用词= nltk.corpus.stopwords.words（“英语”）'和在代码块，它看起来像它看起来在corpora.stopwords的'nltk_data'文件夹，但中间/不见了。这可能只是一个目录地址问题。不是100％确定这会起作用，因为我看不到您的系统或文件，但它看起来不错。 – sconfluentus

使用完整路径，例如'/ home/sbx_user1062/nltk_data'并尝试：http://*.com/a/22987374/610569 – alvas

如果没有任何效果，请参阅'magically_find_nltk_data（）'from http://*.com/questions/36382937/nltk- doesnt-add-nltk-data-to-search-path/36383314＃36383314 – alvas

答

如果你禁用词语料库是/nltk_data下（基于根，而不是在你的主目录），你需要告诉NLTK您尝试访问语料库前：在AWS上LAMBDA你

from nltk.corpus import stopwords 
nltk.data.path.append("/nltk_data") 

stopwords = stopwords.words('english')

我认为OP的问题比看起来更深。无服务器系统假设所有事情都可以通过代码完成，只需要最少的外部资源（数据/模型）落在硬盘上。 – alvas

很可能;但是如果资源不在路径上，也没有关系...... – alexis

答

需要包括NLTK Python包与Lambda和修改data.py：

path += [ 
    str('/usr/share/nltk_data'), 
    str('/usr/local/share/nltk_data'), 
    str('/usr/lib/nltk_data'), 
    str('/usr/local/lib/nltk_data') 
]

到

path += [ 
    str('/var/task/nltk_data') 
    #str('/usr/share/nltk_data'), 
    #str('/usr/local/share/nltk_data'), 
    #str('/usr/lib/nltk_data'), 
    #str('/usr/local/lib/nltk_data') 
]

您不能包含整个nltk_data目录，删除所有zip文件，如果您只需要停用词，请保存nltk_data - > corpora - >停用词并转储剩下的部分。如果你需要记号器保存nltk_data - >记号器 - > punkt。要下载nltk_data文件夹使用Anaconda Jupyter笔记本电脑和运行

nltk.download()

或

https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/stopwords.zip

或

python -m nltk.downloader all

其中data.py是需要修改的吗？ –

答

我收到了同样的问题，但我解决它使用环境变量。

执行“nltk.download（）”并将其复制到AWS lambda应用程序的根文件夹中。（该文件夹应该称为“nltk_data”。）
在您的lambda函数（在AWS控制台中）的用户界面中添加“NLTK_DATA”=“./nltk_data”。请参阅图片。

在Python中使用带有AWS Lambda函数的NLTK语料库

相关推荐