如何从鹅印地文网页中提取文章？

问题描述：

我使用Python Goose从网页中提取文章。它适用于很多语言，但对印地语不起作用。我试图添加印地语停止作为stopwords-hi.txt和设置target_language嗨，没有成功。谢谢，伊兰如何从鹅印地文网页中提取文章？

究竟如何失败？ –

清空文本函数不会返回任何内容 –

答

是的，我有同样的问题。我一直在研究所有印度地区语言的文章，而且我无法单独使用Goose来提取内容。如果您可以单独使用文章描述，那么meta_description完美地起作用。您可以使用它来代替不返回任何内容的clean_text。

另一种选择，但更多的行代码：

import urllib 
from bs4 import BeautifulSoup 

url = "http://www.jagran.com/news/national-this-pay-scale-calculator-will-tell-your-new-salary-after-7th-pay-commission-14132357.html" 
html = urllib.urlopen(url).read() 
soup = BeautifulSoup(html, "lxml") 

##removing all script, style and reference links to get only the article content 
for script in soup(["script", "style",'a',"href","formfield"]): 
    script.extract() 


text = soup.get_text() 

lines = (line.strip() for line in text.splitlines()) 
chunks = (phrase.strip() for line in lines for phrase in line.split(" ")) 
text = '\n'.join(chunk for chunk in chunks if chunk) 

print (text)

公开披露：事实上，我的原代码某处只有堆栈溢出。修改它一点点。

如何从鹅印地文网页中提取文章？

相关推荐