RSS源在开始处有一个“\ n”。我如何删除它？ - Python的

问题描述：

我想从这个饲料中提取数据：RSS源在开始处有一个“ n”。我如何删除它？ - Python的

http://realbusiness.co.uk/feed/

但是它看起来与其他不同的饲料，我从拉动。他们是这样的：

https://www.ft.com/companies?format=rss

当我拉离“https://www.ft.com/companies?format=rss”数据所需的一切是非常简单的，因为我使用minidom命名切片数据，并拉我需要的一切，像这样：

from xml.dom import minidom 
from urllib.request import urlopen 

url = 'https://www.ft.com/companies?format=rss&page=1' 
html = urlopen(url) 
dom = minidom.parse(html) 
item = dom.getElementsByTagName('item') 
for node in item: 
    pubdate = node.getElementsByTagName('pubDate')[0].childNodes[0].nodeValue 
    link = node.getElementsByTagName('link')[0].childNodes[0].nodeValue 
    title = node.getElementsByTagName('title')[0].childNodes[0].nodeValue

然而，当我尝试做同样为“http://realbusiness.co.uk/feed/”使用下面的代码：

from xml.dom import minidom 
from urllib.request import urlopen 

url = 'http://realbusiness.co.uk/feed/' 
html = urlopen(url) 
dom = minidom.parse(html)

我得到以下错误：

Traceback (most recent call last): 
    File "C:/Users/NAME/Desktop/Scripts/scrapesites/deleteme.py", line 6, in <module> 
    dom = minidom.parse(html) 
    File "C:\Python36\lib\xml\dom\minidom.py", line 1958, in parse 
    return expatbuilder.parse(file) 
    File "C:\Python36\lib\xml\dom\expatbuilder.py", line 913, in parse 
    result = builder.parseFile(file) 
    File "C:\Python36\lib\xml\dom\expatbuilder.py", line 207, in parseFile 
    parser.Parse(buffer, 0) 
xml.parsers.expat.ExpatError: XML or text declaration not at start of entity: line 2, column 0

我的结论是为什么发生这种情况，是因为这两个网站的rss结构略有不同。 'http://realbusiness.co.uk/feed/'在页面的第一行有'\ n'，而'https://www.ft.com/companies?format=rss'没有。

如何删除“\ n”以便我可以解析数据？

如果我对我的解决方案有误，那么正确的解决方案是什么？

在此先感谢。

我不认为这是用正确的方法...的urlopen不返回一个字符串。 –

答

它可能通过读取\n性格分析，像这样工作之前：

html = urlopen(url) 
html.read(1) 
dom = minidom.parse(html)

代码使用'minidom.parse'，它采用类似于对象的文件而不是'string'。当文件以换行符，空格或制表符开始时，会崩溃。 – ikkuh

我看到它是坏的。我严重误解了一些东西。删除我的答案，并扭转downvote。对任何麻烦抱歉。干杯。 –

RSS源在开始处有一个“\ n”。我如何删除它？ - Python的

相关推荐