使用lxml解析RSS时出现编码错误
我想使用lxml解析下载的RSS,但我不知道如何处理UnicodeDecodeError?使用lxml解析RSS时出现编码错误
request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)
但我得到一个错误:
tree = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)
你或许应该只试图定义的字符编码作为最后的手段,因为它是明确的编码是基于什么样的XML序言(如果而不是通过HTTP头)。无论如何,除非你想重写编码,否则不需要将编码传递给etree.XMLParser
。所以摆脱encoding
参数,它应该工作。
编辑:好的,问题实际上似乎与lxml
。下面的作品,无论出于何种原因:
parser = etree.XMLParser(ns_clean=True, recover=True)
etree.parse('http://wiadomosci.onet.pl/kraj/rss.xml', parser)
它往往更容易得到字符串加载并整理出了lxml的图书馆,然后再调用fromstring就可以了,而不是依靠lxml.etree.parse( )功能及其难以管理的编码选项。
这是因为RSS文件开头的编码声明,所以一切应该只是工作:
<?xml version="1.0" encoding="utf-8"?>
下面的代码演示了一些可以应用到为不同的编码etree解析不同的变化。您也可以请求它写出不同的编码,这些编码将出现在标题中。
import lxml.etree
import urllib2
request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request).read()
print [response]
# ['<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\xc5\x9bci...']
uresponse = response.decode("utf8")
print [uresponse]
# [u'<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\u015bci...']
tree = lxml.etree.fromstring(response)
res = lxml.etree.tostring(tree)
print [res]
# ['<feed xmlns="http://www.w3.org/2005/Atom">\n<title>Wiadomości...']
lres = lxml.etree.tostring(tree, encoding="latin1")
print [lres]
# ["<?xml version='1.0' encoding='latin1'?>\n<feed xmlns=...<title>Wiadomości...']
# works because the 38 character encoding declaration is sliced off
print lxml.etree.fromstring(uresponse[38:])
# throws ValueError(u'Unicode strings with encoding declaration are not supported.',)
print lxml.etree.fromstring(uresponse)
代码可以在这里尝试: http://scraperwiki.com/scrapers/lxml_and_encoding_declarations/edit/#
我遇到了类似的问题,而且事实证明这无关与编码。发生了什么 - lxml正在抛出一个完全不相关的错误。在这种情况下,错误在于.parse函数需要文件名或URL,而不是包含内容本身的字符串。但是,当它试图打印出错误时,它会窒息非ASCII字符并显示完全混淆的错误消息。这是非常不幸的,其他人纷纷发表意见,在这里这个问题:
https://mailman-mail5.webfaction.com/pipermail/lxml/2009-February/004393.html
幸运的是,你是一个很容易解决。只是.fromstring取代.parse,你应该完全好走:
request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
## lxml Y U NO MAKE SENSE!!!
tree = etree.fromstring(response, parser)
我的机器只是测试这一点,它工作得很好。希望能帮助到你!
当我运行没有编码参数...; /的脚本时,仍然有相同的错误。为什么etree.XMLParser完成错误,尽管传递正确的编码? – domi 2011-04-28 00:45:50
它现在正在工作,但我不得不升级lxml到2.2.8版本,因为2.2.4我无法解析远程URL。此外,当我改变这个问题时,我的问题的代码工作:tree = etree.parse(StringIO.StringIO(response),parser) – domi 2011-04-28 20:46:39