如何处理lxml中的编码以正确解析html-string?
我有一个xml file。请下载并保存为blog.xml
。 这是我的文件在谷歌博客的列表,我写了一些代码来解析它,有一些与lxml扭曲的东西。如何处理lxml中的编码以正确解析html-string?
代码1:
from stripogram import html2text
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value'].encode("utf-8")
print html2text(string)
它获得与编码1一个正确的结果。
码2:
import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value']
myhtml=lxml.html.document_fromstring(string)
print myhtml.text_content()
它获得与CODE2一个错误的输出。
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82659)
ValueError: Unicode strings with encoding declaration are not supported.
CODE3:
import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value'].encode("utf-8")
myhtml=lxml.html.document_fromstring(string)
print myhtml.text_content()
它获得与CODE3一个错误的输出。
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82748)
File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81546)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78216)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 599, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74827)
lxml.etree.XMLSyntaxError: line 1395: Tag b:include invalid
如何处理lxml中的编码以正确解析html-string?
There is a bug in lxml。 检查这个代码的输出:
import lxml.html
import feedparser
def test():
try:
lxml.html.document_fromstring('')
except Exception as e:
print e
d = feedparser.parse('blog.xml')
e = d.entries[0].content[0]['value'].encode('utf-8')
test() # XMLSyntaxError: None
lxml.html.document_fromstring(e)
test() # XMLSyntaxError: line 1407: Tag b:include invalid
因此错误是混乱的,为什么你的分析失败的真正原因是,你传递空字符串document_fromstring。
试试这个代码:
import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value'].encode("utf-8")
if not string:
continue
myhtml=lxml.html.document_fromstring(string)
print myhtml.text_content()
你可以自己创建一个解析器,而不是使用document_fromstring
:
from cStringIO import StringIO
from lxml import etree
for num, entry in enumerate(d.entries):
text = entry.content[0]['value'].encode('utf8')
parser = etree.HTMLParser()
tree = etree.parse(StringIO(text), parser)
print ''.join(tree.xpath('.//text()'))
Blogger.com的Atom提要出口,这部作品打印.content[0].value
条目的文本内容。
1.新增'从LXML进口etree' 2.也许是'打印tree.text_content()'3.but这是一个错误的输出:回溯(最近通话最后): 文件“
@it_is_a_literature:实际上,所有更正。 – 2013-04-07 15:22:20
回溯(最近通话最后一个): 文件“
我怀疑在这些条目中*有*解析错误,但是lxml在错误的位置忽略了该异常。 Python C-API异常处理需要代码检查某些点的异常,如果没有完成,那么当另一个异常发生* *得到正确处理时,异常会在*之后*突然出现。如果你省略了第一个“测试”电话会发生什么?他是否与XMLSyntaxError一样? – 2013-04-16 08:13:39
无论如何,这肯定应该报告给LXML项目。 – 2013-04-16 08:14:27
@Martijn Pieters:是的,同样的错误发生了,第一个'test'调用只是为了显示'XMLSyntaxError'消息在解析'e'后发生了变化。 – gatto 2013-04-16 10:20:09