BeautifulSoup字符代码错误
问题描述:
我正在使用BeautifulSoup刮取网站信息。具体而言,我想收集有关谷歌搜索(标题,发明人,摘要等)专利的信息。我会为每个专利的URL列表,但BeautifulSoup是有某些网站的麻烦,给我以下错误:BeautifulSoup字符代码错误
的UnicodeDecodeError:“UTF-8”编解码器不能在531位解码字节的0xCC:无效延续字节
下面是错误回溯:
Traceback (most recent call last):
soup = BeautifulSoup(the_page,from_encoding='utf-8')
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 172, in __init__
self._feed()
File "C:\Python27\lib\site-packages\bs4\__init__.py", line 185, in _feed
self.builder.feed(self.markup)
File "C:\Python27\lib\site-packages\bs4\builder\_lxml.py", line 195, in feed
self.parser.close()
File "parser.pxi", line 1209, in lxml.etree._FeedParser.close (src\lxml\lxml.etree.c:90597)
File "parsertarget.pxi", line 142, in lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99984)
File "parsertarget.pxi", line 130, in lxml.etree._TargetParserContext._handleParseResult (src\lxml\lxml.etree.c:99807)
File "lxml.etree.pyx", line 294, in lxml.etree._ExceptionContext._raise_if_stored (src\lxml\lxml.etree.c:9383)
File "saxparser.pxi", line 259, in lxml.etree._handleSaxData (src\lxml\lxml.etree.c:95945)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xcc in position 531: invalid continuation byte
我检查网站的编码,并且它声称是“UTF-8”。我也将它指定为BeautifulSoup的输入。以下是我的代码:
import urllib, urllib2
from bs4 import BeautifulSoup
#url = 'https://www.google.com/patents/WO2001019016A1?cl=en' # This one works
url = 'https://www.google.com/patents/WO2006016929A2?cl=en' # This one doesn't work
user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'
values = {'name' : 'Somebody',
'location' : 'Somewhere',
'language' : 'Python' }
headers = { 'User-Agent' : user_agent }
data = urllib.urlencode(values)
req = urllib2.Request(url, data, headers)
response = urllib2.urlopen(req)
the_page = response.read()
print response.headers['content-type']
print response.headers.getencoding()
soup = BeautifulSoup(the_page,from_encoding='utf-8')
我收录了两个网址。一个导致错误,另一个正常工作(在评论中标记为这样)。在这两种情况下,我都可以将html打印到终端上,但是BeautifulSoup一直崩溃。
有什么建议吗?这是我第一次使用BeautifulSoup。
答
你应该在编码UTF-8的字符串:
soup = BeautifulSoup(the_page.encode('UTF-8'))
我使用Python 2.7,BeautifulSoup4在Windows – user1911297