Python sys.stdin引发一个UnicodeDecodeError

问题描述：

我想用cURL和Python的BeautifulSoup库编写一个（非常）基本的网络爬虫，因为这比GNU awk和一堆正则表达式更容易理解。Python sys.stdin引发一个UnicodeDecodeError

目前，我想管的网页内容到程序与卷曲（即curl http://www.example.com/ | ./parse-html.py）

出于某种原因，Python中抛出一个UnicodeDecodeError因为无效的开始字节（我已经看了在this answer和this answer关于无效的起始字节，但没有弄清楚如何解决他们的问题）。

具体而言，我试着从第一个答案中使用a.encode('utf-8').split()。第二个答案只是解释了这个问题（Python发现一个无效的起始字节），尽管它没有给出解决方案。

我已经尝试卷曲的输出重定向到一个文件（即curl http://www.example.com/ > foobar.html和修改程序，接受一个文件作为命令行参数，尽管这会导致同样的UnicodeDecodeError。

我检查以及locale charmap输出为UTF-8，这是据我所知，这意味着我的系统是UTF-8编码字符（这使得特别是关于这个UnicodeDecodeError。

目前我迷惑，从而导致错误的确切行html_doc = sys.stdin.readlines().encode('utf-8').strip()。我已经尝试将其重写为for循环，尽管我获得了相同的结果ssue。

究竟是什么导致UnicodeDecodeError，我该如何解决这个问题？

编辑： 通过改变线路html_doc = sys.stdin.readlines().encode('utf-8').strip()到html_doc = sys.stdin修复该问题

答

的问题是在读取过程中，不编码;输入资源不是用UTF-8编码的，而是另一种编码。在UTF-8的外壳，可以方便的与

$ echo 2¥ | iconv -t iso8859-1 | python3 -c 'import sys;sys.stdin.readline()' 
Traceback (most recent call last): 
    File "<string>", line 1, in <module> 
    File "/usr/lib/python3.5/codecs.py", line 321, in decode 
    (result, consumed) = self._buffer_decode(data, self.errors, final) 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa5 in position 1: invalid start byte

您可以读取文件（sys.stdin.buffer.read()，或with open(..., 'rb') as f: f.read()）为二进制（你会得到一个bytes对象）重现该问题，仔细看了看，猜测编码。实际算法做到这一点is documented in the HTML standard。

但是，在许多情况下，编码不是在文件本身中指定的，而是通过HTTP Content-Type header指定的。不幸的是，你的curl调用不会捕获这个头文件。 Python不使用curl 和 Python，只能使用Python - 它已经是can download URLs。偷the encoding detection algorithm from youtube-dl，我们得到这样的：

import re 
import urllib.request 


def guess_encoding(content_type, webpage_bytes): 
    m = re.match(
     r'[a-zA-Z0-9_.-]+/[a-zA-Z0-9_.-]+\s*;\s*charset="?([a-zA-Z0-9_-]+)"?', 
     content_type) 
    if m: 
     encoding = m.group(1) 
    else: 
     m = re.search(br'<meta[^>]+charset=[\'"]?([a-zA-Z0-9_-]+)[ /\'">]', 
         webpage_bytes[:1024]) 
     if m: 
      encoding = m.group(1).decode('ascii') 
     elif webpage_bytes.startswith(b'\xff\xfe'): 
      encoding = 'utf-16' 
     else: 
      encoding = 'utf-8' 

    return encoding 


def download_html(url): 
    with urllib.request.urlopen(url) as urlh: 
     content = urlh.read() 
     encoding = guess_encoding(urlh.getheader('Content-Type'), content) 
     return content.decode(encoding) 

print(download_html('https://phihag.de/2016/iso8859.php'))

也有一些库（虽然不是在标准库），它支持这个开箱即用，即requests的。我也建议您阅读basics of what encodings are。

Python sys.stdin引发一个UnicodeDecodeError

相关推荐