UnicodeEncodeError:在位置30339 'ASCII' 编解码器无法编码的字符的u '\ u2019':在范围序数不(128)

问题描述:

下面是有问题的代码UnicodeEncodeError:在位置30339 'ASCII' 编解码器无法编码的字符的u ' u2019':在范围序数不(128)

发起会话处理器

session = requests.Session() 

编程获取SAML断言,打开最初的IdP网址,并遵循所有的HTTP302的重定向,并得到所产生的登录页面

formresponse = session.get(idpentryurl, verify=sslverification) 

idpauthformsubmiturl,这是科幻所有302S

idpauthformsubmiturl = formresponse.url 

解析响应并且为了建立所有形式的字典中提取所有必要的值后最终URL值IdP进行预计

formsoup = BeautifulSoup(formresponse.text.decode('utf8')) 
payload = {} 

调试输出:

DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): myapps.microsoft.com 
DEBUG:urllib3.connectionpool:https: //myapps.microsoft.com:443 "GET /signin/AWS%20CMD%20(Audit)/18216d2a-eef8-4fde-962c-50cf615f3f5b HTTP/1.1" 302 244 
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): account.activedirectory.windowsazure.com 
DEBUG:urllib3.connectionpool:https://account.activedirectory.windowsazure.com:443 "GET /applications/signin/AWS%20CMD%20(Audit)/18216d2a-eef8-4fde-962c-50cf615f3f5b HTTP/1.1" 302 94 
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): login.microsoftonline.com 
DEBUG:urllib3.connectionpool:https://login.microsoftonline.com:443 "GET /common/oauth2/authorize?client_id=0000000c-0000-0000-c000-000000000000&redirect_uri=https%3A%2F%2Faccount.activedirectory.windowsazure.com%2F&response_mode=form_post&response_type=code%20id_token&scope=openid%20profile&state=OpenIdConnect.AuthenticationProperties%3DmIDzRLZskQlxxtgB9rjxiHrNVmQJpcUVaK8wuZ3A2PMIyBE8fzxkXDcroNhC4wyof9OK9OlhqH0J_stoYSEIhKiEzx4O3XDW4rS4xyFTitGmztuV3ozOJhX5uafmQm_XmKnXEjEt9CNwFbp2Kju3rRGLAXRViD3byQ7XpwdXkeXoDFLwmy5OIXQgzvPjSsc7Jx7xEXMHckDwElhBOBFXmJVYCkHYx6cB-3yjwGJHX6RQ2lfx6CUg7x2PqPkbo4WsUxbZDAJZsMqYXyVRZGSDqAgU3gSezlHNgZGh-nblkxj7Dw6rdMVKmpNWZWkjp3zI3OjWa91FTrVc0mC9gIQC-BC4zaF-FrwQ4rHPbQlisQoS6-S1qM8ca_cEi6CfFaHh2lrtB-xdNEVum97Mzmlg9g&nonce=1507770263.sCv6L2a21eQuLNKaXL3zog&nux=1 HTTP/1.1" 200 15838 
Traceback (most recent call last): 
    File "formauth.py", line 62, in <module> 

formsoup = BeautifulSoup(formresponse.text.decode('utf8')) 

    File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode 
    return codecs.utf_8_decode(input, errors, True) 

**UnicodeEncodeError:** 'ascii' codec can't encode character u'\u2019' in position 30342: ordinal not in range(128) 

尝试像下面的技巧没有帮助:在响应身体替换非空白字符与空间

formresponse.encoding = formresponse.apparent_encoding 
formsoupba = bytearray(formresponse.text, 'utf8') 
for i, val in enumerate(formsoupba): 
    if val > 128: 
     formsoupba[i] = 32 
     formsoup = BeautifulSoup(formsoupba.decode('utf8'), "html.parser") 

将产生以下错误:

回报codecs.utf_8_decode(输入,错误,真) 的UnicodeDecodeError: 'UTF-8' 编解码器不能解码位置30334字节0x80的:无效的起始字节

任何帮助,将不胜感激

您试图解码一个unicode字符(\u2019,一个引号)到utf-8,应该工作正常。 有些东西然后试图编码回ascii - 一些bs4解析器也许?

没关系 - 这里的鸟枪法,如果你愿意失去奇字符:

clean_text = formresponse.text.encode('utf8').decode('ascii', 'ignore') 
formsoup = BeautifulSoup(clean_text, "html.parser") 

这只是ignore小号任何编码错误,这意味着你失去了个性。看看这里的文档超出其他一些选项忽略:https://docs.python.org/2/library/codecs.html

更深入的办法是找到页面的实际编码 - https://login.microsoftonline.com:443号称是<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">,但显然这不是它是否包含这些类型的字符。我认为这可能会抛弃BeautifulSoup。尝试给bs4一些不同的编码,如cp1252latin-1

+1

你的猎枪解决方法确实修复了错误信息,所以非常感谢。另一方面,我得到一个新的错误,可能会或可能不会涉及到现在被删除的字符:)我想我只是要弄清楚为什么我得到一个“响应不包含有效的SAML断言”。任何方式,感谢修复 – gbaz