如何在没有直接链接的情况下下载验证码图片
问题描述:
我试图从命令行客户端访问sci-hub.io,而不是打败它的验证码系统。当您将doi发布到其首页时,它将返回http://moscow.sci-hub.io/abc123blah/foo.pdf表单的pdf网址。如果您然后请求链接,您随机获得pdf或验证码。 CAPTCHA页面有这个来源:如何在没有直接链接的情况下下载验证码图片
<html>
<head>
<title>Для просмотра статьи разгадайте капчу</title>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
<body style = "background:white">
<div>
<table style = "width:100%;height:100%"><tr><td style = "vertical-align:middle;text-align:center">
<h2 style = "color:gray;font-family:sans-serif;padding:18px">для просмотра статьи разгадайте капчу</h2>
<p></p>
<form action = "" method = "POST">
<p><img id="captcha" src="/captcha/securimage_show.php" /></p>
<input type="text" maxlength="6" name="captcha_code" style = "width:256px;font-size:18px;height:36px;margin-top:18px;text-align:center" autofocus /><br>
<a style = "color:gray;text-decoration:none" href="#" onclick="document.getElementById('captcha').src = '/captcha/securimage_show.php?' + Math.random(); return false">[ показать другую картинку ]</a>
<p style = "margin-top:22px"><input type = "submit" value= "Продолжить"></p>
</form>
</td></tr></table>
</div>
</body>
</html>
所有我能想到做的是,要求securimage_show.php,保存图片,它显示给用户,抢它的解码,然后POST响应。一个例子PDF链接是http://moscow.sci-hub.io/291193c259b69cc057d74e3eb4965c4f/ong2014.pdf 喜欢的东西:
import requests
from PIL import Image
import io
pdf_url = "http://moscow.sci-hub.io/3dcd1bf3b82ea549c0a72e9ab195ab78/walter2015.pdf"
r1 = requests.get(pdf_url)
if r1.headers['Content-Type'] != 'application/pdf':
print("Looks like Sci-hub gave us a captcha")
image = requests.get("http://moscow.sci-hub.io/captcha/securimage_show.php").content
img = io.BytesIO(image)
im = Image.open(img)
im.show()
captcha_text = input("Enter captcha text: ")
r2 = requests.post(pdf_url, data = {'captcha_code': captcha_text})
if r2.headers['Content-Type'] != 'application/pdf':
print("Looks like Sci-hub gave us another captcha")
else:
with open("filename.pdf", 'wb') as f:
f.write(r.content)
print("saved!")
else:
print("Got a PDF")
with open("filename.pdf", 'wb') as f:
f.write(r.content)
print("saved!")
我没有一种方法可以让我第一次请求PDF时生成的验证码原始图像。当我从securimage_show.php请求另一个验证码图像时,它会生成一个新的图像,以便POST响应不正确。我怎样才能解决这个问题?
答
感谢安德鲁指引我在正确的方向。我需要与请求建立会话。我假设这个会话来回传递一个cookie,以便服务器可以跟踪它发送给我的最新验证码。只是一个猜测,因为这对我来说仍然有点神奇。
import requests
from PIL import Image
from io import BytesIO
pdf_url = "http://moscow.sci-hub.io/3dcd1bf3b82ea549c0a72e9ab195ab78/walter2015.pdf"
s = requests.Session()
r1 = s.get(pdf_url)
if r1.headers['Content-Type'] != 'application/pdf':
print("Looks like Sci-hub gave us a captcha")
image = s.get("http://moscow.sci-hub.io/captcha/securimage_show.php").content
img = BytesIO(image)
im = Image.open(img)
im.show()
captcha_text = input("Enter captcha text: ")
r2 = s.post(pdf_url, data = {'captcha_code': captcha_text})
if r2.headers['Content-Type'] != 'application/pdf':
print("Looks like Sci-hub gave us another captcha")
else:
with open("filename.pdf", 'wb') as f:
f.write(r2.content)
print("saved!")
else:
print("Got a PDF")
with open("filename.pdf", 'wb') as f:
f.write(r1.content)
print("saved!")
也许你应该在一个会话中执行两个操作?请参阅http://docs.python-requests.org/en/master/user/advanced/ –