scrapy HTTPS代理403错误 - 作品卷曲

问题描述：

我有HttpProxyMiddleware Linux上的scrapy 1.4.0项目启用，即我的settings.py包括这样的：scrapy HTTPS代理403错误 - 作品卷曲

DOWNLOADER_MIDDLEWARES = { 
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 10, 
}

当我运行我的蜘蛛（命名sslproxies）使用下面的命令，我得到一个错误：

export https_proxy=https://123.123.123.123:3128 
scrapy crawl sslproxies -o output/data.csv

相关的错误：

2017-08-15 18:57:20 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.sslproxies.org/> (referer: None) 
2017-08-15 18:57:20 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.sslproxies.org/>: HTTP status code is not handled or not allowed 
2017-08-15 18:57:20 [scrapy.core.engine] INFO: Closing spider (finished)

403我mplies请求被禁止。但是，如果我使用curl测试代理服务器：

curl -vx https://123.123.123.123:3128 https://httpbin.org/headers

我得到一个有效的响应，它使用代理服务器：

* Establish HTTP proxy tunnel to httpbin.org:443 
> CONNECT httpbin.org:443 HTTP/1.1 
> Host: httpbin.org:443 
> User-Agent: curl/7.47.0 
> Proxy-Connection: Keep-Alive 
> 
< HTTP/1.1 200 Connection established

如果我通过取消https_proxy环境变量蜘蛛作品绕过代理。我在scrapy http代理中间件配置中丢失了些什么？

答

2017-08-15 18:57:20 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.sslproxies.org/> (referer: None)

美国的蜘蛛正在请求https://www.sslproxies.org/

为此，创建另一个中间件这样

class CustomProxyMiddleware(object): 

    def process_request(self, request, spider): 

     request.meta['proxy'] = "https://123.123.123.123:3128"

这将意味着正在使用的每一个代理请求您的蜘蛛让。

这不是HttpProxyMiddleware应该做的事情吗？ –

我从来没有试图设置一个环境变量，所以我不知道应该工作或不，我建议你使用我的建议。并且，我将能够帮助你更多 – Umair

scrapy HTTPS代理403错误 - 作品卷曲

相关推荐