Python Scrapy FormRequest回调没有发生

Python Scrapy FormRequest回调没有发生

问题描述:

我正在写一个使用Scrapy的python脚本来抓取有登录页面的网站。我试图用Scrapy中的FormRequest.from_response填充表单,但我不成功,不知道为什么,但它看起来像from_response中声明的回调函数没有被调用。Python Scrapy FormRequest回调没有发生

我Spyder的代码如下:

class user_scrape(CrawlSpider): 
name = "spyder" 
allowed_domains = ["domain.tld"] 
start_urls = [ 
    "http://domain.tld/page1", 
    "http://domain.tld/page2" 
] 

login_user = "username" 
login_pass = "secret" 
login_page = "http://domain.tld/login.php" 

def start_requests(self): 
    yield Request(
     url=self.login_page, 
     callback=self.login, 
     dont_filter=True, 
    ) 

def login(self, response): 
    print "----- LOGIN -----" 
    return FormRequest.from_response(
     response, 
     formname='form_login', 
     formdata={ 
      'username': self.login_user, 
      'password': self.login_pass, 
      'cookietime': 'on', 
     }, 
     callback=self.check_login_response, 
    ) 

def check_login_response(self, response): 
    print response.url 
    print response.body 

    return [Request(url=url)for url in self.start_urls] 

def parse(self, response): 
    print response.url 

当我运行它打印“登录”,然后它似乎停止,而不会进入“check_login_response”,它应该继续Spyder的。

Spyder的日志如下:

2016-01-21 16:34:23 [scrapy] INFO: Scrapy 1.0.4 started (bot: UsersScrape) 
2016-01-21 16:34:23 [scrapy] INFO: Optional features available: ssl, http11 
2016-01-21 16:34:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'UsersScrape.spiders', 'SPIDER_MODULES': ['UsersScrape.spiders'], 'RETRY_TIMES': 5, 'BOT_NAME': 'UsersScrape', 'RETRY_HTTP_CODES': [400, 408, 500, 501, 502, 503, 504, 505, 506, 507, 508, 509, 510, 511, 512, 513, 514, 515, 516, 517, 518, 519, 520, 521, 522, 523, 524, 525, 526, 527, 528, 529, 530], 'DOWNLOAD_DELAY': 1, 'USER_AGENT': 'Mozilla/5.0 (Android 4.4; Mobile; rv:41.0) Gecko/41.0 Firefox/41.0'} 
2016-01-21 16:34:24 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState 
2016-01-21 16:34:24 [scrapy] INFO: Enabled downloader middlewares: RetryMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 
2016-01-21 16:34:24 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 
2016-01-21 16:34:24 [scrapy] INFO: Enabled item pipelines: 
2016-01-21 16:34:24 [scrapy] INFO: Spider opened 
2016-01-21 16:34:24 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2016-01-21 16:34:24 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 
2016-01-21 16:34:24 [scrapy] DEBUG: Crawled (200) <GET http://domain.tld/login.php?> (referer: None) 
----- LOGIN ----- 
2016-01-21 16:34:25 [scrapy] DEBUG: Redirecting (302) to <GET http://domain.tld.com/> from <POST http://domain.tld/takelogin.php> 
2016-01-21 16:34:27 [scrapy] DEBUG: Redirecting (302) to <GET http://domain.tld/> from <GET http://domain.tld/> 
2016-01-21 16:34:27 [scrapy] DEBUG: Filtered duplicate request: <GET http://domain.tld/> 
2016-01-21 16:34:27 [scrapy] INFO: Closing spider (finished) 
2016-01-21 16:34:27 [scrapy] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 1261, 
'downloader/request_count': 3, 
'downloader/request_method_count/GET': 2, 
'downloader/request_method_count/POST': 1, 
'downloader/response_bytes': 3877, 
'downloader/response_count': 3, 
'downloader/response_status_count/200': 1, 
'downloader/response_status_count/302': 2, 
'dupefilter/filtered': 1, 
'finish_reason': 'finished', 
'finish_time': datetime.datetime(2016, 1, 21, 15, 34, 27, 101000), 
'log_count/DEBUG': 5, 
'log_count/INFO': 7, 
'request_depth_max': 1, 
'response_received_count': 1, 
'scheduler/dequeued': 3, 
'scheduler/dequeued/memory': 3, 
'scheduler/enqueued': 3, 
'scheduler/enqueued/memory': 3, 
'start_time': datetime.datetime(2016, 1, 21, 15, 34, 24, 238000)} 
2016-01-21 16:34:27 [scrapy] INFO: Spider closed (finished) 

形式的HTML代码:

<form method="post" name="login_form" action="takelogin.php" onsubmit="return startLoginVerify();"> 
    <table id="login_form" border="0" cellpadding=5> 
    <tr> 
    <td colspan="2" align="right"> 
     <img style="cursor:pointer;" onClick="close_login_box();" src="pic/close.gif" align="right"> 
    </td> 
    </tr> 
    <tr> 
    <td class=rowhead style="padding-left:25px;">User:</td> 
    <td align=left style="padding-right:25px;"> 
     <input type="text" size=30 name="username" id="navbar_login_menu_input_to_focus_on" /> 
    </td> 
    </tr> 
    <tr> 
    <td class=rowhead>Password:</td> 
    <td align=left><input type="password" size=30 name="password" /></td> 
    </tr> 
    .... 
    </table> 
</form> 

我已检查了FormRequest导游,我看不出有什么区别可能导致我不工作。

谢谢你的时间和帮助!

该日志显示该请求正在被过滤,因为您正在访问同一个网址两次(使相同的请求完全准确)。

尝试设置dont_filter=True的登录请求:

FormRequest.from_response(
    response, 
    formname='form_login', 
    formdata={ 
     'username': self.login_user, 
     'password': self.login_pass, 
     'cookietime': 'on', 
    }, 
    callback=self.check_login_response, 
    dont_filter=True, 
) 
+0

你是对的,谢谢! –