从scrapy和浏览器控制台不同的XPath结果

问题描述:

我使用硒和PhantomJS从大学web page从scrapy和浏览器控制台不同的XPath结果

出于测试目的,收集了教授的联系信息(不是恶意的目的),让我们说kw.txt是文件只包含两个姓氏的样子

最大

import scrapy 
from selenium import webdriver 

from universities.items import UniversitiesItem 

class iupui(scrapy.Spider): 
    name = 'iupui' 
    allowed_domains = ['iupui.com'] 
    start_urls = ['http://iupuijags.com/staff.aspx'] 

    def __init__(self): 
     self.last_name = '' 

    def parse(self, response): 
     with open('kw.txt') as file_object: 
      last_names = file_object.readlines() 

     for ln in last_names: 
      #driver = webdriver.PhantomJS("C:\\Users\yashi\AppData\Roaming\Python\Python36\Scripts\phantomjs.exe") 
      driver = webdriver.Chrome('C:\\Users\yashi\AppData\Local\Programs\Python\Python36\chromedriver.exe') 
      driver.set_window_size(1120, 550) 
      driver.get('http://iupuijags.com/staff.aspx') 

      kw_search = driver.find_element_by_id('ctl00_cplhMainContent_txtSearch') 
      search = driver.find_element_by_id('ctl00_cplhMainContent_btnSearch') 

      self.last_name = ln.strip() 
      kw_search.send_keys(self.last_name) 
      search.click() 

      item = UniversitiesItem() 
      results = response.xpath('//table[@class="default_dgrd staff_dgrd"]//tr[contains(@class,"default_dgrd_item ' 
            'staff_dgrd_item") or contains(@class, "default_dgrd_alt staff_dgrd_alt")]') 
      for result in results: 
       full_name = result.xpath('./td[@class="staff_dgrd_fullname"]/a/text()').extract_first() 
       print(full_name) 
       if self.last_name in full_name.split(): 
        item['full_name'] = full_name 
        email = result.xpath('./td[@class="staff_dgrd_staff_email"]/a/href').extract_first() 
        if email is not None: 
         item['email'] = email[7:] 
        else: 
         item['email'] = '' 
        item['phone'] = result.xpath('./td[@class="staff_dgrd_staff_phone"]/text()').extract_first() 
       yield item 
      driver.close() 

但是,结果ALWA ys给我一堆名字看起来像

2017-09-12 15:27:13 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request 
Dr. Roderick Perry 
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx> 
{} 
Gail Barksdale 
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx> 
{} 
John Rasmussen 
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx> 
{} 
Jared Chasey 
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx> 
{} 
Denise O'Grady 
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx> 
{} 
Ed Holdaway 
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx> 
{} 

结果的长度总是相同的每次迭代。

这是怎么看起来像在控制台当我把XPath的它: console result

我真的无法弄清楚什么问题。

因此很少有问题。

  • 您没有使用来自您的硒代码的响应。你是 浏览页面,然后从页面的源头什么都不做。

  • 接下来,即使没有找到匹配项,您也正在屈服项目,因此
    空白项目。

  • 还有你创建外循环的项目时,应该是里面

  • 你正在做的比较是大小写敏感的。所以你检查
    max但结果有Max,你忽略了匹配。

  • 您还有一个href中缺少@的电子邮件。

下面是一个固定的版本

class iupui(scrapy.Spider): 
    name = 'iupui' 
    allowed_domains = ['iupui.com'] 
    start_urls = ['http://iupuijags.com/staff.aspx'] 

    # def __init__(self): 
    #  self.last_name = '' 

    def parse(self, response): 
     # with open('kw.txt') as file_object: 
     #  last_names = file_object.readlines() 
     last_names = ["max"] 
     for ln in last_names: 
      #driver = webdriver.PhantomJS("C:\\Users\yashi\AppData\Roaming\Python\Python36\Scripts\phantomjs.exe") 
      driver = webdriver.Chrome() 
      driver.set_window_size(1120, 550) 
      driver.get('http://iupuijags.com/staff.aspx') 

      kw_search = driver.find_element_by_id('ctl00_cplhMainContent_txtSearch') 
      search = driver.find_element_by_id('ctl00_cplhMainContent_btnSearch') 

      self.last_name = ln.strip() 
      kw_search.send_keys(self.last_name) 
      search.click() 

      res = response.replace(body=driver.page_source) 


      results = res.xpath('//table[@class="default_dgrd staff_dgrd"]//tr[contains(@class,"default_dgrd_item ' 
            'staff_dgrd_item") or contains(@class, "default_dgrd_alt staff_dgrd_alt")]') 
      for result in results: 
       full_name = result.xpath('./td[@class="staff_dgrd_fullname"]/a/text()').extract_first() 
       print(full_name) 
       if self.last_name.lower() in full_name.lower().split(): 
        item = UniversitiesItem() 

        item['full_name'] = full_name 
        email = result.xpath('./td[@class="staff_dgrd_staff_email"]/a/@href').extract_first() 
        if email is not None: 
         item['email'] = email[7:] 
        else: 
         item['email'] = '' 
        item['phone'] = result.xpath('./td[@class="staff_dgrd_staff_phone"]/text()').extract_first() 
        yield item 
      driver.close() 
+0

非常感谢。虽然我没有机器来测试它,但它看起来不错。 res = response.replace(body = driver.page_source)。我认为这是关键问题。而'最大'只是我的错误。它应该是keyword.txt中的“Max”。我也一直在打开和关闭浏览器。那非常愚蠢。但是现在我想我可以把它放到循环中,直到整个抓取工作完成。 – user8314628

+0

它的工作原理!谢谢!你想看看我问的另一个类似问题吗?我想我应该用别的东西替换回应。 https://*.com/questions/46125667/scrapy-shell-cant-crawl-information-while-xpath-works-in-chrome-console?noredirect=1#comment79217663_46125667 – user8314628