从scrapy和浏览器控制台不同的XPath结果
问题描述:
我使用硒和PhantomJS从大学web page从scrapy和浏览器控制台不同的XPath结果
出于测试目的,收集了教授的联系信息(不是恶意的目的),让我们说kw.txt是文件只包含两个姓氏的样子
最大
李
import scrapy
from selenium import webdriver
from universities.items import UniversitiesItem
class iupui(scrapy.Spider):
name = 'iupui'
allowed_domains = ['iupui.com']
start_urls = ['http://iupuijags.com/staff.aspx']
def __init__(self):
self.last_name = ''
def parse(self, response):
with open('kw.txt') as file_object:
last_names = file_object.readlines()
for ln in last_names:
#driver = webdriver.PhantomJS("C:\\Users\yashi\AppData\Roaming\Python\Python36\Scripts\phantomjs.exe")
driver = webdriver.Chrome('C:\\Users\yashi\AppData\Local\Programs\Python\Python36\chromedriver.exe')
driver.set_window_size(1120, 550)
driver.get('http://iupuijags.com/staff.aspx')
kw_search = driver.find_element_by_id('ctl00_cplhMainContent_txtSearch')
search = driver.find_element_by_id('ctl00_cplhMainContent_btnSearch')
self.last_name = ln.strip()
kw_search.send_keys(self.last_name)
search.click()
item = UniversitiesItem()
results = response.xpath('//table[@class="default_dgrd staff_dgrd"]//tr[contains(@class,"default_dgrd_item '
'staff_dgrd_item") or contains(@class, "default_dgrd_alt staff_dgrd_alt")]')
for result in results:
full_name = result.xpath('./td[@class="staff_dgrd_fullname"]/a/text()').extract_first()
print(full_name)
if self.last_name in full_name.split():
item['full_name'] = full_name
email = result.xpath('./td[@class="staff_dgrd_staff_email"]/a/href').extract_first()
if email is not None:
item['email'] = email[7:]
else:
item['email'] = ''
item['phone'] = result.xpath('./td[@class="staff_dgrd_staff_phone"]/text()').extract_first()
yield item
driver.close()
但是,结果ALWA ys给我一堆名字看起来像
2017-09-12 15:27:13 [selenium.webdriver.remote.remote_connection] DEBUG: Finished Request
Dr. Roderick Perry
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx>
{}
Gail Barksdale
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx>
{}
John Rasmussen
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx>
{}
Jared Chasey
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx>
{}
Denise O'Grady
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx>
{}
Ed Holdaway
2017-09-12 15:27:13 [scrapy.core.scraper] DEBUG: Scraped from <200 http://iupuijags.com/staff.aspx>
{}
结果的长度总是相同的每次迭代。
这是怎么看起来像在控制台当我把XPath的它: console result
我真的无法弄清楚什么问题。
答
因此很少有问题。
您没有使用来自您的硒代码的响应。你是 浏览页面,然后从页面的源头什么都不做。
接下来,即使没有找到匹配项,您也正在屈服项目,因此
空白项目。还有你创建外循环的项目时,应该是里面
你正在做的比较是大小写敏感的。所以你检查
max
但结果有Max
,你忽略了匹配。您还有一个href中缺少
@
的电子邮件。
下面是一个固定的版本
class iupui(scrapy.Spider):
name = 'iupui'
allowed_domains = ['iupui.com']
start_urls = ['http://iupuijags.com/staff.aspx']
# def __init__(self):
# self.last_name = ''
def parse(self, response):
# with open('kw.txt') as file_object:
# last_names = file_object.readlines()
last_names = ["max"]
for ln in last_names:
#driver = webdriver.PhantomJS("C:\\Users\yashi\AppData\Roaming\Python\Python36\Scripts\phantomjs.exe")
driver = webdriver.Chrome()
driver.set_window_size(1120, 550)
driver.get('http://iupuijags.com/staff.aspx')
kw_search = driver.find_element_by_id('ctl00_cplhMainContent_txtSearch')
search = driver.find_element_by_id('ctl00_cplhMainContent_btnSearch')
self.last_name = ln.strip()
kw_search.send_keys(self.last_name)
search.click()
res = response.replace(body=driver.page_source)
results = res.xpath('//table[@class="default_dgrd staff_dgrd"]//tr[contains(@class,"default_dgrd_item '
'staff_dgrd_item") or contains(@class, "default_dgrd_alt staff_dgrd_alt")]')
for result in results:
full_name = result.xpath('./td[@class="staff_dgrd_fullname"]/a/text()').extract_first()
print(full_name)
if self.last_name.lower() in full_name.lower().split():
item = UniversitiesItem()
item['full_name'] = full_name
email = result.xpath('./td[@class="staff_dgrd_staff_email"]/a/@href').extract_first()
if email is not None:
item['email'] = email[7:]
else:
item['email'] = ''
item['phone'] = result.xpath('./td[@class="staff_dgrd_staff_phone"]/text()').extract_first()
yield item
driver.close()
非常感谢。虽然我没有机器来测试它,但它看起来不错。 res = response.replace(body = driver.page_source)。我认为这是关键问题。而'最大'只是我的错误。它应该是keyword.txt中的“Max”。我也一直在打开和关闭浏览器。那非常愚蠢。但是现在我想我可以把它放到循环中,直到整个抓取工作完成。 – user8314628
它的工作原理!谢谢!你想看看我问的另一个类似问题吗?我想我应该用别的东西替换回应。 https://*.com/questions/46125667/scrapy-shell-cant-crawl-information-while-xpath-works-in-chrome-console?noredirect=1#comment79217663_46125667 – user8314628