使用python scrapy提取链接和文本

问题描述：

我是Python和Scrapy的新手。我想从网站http://www.vodafone.com.au/about/legal/critical-information-summary/plans中提取信息，包括文件的链接，名称和有效的。使用python scrapy提取链接和文本

我试过这段代码，但它不起作用。如果有人能解释并帮助我，我将不胜感激。

这里是文件vodafone.py

import scrapy 

from scrapy.linkextractor import LinkExtractor 
from scrapy.spiders import Rule, CrawlSpider 
from vodafone_scraper.items import VodafoneScraperItem 


class VodafoneSpider(scrapy.Spider): 
    name = 'vodafone' 
    allowed_domains = ['vodafone.com.au'] 
    start_urls = ['http://www.vodafone.com.au/about/legal/critical-information-summary/plans'] 

    def parse(self, response): 
     for sel in response.xpath('//tbody/tr/td[1]/a'): 
      item = VodafoneScraperItem() 
      item['link'] = sel.xpath('href').extract() 
      item['name'] = sel.xpath('text()').extract_first() 

      yield item

答

因为是由JavaScript动态生成的页面内容它不工作。您尝试从中提取数据的元素不存在于Scrapy收到的HTML源代码中（您可以在浏览器中打开页面源代码时看到自己）。

你有两个选择：

尝试看看你会不会发现，该页面使用任何API。在浏览器的开发人员工具中查找网络选项卡上的XHR请求。幸运的是，这个具体页面似乎从http://www.vodafone.com.au/rest/CIS?field:planCategory:equals=Mobile%20Plans&field:planFromDate:lessthaneq=20/08/2017这样的请求中获取数据。它返回可以解析的JSON。
另一种选择是渲染包含JavaScript的页面，然后解析它。我建议使用Splash，因为它通过scrapy-splash库与Scrapy无缝集成。

答

相反的要求：

start_urls = ['http://www.vodafone.com.au/about/legal/critical-information-summary/plans']

您可以设置start_urls到：

start_urls = ['http://www.vodafone.com.au/rest/CIS?field:planCategory:equals=Mobile%20Plans&field:planFromDate:lessthaneq=22/08/2017']

比转换response.body以JSON格式：

response_json = json.loads(response.body)

现在会给你网站上的所有对象。现在简单地重复一个循环了，并得到所需的数据：

for item_json in response_json: 
    item["link"] = item_json["document"]["file"] 
    item["name"] = item_json["document"]["name"]

完整的代码片段是在这里：

import scrapy 
import json 
from vodafone_scraper.items import VodafoneScraperItem 


class VodafoneSpider(scrapy.Spider): 
    name = 'vodafone' 
    allowed_domains = ['vodafone.com.au'] 
    start_urls = [ 
     'http://www.vodafone.com.au/rest/CIS?field:planCategory:equals=Mobile%20Plans&field:planFromDate:lessthaneq=22/08/2017'] 

def parse(self, response): 
    response_json = json.loads(response.body) 
    for item_json in response_json: 
     item = VodafoneScraperItem() 
     item["link"] = item_json["document"]["file"] 
     item["book"] = item_json["document"]["name"] 

     yield item

使用python scrapy提取链接和文本

相关推荐