Scrapy爬取并保存网页数据

网页爬取在上一篇博客中已写过，在这里不在写相关步骤，需要的可以查看上一篇博客。

地址为：https://blog.csdn.net/csdnmgq/article/details/88703019

将itcast.py文件下的代码替换为：

import scrapy
from test002.items import ItcastItem
class ItcastSpider(scrapy.Spider):
    name = "itcast"
    allowed_domains = ["itcast.cn"]
    start_urls = ("http://www.itcast.cn/channel/teacher.shtml",)

    def parse(self, response):
        filename = "test.html"
        open(filename, 'wb').write(response.body)
        items = []

        for each in response.xpath("//div[@class='li_txt']"):
            # 将我们得到的数据封装到一个 `ItcastItem` 对象
            item = ItcastItem()
            # extract()方法返回的都是unicode字符串
            name = each.xpath("h3/text()").extract()
            title = each.xpath("h4/text()").extract()
            info = each.xpath("p/text()").extract()

            # xpath返回的是包含一个元素的列表
            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]

            items.append(item)

        # 直接返回最后数据
        return items

如下所示：

Scrapy爬取网页数据