使用scrapy的MySQL数据库错误
问题描述:
我试图在MySQL数据库中保存报废的数据。我script.py是使用scrapy的MySQL数据库错误
# -*- coding: utf-8 -*-
import scrapy
import unidecode
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from lxml import html
class ElementSpider(scrapy.Spider):
name = 'books'
download_delay = 3
allowed_domains = ["goodreads.com"]
start_urls = ["https://www.goodreads.com/list/show/19793.I_Marked_My_Calendar_For_This_Book_s_Release",]
rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//a[@class="next_page"]',)), callback="parse", follow= True),)
def parse(self, response):
for href in response.xpath('//div[@id="all_votes"]/table[@class="tableList js-dataTooltip"]/tr/td[2]/div[@class="js-tooltipTrigger tooltipTrigger"]/a/@href'):
full_url = response.urljoin(href.extract())
print full_url
yield scrapy.Request(full_url, callback = self.parse_books)
break;
next_page = response.xpath('.//a[@class="next_page"]/@href').extract()
if next_page:
next_href = next_page[0]
next_page_url = 'https://www.goodreads.com' + next_href
print next_page_url
request = scrapy.Request(next_page_url, self.parse)
yield request
def parse_books(self, response):
yield{
'url': response.url,
'title':response.xpath('//div[@id="metacol"]/h1[@class="bookTitle"]/text()').extract(),
'link':response.xpath('//div[@id="metacol"]/h1[@class="bookTitle"]/a/@href').extract(),
}
而且pipeline.py是
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import MySQLdb
import hashlib
from scrapy.exceptions import DropItem
from scrapy.http import Request
import sys
class SQLStore(object):
def __init__(self):
self.conn = MySQLdb.connect("localhost","root","","books")
self.cursor = self.conn.cursor()
print "connected to DB"
def process_item(self, item, spider):
print "hi"
try:
self.cursor.execute("""INSERT INTO books_data(next_page_url) VALUES (%s)""", (item['url']))
self.conn.commit()
except Exception, e:
print e
当我运行该脚本没有任何错误。蜘蛛运行良好,但我认为光标不指向process_item。即使它不打印嗨。
答
你的方法签名是错误的,它应该采取项目和蜘蛛参数:
process_item(self, item, spider)
你也需要有管道安装在您的settings.py文件:
ITEM_PIPELINES = {"project_name.path.SQLStore"}
你语法也不正确,您需要传递一个元组:
self.cursor.execute("""INSERT INTO books_data(next_page_url) VALUES (%s)""",
(item['url'],) # <- add ,
已经有tr ied这个,但不工作。我在settings.py中添加了管道,如下所示:ITEM_PIPELINES ='test1.pipelines.SQLStore':300, } –
你的piplines目录中的init.py文件中有什么?你也有'process_item(self,item,spider)'? –
那么scrapy如何找到你的SQLStore管道? –