有没有人有Scrapy中的sqlite管道的示例代码?

问题描述:

我正在寻找Scrapy中SQLite管道的一些示例代码。我知道没有内置的支持,但我相信它已经完成。只有实际的代码可以帮助我,因为我只知道足够的Python和Scrapy来完成我非常有限的任务,并且需要代码作为起点。有没有人有Scrapy中的sqlite管道的示例代码?

+0

不是异步运行的Scrapy需要非阻塞数据存储吗?在这种情况下,SQLite将无法工作? – zelusp 2016-10-06 18:24:24

+0

似乎sqlite3足够快速并且足够聪明地处理并发(最多一点)。 [see here](http://*.com/questions/4060772/sqlite3-concurrent-access) – zelusp 2016-10-08 00:19:25

如果你觉得舒服扭曲的adbapi,你可以采取为出发点这个MySQL管道:http://github.com/darkrho/scrapy-googledir-mysql/blob/master/googledir/pipelines.py

,并使用该线在__init__

self.dbpool = adbapi.ConnectionPool("sqlite3", database="/path/sqlite.db") 
+0

如果您使用sqlite3,请注意线程需求,它要求sqlite对象与创建时使用的线程相同 – Lionel 2011-11-21 01:30:50

我做了这样的事情:

# 
# Author: Jay Vaughan 
# 
# Pipelines for processing items returned from a scrape. 
# Dont forget to add pipeline to the ITEM_PIPELINES setting 
# See: http://doc.scrapy.org/topics/item-pipeline.html 
# 
from scrapy import log 
from pysqlite2 import dbapi2 as sqlite 

# This pipeline takes the Item and stuffs it into scrapedata.db 
class scrapeDatasqLitePipeline(object): 
    def __init__(self): 
     # Possible we should be doing this in spider_open instead, but okay 
     self.connection = sqlite.connect('./scrapedata.db') 
     self.cursor = self.connection.cursor() 
     self.cursor.execute('CREATE TABLE IF NOT EXISTS myscrapedata ' \ 
        '(id INTEGER PRIMARY KEY, url VARCHAR(80), desc VARCHAR(80))') 

    # Take the item and put it in database - do not allow duplicates 
    def process_item(self, item, spider): 
     self.cursor.execute("select * from myscrapedata where url=?", item['url']) 
     result = self.cursor.fetchone() 
     if result: 
      log.msg("Item already in database: %s" % item, level=log.DEBUG) 
     else: 
      self.cursor.execute(
       "insert into myscrapedata (url, desc) values (?, ?)", 
        (item['url'][0], item['desc'][0]) 

      self.connection.commit() 

      log.msg("Item stored : " % item, level=log.DEBUG) 
     return item 

    def handle_error(self, e): 
     log.err(e) 

对于任何试图解决类似问题的人来说,我只是遇到了一个不错的Sqlite Item Exproter for SQLite:https://github.com/RockyZ/Scrapy-sqlite-item-exporter

scrapy crawl <spider name> -o sqlite.db -t sqlite 

它也可以适于用作项目管道的替代产品出口商:

它包括到您的项目设置后,你可以使用它。

这是一个sqlalchemy的sqlite管道。 使用sqlalchemy,您可以根据需要轻松更改数据库。

settings.py附加数据库配置

# settings.py 
# ... 
DATABASE = { 
    'drivername': 'sqlite', 
    # 'host': 'localhost', 
    # 'port': '5432', 
    # 'username': 'YOUR_USERNAME', 
    # 'password': 'YOUR_PASSWORD', 
    'database': 'books.sqlite' 
} 

然后在pipelines.py添加以下

# pipelines.py 
import logging 

from scrapy import signals 
from sqlalchemy import Column, Integer, String, DateTime 
from sqlalchemy import create_engine 
from sqlalchemy.engine.url import URL 
from sqlalchemy.ext.declarative import declarative_base 
from sqlalchemy.orm import sessionmaker 
from sqlalchemy.pool import NullPool 

logger = logging.getLogger(__name__) 

DeclarativeBase = declarative_base() 

class Book(DeclarativeBase): 
    __tablename__ = "books" 

    id = Column(Integer, primary_key=True) 
    title = Column('title', String) 
    author = Column('author', String) 
    publisher = Column('publisher', String) 
    url = Column('url', String) 
    scrape_date = Column('scrape_date', DateTime) 

    def __repr__(self): 
     return "<Book({})>".format(self.url) 


class SqlitePipeline(object): 
    def __init__(self, settings): 
     self.database = settings.get('DATABASE') 
     self.sessions = {} 

    @classmethod 
    def from_crawler(cls, crawler): 
     pipeline = cls(crawler.settings) 
     crawler.signals.connect(pipeline.spider_opened, signals.spider_opened) 
     crawler.signals.connect(pipeline.spider_closed, signals.spider_closed) 
     return pipeline 

    def create_engine(self): 
     engine = create_engine(URL(**self.database), poolclass=NullPool, connect_args = {'charset':'utf8'}) 
     return engine 

    def create_tables(self, engine): 
     DeclarativeBase.metadata.create_all(engine, checkfirst=True) 

    def create_session(self, engine): 
     session = sessionmaker(bind=engine)() 
     return session 

    def spider_opened(self, spider): 
     engine = self.create_engine() 
     self.create_tables(engine) 
     session = self.create_session(engine) 
     self.sessions[spider] = session 

    def spider_closed(self, spider): 
     session = self.sessions.pop(spider) 
     session.close() 

    def process_item(self, item, spider): 
     session = self.sessions[spider] 
     book = Book(**item) 
     link_exists = session.query(Book).filter_by(url=item['url']).first() is not None 

     if link_exists: 
      logger.info('Item {} is in db'.format(book)) 
      return item 

     try: 
      session.add(book) 
      session.commit() 
      logger.info('Item {} stored in db'.format(book)) 
     except: 
      logger.info('Failed to add {} to db'.format(book)) 
      session.rollback() 
      raise 

     return item 

items.py应该是这样的

#items.py 
import scrapy 

class BookItem(scrapy.Item): 
    title = scrapy.Field() 
    author = scrapy.Field() 
    publisher = scrapy.Field() 
    scrape_date = scrapy.Field() 

您也可以考虑移动class Book into items.py