Scrapy爬行蜘蛛只触摸start_urls

问题描述:

我发现我的CrawlSpider只爬行start_urls,而不会进一步。Scrapy爬行蜘蛛只触摸start_urls

以下是我的代码。

import scrapy 
from scrapy.linkextractors import LinkExtractor 
from scrapy.spiders import CrawlSpider, Rule 


class ExampleSpider(CrawlSpider): 
    name = 'example' 
    allowed_domains = ['holy-bible-eng'] 
    start_urls = ['file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml'] 

    rules = (
     Rule(LinkExtractor(allow=r'OEBPS'), callback='parse_item', follow=True), 
    ) 

    def parse_item(self, response): 
     return response 

下面是我在file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtmlstart_urls

<?xml version="1.0" encoding="UTF-8"?> 
 
<!DOCTYPE html 
 
    PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> 
 
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /><title>Holy Bible</title><link href="lds_ePub_scriptures.css" rel="stylesheet" type="text/css" /></head><body class="bible-toc"><div class="titleBlock"><h1 class="toc-title">The Names and Order of All the <br /><span class="dominant">Books of the Old and <br />New Testaments</span></h1></div><div class="bible-toc"><p><a href="bible_dedication.xhtml">Epistle Dedicatory</a> | <a href="quad_abbreviations.xhtml">Abbreviations</a></p><h2 class="toc-title"><a href="ot.xhtml">The Books of the Old Testament</a></h2><p><a href="gen.xhtml">Genesis</a> | <a href="ex.xhtml">Exodus</a> | <a href="lev.xhtml">Leviticus</a> | <a href="num.xhtml">Numbers</a> | <a href="deut.xhtml">Deuteronomy</a> | <a href="josh.xhtml">Joshua</a> | <a href="judg.xhtml">Judges</a> | <a href="ruth.xhtml">Ruth</a> | <a href="1-sam.xhtml">1 Samuel</a> | <a href="2-sam.xhtml">2 Samuel</a> | <a href="1-kgs.xhtml">1 Kings</a> | <a href="2-kgs.xhtml">2 Kings</a> | <a href="1-chr.xhtml">1 Chronicles</a> | <a href="2-chr.xhtml">2 Chronicles</a> | <a href="ezra.xhtml">Ezra</a> | <a href="neh.xhtml">Nehemiah</a> | <a href="esth.xhtml">Esther</a> | <a href="job.xhtml">Job</a> | <a href="ps.xhtml">Psalms</a> | <a href="prov.xhtml">Proverbs</a> | <a href="eccl.xhtml">Ecclesiastes</a> | <a href="song.xhtml">Song of Solomon</a> | <a href="isa.xhtml">Isaiah</a> | <a href="jer.xhtml">Jeremiah</a> | <a href="lam.xhtml">Lamentations</a> | <a href="ezek.xhtml">Ezekiel</a> | <a href="dan.xhtml">Daniel</a> | <a href="hosea.xhtml">Hosea</a> | <a href="joel.xhtml">Joel</a> | <a href="amos.xhtml">Amos</a> | <a href="obad.xhtml">Obadiah</a> | <a href="jonah.xhtml">Jonah</a> | <a href="micah.xhtml">Micah</a> | <a href="nahum.xhtml">Nahum</a> | <a href="hab.xhtml">Habakkuk</a> | <a href="zeph.xhtml">Zephaniah</a> | <a href="hag.xhtml">Haggai</a> | <a href="zech.xhtml">Zechariah</a> | <a href="mal.xhtml">Malachi</a></p><h2 class="toc-title"><a href="nt.xhtml">The Books of the New Testament</a></h2><p><a href="matt.xhtml">Matthew</a> | <a href="mark.xhtml">Mark</a> | <a href="luke.xhtml">Luke</a> | <a href="john.xhtml">John</a> | <a href="acts.xhtml">Acts</a> | <a href="rom.xhtml">Romans</a> | <a href="1-cor.xhtml">1 Corinthians</a> | <a href="2-cor.xhtml">2 Corinthians</a> | <a href="gal.xhtml">Galatians</a> | <a href="eph.xhtml">Ephesians</a> | <a href="philip.xhtml">Philippians</a> | <a href="col.xhtml">Colossians</a> | <a href="1-thes.xhtml">1 Thessalonians</a> | <a href="2-thes.xhtml">2 Thessalonians</a> | <a href="1-tim.xhtml">1 Timothy</a> | <a href="2-tim.xhtml">2 Timothy</a> | <a href="titus.xhtml">Titus</a> | <a href="philem.xhtml">Philemon</a> | <a href="heb.xhtml">Hebrews</a> | <a href="james.xhtml">James</a> | <a href="1-pet.xhtml">1 Peter</a> | <a href="2-pet.xhtml">2 Peter</a> | <a href="1-jn.xhtml">1 John</a> | <a href="2-jn.xhtml">2 John</a> | <a href="3-jn.xhtml">3 John</a> | <a href="jude.xhtml">Jude</a> | <a href="rev.xhtml">Revelation</a></p><h2 class="toc-title"><a href="bible-helps_title-page.xhtml">Appendix</a></h2><p><a href="tg.xhtml">Topical Guide</a> | <a href="bd.xhtml">Bible Dictionary</a> | <a href="bible-chron.xhtml">Bible Chronology</a> | <a href="harmony.xhtml">Harmony of the Gospels</a> | <a href="jst.xhtml">Joseph Smith Translation</a> | <a href="bible-maps.xhtml">Bible Maps</a> | <a href="bible-photos.xhtml">Bible Photographs</a></p></div></body></html>

和下面的是我的控制台输出。

(crawl) G:\kjvbible>scrapy crawl example 
...... 
...... 

2017-04-08 09:24:59 [scrapy.core.engine] INFO: Spider opened 
2017-04-08 09:24:59 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 
2017-04-08 09:24:59 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6026 
2017-04-08 09:24:59 [scrapy.core.engine] DEBUG: Crawled (200) <GET file:///G:/holy-bible-eng/OEBPS/bible-toc.xhtml> (referer: None) 
2017-04-08 09:24:59 [scrapy.core.engine] INFO: Closing spider (finished) 
2017-04-08 09:24:59 [scrapy.statscollectors] INFO: Dumping Scrapy stats: 
{'downloader/request_bytes': 237, 
'downloader/request_count': 1, 
'downloader/request_method_count/GET': 1, 
'downloader/response_bytes': 3693, 

它没有更深入。

任何建议将受到欢迎。

CrawlSpider documentation

遵循是一个布尔值,它指定如果链接应从 与此规则提取的每个响应被遵循。 如果回调没有遵循 默认为真,否则默认为False

你不能有callbackfollow=True规则在同一时间。它只会听取回调,而且不会再继续。

所以CrawlSpider的规则背后的主要思想是,它可以找到链接,遵循和实际提取链接。

现在scrapy不是检查您的“本地”文件的最好办法,因为这只是创建一个简单的脚本。

另一个错误是您正在设置allowed_domains类变量,该变量指定它应该接受哪些域。所有其他人都被拒绝,这只适用于互联网上的链接。如果您不想拒绝域名,或者根本不使用域名(您的情况),请移除该变量。

+0

感谢您的回复,我刚刚评论了'allow_domains',它开始遵循链接! – Aaron

+0

很高兴帮助! – eLRuLL

+0

@Aaron请记得接受答案,如果它帮助你。 – eLRuLL