前言

最近在看一本讲爬虫的书，这里尝试着做一个实战。现在的境界相当于，是已经迈过门槛，走到门口，学会了小手枪的使用，愉悦，并且到处瞄准尝试，三天一线。以后的路可能就是从广度和深度来扩展了吧。除了这种射击模式，还有其他的射击模式需要学习，不止手中的这把枪，还有其他枪需要了解和使用。

思路

先抓大，再抓小。然后正则过滤，多线程爬取。

采用广度搜索的方法爬取。

嵌套

练习

目标url:https://www.kanunu8.com/book2/10748/index.html

1. 抓取文章链接

爬虫实战——爬取小说《从你的全世界路过》
首先要做的就是判断我们抓取的部分是那些。这里可以看到其中所有的章节都是被tbody这个标签来包裹的，那么我们就先把这一部分弄出来。

1.1 抓大的部分

导入我们需要的包

import re
import requests
import os
from multiprocessing.dummy import Pool

然后抓取

import requests
import os
from multiprocessing.dummy import Pool


html=requests.get("https://www.kanunu8.com/book2/10748/index.html").content.decode('gb2312')
html_ser=re.findall("tbody>(.*?)</tbody",html,re.S).group(1)

这里有一个为什么要使用gb2312这个编码格式。
爬虫实战——爬取小说《从你的全世界路过》
因为他网站charset用了这个，so。

由于这里没有找到一个合适的唯一截取点，所以使用findall的方式来找

import re
import requests
import os
from multiprocessing.dummy import Pool
import time


html=requests.get("https://www.kanunu8.com/book2/10748/index.html").content.decode('gb2312')
html_ser=re.findall("tbody>(.*?)</tbody",html,re.S)
for i in html_ser:
	if re.search("第一夜",i,re.S)!=None:
		html_sers=i

1.2 筛选出小的部分

使用findall()来获取所有的href链接。
然后将其写成函数

def get_href():
	html=requests.get("https://www.kanunu8.com/book2/10748/index.html").content.decode('gb2312')
	html_ser=re.findall("tbody>(.*?)</tbody",html,re.S)
	for i in html_ser:
		if re.search("第一夜",i,re.S)!=None:
			html_sers=i

	s=re.findall('href="(.*?)"',html_sers,re.S)
	lst=[]
	for i in s:
		
		lst.append("https://www.kanunu8.com/book2/10748/"+i)
	return lst

2.文章爬取

找到了href之后，那么我们就可以写我们的单个文章的爬取了。

def ends(href):
	html=requests.get(href).content.decode('gbk')
	title=re.search('<font color="#dc143c">(.*?)</font',html,re.S).group(1)
	text_block=re.search('<p>(.*?)</p>',html,re.S).group(1)
	text_block=text_block.replace('<br>',"")
	save(title,text_block)

将我们爬好的数据写入txt

def save(title,text):
	with open(os.path.join('从你的全世界路过',title+'.txt'),'w',encoding="utf-8") as f:
		f.write(text)

3. 开启多线程

href=get_href()
print (href)
pool=Pool(5)
pool.map(ends,href)

4. 最终效果

爬虫实战——爬取小说《从你的全世界路过》