python网络爬虫——好吧又双叒是爬妹子图

python3.6.4

我用的是requests+Beautifulsoup获取和解析网页

爬取的网址：http://www.mzitu.com/all/

由于不知道怎么开始，所以先贴代码吧 ······话说，代码怎么贴······

from bs4 import BeautifulSoup
import requests
import os
import time

url='http://www.mzitu.com/all/'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'}
url_get=requests.get(url,headers=headers)
url_soup=BeautifulSoup(url_get.content,'lxml')
soup_list=url_soup.find('div',class_="all").find_all('a')[1:]
title=0
for soup in soup_list:
    time.sleep(0.05)
    title+=1
    if not os.path.exists(os.path.join('E:\meizitu\\',str(title))):
        os.makedirs(os.path.join('E:\meizitu\\',str(title)))
    os.chdir(os.path.join('E:\meizitu\\')+str(title))
    href=soup.get('href')#总页面链接
    page_get=requests.get(href)
    page_soup=BeautifulSoup(page_get.content,'lxml')
    max_page_list=page_soup.find('div',class_='pagenavi').find_all('span')[-2].get_text()
    for max_page in range(1,int(max_page_list)):
        time.sleep(0.05)
        page_url = href + '/' + str(max_page)#每页链接

        page_header={'Referer':href,
                     'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'}
        img_get = requests.get(page_url,headers=page_header)
        img_soup=BeautifulSoup(img_get.content,'lxml')
        img_url=img_soup.find('div',class_="main-image").find_all('img')#图片标签列表
        for img in img_url:
            time.sleep(0.05)
            src=img.get('src')
            jpg=requests.get(src,headers=page_header)
            name=src[-6:-4]
            f=open(name+'.jpg','wb')
            f.write(jpg.content)
            f.close
            print(name+' 下载完成')

原来直接粘贴就行，OK

整个代码似乎没有函数···额

因为自己也是初学者，还不习惯函数，so······慢慢来吧

============================

从这里开始爬

python网络爬虫——好吧又双叒是爬妹子图

首先是导入要用到的各种模块

其中os用来对文件夹进行操作

time是防止爬虫累死---防止因为请求过快而被屏蔽

from bs4 import BeautifulSoup
import requests
import os
import time

然后是请求头和链接地址

url='http://www.mzitu.com/all/'
headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'}

接着开始用requests.get()来请求网页

url_get=requests.get(url,headers=headers)

然后是煲汤解析网页

url_soup=BeautifulSoup(url_get.content,'lxml')
title=0

用find（）和find_all()来获取需要的网页标签（！！！千万注意find_all（）返回的是一个列表！！！）

这里用了一次切片，把匹配到的第一个内容删了（不是我们需要的内容）

soup_list=url_soup.find('div',class_="all").find_all('a')[1:]

这里的title是准备待会儿给文件夹命名用的

-------------------

煲完汤之后开始处理表里的每个元素

for soup in soup_list:
    time.sleep(0.05)
    title+=1
    if not os.path.exists(os.path.join('E:\meizitu\\',str(title))):
        os.makedirs(os.path.join('E:\meizitu\\',str(title)))
    os.chdir(os.path.join('E:\meizitu\\')+str(title))
    href=soup.get('href')#总页面链接
   page_get=requests.get(href)
    page_soup=BeautifulSoup(page_get.content,'lxml')
    max_page_list=page_soup.find('div',class_='pagenavi').find_all('span')[-2].get_text()

用一个循环来操作列表里的每个元素

time.sleep(0.05) 每次循环都让爬虫休息0.05秒

title由零变成1 这是第一个文件夹的名，每次大循环都创建一个文件夹，第n次大循环创建的文件夹名为n

其中 if not 语句用来判断是否存在这样一个文件夹，存在则pass，不存在就创建

上面这句是得到文件夹路径（完全可以不必这样写）

os.path.join('E:\meizitu\\',str(title))

打开文件夹

os.chdir(os.path.join('E:\meizitu\\')+str(title))

从获取的链接里提取href 得到页面总链接，看下图

python网络爬虫——好吧又双叒是爬妹子图

href=soup.get('href')#总页面链接

接着又是煲汤什么的，找到最大页面数值，仔细看图，注意pagenavi

python网络爬虫——好吧又双叒是爬妹子图

   page_get=requests.get(href)
    page_soup=BeautifulSoup(page_get.content,'lxml')
    max_page_list=page_soup.find('div',class_='pagenavi').find_all('span')[-2].get_text()

这里又对列表切片，然后get_text()获取找到的标签中的文本,也就是最大的页面

for max_page in range(1,int(max_page_list)):
    time.sleep(0.05)
    page_url = href + '/' + str(max_page)#每页链接

    page_header={'Referer':href,
                 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'}
    img_get = requests.get(page_url,headers=page_header)
    img_soup=BeautifulSoup(img_get.content,'lxml')
    img_url=img_soup.find('div',class_="main-image").find_all('img')#图片标签列表

接着又是一次循环，睡觉之后把每个页面的链接拼出来，像这样： python网络爬虫——好吧又双叒是爬妹子图

这里换了一个请求头，用来处理防盗链（增加了referer），不加的时候下载得到的图片是防盗图

还是煲汤，煲汤······（开始意识到函数的重要性了）

for img in img_url:
    time.sleep(0.05)
    src=img.get('src')
    jpg=requests.get(src,headers=page_header)
    name=src[-6:-4]
    f=open(name+'.jpg','wb')
    f.write(jpg.content)
    f.close
    print(name+' 下载完成')

接着又开始小循环，睡觉

同样的获取scr对应的链接

终于到了下载图片，requests.get()获取图片内容

name= 这句是从是src中截取一段字符作为文件名（愿意截几个就接几个，最好两位数或者两位以上）

终于，保存文件

每下载完一张就打印一次下载完成，

像这样：

python网络爬虫——好吧又双叒是爬妹子图

然后就注意身体吧，看下图： python网络爬虫——好吧又双叒是爬妹子图

原来写博客这么累······

python网络爬虫——好吧又双叒是爬妹子图

相关推荐