Python通过URL爬取图片

上一篇解决的问题是如何通过url获取图片,现在来研究如何通过给定的网址获取网页中刷出来的图片。

代码来自微信公众号CVPy,这是我在使用Adaboost训练人脸分类器的时候在CSDN里面偶然发现的。可以看出是一个很有趣很强的兄弟,向他学习。在调试代码的过程中,我也自己发现了一些问题:


错误:ValueError: Couldn't find a tree builder with the features you requested: xml. Do you need to install a parser library?
办法:安装easy_install lxml

错误:BeautifulSoup无法正常调用
办法:只pip install BeautifulSoap不行,要完整地安装BS4

错误:AttributeError: 'str' object has no attribute 'startwith'
办法:判断字符串是否以某字符创开始是startswith,不是startwith


完整代码如下:

# -*- coding=utf-8 -*-
import requests as req
from bs4 import BeautifulSoup
from PIL import Image
from io import BytesIO
import os
from skimage import io
import numpy as np
import urllib
import cv2
url = 'https://www.zhihu.com/question/37787176'
headers = {'User-Agent' : 'Mozilla/5.0 (Linux;Android 6.0;Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/58.0.3029.96 Mobile Safari/537.36'}
response = req.get(url, headers=headers)
content = str(response.content)
print content
soup = BeautifulSoup(content, "lxml")
images = soup.find_all('img')
print u"共有%d张图片" % len(images)

if not os.path.exists("images"):
    os.mkdir("images")
for i in range(len(images)):
    img = images[i]
    print u"正在处理第%d张图片..." % (i+1)
    img_src = img.get('src')
    if img_src.startswith("http"):
        resp = urllib.urlopen(img_src)
        image = np.asarray(bytearray(resp.read()), dtype="uint8")
        image = cv2.imdecode(image, cv2.IMREAD_COLOR)
        w,h = image.shape[:2]
        print w,h
        img_path = "images/" + str(i+1) + ".jpg"
        if w>=200 and h>200:
            cv2.imshow("Image", image)
            cv2.waitKey(3000)
print u"处理完成"

结果就是下面这个样子,存储图片的方式多种,在上篇“Python通过url获取图片的几种方法”中有介绍。


Python通过URL爬取图片