如何从网站上刮取图片?
问题描述:
我们如何得到本网站的所有图片:http://www.theft-alerts.com 我们需要19页的图片。播种远,我们有这个代码,但它不工作。我们想要在新地图中的图像。如何从网站上刮取图片?
#!/usr/bin/python
import [urllib2][1]
from bs4 import BeautifulSoup
from urlparse import urljoin
url = "http://www.theft-alerts.com/index-%d.html"
page = urllib2.urlopen(url).read()
soup = BeautifulSoup(page, "html.parser")
base = "http://www.theft-alerts.com"
images = [urljoin(base,a["href"]) for a in soup.select("td a[href^=images/]")]
for url in images:
img = BeautifulSoup(urllib2.urlopen(url).read(),"lxml").find("img")["src"]
with open("myimages/{}".format(img), "w") as f:
f.write(urllib2.urlopen("{}/{}".format(url.rsplit("/", 1)[0], img)).read())
答
你需要遍历每个页面并提取图像,可以不断循环,直到与文本"Next"
锚与类resultnav
代码标签:
import requests
from bs4 import BeautifulSoup
from urlparse import urljoin
def get_pages(start):
soup = BeautifulSoup(requests.get(start).content)
images = [img["src"] for img in soup.select("div.itemspacingmodified a img")]
yield images
nxt = soup.select("code.resultnav a")[-1]
while True:
soup = BeautifulSoup(requests.get(urljoin(url, nxt["href"])).content)
nxt = soup.select("code.resultnav a")[-1]
if nxt.text != "Next":
break
yield [img["src"] for img in soup.select("div.itemspacingmodified a img")]
url = "http://www.theft-alerts.com/"
for images in get_pages(url):
print(images)
,让你来自所有19页的图像。
“它不工作”你知道为什么吗?至少,你的url包含一个你还没有填写的参数。 –