Python3爬虫简易实例1(Requests正则)
抓取某电影网TOP100
1.分析源码,以便于写正则表达式
这里是网页关键部分的源码:
然后针对它写正则表达式:
pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a' + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>' + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S)
然后完善:
import requests from requests.exceptions import RequestException import re import json def get_one_page(url, headers): try: response = requests.get(url, headers=headers) if response.status_code == 200: return response.text else: return None except RequestException: return None def parse_one_page(html): pattern = re.compile('<dd>.*?board-index.*?>(\d+)</i>.*?data-src="(.*?)".*?name"><a' + '.*?>(.*?)</a>.*?star">(.*?)</p>.*?releasetime">(.*?)</p>' + '.*?integer">(.*?)</i>.*?fraction">(.*?)</i>.*?</dd>', re.S) items = re.findall(pattern, html) for item in items: yield { "index": item[0], "image": item[1], "title": item[2], "actor": item[3].strip()[3:], "time": item[4].strip()[5:], "score": item[5]+item[6], } def write_to_file(content): with open("result.txt", "a", encoding="utf-8") as f: f.write(json.dumps(content, ensure_ascii=False) + "\n") def main(offset): headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36" } url = "http://maoyan.com/board/4?offset=" + str(offset) html = get_one_page(url, headers) for item in parse_one_page(html): write_to_file(item) if __name__ == '__main__': for i in range(10): main(i*10)
必须模拟User-Agent的信息,Python默认的信息是python,必须把它改成浏览器的信息,否则会出错。
写入异常处理防止出错。
存取下来的文本内容:
还可以完善一下它,加入进程池,加速爬取。
from multiprocessing import Pool '''.................'''
if __name__ == '__main__': pool = Pool() pool.map(main, [i*10 for i in range(10)])