【实战】免费代理!
引言
作为一个个人爬虫开发,最苦恼的事之一肯定是代理ip的问题。今天我们就自己动手来做一个可用的代理IP池。
需求分析
爬取西刺代理网站中可用的高匿代理。
知识点
爬取数据:Requests
数据筛选:Beautifulsoup
数据库:Mongo
主要代码
网站内容很简单,这里就不做过多的解析了。直接放出部分代码
发送requests请求:
def get_response(self):
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36",
}
self.response = requests.get(self.url, headers=headers).text
提取目标内容:
def get_ip_info_list(self):
soup = BeautifulSoup(self.response,"lxml")
ip_list = (soup.find(id="ip_list"))
ip_detail = ip_list.find_all(name="tr")
ip_detail = ip_detail[1:]
for ip in ip_detail:
item = {}
item['ip'] = ip.find_all(name = "td")[1].string
item['port'] = ip.find_all(name = 'td')[2].string
try:
item['location'] = ip.find_all(name = 'td')[3].find(name = "a").string
except:
item['location'] = ip.find_all(name='td')[3].string.strip()
item['anonymous'] = ip.find_all(name = 'td')[4].string
item['type'] = ip.find_all(name = 'td')[5].string
item['speed'] = ip.find_all(name = 'td')[6].find(class_ = 'bar').attrs['title']
item['connect_time'] = ip.find_all(name = 'td')[7].find(class_ = 'bar').attrs['title']
item['alive_time'] = ip.find_all(name = 'td')[8].string
item['verify_time'] = ip.find_all(name = 'td')[9].string
if self.check_verify_time("20"+item['verify_time'].split(" ")[0]):
if not check_proxy_duplicate(item):
yield item
else:
return
检查ip是否重复:
def check_proxy_duplicate(proxy):
ip = proxy['ip']
curr = pymongo.MongoClient()
db = curr['proxy']
collection = db['proxy']
ip_exist = collection.find({"ip":ip})
ip_exist_list = []
for i in ip_exist:
ip_exist_list.append(i)
if ip_exist_list :
print("%s已经存在"%ip)
return True
else:
return False
存储至数据库:
def save_mongo(item):
curr = pymongo.MongoClient()
db = curr['proxy']
collection = db['proxy']
collection.insert_one(item)
curr.close()
print("%s 存储完成"%(item['ip']))
检查代理是否可用:
def check_proxy_enable(proxy):
proxy_string = proxy['type'] + "://" + proxy['ip'] + ":" + proxy['port']
if proxy['type'] == "HTTP":
proxy_for_check = {'HTTP':proxy_string}
elif proxy['type'] == 'HTTPS':
proxy_for_check = {'HTTPS': proxy_string}
try:
requests.get("http://www.sina.com.cn",proxies=proxy_for_check)
except:
del_proxy_from_mongo(proxy)
else:
update_proxy_from_mongo(proxy)
爬取结果
从mongoDB中查看:
源码
链接:链接:https://pan.baidu.com/s/1MgwmhUKLnpTKnI-HF2JChQ 提取码 提取码:4lwi