python3 requests+bs4 循环爬取ip地址池
之前为了公司方便使用,做了个爬淘宝评论的窗口软件。
但是ip一直是个难题,访问频率慢下载的就太慢。
访问频率快几次就被封、后来自己去了解了下搭建了个ip地址池
第一步、找代理网址
我看了几个,觉得西刺比较友好。西刺网址 http://www.xicidaili.com/nn/
看了下网页,比较简单,本来想用xpath的。但是没怎么用过bs4,想着用bs4练习下吧。
下面上代码
import requests
from bs4 import BeautifulSoup
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER'}
def get_ip(url):
web_data = requests.get(url,headers = header).content.decode()
#这里会报个warning警告错误,但是不影响,好像是下面这句代码的原因,没有深究
soup = BeautifulSoup(web_data)
#查找所有的tr
ips = soup.find_all('tr')
#注意此处是ip_lists。带s
ip_lists = []
for i in range(1,len(ips)):
ip_info = ips[i]
tds = ip_info.find_all('td')
ip_lists.append({tds[5].text:tds[1].text +':'+ tds[2].text})
#创建空列表在循环体外,每次找到正确的值,加入列表中
#注意此处是ip_list。不带s
ip_list = []
print(ip_lists)
for ip in ip_lists:
try:
#这里设置timeout来验证ip响应时间
res = requests.get('http://tool.oschina.net/codeformat/js',proxies =ip,timeout = 3)
#这里看着有些繁琐,由于基础不稳,只能这样写
print(ip[tuple(ip)[0]] +"已存")
ip_list.append(ip)
except :
print(ip[tuple(ip)[0]] +" 链接超时,已舍弃")
print('IP已更新')
if __name__ == '__main__':
url = 'http://www.xicidaili.com/nn/'
get_ip(url)
做了这些以后,但是远远不够,因为感觉使用起来不方便。。。那么、优化吧
既然用着不方便,那就引入数据库吧
如果能加个定时爬取,放在服务器上执行就更好了
那么,行动起来吧
import requests
from bs4 import BeautifulSoup
import random
import pymysql
from threading import Timer
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36",
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Opera/8.0 (Windows NT 5.1; U; en)',
'Mozilla/5.0 (Windows NT 5.1; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.50',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.50',
'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0',
'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2 ',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.133 Safari/534.16',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36',
'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0) ',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
'Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E)',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)',
'Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
'Mozilla/5.0 (X11; U; Linux x86_64; zh-CN; rv:1.9.2.10) Gecko/20100922 Ubuntu/10.10 (maverick) Firefox/3.6.10',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)',
'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.12) Gecko/20070731 Ubuntu/dapper-security Firefox/1.5.0.12',
'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; QQDownload 732; .NET4.0C; .NET4.0E; LBBROWSER)',
'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1',
'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.3 Mobile/14E277 Safari/603.1.30',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'
]
#看着乱不乱,这算是个福利吧,好多个agent,拿去自己用吧
#这里是从中随机选一个
UserAgent = random.choice(user_agent_list)
header = {'User-Agent': UserAgent}
def get_ip(url):
web_data = requests.get(url,headers = header).content.decode()
#这里会报个warning警告错误,但是不影响,好像是下面这句代码的原因,没有深究
soup = BeautifulSoup(web_data)
#查找所有的tr
ips = soup.find_all('tr')
#注意此处是ip_lists。带s
ip_lists = []
for i in range(1,len(ips)):
ip_info = ips[i]
tds = ip_info.find_all('td')
ip_lists.append({tds[5].text:tds[1].text +':'+ tds[2].text})
#创建空列表在循环体外,每次找到正确的值,加入列表中
#注意此处是ip_list。不带s
ip_list = []
print(ip_lists)
for ip in ip_lists:
try:
#这里设置timeout来验证ip响应时间
res = requests.get('http://tool.oschina.net/codeformat/js',proxies =ip,timeout = 3)
#这里看着有些繁琐,由于基础不稳,只能这样写
print(ip[tuple(ip)[0]] +"已存")
ip_list.append(ip)
except :
print(ip[tuple(ip)[0]] +" 链接超时,已舍弃")
#这里相比上一个加一句代码
save_ip(ip_list)
print('IP已更新')
**#画重点**
#这个是threading里的库,意思是呢,每隔3600秒执行一次get_ip函数,设定完,要加个start才能启动哦
timer = Timer(3600,get_ip)
timer.start()
def save_ip(ip_list):
#链接数据库,我把地址匿了。哈哈
conn = pymysql.connect(host="129.28.***.***",
port=3306,
user="root",
password="werf96520",
database="tb"
)
cursor = conn.cursor() # 这里网上大多叫创建游标,我理解的就是创建个对象
#建立游标后呢,既然要循环存取,那么肯定要先清空旧的数据
cursor.execute("truncate table ip")
for i in ip_list:
sql = 'insert into ip(ip) values ("%s")'
#据说这样执行sql防注入,但是也没人注入我.....
cursor.execute(sql % i) # 执行
#增加是commit、查询则是fetch_all。
conn.commit()
#关闭这个对象
cursor.close()
#关闭数据库连接
conn.close()
if __name__ == '__main__':
url = 'http://www.xicidaili.com/nn/'
get_ip(url)
你以为这就结束了???
最后的最后
为了方便自己与别人使用,封装一下吧。
pip install pyinstaller
安装后即可开启封装操作,-F的意思是,只封装出一个exe,如果是不加-F,你会发现好多个东西,显得很累赘
结果
放到服务器上运行,结果不错,但是这里呢,只是验证了ip是否可用,并没有验证是否高匿。
想要验证的同学,自己可以去 https://ip.cn 试下。
有疑问的童鞋可以加群另有python资料赠送,v15836676587.