python爬取优词词典
运用python爬取优词词典并制作索引
前期准备:
1.python学习
2.了解网络知识
3.了解爬虫原理
4.requests模块的运用知识
5.Beautiful模块的理解运用
6.数据库知识的运用
7.pymysql的运用
在这里我不在赘述python的安装以及pip安装requests,pymysql,Beautiful网上有很多教程(前期请面向百度编程)
做好前面几点,我们开始编写爬虫
1明确目标:目标网站 http://www.youdict.com/ciku/
目标元素:单词(包括英文,中文),单词连接,图片连接
2.编写获取页面以及获取元素代码:
newsurl='http://www.youdict.com\ /ciku/id_5_0_0_0_0.html' res=requests.get(newsurl) #res=requests.get(newsurl) res.encoding='utf-8' soup =BeautifulSoup(res.text,'html.parser') #print(soup) divs=soup.select(".col-sm-6") #print(divs[0]) for each_div in divs: english=each_div.div.div.h3.a.text imgurl=transurl(each_div.div.img['src']) chinese=each_div.div.p.text #print(english+" "+imgurl+" "+chinese) insert(english,chinese,imgurl)
3.根据页面跳转规则拼接url:
newsurl='http://www.youdict.com\ciku/id_5_0_0_0_'+str(i)+'.html'
i 是由循环确定
4.连接数据库:
def insert(english,chinese,imgurl):
db = pymysql.connect("localhost","root","your\
db pass","your db name" )
cursor = db.cursor()
#summary = summary.tostring(summary,encoding='utf-8')
english=pymysql.escape_string(english)
chinese=pymysql.escape_string(chinese)
imgurl=pymysql.escape_string(imgurl)
sql="insert into reaserchwords(english,chinese,\
imgurl) values('"+english+"','"+chinese+"','"+imgurl+"')"
cursor.execute(sql)
db.commit()
db.close()
5.组合起来完整的爬虫:
# coding=utf-8 ''' Created on 2018.8.18 @author: ZEC--- ''' import requests import pymysql from bs4 import BeautifulSoup def insert(english,chinese,imgurl): db = pymysql.connect("localhost","root","your\ db pass","your db name" ) cursor = db.cursor() #summary = summary.tostring(summary,encoding='utf-8') english=pymysql.escape_string(english) chinese=pymysql.escape_string(chinese) imgurl=pymysql.escape_string(imgurl) sql="insert into reaserchwords(english,chinese,\ imgurl) values('"+english+"','"+chinese+"','"+imgurl+"')" cursor.execute(sql) db.commit() db.close() def transurl(url): url="http://www.youdict.com"+url url.strip('\n') return url def main_thread(start,end): i=start while i<end: newsurl='http://www.youdict.com\ /ciku/id_5_0_0_0_'+str(i)+'.html' res=requests.get(newsurl) #res=requests.get(newsurl) res.encoding='utf-8' soup =BeautifulSoup(res.text,'html.parser') #print(soup) divs=soup.select(".col-sm-6") #print(divs[0]) for each_div in divs: english=each_div.div.div.h3.a.text imgurl=transurl(each_div.div.img['src']) chinese=each_div.div.p.text #print(english+" "+imgurl+" "+chinese) insert(english,chinese,imgurl) print(str(i+1)+"页面 is ok") i=i+1 main_thread(67,274)
自己做的单词搜索页面如图:
搜索案例网址:www.senlear.com/words