Python数据爬虫学习笔记（13）爬取微信文章数据

一、需求：在微信搜索网站中,通过设定搜索关键词以及搜索页面数，爬取出所有符合条件的微信文章：

二、搜索页URL分析阶段：

1、在搜索框中输入任意关键词，在出现的搜索结果页面点击下一页，将每一页的URL复制下来进行观察：

Python数据爬虫学习笔记（13）爬取微信文章数据

2、注意到页码由page=X决定，搜索关键词由query=X决定，URL中的其它变量可以通过逐个删除测试的方式探究是否为必要变量，注意尽量不要使用QQ浏览器，由于QQ浏览器对于微信过于“智能”，URL在错误的情况下仍有可能出现正常的网页。

三、搜索结果的元素网页URL分析阶段：

1、观察搜索结果页面的源代码：

Python数据爬虫学习笔记（13）爬取微信文章数据

注意到，搜索结果的元素网页的网址被<a target="_blank" href=".......“所包围。

2、但是发现，源代码中的URL打开时（注意要通过复制URL至网址栏，再回车的方式，不要在源代码界面点击URL打开），会提示参数错误：

Python数据爬虫学习笔记（13）爬取微信文章数据

与手动搜索的网页URL比对注意到，源代码中的URL多了”&“，删除之后URL打开正常，因此爬取出的URL需要搜索出该段字符进行删除。

四、编写代码：

import re
import urllib.request
import time
import urllib.error
import urllib.request
#自定义函数，功能为使用代理服务器爬一个网址
def use_proxy(proxy_addr,url):
    #建立异常处理机制
    try:
        req=urllib.request.Request(url)
        #浏览器伪装
        req.add_header("User-Agent","Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6726.400 QQBrowser/10.2.2265.400")
        proxy= urllib.request.ProxyHandler({'http':proxy_addr})  
        opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)  
        urllib.request.install_opener(opener)
        data = urllib.request.urlopen(req).read()
        return data
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)
        #若为URLError异常，延时10秒执行
        time.sleep(10)
    except Exception as e:
        print("exception:"+str(e))
        #若为Exception异常，延时1秒执行
        time.sleep(1)

#设置关键词            
key="Python"
#设置代理服务器，该代理服务器有可能失效，读者需要换成新的有效代理服务器
proxy="127.0.0.1:8888"
#爬多少页
for i in range(0,10):
    key=urllib.request.quote(key)
    thispageurl="http://weixin.sogou.com/weixin?type=2&query="+key+"&page="+str(i)
    thispagedata=use_proxy(proxy,thispageurl)
    #检验是否爬取到了数据
    print(len(str(thispagedata)))
    pat1='<a target="_blank" href="(.*?)"'
    rs1=re.compile(pat1,re.S).findall(str(thispagedata))
    if(len(rs1)==0):
        print("此次（"+str(i)+"页）没成功")
        continue
    for  j in range(0,len(rs1)):
        thisurl=rs1[j]
        thisurl=thisurl.replace("amp;","")
        file="F:/weixin/第"+str(i)+"页第"+str(j)+"篇文章.html"
        thisdata=use_proxy(proxy,thisurl)
        try:
            fh=open(file,"wb")
            fh.write(thisdata)
            fh.close()
            print("第"+str(i)+"页第"+str(j)+"篇文章成功")
        except Exception as e:
            print(e)
            print("第"+str(i)+"页第"+str(j)+"篇文章失败")

Python数据爬虫学习笔记（13）爬取微信文章数据

感谢韦玮老师的指导

相关推荐