Python爬取拉勾网招聘信息

最近自学研究爬虫，特找个地方记录一下代码。就来到了51cto先测试一下。第一次发帖不太会。先贴个代码。

首先打开拉勾网首页，然后在搜索框输入关键字Python。打开抓包工具。因为我的是MAC os，所以用的自带的Safari浏览器的开启时间线录制。通过抓取post方法，可以看到完整url=

1

http://www.lagou.com/jobs/positionAjax.json?

然后可以发现post的数据有三个，一个是first，kd，pn。其中first应该是判断是不是首页，Kd就是你输入的关键字，pn就是页码。除了第一页的first是true以外都是false。所以就可以用过if判断每次要post的数据。你从浏览器输入上面的网址他给你返回的应该是遗传json数据。所以需要json.loads()来处理这些数据。看了一下json，跟多维数组的使用比较类似。。。最后就是把我需要的数据趴下来写到文本文件中。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31

#coding=utf-8

import json

import urllib2

import urllib

import sys

reload(sys)

sys.setdefaultencoding('utf-8')

page=1

length=0

index=1

f=open('lagoudata.txt','a+')

while page<5:

    if(page==1):

        post_data = {'first':'true','kd':'python','pn':page}

    else:

        post_data = {'first':'false','kd':'python','pn':page}

    page=page+1

    r = urllib2.Request("http://www.lagou.com/jobs/positionAjax.json?px=default", urllib.urlencode(post_data))

    html=urllib2.urlopen(r).read()

    hjson=json.loads(html)

    result=hjson['content']['result']

        # print result

    length=length+len(result)

    for i in range(len(result)):

        string=str(index)+','+result[i]['companyName']+','+result[i]['financeStage']+','+result[i]['positionAdvantage']+','+result[i]['education']+','+result[i]['workYear']+','+result[i]['city']+','+result[i]['salary']

        f.write(string)

        f.write('\r\n')

        index=index+1

        #print string
f.close()

print length

因为这边拉钩网返回的json数据，所以要做处理。反正下图是我最后爬的数据

PS：我写的程序是自己手动输入需要爬取多少页。但是学长给我说可以通过判断页面中“下一页”这三个字在HTML中的区别。通过审查第一页和最后一页中“下一页”这三个字就可以看出来他们的class是不同的。到时候可以通过判断class来确定是不是到达最后一页。可惜发现拉钩是js渲染的。。。。而且他只提供30页的数据。。。所以这边也就无所谓了。

本文转自努力的C 51CTO博客，原文链接:http://blog.51cto.com/fulin0532/1748561

Python爬取拉勾网招聘信息

相关推荐