第一个爬虫和测试

1.网络爬虫

Python是简单网络爬虫：

简单的爬虫工具:BeautifulSoup

BeautifulSoup拥有强大的解析网页及查找元素的功能

在使用BeautifulSoup4 库之前，需要进行引用，由于这个库的名字非常特殊且采用面向对象方式组织，

可以用from…import 方式从库中直接引用BeautifulSoup 类，方法如下。

插入代码如下：

>>>from bs4 import BeautifulSoup

BeautifulSoup的安装：

在电脑上cmd中输入以下代码：

pip install python-bs4

BeautifulSoup 的属性：

第一个爬虫和测试

在BeautifulSoup4库中每一个Tag标签，也称为Tag对象

标签Tag有4个常用属性为：

第一个爬虫和测试

了解了BeautifulSoup，还要有一个依赖的工具requests

requests库是一个简洁且简单的处理HTTP请求的第三方库，并且request 库支持非常丰富的链接访问功能。

requests 库的安装代码为：

pip install requests

requests 库中的网页请求函数：

第一个爬虫和测试

response 对象的属性

第一个爬虫和测试

response 对象的方法

第一个爬虫和测试

了解了基础的知识，现在我们来进行网页爬虫：

对google主页进行爬虫：

1.利用request库的get（）函数访问google 20次，输入代码为：

import requests
def getHTMLText(url):
    print("第",i+1,"次访问")
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding='utf-8'
        print("网络状态码:",r.status_code)
        print("text属性长度:",len(r.text))
        print("content属性长度:",len(r.content))
        return r.text
    except:
        return "error"
url="http://www.google.cn"
for i in range(20):
    print(getHTMLText(url))

打印返回状态，其中text（）内容即为url对应的内容，其显示的结果为：（由于数据太多，只截取其中第一次的显示代码）

Python 3.7.3 (v3.7.3:ef4ec6ed12, Mar 25 2019, 21:26:53) [MSC v.1916 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license()" for more information.
>>> 
 RESTART: C:/Users/Administrator/AppData/Local/Programs/Python/Python37-32/python代码文件/py1.3.py 
第 1 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213
<!DOCTYPE html>
<html lang="zh">
  <meta charset="utf-8">
  <title>Google</title>
  <style>
    html { background: #fff; margin: 0 1em; }
    body { font: .8125em/1.5 arial, sans-serif; text-align: center; }
    h1 { font-size: 1.5em; font-weight: normal; margin: 1em 0 0; }
    p#footer { color: #767676; font-size: .77em; }
    p#footer a { background: url(//www.google.cn/intl/zh-CN_cn/images/cn_icp.gif) top right no-repeat; padding: 5px 20px 5px 0; }
    ul { margin: 2em; padding: 0; }
    li { display: inline; padding: 0 2em; }
    div { -moz-border-radius: 20px; -webkit-border-radius: 20px; border: 1px solid #ccc; border-radius: 20px; margin: 2em auto 1em; max-width: 650px; min-width: 544px; }
    div:hover, div:hover * { cursor: pointer; }
    div:hover { border-color: #999; }
    div p { margin: .5em 0 1.5em; }
    img { border: 0; }
  </style>
  <div>
    <a href="http://www.google.com.hk/webhp?hl=zh-CN&amp;sourceid=cnhp">
      <img src="//www.google.cn/landing/cnexp/google-search.png" alt="Google" width="586" height="257">
    </a>
    <h1><a href="http://www.google.com.hk/webhp?hl=zh-CN&amp;sourceid=cnhp"><strong id="target">google.com.hk</strong></a></h1>
    <p>请收藏我们的网址
  </div>
  <ul>
    <li><a href="http://translate.google.cn/?sourceid=cnhp">翻译</a>
  </ul>
  <p id="footer">&copy;2011 - <a href="http://www.miibeian.gov.cn/">ICP证合字B2-20070004号</a>
  <script>
    var gcn=gcn||{};gcn.IS_IMAGES=(/images\.google\.cn/.exec(window.location)||window.location.hash=='#images'||window.location.hash=='images');gcn.HOMEPAGE_DEST='http://www.google.com.hk/webhp?hl=zh-CN&sourceid=cnhp';gcn.IMAGES_DEST='http://images.google.com.hk/imghp?'+'hl=zh-CN&sourceid=cnhp';gcn.DEST_URL=gcn.IS_IMAGES?gcn.IMAGES_DEST:gcn.HOMEPAGE_DEST;gcn.READABLE_HOMEPAGE_URL='google.com.hk';gcn.READABLE_IMAGES_URL='images.google.com.hk';gcn.redirectIfLocationHasQueryParams=function(){if(window.location.search&&/google\.cn/.exec(window.location)&&!/webhp/.exec(window.location)){window.location=String(window.location).replace('google.cn','google.com.hk')}}();gcn.replaceHrefsWithImagesUrl=function(){if(gcn.IS_IMAGES){var a=document.getElementsByTagName('a');for(var i=0,len=a.length;i<len;i++){if(a[i].href==gcn.HOMEPAGE_DEST){a[i].href=gcn.IMAGES_DEST}}}}();gcn.listen=function(a,e,b){if(a.addEventListener){a.addEventListener(e,b,false)}else if(a.attachEvent){var r=a.attachEvent('on'+e,b);return r}};gcn.stopDefaultAndProp=function(e){if(e&&e.preventDefault){e.preventDefault()}else if(window.event&&window.event.returnValue){window.eventReturnValue=false;return false}if(e&&e.stopPropagation){e.stopPropagation()}else if(window.event&&window.event.cancelBubble){window.event.cancelBubble=true;return false}};gcn.resetChildElements=function(a){var b=a.childNodes;for(var i=0,len=b.length;i<len;i++){gcn.listen(b[i],'click',gcn.stopDefaultAndProp)}};gcn.redirect=function(){window.location=gcn.DEST_URL};gcn.setInnerHtmlInEl=function(a){if(gcn.IS_IMAGES){var b=document.getElementById(a);if(b){b.innerHTML=b.innerHTML.replace(gcn.READABLE_HOMEPAGE_URL,gcn.READABLE_IMAGES_URL)}}};
    gcn.listen(document, 'click', gcn.redirect);
    gcn.setInnerHtmlInEl('target');
  </script>

第 2 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 3 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 4 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 5 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 6 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 7 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 8 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 9 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 10 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 11 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 12 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 13 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 14 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 15 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 16 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 17 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 18 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 19 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

第 20 次访问
网络状态码: 200
text属性长度: 3185
content属性长度: 3213

>>>

利用菜鸟教程进行内容叙写文章及表格：

<!DOCTYPE html>

<html>

<head>

<meta charset="utf-8">

<title>菜鸟教程（runoob.com)</title>

</head>

<body>

         <hl>我的第一个标题学号25</hl>

         <p id="first">我的第一个段落。</p>

</body>

                  <table border="1">

          <tr>

                  <td>row 1, cell 1</td>

                  <td>row 1, cell 2</td>

         </tr>

         <tr>

                  <td>row 2, cell 1</td>

                  <td>row 2, cell 2</td>

         <tr>

</table>

</html>

由上面编程代码，结果显示为：

第一个爬虫和测试

爬取最好大学网2017年中国大学排名网站

软科中国大学排名2017网页片段

第一个爬虫和测试

输入如下代码为：

import requests
from bs4 import BeautifulSoup
allUniv = []
def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = 'utf-8'
        return r.text
    except:
        return ""
def fillUnivList(soup):
    data = soup.find_all('tr')
    for tr in data:
        ltd = tr.find_all('td')
        if len(ltd)==0:
            continue
        singleUniv = []
        for td in ltd:
            singleUniv.append(td.string)
        allUniv.append(singleUniv)
def printUnivList(num):
    print("{1:^2}{2:{0}^10}{3:{0}^6}{4:{0}^4}{5:{0}^10}".format(chr(12288),"排名","学校名称","省市","总分","科研规模"))
    for i in range(num):
        u=allUniv[i]
        print("{1:^2}{2:{0}^10}{3:{0}^6}{4:{0}^8.1f}{5:{0}^10}".format(chr(12288),i+1,u[1],u[2],eval(u[3]),u[6]))
def main():
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2017.html'
    html = getHTMLText(url)
    soup = BeautifulSoup(html, "html.parser")
    fillUnivList(soup)
    printUnivList(10)
main()

得出最好大学排名的前10名，结果显示为：
第一个爬虫和测试

此次的python爬虫学习到此结束..........................................................................

第一个爬虫和测试

1.网络爬虫

对google主页进行爬虫：

爬取最好大学网2017年中国大学排名网站

相关推荐