Python 爬虫实战（2）

Dec 19, 2017 | 爬虫

文章目录

Python 爬虫实战2

目标：获取上交所和深交所所有股票的名称和交易信息,存储到一个本地文件中

网站选择原则：股票信息静态存在于html页面中，非js代码生成，没有Robbts协议限制

选取方法：打开网页，查看源代码，搜索网页的股票价格数据是否存在于源代码中

下面的百度股市通中，股票的信息完全再html代码中，符合要求（并且发现网址中包含我们需要的关键字:sz代表深交所，而后面的数字就是股票代码了）

输出结果

除了单个股票的信息，我们需要所有交所和深交所的股票，访问 http://quote.eastmoney.com/stocklist.html 查看页面

股票代码

所有我们只需要先获取所有的股票代码，然后循环访问百度即可获得所有的股票信息

输出结果：

输出结果

大部分讲解都在 Python 爬虫实战（1）中介绍过了，需要请查看 http://blog.zj2626.com/2017/12/14/20171214_crawler

python文件读写

Python内置了读写文件的函数，用法和C是兼容的。

在磁盘上读写文件的功能都是由操作系统提供的，现代操作系统不允许普通的程序直接操作磁盘，所以，读写文件就是请求操作系统打开一个文件对象（通常称为文件描述符），然后，通过操作系统提供的接口从这个文件对象中读取数据（读文件），或者把数据写入这个文件对象（写文件）。

读文件

python内置的open()函数，返回一个文件对象;(参数中 r代表读 w代表写)

读文件

得到文件对象，则可以直接调用f.read()把文件内容读取到内存中来
1

f.read()

读取时发生的问题：

如果文件不存在，open()函数就会抛出一个IOError的错误，并且给出错误码和详细的信息告诉你文件不存在
‘gbk’ codec can’t decode byte 0xaf in position …

#问题2解决方案两个：
# 1. 打开文件的时候就指定编码的类型
    f = open('E:/Data.txt', 'r',encoding = 'utf-8')
    f.read()
# 2. 修改文件编码为utf-8

最后一步是调用close()方法关闭文件。文件使用完毕后必须关闭，因为文件对象会占用操作系统的资源，并且操作系统同一时间能打开的文件数量也是有限的

f.close()

为了防止中途出现异常而无法关闭文件，使用try finally语句

try:
    f = open('E:/Data.txt', 'r', encoding = 'utf-8')
    print(f.read())
finally:
    if f:
        f.close()

为了简化代码，python提供了一个更好的更简洁的方法读取文件(和try…finally一样的)

1 2	with open('E:/Data.txt', 'r', encoding = 'utf-8') as f: print(f.read())

read()方法一次把所有的文件内容读取进来，如果文件太大就不太好用，所以要反复调用read(size)来一部分一部分的读取，也可以调用readline()一次读取一行，或者调用readlines()一次读取全部并返回list
1
2
3
4

with open('E:/Data.txt', 'r', encoding = 'utf-8') as f:
print(f.readline())
print(f.readline())
print(f.readline())

如果文件很小，read()一次性读取最方便；如果不能确定文件大小，反复调用read(size)比较保险；如果是配置文件，调用readlines()最方便

写文件

第二个参数传入标识符’w’或者’wb’表示写文本文件或写二进制文件; write()函数会把数据替换掉原文件中内容

try:
    f = open('E:/Data.txt', 'w', encoding = 'utf-8')
    f.write('ffffffffffffffffffffffff')
finally:
    if f:
        f.close()

当我们写文件时，操作系统往往不会立刻把数据写入磁盘，而是放到内存缓存起来，空闲的时候再慢慢写入。只有调用close()方法时，操作系统才保证把没有写入的数据全部写入磁盘。忘记调用close()的后果是数据可能只写了一部分到磁盘，剩下的丢失了。

同读取一样系统提供更好的

1 2	with open('E:/Data.txt', 'w', encoding = 'utf-8') as f: f.write('kkkkkkkkkkkkkkkk')

二进制文件

前面讲的默认都是读取文本文件，并且是UTF-8编码的文本文件。要读取二进制文件，比如图片、视频等等，用’rb’模式打开文件即可

1 2	with open('D:/20171226101748.png', 'rb') as f: f.read()

个人代码：

from urllib import request
from bs4 import BeautifulSoup as bs
import re

def getCodes():
    url = 'http://quote.eastmoney.com/stocklist.html';
    resp = request.urlopen(url)
    resp_text = resp.read().decode('gbk')
    soap = bs(resp_text, 'html.parser')
    list = soap.find_all('div', id = 'quotesearch')[0].find_all('ul')[0].find_all('li')

    codeList = []
    for li in list:
        try: 
            #eg: sh603183
            codeList.append(re.findall(r"[s][hz]\d{6}", li.find('a')['href']))
        except:
            continue
        
    return codeList

def makeDict(code):
    infoDict = {}
    url = 'https://gupiao.baidu.com/stock/'+ code +'.html';
    resp = request.urlopen(url)
    resp_text = resp.read().decode('utf-8')
    soap = bs(resp_text, 'html.parser')
    
    try:
        stockInfo = soap.find_all(attrs = {'class','stock-bets'})
        name = soap.find(attrs = {'class', 'bets-name'})
        if name is None:
            return None
        infoDict['name'] = name.text.strip()
        keys = soap.find_all('dt')
        values = soap.find_all('dd')
        for i in range(len(keys)):
            infoDict[keys[i].text] = values[i].text

        return infoDict
    except:
        return None

def writeFile(codeList):
    i=0
    for code in codeList:
        i = i+1
        # 下面两个判断是因为前45个股票百度并没有信息，所以跳过了，200个以后的数据就不再取了，太多了，科科
        if i < 45:
            continue
        if i > 200:
            break
        infoDict = makeDict(code[0])
        if infoDict is None:
            continue
        print (infoDict)
        with open('E://Data.txt', 'a', encoding = 'utf-8') as f:
            f.write(str(infoDict) + '\n')

codeList = getCodes();
print("start")
writeFile(codeList)

别人家的代码【滑稽】：

# -*- coding: utf-8 -*-
 
import requests
from bs4 import BeautifulSoup
import traceback
import re
 
def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
 
def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue
 
def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})
 
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名称': name.text.split()[0]})
             
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
             
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
                count = count + 1
                print("\r当前进度: {:.2f}%".format(count*100/len(lst)),end="")
        except:
            count = count + 1
            print("\r当前进度: {:.2f}%".format(count*100/len(lst)),end="")
            continue
 
def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)
 
main()

转载自链接地址: http://python.jobbole.com/88350/

个人博客欢迎来访： http://blog.zj2626.com

python 爬虫

Python 爬虫实战（2）

Python 爬虫实战（2）

目标： 获取上交所和深交所所有股票的名称和交易信息,存储到一个本地文件中

python文件读写

个人代码：

别人家的代码【滑稽】：

相关推荐

目标：获取上交所和深交所所有股票的名称和交易信息,存储到一个本地文件中