Python高效的Web刮？

问题描述：

我是相当新的Python和我试图使一个股票应用程序网络分析器。我基本上使用urllib在参数列表中打开每个股票所需的网页，并阅读该页面的html代码的完整内容。然后，我正在切片，以便找到我正在寻找的报价。我实施的方法有效，但我怀疑这是实现这一结果的最有效方法。我花了一些时间研究其他潜在的更快速读取文件的方法，但似乎没有涉及网络抓取。这里是我的代码：Python高效的Web刮？

from urllib.request import urlopen 

def getQuotes(stocks): 
    quoteList = {} 
    for stock in stocks: 
     html = urlopen("https://finance.google.com/finance?q={}".format(stock)) 
     webpageData = html.read() 
     scrape1 = webpageData.split(str.encode('<span class="pr">\n<span id='))[1].split(str.encode('</span>'))[0] 
     scrape2 = scrape1.split(str.encode('>'))[1] 
     quote = bytes.decode(scrape2) 
     quoteList[stock] = float(quote) 
    return quoteList 

print(getQuotes(['FB', 'GOOG', 'TSLA']))

非常感谢你所有提前！

退房[美丽的汤（https://www.crummy.com/software/BeautifulSoup/bs4/doc/） – Mako212

我会用'requests'包工作，而不是'urllib'直接。我会认为上面的代码运行得非常快，不是吗？当你有很多请求时，你可以看看多线程。应该很好地根据代码加快速度。 – Andras

哦，是的，并检查美丽的汤或lxml，如上所述。 – Andras

答

我基本上使用的urllib打开在参数列表中的每个股票所需的网页，阅读该网页的HTML代码的全部内容。然后，我正在切片，以便找到我正在寻找的报价。

下面是Beautiful Soup和requests，落实：

import requests 
from bs4 import BeautifulSoup 

def get_quotes(*stocks): 
    quotelist = {} 
    base = 'https://finance.google.com/finance?q={}' 
    for stock in stocks: 
     url = base.format(stock) 
     soup = BeautifulSoup(requests.get(url).text, 'html.parser') 
     quote = soup.find('span', attrs={'class' : 'pr'}).get_text().strip() 
     quotelist[stock] = float(quote) 
    return quotelist 

print(get_quotes('AAPL', 'GE', 'C')) 
{'AAPL': 160.86, 'GE': 23.91, 'C': 68.79} 
# 1 loop, best of 3: 1.31 s per loop

正如你可能想看看multithreading或grequests的评论中提到。

使用grequests进行异步HTTP请求：

def get_quotes(*stocks): 
    quotelist = {} 
    base = 'https://finance.google.com/finance?q={}' 
    rs = (grequests.get(u) for u in [base.format(stock) for stock in stocks]) 
    rs = grequests.map(rs) 
    for r, stock in zip(rs, stocks): 
     soup = BeautifulSoup(r.text, 'html.parser') 
     quote = soup.find('span', attrs={'class' : 'pr'}).get_text().strip() 
     quotelist[stock] = float(quote) 
    return quotelist 

%%timeit 
get_quotes('AAPL', 'BAC', 'MMM', 'ATVI', 
      'PPG', 'MS', 'GOOGL', 'RRC') 
1 loop, best of 3: 2.81 s per loop

更新：这里是从尘土飞扬菲利普斯Python 3的面向对象的编程使用修改后的版本内置threading模块。

from threading import Thread 

from bs4 import BeautifulSoup 
import numpy as np 
import requests 


class QuoteGetter(Thread): 
    def __init__(self, ticker): 
     super().__init__() 
     self.ticker = ticker 
    def run(self): 
     base = 'https://finance.google.com/finance?q={}' 
     response = requests.get(base.format(self.ticker)) 
     soup = BeautifulSoup(response.text, 'html.parser') 
     try: 
      self.quote = float(soup.find('span', attrs={'class':'pr'}) 
           .get_text() 
           .strip() 
           .replace(',', '')) 
     except AttributeError: 
      self.quote = np.nan 


def get_quotes(tickers): 
    threads = [QuoteGetter(t) for t in tickers] 
    for thread in threads:   
     thread.start() 
    for thread in threads: 
     thread.join() 
    quotes = dict(zip(tickers, [thread.quote for thread in threads])) 
    return quotes 

tickers = [ 
    'A', 'AAL', 'AAP', 'AAPL', 'ABBV', 'ABC', 'ABT', 'ACN', 'ADBE', 'ADI', 
    'ADM', 'ADP', 'ADS', 'ADSK', 'AEE', 'AEP', 'AES', 'AET', 'AFL', 'AGN', 
    'AIG', 'AIV', 'AIZ', 'AJG', 'AKAM', 'ALB', 'ALGN', 'ALK', 'ALL', 'ALLE', 
    ] 

%time get_quotes(tickers) 
# Wall time: 1.53 s

您与BeautifulSoup第一个解决方案实际上最终是比我最初的实现略慢......但噢男孩，有grequests配对它确实的伎俩！更快的结果。再次感谢！ –

@ChaseShankula是的，并不感到惊讶 - BeautifulSoup的速度并不是特别着名。在这种情况下，占用时间的是底层请求和解析器。什么BS4是用于从一个文件[树]拉动多个数据片有用（http://web.simmons.edu/~grabiner/comm244/weekfour/document-tree.html）。有通过[文件]阅读（https://www.crummy.com/software/BeautifulSoup/bs4/doc/）时，你可以，它会在某个时候在路上派上用场。 –

@ChaseShankula更新为使用'threading'而不是'grequests'，因为我遇到了一些问题。 –

Python高效的Web刮？

相关推荐