使用python抓取数据

问题描述：

这里网站会列出一些基于下拉过滤器的数据，所以我试图通过传递静态下拉值来获取这些数据，但我认为由于视图状态我无法抓住这些数据。使用python抓取数据

任何人都有任何想法如何抓取使用viewstate的asp.net网站数据？

我收到以下错误

验证视图状态MAC失败。如果此应用程序由Web Farm或集群托管，请确保<machineKey>配置指定相同的validationKey和验证算法。 AutoGenerate不能在群集中使用。

的Python脚本

import requests 
from bs4 import BeautifulSoup 

def get_viewstate(): 
url = "http://xlnindia.gov.in/frm_G_Cold_S_Query.aspx?ST=GJ" 
req = requests.get(url) 
data = req.text 

bs = BeautifulSoup(data) 
return bs.find("input", {"id": "__VIEWSTATE"}).attrs['value'] 

url = "http://xlnindia.gov.in/frm_G_Cold_S_Query.aspx?ST=GJ" 
data = {"__VIEWSTATE": get_viewstate(),"ST":'GJ', "ddldistrict":'AMR', "ddltaluka":'' ,"btnSearch":'Search'} 
req = requests.post(url, data) 

bs = BeautifulSoup(req.text) 
print(bs.prettify())

给硒试一试？ –

答

我不认为你可以requests做到这一点，但你可以很容易地做到这一点与selenium。

安装硒 - pip install selenium或pip3 install selenium。
然后从您的机器上从this link下载最新的Chromedriver，并将driver复制到您的工作目录。

您可以阅读selenium文档here。

import time 
from selenium import webdriver 

url = "http://xlnindia.gov.in/frm_G_Cold_S_Query.aspx?ST=GJ" 
browser = webdriver.Chrome() 
browser.get(url) 

#change this if you want to change the state from Gujrat to something else 
#or you can change the state simply by changing the "?ST=GJ" part of the URL 
#state = browser.find_element_by_id("ddlState") 
#state.send_keys("BR") 

district = browser.find_element_by_id("ddldistrict") 
district.send_keys("AMR") 

#Skip this if you want to include all categories into the result 
category = browser.find_element_by_id("ddlCategory") 
category.send_keys("R") 

button = browser.find_element_by_id("btnSearch") 
button.click() 

time.sleep(10) 
browser.save_screenshot(browser.title + ".JPEG") 
html = browser.page_source 
print(html) 

browser.close() 
browser.quit()

注
如果你想使用无头浏览器selenium，使用PhantomJS。要了解如何使用PhantomJS阅读this。

谢谢MD。 Khairul Basar，它完美的工作 –

hey MD。 Khairul Basar你能帮我把每个字段的值存入mysql数据库吗？ –

@JunedAnsari可以试试看。 –

使用python抓取数据

相关推荐