python - 使用BeautifulSoup网站刮ajax网站

python - 使用BeautifulSoup网站刮ajax网站

问题描述:

我想刮电子商务网站,使用ajax调用来加载其下一页。python - 使用BeautifulSoup网站刮ajax网站

我可以抓取第1页上的数据,但是当我将第1页滚动到底部时,第2页会通过ajax调用自动加载。

我的代码:

from bs4 import BeautifulSoup as soup 
from urllib.request import urlopen as ureq 
my_url='http://www.shopclues.com/mobiles-smartphones.html' 
page=ureq(my_url).read() 
page_soup=soup(page,"html.parser") 
containers=page_soup.findAll("div",{"class":"column col3"}) 
for container in containers: 
    name=container.h3.text 
    price=container.find("span",{'class':'p_price'}).text 
    print("Name : "+name.replace(","," ")) 
    print("Price : "+price) 
for i in range(2,7): 
    my_url="http://www.shopclues.com/ajaxCall/moreProducts?catId=1431&filters=&pageType=c&brandName=&start="+str(36*(i-1))+"&columns=4&fl_cal=1&page="+str(i) 
    page=ureq(my_url).read() 
    print(page) 
    page_soup=soup(page,"html.parser") 
    containers=page_soup.findAll("div",{"class":"column col3"}) 
    for container in containers: 
     name=container.h3.text 
     price=container.find("span",{'class':'p_price'}).text 
     print("Name : "+name.replace(","," ")) 
     print("Price : "+price) 

我已经印刷由ureq读取AJAX页面知道我是否能够打开AJAX页面,我得到了一个输出为: enter image description here

B”'是输出: 打印(页)

请为我提供一个解决方案来刮取剩余的数据。

+1

试着用'硒'。 –

+1

我是新来的网络报废它会是你的一种,如果你可以提供我的代码 –

+2

我建议使用他们的APi,http://developer.shopclues.com/index.php/API_Basics#link –

from selenium import webdriver 
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from bs4 import BeautifulSoup as soup 
from urllib2 import urlopen as ureq 
import random 
import time 

chrome_options = webdriver.ChromeOptions() 
prefs = {"profile.default_content_setting_values.notifications": 2} 
chrome_options.add_experimental_option("prefs", prefs) 

# A randomizer for the delay 
seconds = 5 + (random.random() * 5) 
# create a new Chrome session 
driver = webdriver.Chrome(chrome_options=chrome_options) 
driver.implicitly_wait(30) 
# driver.maximize_window() 

# navigate to the application home page 
driver.get("http://www.shopclues.com/mobiles-smartphones.html") 
time.sleep(seconds) 
time.sleep(seconds) 
# Add more to range for more phones 
for i in range(1): 
    element = driver.find_element_by_id("moreProduct") 
    driver.execute_script("arguments[0].click();", element) 
    time.sleep(seconds) 
    time.sleep(seconds) 
html = driver.page_source 
page_soup = soup(html, "html.parser") 
containers = page_soup.findAll("div", {"class": "column col3"}) 
for container in containers: 
# Add error handling 
    try: 
     name = container.h3.text 
     price = container.find("span", {'class': 'p_price'}).text 
     print("Name : " + name.replace(",", " ")) 
     print("Price : " + price) 
    except AttributeError: 
     continue 
driver.quit() 

我用硒来加载网站,并点击按钮加载更多的结果。然后拿出生成的html并输入你的代码。

+1

欢迎到*!请在答案中提供解释或文档,以进一步帮助原始海报和任何可能搜索此答案的人。 –

+1

对不起,说实话我没有得到你在这里实际做的。此外,我没有找到任何按钮来加载更多的产品,因为当我向下滚动页面时页面本身已经加载。 –

+1

我试了一次,我得到了加载更多的按钮,但当我点击它浏览器带我到网址:http://b.codeonclick.com/script/wait.php?stamat=m%7C% 2C%2Cg3F-9jZXoGU3B_9GH0dEdHP3xP.f10%2CqqtKzScrXaD6J-TdEPg201mBMiNRUBdz6CXReBfSkvUVRInI1LXqZThgGFzCEHMpF1lleptOU_QsrpOi6T7Hby7nsDmByZIpPmfQ9jTUqKnDJMkuuIUs2gNUMD-4q8sddxXk9SJ9DV0v5jXqlTWUZdtJQpypd5folRnCfkojHyAp_deich7xrxO_f1wrkstlYSw7fGuN7n6aoTbh6DiYEF0Ypi2LPx8j3rcuOvcI8SqWq0Nn017hDlPJJxhoMjvHa67t4aRUI7sl9iV308NqAjdhpD5WQ7sYXYpfMxy-KpDzCUiL5Ndf-N_giWqeVZ-5 TTC = t4xr44rc –