淘宝商品及评论抓取

1、淘宝商品抓取

需要用cookie才能抓取,另外信息在源代码里隐藏,需要正则匹配提取

import requests
import re
from lxml import etree
import json


headers1 = {
    "authority":"authority",
    "cookie":"t=9112f19ggggUjn6IZNGOI_GrdT9tGz36F",
    "user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36",
    "upgrade-insecure-requests":"1",

}

def request1(url):

    # time.sleep(3)
    print(url)
    html=requests.get(url,headers=headers1)
    print(html.text)
    # json1=json.loads(html.text)['data']
    json_html=re.findall('g_page_config = (.*);.*?g_srp_loadCss',html.text,re.S)[0]
    print(json_html)
    json1=json.loads(json_html)
    # print(json1)
    return json1


taobao_url="https://s.taobao.com/search?q=笔记本"
taobao_html=request1(taobao_url)

print(len(taobao_html["mods"]["itemlist"]["data"]["auctions"]))
for i in range(len(taobao_html["mods"]["itemlist"]["data"]["auctions"])):
    print(taobao_html["mods"]["itemlist"]["data"]["auctions"][i]["nick"])
    print(taobao_html["mods"]["itemlist"]["data"]["auctions"][i]["title"])

淘宝商品及评论抓取

1、淘宝评论抓取

js里面通过分析得到:https://rate.taobao.com/feedRateList.htm?auctionNumId=542463533286&userNumId=107201929&currentPageNum=1&pageSize=20
(需要得到产品id和userid–这个源代码里可以获取)
淘宝商品及评论抓取