Python:如何提取“数据绑定”html元素?

问题描述:

我想从网站中提取数据。元素被隐藏。当我尝试“查看源代码”时,不显示标题文本。Python:如何提取“数据绑定”html元素?

<h4 data-bind="Text: Name"></h4> 

但是,当我尝试检查,有文字可见。

<h4 data-bind="Text: Name">STM1F-1S-HC</h4> 

使用的代码是:

def getlink(link): 
    try: 
     f = urllib.request.urlopen(link) 
     soup0 = BeautifulSoup(f) 
    except Exception as e: 
     print (e) 
     soup0 = 'abc' 
    for row2 in soup0.findAll("h4",{"data-bind":"text: Name"}): 
     Name = row2.text 
     print(Name) 

#code to find all links to the products for further processing. 
i=1 
global i 
for row in r1.findAll('a', { "class" : "col-xs-12 col-sm-6" }): 
    link = 'https://www.truemfg.com/USA-Foodservice/'+row['href'] 
    print(link) 
    getlink(link) 
print(productcount) 

输出是:

https://www.truemfg.com/USA-Foodservice/Products/Traditional-Reach-Ins 
C:\Users\Santosh\Anaconda3\lib\site-packages\bs4\__init__.py:181: UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently. 

The code that caused this warning is on line 193 of the file C:\Users\Santosh\Anaconda3\lib\runpy.py. To get rid of this warning, change code that looks like this: 

BeautifulSoup([your markup]) 

to this: 

BeautifulSoup([your markup], "lxml") 

    markup_type=markup_type)) 

https://www.truemfg.com/USA-Foodservice/Products/Specification-Series 

https://www.truemfg.com/USA-Foodservice/Products/Food-Prep-Tables 

https://www.truemfg.com/USA-Foodservice/Products/Undercounters 

https://www.truemfg.com/USA-Foodservice/Products/Worktops 

https://www.truemfg.com/USA-Foodservice/Products/Chef-Bases 

https://www.truemfg.com/USA-Foodservice/Products/Milk-Coolers 

https://www.truemfg.com/USA-Foodservice/Products/Glass-Door-Merchandisers 

https://www.truemfg.com/USA-Foodservice/Products/Air-Curtains 

https://www.truemfg.com/USA-Foodservice/Products/Display-Cases 

https://www.truemfg.com/USA-Foodservice/Products/Underbar-Refrigeration 

,我们发现,有没有印名字。

有人可以让我知道打印名称的解决方案。

谢谢, 桑托什

通过XHR动态生成必需的内容。你可以试试下面的代码直接请求数据,并避免解析HTML

import requests 

url = 'https://prodtrueservices.azurewebsites.net/api/products/productline/403/1?skip=0&take=200&unit=Imperial' 
r = requests.get(url) 
counter = 0 

while True: 
    try: 
     print(r.json()['Products'][counter]['Name']) 
      counter += 1 
    except IndexError: 
     break 

这应该让你得到所有名称

+0

谢谢你的答案。你能不能让我知道如何使用XHR获取URL。 – lifeofpy

+0

你能澄清你到底想要什么吗? – Andersson

+0

我的要求是 - 对于上述输出中的所有链接,我想提取excel表格中的所有产品信息以及文件夹和文件路径中的所有图像和文件作为表格中的一列。你是如何进入'https://prodtrueservices.azurewebsites.net/api/products/productline/403/1?skip=0&take=200&unit=Imperial'这个网址的? – lifeofpy