在XPath中管理引号(lxml)
问题描述:
我想从给定网站中的表'MANUFACTURING AT A GLANCE'中提取网页元素。但该行的名称有'(单引号)。这是干扰我的语法。我如何解决这个问题?此代码适用于其他行。在XPath中管理引号(lxml)
import requests
from lxml import html, etree
ism_pmi_url = 'https://www.instituteforsupplymanagement.org/ismreport/mfgrob.cfm?SSO=1'
page = requests.get(ism_pmi_url)
tree = html.fromstring(page.content)
PMI_CustomerInventories = tree.xpath('//strong[text()="Customers' Inventories"]/../../following-sibling::td/p/text()')
PMI_CustomerInventories_Curr_Val = PMI_CustomerInventories[0]
答
这是我的方法来避免你的问题。也许并不是你真正需要的东西,但可以帮助你获得想法。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import lxml.html
import re
import requests
import lxml.html
from pprint import pprint
def load_lxml(response):
return lxml.html.fromstring(response.text)
url = 'https://www.instituteforsupplymanagement.org/ismreport/mfgrob.cfm?SSO=1'
response = requests.get(url)
root = load_lxml(response)
headers = []
data = []
for index,row in enumerate(root.xpath('//*[@id="home_feature_container"]/div/div/div/span/table[2]/tbody/tr')):
rows = []
for cindex,column in enumerate(row.xpath('./th//text() | ./td//text()')):
if cindex == 1:
continue
column = column.strip()
if index == 0 or not column:
continue
elif index == 1:
headers.append(column)
else:
rows.append(column)
if rows and len(rows) == 6:
data.append(rows)
data.insert(0,headers)
pprint(data)
结果:
[['Series Index',
'Feb',
'Series Index',
'Jan',
'Percentage',
'Point',
'Change',
'Direction',
'Rate of Change',
'Trend* (Months)'],
['65.1', '60.4', '+4.7', 'Growing', 'Faster', '6'],
['62.9', '61.4', '+1.5', 'Growing', 'Faster', '6'],
['54.2', '56.1', '-1.9', 'Growing', 'Slower', '5'],
['54.8', '53.6', '+1.2', 'Slowing', 'Faster', '10'],
['51.5', '48.5', '+3.0', 'Growing', 'From Contracting', '1'],
['47.5', '48.5', '-1.0', 'Too Low', 'Faster', '5'],
['68.0', '69.0', '-1.0', 'Increasing', 'Slower', '12'],
['57.0', '49.5', '+7.5', 'Growing', 'From Contracting', '1'],
['55.0', '54.5', '+0.5', 'Growing', 'Faster', '12'],
['54.0', '50.0', '+4.0', 'Growing', 'From Unchanged', '1']]
[Finished in 2.9s]
谢谢wu4m4n。我注意到有时网站会发生一些变化。 X Path从'/ div/div/div /'变为'/ div/div [2]/div /'等。这是完全不可预知的。所以过去用这种技术代码失败了。 –