Python:使用XPath从表中获取数据
问题描述:
我试图从http://projects.fivethirtyeight.com/election-2016/delegate-targets/的底部获取表中的数据。Python:使用XPath从表中获取数据
import requests
from lxml import html
url = "http://projects.fivethirtyeight.com/election-2016/delegate-targets/"
response = requests.get(url)
doc = html.fromstring(response.text)
tables = doc.findall('.//table[@class="delegates desktop"]')
election = tables[0]
election_rows = election.findall('.//tr')
def extractCells(row, isHeader=False):
if isHeader:
cells = row.findall('.//th')
else:
cells = row.findall('.//td')
return [val.text_content() for val in cells]
import pandas
def parse_options_data(table):
rows = table.findall(".//tr")
header = extractCells(rows[1], isHeader=True)
data = [extractCells(row, isHeader=False) for row in rows[2:]]
return pandas.DataFrame(data, columns=header)
election_data = parse_options_data(election)
election_data
我遇到了与候选人的名字('特朗普','克鲁斯','卡西奇')最高行的麻烦。它在tr class =“top”之下,现在我只有tr class =“bottom”(从“won/target”开始)。
任何帮助非常感谢!
答
候选人的名字是第0行中:
candidates = [val.text_content() for val in rows[0].findall('.//th')[1:]]
或者,如果重复使用相同的extractCells()
功能:在这里
candidates = extractCells(rows[0], isHeader=True)[1:]
[1:]
片是跳过第一个空th
细胞。
答
不好(硬编码),但运行,因为你想。
def parse_options_data(table):
rows = table.findall(".//tr")
candidate = extractCells(rows[0], isHeader=True)[1:]
header = extractCells(rows[1], isHeader=True)[:3] + candidate
data = [extractCells(row, isHeader=False) for row in rows[2:]]
return pandas.DataFrame(data, columns=header)