如何使用BeautilSoup提取表信息？

问题描述：

我需要Internship,Residency,Fellowship中包含的信息。我可以从表中提取值，但是在这种情况下，该表存在，其价值我不能决定使用哪个表，因为标题（如Internship）是表作为一个简单的纯文本之外的div标签下存在，并经过我需要提取。而且我有很多这种类型的页面，每个页面都没有必要具有这些值，例如在某些页面中可能完全不存在Residency。（这会减少页面中的表总数）。这种页面的一个例子是this。在这个页面Internship根本不存在。

我现在面临的主要问题是所有的表都具有相同的属性值，所以我不能决定中使用不同的页面的表。如果我的兴趣值没有出现在页面中，则必须返回该值的空字符串。

我使用Python中BeautifulSoup。有人可以指出，我怎么能继续提取这些值。

答

它看起来像IDS的标题和数据均拥有独特的价值和标准的后缀。您可以使用它来搜索适当的值。这里是我的解决方案：

from BeautifulSoup import BeautifulSoup 

# Insert whatever networking stuff you're doing here. I'm going to assume 
# that you've already downloaded the page and assigned it to a variable 
# named 'html' 

soup = BeautifulSoup(html) 
headings = ['Internship', 'Residency', 'Fellowship'] 
values = [] 
for heading in headings: 
    x = soup.find('span', text=heading) 
    if x: 
     span_id = x.parent['id'] 
     table_id = span_id.replace('dnnTITLE_lblTitle', 'Display_HtmlHolder')   
     values.append(soup.find('td', attrs={'id': table_id}).text) 
    else: 
     values.append('') 

print zip(headings, values)

由于它的工作！ – Steve 2013-02-19 03:50:45

如何使用BeautilSoup提取表信息？

相关推荐