如何循环遍历Python中的html表格数据集

问题描述：

我是第一次在这里尝试获取一些Python技能的海报;请对我友好:-)如何循环遍历Python中的html表格数据集

虽然我对编程概念并不陌生（我之前一直在搞PHP），但对Python的过渡对我来说变得有点困难。我想这主要是因为我缺乏大部分 - 如果不是全部 - 对普通“设计模式”（？）等的基本理解。

说了这么多，就是这个问题。我目前的一部分工作是利用美丽的汤来写一个简单的刮板。要处理的数据与下面列出的数据具有相似的结构。

<table> 
    <tr> 
     <td class="date">2011-01-01</td> 
    </tr> 
    <tr class="item"> 
     <td class="headline">Headline</td> 
     <td class="link"><a href="#">Link</a></td> 
    </tr> 
    <tr class="item"> 
     <td class="headline">Headline</td> 
     <td class="link"><a href="#">Link</a></td> 
    </tr> 
    <tr> 
     <td class="date">2011-01-02</td> 
    </tr> 
    <tr class="item"> 
     <td class="headline">Headline</td> 
     <td class="link"><a href="#">Link</a></td> 
    </tr> 
    <tr class="item"> 
     <td class="headline">Headline</td> 
     <td class="link"><a href="#">Link</a></td> 
    </tr> 
</table>

的主要问题是，我根本不能让我围绕着如何1）保持当前的日期（TR-> TD类=“日期”的轨迹），而2头）循环遍历项目后续的tr：s（tr class =“item” - > td class =“headline”和tr class =“item” - > td class =“link”）以及3）将处理后的数据存储在一个数组中。

此外，所有数据将被插入数据库，其中每个条目必须包含以下信息;

日期
标题
链接

注意污物：荷兰国际集团的数据库不是问题的一部分，我只是为了更好地说明什么，我想提到这个在这里完成:-)

现在，有很多不同的方法来皮肤猫。因此，虽然解决手头问题的方法确实非常受欢迎，但如果有人愿意详细阐述为了“攻击”这类问题而使用的实际逻辑和策略，我将非常感激:-)

最后但并非最不重要的是，对于这样一个不好的问题抱歉。

答

基本的问题是，这张表是标记的外观，而不是语义结构。正确完成后，每个日期及其相关项目应共享一位家长。不幸的是，他们没有，所以我们不得不做。

的基本策略是通过各行的表进行迭代

如果第一个资料表具有一流的“日期”，我们得到的日期值和更新last_seen_date
否则，我们得到提取标题和链接，然后将（last_seen_date，标题，链接）保存到数据库中

。

import BeautifulSoup 

fname = r'c:\mydir\beautifulSoup.html' 
soup = BeautifulSoup.BeautifulSoup(open(fname, 'r')) 

items = [] 
last_seen_date = None 
for el in soup.findAll('tr'): 
    daterow = el.find('td', {'class':'date'}) 
    if daterow is None:  # not a date - get headline and link 
     headline = el.find('td', {'class':'headline'}).text 
     link = el.find('a').get('href') 
     items.append((last_seen_date, headline, link)) 
    else:     # get new date 
     last_seen_date = daterow.text

嗨，休，我决定和你的建议一起去做，结果非常好。谢谢你的努力！ :-) – Mattias 2011-01-08 03:00:20

答

您可以使用Python包中包含的元素树。

http://docs.python.org/library/xml.etree.elementtree.html

from xml.etree.ElementTree import ElementTree 

tree = ElementTree() 
tree.parse('page.xhtml') #This is the XHTML provided in the OP 
root = tree.getroot() #Returns the heading "table" element 
print(root.tag) #"table" 
for eachTableRow in root.getchildren(): 
    #root.getchildren() is a list of all of the <tr> elements 
    #So we're going to loop over them and check their attributes 
    if 'class' in eachTableRow.attrib: 
     #Good to go. Now we know to look for the headline and link 
     pass 
    else: 
     #Okay, so look for the date 
     pass

这应该是足以让你对你的方式来解析这一点。

嗨，感谢您的输入。我目前正在使用beautifulsoup作刮擦用途，但我很可能会很快考虑Element Tree。干杯! :-) – Mattias 2011-01-08 03:11:12

如何循环遍历Python中的html表格数据集

相关推荐