使用BeautifulSoup刮表

问题描述:

我有一个问题,我怀疑是相当直接。我有以下类型的页面,从中我想收集在最后一个表中的信息(如果你向下滚动所有的方式,它是一个在标有“程序”框):使用BeautifulSoup刮表

http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-2&language=EN

的HTML因为我想刮这样的长相表:

<tbody><tr class="doc_title"> 
<td style="background-image: url(&quot;/img/struct/navigation/gradient_blue.gif&quot;);" align="left" valign="top"><img src="/img/struct/functional/arrow_title_doc.gif" alt="" align="absmiddle" border="0" height="14" width="8"> <span style="font-weight: bold;">PROCEDURE</span></td><td style="background-image: url(&quot;/img/struct/navigation/gradient_blue.gif&quot;);" align="right" valign="top"> 
<table border="0" cellpadding="3" cellspacing="0" width="50"> 
<tbody><tr><td align="center"><a href="#top"><img src="/img/struct/functional/top_doc.gif" alt="" border="0" height="16" width="16"></a></td><td align="center"><img src="/img/struct/navigation/spacer.gif" alt="" border="0" height="10" width="15"></td><td align="center"><a href="#title2"><img src="/img/struct/functional/sort_up.gif" alt="" border="0" height="10" width="15"></a></td></tr></tbody></table></td></tr> 

<tr class="contents" valign="top"><td colspan="2"> 
<p></p><table style="border-collapse: collapse; width: 481.85pt;" align="center" cellspacing="0"> 
<tbody><tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Title</span></p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7"> 
<p style="">Mutual assistance for the recovery of claims relating to taxes, duties and other measures</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">References</span></p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7"> 
<p style=""><a href="http://ec.europa.eu/prelex/liste_resultats.cfm?CL=en&amp;ReqId=0&amp;DocType=COM&amp;DocYear=2009&amp;DocNum=0028">COM(2009)0028</a> – C6-0061/2009 – <a href="/oeil/FindByProcnum.do?lang=en&amp;procnum=CNS/2009/0007">2009/0007(CNS)</a></p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Date of consulting Parliament</span></p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7"> 
<p style="">16.2.2009</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Committee responsible</span></p> 

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date announced in plenary</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7"> 
<p style="">ECON</p> 

<p style="">19.10.2009</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0pt 0.75pt; border-style: solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Committee(s) asked for opinion(s)</span></p> 

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date announced in plenary</p> 
</td> 
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2"> 
<p style="">CONT</p> 

<p style="">19.10.2009</p> 
</td> 
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">JURI</p> 

<p style="">19.10.2009</p> 
</td> 
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">&nbsp;</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0pt 0pt; border-style: solid solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1"> 
<p style="">&nbsp;</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Not delivering opinions</span></p> 

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date of decision</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2"> 
<p style="">CONT</p> 

<p style="">1.10.2009</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">JURI</p> 

<p style="">5.10.2009</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">&nbsp;</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1"> 
<p style="">&nbsp;</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Rapporteur(s)</span></p> 

<p style="">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Date appointed</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="3"> 
<p style="">Theodor Dumitru Stolojan</p> 

<p style="">21.7.2009</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="2"> 
<p style="">&nbsp;</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 20.59%;" rowspan="1" colspan="2"> 
<p style="">&nbsp;</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0pt 0.75pt; border-style: solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Discussed in committee</span></p> 
</td> 
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2"> 
<p style="">10.11.2009</p> 
</td> 
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">1.12.2009</p> 
</td> 
<td style="border-width: 1pt 0pt 0pt; border-style: solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">21.1.2010</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0pt 0pt; border-style: solid solid none none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1"> 
<p style="">&nbsp;</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Date adopted</span></p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.88%;" rowspan="1" colspan="2"> 
<p style="">27.1.2010</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">&nbsp;</p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="2"> 
<p style="">&nbsp;</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 15.3%;" rowspan="1" colspan="1"> 
<p style="">&nbsp;</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Result of final vote</span></p> 
</td> 
<td style="border-width: 0.75pt 0pt 1pt; border-style: solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 12.94%;" rowspan="1" colspan="1"> 
<p style="">+:</p> 

<p style="">–:</p> 

<p style="">0:</p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0pt; border-style: solid solid solid none; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 48.82%;" rowspan="1" colspan="6"> 
<p style="">39</p> 

<p style="">0</p> 

<p style="">1</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Members present for the final vote</span></p> 
</td> 
<td style="border-width: 0.75pt 1pt 0.75pt 0.75pt; border-style: solid; border-color: rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7"> 
<p style="">Burkhard Balz, Sharon Bowles, Udo Bullmann, Pascal Canfin, Nikolaos Chountis, George Sabin Cutaş, Leonardo Domenici, Derk Jan Eppink, Markus Ferber, Elisa Ferreira, Vicky Ford, José Manuel García-Margallo y Marfil, Jean-Paul Gauzès, Sylvie Goulard, Enikő Győri, Liem Hoang Ngoc, Eva Joly, Othmar Karas, Wolf Klinz, Jürgen Klute, Werner Langen, Astrid Lulling, Arlene McCarthy, Ivari Padar, Alfredo Pallone, Anni Podimata, Antolín Sánchez Presedo, Olle Schmidt, Edward Scicluna, Peter Simon, Peter Skinner, Theodor Dumitru Stolojan, Ivo Strejček, Kay Swinburne, Marianne Thyssen, Ramon Tremosa i Balcells</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border-left: 0.75pt solid rgb(0, 0, 0); border-right: 1pt solid rgb(0, 0, 0); border-top: 0.75pt solid rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 38.24%;" rowspan="1" colspan="1"> 
<p style=""><span style="font-weight: bold;">Substitute(s) present for the final vote</span></p> 
</td> 
<td style="border-left: 0.75pt solid rgb(0, 0, 0); border-right: 1pt solid rgb(0, 0, 0); border-top: 0.75pt solid rgb(0, 0, 0); padding: 2.8pt 5.1pt; vertical-align: top; width: 61.76%;" rowspan="1" colspan="7"> 
<p style="">Marta Andreasen, Sophie Briard Auconie, David Casa, Danuta Jazłowiecka, Arturs Krišjānis Kariņš, Philippe Lamberts, Andreas Schwab</p> 
</td> 
<td style="" rowspan="1" colspan="1"></td></tr> 

<tr style=""> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 38.24%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 12.94%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 2.94%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 4.71%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 10.58%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 10%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 5.29%;" rowspan="1" colspan="1"></td> 
<td style="border: medium none; margin: 0pt; padding: 0pt; width: 15.3%;" rowspan="1" colspan="1"></td> 
<td style="" rowspan="1" colspan="1"></td></tr> 
</tbody></table> 
</td></tr> 
</tbody> 

我现在面临的问题是,对表标签不具有标识符(据我可以告诉),所以我不知道如何选择这个表并从中刮去信息。我一直在使用BeautifilSoup来获取网站上的其他信息,但我对如何刮这张桌子感到不知所措。

如果有人能告诉我如何继续我将不胜感激!

随着亲切的问候,

托马斯

如果你是一个聪明一点,您可以找到其他属性的元素。我抓住了这个镜头来抓取你的数据,它可能不是最好的–,但它让你接近。

我注意到的第一件事情是,在第二次出现“PROCEDURE”(第一个是链接,第二个是标题)后,您肯定需要数据。所以,我拆对:

data = html.split("PROCEDURE", 2)[2] 

于是,我找了<td>标签与rowspan=1

bs = BeautifulSoup.BeautifulSoup(data) 
tds = bs.findAll("td", { "rowspan": 1 }) 

越来越近......

>>> tds[0].text 
u'Title' 
>>> tds[1].text 
u'Mutual assistance for the recovery of claims relating to taxes, duties and other measures' 
>>> tds[3].text 
u'References' 
>>> tds[4].text 
u'COM(2009)00282009/0007(CNS)2009 a>' 

请注意,我在跳过指数2tds,因为他们使用间隔或其他东西(它是空的)。无论如何,这是一个开始。我在BeautifulSoup上找到的真正诀窍就是只将它提供给你所知道的那个区域的数据,因为那样的话就不那么容易了。它也自on接受不好看的输入,所以不要害怕喂它垃圾。

我在元素列表中走得更远一些,它并不完美。您需要细化搜索,因为它们的值在<td>之间有<td>个元素。

+0

嗨杰德,非常感谢你的帮助,这是一个很好的例子说明如何进行。如果你的时间我不想太多,但你能告诉我如何把表格的元素变成电子表格格式吗?我知道这可能是很多工作,所以如果你没有时间,那很好。最好的,托马斯。 – 2010-07-03 10:06:15