用BeautifulSoup刮掉HTML表格不会返回所有标签

问题描述：

出于某种原因，我的代码将返回值标签（例如'到期日'，'已记录'，'需要工作'等），但它不会返回值。例如，当我运行代码时，它将返回“到期日期”，但不是“2014年11月27日”。用BeautifulSoup刮掉HTML表格不会返回所有标签

让事情更好奇，如果我调整代码以接受url的原始输入，代码将返回所有内容（即标签和值）。

请记住，我试图通过具有相同的HTML格式的网址列表循环。

HTML

table id="mcs-initial-abstract-grid" > 
     <tr class="mci-grid-row-header"> 
      <th > 
       <table style="width:100%"> 
        <tr> 
         <td>SOME STRING</td> 
         <td>SOME INTEGER </td> 
         <td>SOME STRING</td> 
        </tr> 
       </table> 
      </th> 
     </tr> 

     <tr> 
      <td> 
       <table style="width:100%"> 
        <tr class="mci-gridview-alternate"> 
         <td style="width:25%"><strong>Due Date:</strong></td> 
         <td style="width:20%">2014-Nov-27</td> 
         <td style="width:20%"><strong>Recorded:</strong></td> 
         <td style="width:35%">2015-Nov-7</td> 
        </tr> 
        <tr > 
         <td><strong>Work Required:</strong></td> 
         <td>$20</td> 
         <td><strong>Variable:</strong></td> 
         <td>2015-Nov-25 14:20</td> 
        </tr> 
       </table> 
      </td> 
     </tr> 
     <tr>

我的代码：

from bs4 import BeautifulSoup as bs 
import requests 
import urllib 

url = 'enter url here' 
r = requests.get(url) 
html_content = r.text 
soup = bs(html_content, 'html5lib') 
for tags in soup.find_all('table', id='mcs-initial-abstract-grid'): 
    for tbody in tags.find_all('tbody'): 
     for tr in tbody.find_all('tr', {'class':'mci-gridview-alternate'}): 
      for td in tr.find_all('td'): 
       print td.text

也许'html5lib'在提供缺失标签方面很慷慨;不过，我注意到，在问题列出的HTML中缺少'tbody'。 –

答

最有可能的罪魁祸首是tbody。这是usually generated by browsers的“特殊”标签之一。而且，由于您使用requests获取页面源代码，因此没有涉及真正的浏览器，因此您不会在html_content中获得tbody。

而且，如果从你的HTML解析逻辑并不利于消除tbody，尝试other parsers - html.parser或lxml，而不是html5lib。

我已经删除了tbody标签并收到了非常相似的结果。我也尝试过'lxml'和'html.parser'，但没有运气。另外，当我将代码转换为提示输入url的raw_input时，数据将被返回。这很好，但我需要循环访问一系列网址。 –

用BeautifulSoup刮掉HTML表格不会返回所有标签

相关推荐