BeautifulSoup缺失/跳过标签

问题描述：

如果您能指出我正确的方向，我们将不胜感激。有没有更好的方式做到这一点，并捕获所有的数据（与HTML标签类“文本文本”））...BeautifulSoup缺失/跳过标签

如果我喜欢这样做。我错过了一些标签，最终原始html字符串的大小是20K（所以它的大量数据）。

soup = BeautifulSoup(r.content, 'html5lib') 
c.case_html = str(soup.find('div', class_='DocumentText') 
print(self.case_html)

以下是用于抓取的代码，现在可以正常工作，但第二个新的标签被添加它已损坏。

soup = BeautifulSoup(r.content, 'html5lib') 
c.case_html = str(soup.find('div', class_='DocumentText').find_all(['p','center','small'])) 
print(self.case_html)

样本HTML如下：原来是周围的20K字符串大小

<form name="form1" id="form1"> 
<div id="theDocument" class="DocumentText" style="position: relative; float: left; overflow: scroll; height: 739px;"> 
<p>PTag</p> 
<p> <center> First center </center> </p> 
<small> this is small</small> 
<p>...</p> 
<p> <center> Second Center </center> </p> 
<p>....</p> 
</div> 
</form>

预计输出是这个

<div id="theDocument" class="DocumentText" style="position: relative; float: left; overflow: scroll; height: 739px;"> 
<p>PTag</p> 
<p> <center> First center </center> </p> 
<small> this is small</small> 
<p>...</p> 
<p> <center> Second Center </center> </p> 
<p>....</p> 
</div>

'c.case_html = STR（soup.find（ '格'，类_ = 'DocumentText'）'你为什么把它改为'string'？ –

你的元素想解析上面粘贴的什么短信？ – SIM

您的预期产量是多少？ – chad

答

你可以试试这个。我只是基于你给定的HTML代码的基础上回答。如果您需要澄清，请让我知道。谢谢！

soup = BeautifulSoup(r.content, 'html5lib') 
case_html = soup.select('div.DocumentText') 
print(case_html.get_text())

BeautifulSoup缺失/跳过标签

相关推荐