引入nokogiri：解析不规则“<”

问题描述：

<tr> 
<th>Total Weight</th> 
<td>< 1 g</td> 
<td style="text-align: right">0 %</td> 

</tr>    
<tr><td class="skinny_black_bar" colspan="3"></td></tr>

不过，我认为“<”登录“<1克”引起引入nokogiri问题。有谁知道任何解决方法？有没有办法逃避“<”的标志？或者，也许有一个函数，我可以调用只是获得纯html段？

答

“小于”（<）isn't legal HTML，但浏览器有很多代码来确定HTML的含义，而不是仅显示错误。这就是为什么你的无效HTML样本在浏览器中显示你想要的样子。

所以诀窍是确保Nokogiri做同样的工作来弥补坏的HTML。确保解析该文件作为HTML而不是XML的：

f = File.open("table.html") 
doc = Nokogiri::HTML(f)

此解析您的文件只是罚款，但扔掉了< 1 g文本。怎么看待第2种TD元素的内容解析：

doc.xpath('(//td)[1]/text()').to_s 
=> "\n " 

doc.xpath('(//td)[2]/text()').to_s 
=> "0 %"

引入nokogiri抛出了你的无效的文本，但保留解析周围结构。你甚至可以看到来自Nokogiri的错误信息：

doc.errors 
=> [#<Nokogiri::XML::SyntaxError: htmlParseStartTag: invalid element name>] 
doc.errors[0].line 
=> 3

是的，第3行不好。

所以看起来Nokogiri没有像浏览器那样解析无效HTML的支持。我建议使用其他库来预处理文件。我试图运行在你的示例文件TagSoup，并通过改变它固定<到<像这样：

% java -jar tagsoup-1.1.3.jar foo.html | xmllint --format - 
src: foo.html 
<?xml version="1.0" standalone="yes"?> 
<html xmlns="http://www.w3.org/1999/xhtml"> 
    <body> 
    <table> 
     <tbody> 
     <tr> 
      <th colspan="1" rowspan="1">Total Weight</th> 
      <td colspan="1" rowspan="1">&lt;1 g</td> 
      <td colspan="1" rowspan="1" style="text-align: right">0 %</td> 
     </tr> 
     <tr> 
      <td colspan="3" rowspan="1" class="skinny_black_bar"/> 
     </tr> 
     </tbody> 
    </table> 
    </body> 
</html>

是否有任何Ruby包将像TagSoup一样强大地解析HTML？ – sampablokuper 2012-06-13 05:13:23

答

作为速战速决，我想出了使用reqular表达这种方法来确定未关闭的标签：

def fix_irregular_html(html) 
    regexp = /<([^<>]*)(<|$)/ 

    #we need to do this multiple time as regex are overlapping 
    while (fixed_html = html.gsub(regexp, "&lt;\\1\\2")) && fixed_html != html 
    html = fixed_html 
    end 

    fixed_html 
end

查看完整的代码，包括测试在这里： https://gist.github.com/796571

它工作了很适合我，我明白任何反馈和改进

引入nokogiri：解析不规则“<”

相关推荐