如何使用HTML :: TreeBuilder解析html？

问题描述：

[...] 
<div class="item" style="clear:left;"> 
<div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);"> 
</div> 
    <h2>Acid Splash</h2> 
    <p>Caster Level(s): Wizard/Sorcerer 0 
    <br />Innate Level: 0 
    <br />School: Conjuration 
    <br />Descriptor(s): Acid 
    <br />Component(s): Verbal, Somatic 
    <br />Range: Medium 
    <br />Area of Effect/Target: Single 
    <br />Duration: Instant 
    <br />Save: None 
    <br />Spell Resistance: Yes 
    <p> 
    You fire a small orb of acid at the target for 1d3 points of acid damage. 
</div> 
[...]

代码这是我的算法：

my $text = ''; 

scan_child($spells); 

print $text, "\n"; 

sub scan_child { 
    my $element = $_[0]; 
    return if ($element->tag eq 'script' or 
      $element->tag eq 'a'); # prune! 
    foreach my $child ($element->content_list) { 
    if (ref $child) { # it's an element 
     scan_child($child); # recurse! 
    } else {   # it's a text node! 
     $child =~ s/(.*)\:/\\item \[$1\]/; #itemize 
     $text .= $child; 
     $text .= "\n"; 
    } 
    } 
    return; 
}

它得到的模式<key> : <value>和李子垃圾像<script>或<a>...</a>。我想改进它以获得<h2>...</h2>标题和所有<p>...<p>块，以便我可以添加一些LaTeX标记。

任何线索？

在此先感谢。

也许你应该退后一步，计算出你想从你正在抓取的页面中提取什么信息，以及你想如何存储它。如果您有一个特定的模式或数据结构，将其添加到问题中将会很有帮助。如果你只是想提取所有的文字，那么你已经很顺利。 – 2014-09-25 20:58:14

也许，我仍然不清楚HTML :: TreeBuilder在节点中存储了什么。 – Daniele 2014-09-25 21:39:22

答

因为这可能是一个问题XY ...

Mojo::DOM是使用CSS选择器解析HTML稍微更现代的框架。下面拉你从文档所需的P元素：

use strict; 
use warnings; 

use Mojo::DOM; 

my $dom = Mojo::DOM->new(do {local $/; <DATA>}); 

for my $h2 ($dom->find('h2')->each) { 
    next unless $h2->all_text eq 'Acid Splash'; 

    # Get following P 
    my $next_p = $h2; 
    while ($next_p = $next_p->next_sibling()) { 
     last if $next_p->node eq 'tag' and $next_p->type eq 'p'; 
    } 

    print $next_p; 
} 

__DATA__ 
<html> 
<body> 
<div class="item" style="clear:left;"> 
<div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);"> 
</div> 
    <h2>Acid Splash</h2> 
    <p>Caster Level(s): Wizard/Sorcerer 0 
    <br />Innate Level: 0 
    <br />School: Conjuration 
    <br />Descriptor(s): Acid 
    <br />Component(s): Verbal, Somatic 
    <br />Range: Medium 
    <br />Area of Effect/Target: Single 
    <br />Duration: Instant 
    <br />Save: None 
    <br />Spell Resistance: Yes 
    <p> 
    You fire a small orb of acid at the target for 1d3 points of acid damage. 
</div> 
</body> 
</html>

输出：

<p>Caster Level(s): Wizard/Sorcerer 0 
    <br>Innate Level: 0 
    <br>School: Conjuration 
    <br>Descriptor(s): Acid 
    <br>Component(s): Verbal, Somatic 
    <br>Range: Medium 
    <br>Area of Effect/Target: Single 
    <br>Duration: Instant 
    <br>Save: None 
    <br>Spell Resistance: Yes 
    </p>

答

我使用look_down()方法扫描HTML。使用look_down()我可以先返回所有class =“item”的div的列表。

然后我可以迭代它们，并找到并处理h2和p，然后我将使用//作为分隔符分割。

如何使用HTML :: TreeBuilder解析html？

相关推荐