如何使用HTML :: TreeBuilder解析html?
问题描述:
这是我想解析如何使用HTML :: TreeBuilder解析html?
[...]
<div class="item" style="clear:left;">
<div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);">
</div>
<h2>Acid Splash</h2>
<p>Caster Level(s): Wizard/Sorcerer 0
<br />Innate Level: 0
<br />School: Conjuration
<br />Descriptor(s): Acid
<br />Component(s): Verbal, Somatic
<br />Range: Medium
<br />Area of Effect/Target: Single
<br />Duration: Instant
<br />Save: None
<br />Spell Resistance: Yes
<p>
You fire a small orb of acid at the target for 1d3 points of acid damage.
</div>
[...]
代码这是我的算法:
my $text = '';
scan_child($spells);
print $text, "\n";
sub scan_child {
my $element = $_[0];
return if ($element->tag eq 'script' or
$element->tag eq 'a'); # prune!
foreach my $child ($element->content_list) {
if (ref $child) { # it's an element
scan_child($child); # recurse!
} else { # it's a text node!
$child =~ s/(.*)\:/\\item \[$1\]/; #itemize
$text .= $child;
$text .= "\n";
}
}
return;
}
它得到的模式<key> : <value>
和李子垃圾像<script>
或<a>...</a>
。 我想改进它以获得<h2>...</h2>
标题和所有<p>...<p>
块,以便我可以添加一些LaTeX标记。
任何线索?
在此先感谢。
答
因为这可能是一个问题XY ...
Mojo::DOM
是使用CSS选择器解析HTML稍微更现代的框架。下面拉你从文档所需的P元素:
use strict;
use warnings;
use Mojo::DOM;
my $dom = Mojo::DOM->new(do {local $/; <DATA>});
for my $h2 ($dom->find('h2')->each) {
next unless $h2->all_text eq 'Acid Splash';
# Get following P
my $next_p = $h2;
while ($next_p = $next_p->next_sibling()) {
last if $next_p->node eq 'tag' and $next_p->type eq 'p';
}
print $next_p;
}
__DATA__
<html>
<body>
<div class="item" style="clear:left;">
<div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);">
</div>
<h2>Acid Splash</h2>
<p>Caster Level(s): Wizard/Sorcerer 0
<br />Innate Level: 0
<br />School: Conjuration
<br />Descriptor(s): Acid
<br />Component(s): Verbal, Somatic
<br />Range: Medium
<br />Area of Effect/Target: Single
<br />Duration: Instant
<br />Save: None
<br />Spell Resistance: Yes
<p>
You fire a small orb of acid at the target for 1d3 points of acid damage.
</div>
</body>
</html>
输出:
<p>Caster Level(s): Wizard/Sorcerer 0
<br>Innate Level: 0
<br>School: Conjuration
<br>Descriptor(s): Acid
<br>Component(s): Verbal, Somatic
<br>Range: Medium
<br>Area of Effect/Target: Single
<br>Duration: Instant
<br>Save: None
<br>Spell Resistance: Yes
</p>
答
我使用look_down()
方法扫描HTML。 使用look_down()
我可以先返回所有class =“item”的div的列表。
然后我可以迭代它们,并找到并处理h2
和p
,然后我将使用//作为分隔符分割。
也许你应该退后一步,计算出你想从你正在抓取的页面中提取什么信息,以及你想如何存储它。如果您有一个特定的模式或数据结构,将其添加到问题中将会很有帮助。如果你只是想提取所有的文字,那么你已经很顺利。 – 2014-09-25 20:58:14
也许,我仍然不清楚HTML :: TreeBuilder在节点中存储了什么。 – Daniele 2014-09-25 21:39:22