如何提取带有html链接的文本？

问题描述：

我尝试使用BaseX解析HTML页面。从这部分代码：如何提取带有html链接的文本？

<td colspan="2" rowspan="1" class="light comment2 last2"> 
    <img class="textalign10" src="templates/comment10.png" 
     alt="*" width="10" height="10" border="0"/> 
    <a shape="rect" href="mypage.php?userid=26682">user</a> 
    : the text I'd like to keep [<a shape="rect" 
    href="http://alink" rel="nofollow">Link</a>] . with that part too. 
</td>

我需要提取与a HTML链接，和消息开头除去第一:字符。

我想获得该确切的文本：

<message> 
the text I'd like to keep [<a shape="rect" href="http://alink" rel="nofollow">Link</a>] . with that part too. 
</message>

使用此功能，

declare 
function gkm:node_message_from_comment($comment as item()*) { 
    if ($comment) then 
    copy $c := $comment 
    modify (
     delete node $c/img[1], 
     delete node $c/a[1], 
     delete node $c/@*, 
     rename node $c as 'message' 
    ) 
    return $c 
    else() 
};

我可以提取文本，但我没有从一开始删除:。即：

<message> 
: the text I'd like to keep [<a shape="rect" href="http://alink" rel="nofollow">Link</a>] . with that part too. 
</message>

答

使用XQuery更新和改造的语句似乎有点过于复杂了我。您还可以选择mypage.php链接之后的节点;随着对输入的更多了解，还可能有更好的方法来选择所需的节点。

要删除:子字符串，请使用substring-after。如果您坚持使用转换语句，则“从第一个结果节点切断:，并返回所有其他结果”也适用。

let $comment :=<td colspan="2" rowspan="1" class="light comment2 last2"> 
    <img class="textalign10" src="templates/comment10.png" alt="*" width="10" height="10" border="0"/> 
    <a shape="rect" href="mypage.php?userid=26682">user</a> 
    : the text I'd like to keep [<a shape="rect" href="http://alink" rel="nofollow">Link</a>] . with that part too. 
</td> 
let $result := $comment/a[starts-with(@href, 'mypage.php')]/following-sibling::node() 
return <message>{ 
    $result[1]/substring-after(., ': '), 
    $result[position() > 1] 
}</message>

由于BaseX支持的XQuery 3.0，你也可以利用的辅助功能head和tail：

return <message>{ 
    head($result)/substring-after(., ': '), 
    tail($result) 
}</message>

完美的作品，谢谢:) – KumZ

如何提取带有html链接的文本？

相关推荐