PHP正则表达式匹配字符串中的HTML标签

问题描述：

我试图解决Drupal的主题标签模块这个错误：http://drupal.org/node/1718154 PHP正则表达式匹配字符串中的HTML标签

我有了这个功能，在我的文字的每一个字由前缀匹配“＃”，像#tag：

function hashtags_get_tags($text) { 
    $tags_list = array(); 
    $pattern = "/#[0-9A-Za-z_]+/"; 
    preg_match_all($pattern, $text, $tags_list); 
    $result = implode(',', $tags_list[0]); 
    return $result; 
    }

我需要忽略网页内部链接，如<a href="#reference">link</a>，或者更一般地，通过＃前缀的任何字，一个HTML标签内出现（所以preceeded通过<，然后是>）。

任何想法我怎么能做到这一点？

强制性警告：尝试使用正则表达式匹配HTML会遇到麻烦。为了在有限的HTML文本中匹配少量文本的主题标签，我猜最坏的情况可能是看起来内容不合理。但是，这很容易导致错误，并且在HTML上使用正则表达式时很容易引入安全问题。非常非常小心。 – 2012-08-08 02:43:58

有人总是链接到：[用正则表达式解析HTML]（http://*.com/a/1732454/1421049）。 – 2012-08-08 03:33:15

实际上，我想我可以限制我的要求：大多数情况下，我想忽略“”标签中的“＃标签”... – gerlos 2012-08-08 03:34:10

答

您可以先剥离标签，因为匹配（使用strip_tags函数）？

function hashtags_get_tags($text) { 

    $text = strip_tags($text); 

    $tags_list = array(); 
    $pattern = "/#[0-9A-Za-z_]+/"; 
    preg_match_all($pattern, $text, $tags_list); 
    $result = implode(',', $tags_list[0]); 
    return $result; 
}

正则表达式将是棘手的，如果你想只匹配是不的HTML标签内的井号标签。

答

你可以使用preg_replace

function hashtags_get_tags($text) { 
$tags_list = array(); 
$pattern = "/#[0-9A-Za-z_]+/"; 
$text=preg_replace("/<[^>]*>/","",$text); 
preg_match_all($pattern, $text, $tags_list); 
$result = implode(',', $tags_list[0]); 
return $result; 
}

答

我做了使用PHP DOM这个函数抛出了前手的标签。

它返回href中所有有#的链接。

如果你想它，只除去内部哈希标签，更换这行：

if(strpos($link->getAttribute('href'), '#') === false) {

与此：

if(strpos($link->getAttribute('href'), '#') !== 0) {

这是函数：

function no_hashtags($text) { 
    $doc = new DOMDocument(); 
    $doc->loadHTML($text); 
    $links = $doc->getElementsByTagName('a'); 
    $nohashes = array(); 
    foreach($links as $link) { 
     if(strpos($link->getAttribute('href'), '#') === false) { 
      $temp = new DOMDocument(); 
      $elem = $temp->importNode($link->cloneNode(true), true); 
      $temp->appendChild($elem); 
      $nohashes[] = $temp->saveHTML(); 
     } 
    } 
    // return $nohashes; 
    return implode('', $nohashes); 
    // return implode(',', $nohashes); 
}

PHP正则表达式匹配字符串中的HTML标签

相关推荐