拆分所有的HTML标记成阵列

问题描述:

让我们假设我有下面这段代码:拆分所有的HTML标记成阵列

<!DOCTYPE html> 
<html> 
<head> 
<meta charset="UTF-8"> 
<title>Title of the document</title> 
</head>  
<body> 
<div id="x">Hello</div> 
<p>world</p> 
<h1>my name</h1> 
</body> 
</html> 

,我需要提取所有的HTML标签,把一个数组中,像这样:

'0' => '<!DOCTYPE html>', 
'1' => '<html>', 
'2' => '<head>', 
'3' => '<meta charset="UTF-8">', 
'4' => '<title>Title of the document</title>', 
'5' => '</head>', 
'6' => '<body>', 
'7' => '<div id="x">Hello</div>', 
'8' => '<p>world</p>', 
'9' => '<h1>my name</h1>', 
.... 

在我的情况下,我不需要获取标签中的所有现有内容,因为我只抓住每个标签的开头就已经非常好。

我该怎么做?

使用与preg_match_all功能如下解决方案:

$html_content = '<!DOCTYPE html> 
<html> 
<head> 
<meta charset="UTF-8"> 
<title>Title of the document</title> 
</head>  
<body> 
<div id="x">Hello</div> 
<p>world</p> 
<h1>my name</h1> 
</body> 
</html>'; 

preg_match_all("/\<\w[^<>]*?\>([^<>]+?\<\/\w+?\>)?|\<\/\w+?\>/i", $html_content, $matches); 
// <!DOCTYPE html> is standardized document type definition and is not a tag 

print_r($matches[0]); 

输出:

Array 
(
    [0] => <html> 
    [1] => <head> 
    [2] => <meta charset="UTF-8"> 
    [3] => <title>Title of the document</title> 
    [4] => </head> 
    [5] => <body> 
    [6] => <div id="x">Hello</div> 
    [7] => <p>world</p> 
    [8] => <h1>my name</h1> 
    [9] => </body> 
    [10] => </html> 
) 

最好的方法是将HTML加载到DOMDocument类中并遍历节点。

参阅相关的问题在这里:https://*.com/a/20025973/2870598