拆分所有的HTML标记成阵列
问题描述:
让我们假设我有下面这段代码:拆分所有的HTML标记成阵列
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Title of the document</title>
</head>
<body>
<div id="x">Hello</div>
<p>world</p>
<h1>my name</h1>
</body>
</html>
,我需要提取所有的HTML标签,把一个数组中,像这样:
'0' => '<!DOCTYPE html>',
'1' => '<html>',
'2' => '<head>',
'3' => '<meta charset="UTF-8">',
'4' => '<title>Title of the document</title>',
'5' => '</head>',
'6' => '<body>',
'7' => '<div id="x">Hello</div>',
'8' => '<p>world</p>',
'9' => '<h1>my name</h1>',
....
在我的情况下,我不需要获取标签中的所有现有内容,因为我只抓住每个标签的开头就已经非常好。
我该怎么做?
答
使用与preg_match_all
功能如下解决方案:
$html_content = '<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Title of the document</title>
</head>
<body>
<div id="x">Hello</div>
<p>world</p>
<h1>my name</h1>
</body>
</html>';
preg_match_all("/\<\w[^<>]*?\>([^<>]+?\<\/\w+?\>)?|\<\/\w+?\>/i", $html_content, $matches);
// <!DOCTYPE html> is standardized document type definition and is not a tag
print_r($matches[0]);
输出:
Array
(
[0] => <html>
[1] => <head>
[2] => <meta charset="UTF-8">
[3] => <title>Title of the document</title>
[4] => </head>
[5] => <body>
[6] => <div id="x">Hello</div>
[7] => <p>world</p>
[8] => <h1>my name</h1>
[9] => </body>
[10] => </html>
)