Preg_match_all

Preg_match_all <a href

问题描述:

Hello i want to extract links <a href="/portal/clients/show/entityId/2121" > and i want a regex which givs me /portal/clients/show/entityId/2121 the number at last 2121 is in other links different any idea?Preg_match_all <a href

+0

你想使用正则表达式从'/ portal/clients/show/entityId/2121'中提取'2121'吗? – halocursed 2009-10-05 12:11:00

+0

不,我想提取'/门户/客户端/显示/ entityId/2121' 另一个链接可以有不同的数字,而不是2121任何想法? – streetparade 2009-10-05 12:13:19

正则表达式解析链接是这样的:

'/<a\s+(?:[^"'>]+|"[^"]*"|'[^']*')*href=("[^"]+"|'[^']+'|[^<>\s]+)/i' 

既然是多么的可怕,我会建议使用Simple HTML Dom至少得到链接。然后你可以在链接href中使用一些非常基本的正则表达式来检查链接。

+0

@streetparade您可能希望避免在捕获的值中包含引用属性值的引号,因此,请相应地调整正则表达式捕获相关: '/ ] + | “[^”] * “| \ '[^ \'] * \')* HREF = ”([^“] +)” | \ '[^ \'] + \'| [^ \ s]的+/I” – 2014-08-28 16:56:32

Paring links from HTML can be done using am HTML parser.

When you have all links, simple get the index of the last forward slash, and you have your number. No regex needed.

+0

hmm .. $ html-> find('href')还是什么? – streetparade 2009-10-05 12:11:52

+0

我不知道。这个发现(...)是从哪里来的? – 2009-10-05 12:42:36

Simple PHP HTML Dom Parser例如:

// Create DOM from string 
$html = str_get_html($links); 

//or 
$html = file_get_html('www.example.com'); 

foreach($html->find('a') as $link) { 
    echo $link->href . '<br />'; 
} 
+0

这会给结果“ – streetparade 2009-10-05 12:26:21

+0

但我只是提取/门户/客户端/显示/ entityId/4636所以这工作 '/ ] + |”[^“] *”|'[^'] *' )* href =(“[^”] +“|'[^'] +'| [^ \ s] +)/ i' – streetparade 2009-10-05 12:26:57

+0

@streetparade my bad,忘记说$ link-> href,编辑 – karim79 2009-10-05 12:30:13

当“解析”HTML我主要依靠PHPQuery:http://code.google.com/p/phpquery/,而不是正则表达式。

Don't use regular expressions for proccessing xml/html。这可以很容易地使用来完成的builtin dom parser

$doc = new DOMDocument(); 
$doc->loadHTML($htmlAsString); 
$xpath = new DOMXPath($doc); 
$nodeList = $xpath->query('//a/@href'); 
for ($i = 0; $i < $nodeList->length; $i++) { 
    # Xpath query for attributes gives a NodeList containing DOMAttr objects. 
    # http://php.net/manual/en/class.domattr.php 
    echo $nodeList->item($i)->value . "<br/>\n"; 
} 

这是我的解决方案:

<?php 
// get links 
$website = file_get_contents("http://www.example.com"); // download contents of www.example.com 
preg_match_all("<a href=\x22(.+?)\x22>", $website, $matches); // save all links \x22 = " 

// delete redundant parts 
$matches = str_replace("a href=", "", $matches); // remove a href= 
$matches = str_replace("\"", "", $matches); // remove " 

// output all matches 
print_r($matches[1]); 
?> 

我建议避免使用基于XML解析器,因为你不会总是知道, 文档是否/网站已经形成良好。

祝你好运