尝试将HTML转换为XML时的链接问题
问题描述:
我想将html文件转换为xml。它正在大部分工作。我遇到的问题是链接。现在,它似乎完全忽略了我的测试文件中的链接。尝试将HTML转换为XML时的链接问题
下面是转换代码:
<?php
ini_set('display_errors', 1);
ini_set('log_errors', 1);
ini_set('error_log', dirname(__FILE__) . '/error_log.txt');
error_reporting(E_ALL);
function convertToXML()
{
$titleLength = 35;
$output = "";
$date = date("D, j M Y G:i:s T");
$fi = fopen("../newsTEST.htm", "r");
$fo = fopen("../newsfeed.xml", "w");
//This is the first parts of the XML
$output .= "<?xml version=\"1.0\"?>\n";
$output .= "<rss version=\"2.0\">\n";
$output .= "<channel>\n";
$output .= "\t<title>Wiggle 100 News</title>\n";
$output .= "\t<link>http://www.wiggle100.com/news.php</link>\n";
$output .= "\t<description>Wiggle 100 Daily News</description>\n";
$output .= "\t<language>en-us</language>\n";
$output .= "\t<pubDate>". $date ."</pubDate>\n";
$output .= "\t<managingEditor>[email protected]</managingEditor>\n";
$output .= "\t<webMaster>[email protected]</webMaster>\n";
$article = "";
$skip = true; //if false will continue to put lines into output until </p>
$newArticle = false;
while(!feof($fi))
{
$line = fgets($fi);
$link = "";
if(strpos($line, "<p") !== false)
{
$pos = strpos($line, "<p");
$line = substr($line, $pos);
$pos = strpos($line, ">");
$line = substr($line, $pos + 1);
$skip = false;
}
if(strpos($line, "</p>") !== false)
{
$pos = strpos($line, "</p>");
$line = substr($line, 0, $pos - 1);
$newArticle = true;
}
//This adds the line to the article
if(!$skip)
{
$article .= $line;
}
//This mixes the article, title, link, and date with
// XML and puts it into the output
if($newArticle)
{
//This if is to get rid of stuff like <p> </p>
if((strlen($article) > 10))
{
$link = findLink($article);
//$article = strip_tags($article);
$title = substr($article, 0, $titleLength) . "...";
$output .= "\t<item>\n";
$output .= "\t\t<title>". $title ."</title>\n";
$output .= "\t\t<link>". $link ."</link>\n";
$output .= "\t\t<description>". $article . "</description>\n";
$output .= "\t\t<pubDate>". $date . "</pubDate>\n";
$output .= "\t</item>\n\n";
}
$article = "";
$line = "";
$skip = true;
}
}
$output .= "</channel>\n";
$output .= "</rss>\n";
fwrite($fo, $output);
fclose($fi);
fclose($fo);
echo "<br /><br /> News converted to XML";
}
//*****************************************************************************
//*****************************************************************************
//Find and return a link in the input.
//Else use the a default
function findLink($input)
{
$link = "http://www.wiggle100.com/news.php";
if(strpos($input, "<a") !== false)
{
$startpos = strpos($input, "href");
$link = substr($input, $startpos + 5);
$endpos = strpos($link, ">");
$link = substr($link, 0, $endpos - 2);
}
return $link;
}
?>
下面是HTML测试代码:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html><head><title>Test Page</title>
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812">
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head>
<body bgcolor="#ffffff">
<p> </p>
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p>
<p> </p>
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p>
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p>
<p> </p>
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font>
<a href="http://www.thedailyreview.com/news/">
http://www.thedailyreview.com/news/</a></p>
</body>
</html>
下面是XML输出:
<rss version="2.0">
<channel>
<title>Wiggle 100 News</title>
<link>http://www.wiggle100.com/news.php</link>
<description>Wiggle 100 Daily News</description>
<language>en-us</language>
<pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate>
<managingEditor>[email protected]</managingEditor>
<webMaster>[email protected]</webMaster>
<item>
<title>This is an article. Blah. Blah. Bla...</title>
<link>http://www.wiggle100.com/news.php</link>
<description>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description>
<pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate>
</item>
<item>
<title>This is another article. Blah. Blah...</title>
<link>http://www.wiggle100.com/news.php</link>
<description>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description>
<pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate>
</item>
<item>
<title>This is the 3rd article. Blah. Blah...</title>
<link>http://www.wiggle100.com/news.php</link>
<description>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description>
<pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate>
</item>
<item>
<title><font size="6">This is the news for...</title>
<link>http://www.wiggle100.com/news.php</link>
<description><font size="6">This is the news for today. Blah Blah Blah!</font>
</description>
<pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate>
</item>
</channel>
</rss>
font标签将消失时我取消了strip_tags()的注释。
答
的问题结束了,我从来没有重置$ newArticle假写入XML输出之后。因此,在$ newArticle设置为true后(发现</p>
时),在输出文章之前,读取的行数不会超过一行。通过在写入输出后将$ newArticle设置为false,程序会正确地向文章添加行,直到遇到</p>
。
答
我做了一些测试,发现它在输入文件中的所有单行上的段落都能正常工作,如下例所示。 (除了它读取左引号作为URL的一部分,但是这很容易固定。)
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html><head><title>Test Page</title>
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812">
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head>
<body bgcolor="#ffffff">
<p> </p>
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p>
<p> </p>
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p>
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p>
<p> </p>
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font> <a href="http://www.thedailyreview.com/news/"> http://www.thedailyreview.com/news/</a></p>
</body>
</html>
+0
谢谢。这帮助我找到了问题。 – 2009-10-24 23:00:36
而不是解析html作为字符串,你可以在PHP中使用html解析器。 http://www.onderstekop.nl/articles/114/ – Xinus 2009-10-24 04:53:47
为什么投票? – 2009-10-24 22:58:16