尝试将HTML转换为XML时的链接问题

问题描述：

我想将html文件转换为xml。它正在大部分工作。我遇到的问题是链接。现在，它似乎完全忽略了我的测试文件中的链接。尝试将HTML转换为XML时的链接问题

下面是转换代码：

<?php 
ini_set('display_errors', 1); 
ini_set('log_errors', 1); 
ini_set('error_log', dirname(__FILE__) . '/error_log.txt'); 
error_reporting(E_ALL); 

function convertToXML() 
{ 

    $titleLength = 35; 
    $output = ""; 
    $date = date("D, j M Y G:i:s T"); 
    $fi = fopen("../newsTEST.htm", "r"); 
    $fo = fopen("../newsfeed.xml", "w"); 

    //This is the first parts of the XML 
    $output .= "<?xml version=\"1.0\"?>\n"; 
    $output .= "<rss version=\"2.0\">\n"; 
    $output .= "<channel>\n"; 
    $output .= "\t<title>Wiggle 100 News</title>\n"; 
    $output .= "\t<link>http://www.wiggle100.com/news.php</link>\n"; 
    $output .= "\t<description>Wiggle 100 Daily News</description>\n"; 
    $output .= "\t<language>en-us</language>\n"; 
    $output .= "\t<pubDate>". $date ."</pubDate>\n"; 
    $output .= "\t<managingEditor>[email protected]</managingEditor>\n"; 
    $output .= "\t<webMaster>[email protected]</webMaster>\n"; 

    $article = ""; 
    $skip = true; //if false will continue to put lines into output until </p> 
    $newArticle = false; 

    while(!feof($fi)) 
    { 
     $line = fgets($fi); 
     $link = ""; 

     if(strpos($line, "<p") !== false) 
     { 
      $pos = strpos($line, "<p"); 
      $line = substr($line, $pos); 

      $pos = strpos($line, ">"); 
      $line = substr($line, $pos + 1); 

      $skip = false;   
     } 

     if(strpos($line, "</p>") !== false) 
     { 
      $pos = strpos($line, "</p>"); 
      $line = substr($line, 0, $pos - 1); 

      $newArticle = true; 
     } 

     //This adds the line to the article 
     if(!$skip) 
     { 
      $article .= $line; 
     } 

     //This mixes the article, title, link, and date with 
     // XML and puts it into the output 
     if($newArticle) 
     { 
      //This if is to get rid of stuff like <p>&nbsp;</p> 
      if((strlen($article) > 10)) 
      { 
       $link = findLink($article); 
       //$article = strip_tags($article); 
       $title = substr($article, 0, $titleLength) . "..."; 

       $output .= "\t<item>\n"; 
       $output .= "\t\t<title>". $title ."</title>\n"; 
       $output .= "\t\t<link>". $link ."</link>\n"; 
       $output .= "\t\t<description>". $article . "</description>\n"; 
       $output .= "\t\t<pubDate>". $date . "</pubDate>\n"; 
       $output .= "\t</item>\n\n"; 
      } 

      $article = ""; 
      $line = ""; 
      $skip = true; 
     } 
    } 

    $output .= "</channel>\n"; 
    $output .= "</rss>\n"; 

    fwrite($fo, $output); 

    fclose($fi); 
    fclose($fo); 

    echo "<br /><br /> News converted to XML"; 
} 

    //***************************************************************************** 
    //***************************************************************************** 

    //Find and return a link in the input. 
    //Else use the a default 
    function findLink($input) 
    { 
     $link = "http://www.wiggle100.com/news.php"; 

     if(strpos($input, "<a") !== false) 
     { 
      $startpos = strpos($input, "href"); 
      $link = substr($input, $startpos + 5); 
      $endpos = strpos($link, ">"); 
      $link = substr($link, 0, $endpos - 2); 
     } 
     return $link; 
    } 


?>

下面是HTML测试代码：

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html><head><title>Test Page</title> 
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812"> 
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head> 
<body bgcolor="#ffffff"> 
<p>&nbsp;</p> 
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font> 
<a href="http://www.thedailyreview.com/news/"> 
http://www.thedailyreview.com/news/</a></p> 
</body> 
</html>

下面是XML输出：

<rss version="2.0"> 
<channel> 
    <title>Wiggle 100 News</title> 
    <link>http://www.wiggle100.com/news.php</link> 
    <description>Wiggle 100 Daily News</description> 
    <language>en-us</language> 
    <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    <managingEditor>[email protected]</managingEditor> 
    <webMaster>[email protected]</webMaster> 
    <item> 
     <title>This is an article. Blah. Blah. Bla...</title> 
     <link>http://www.wiggle100.com/news.php</link> 
     <description>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
     <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
     <title>This is another article. Blah. Blah...</title> 
     <link>http://www.wiggle100.com/news.php</link> 
     <description>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
     <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
     <title>This is the 3rd article. Blah. Blah...</title> 
     <link>http://www.wiggle100.com/news.php</link> 
     <description>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah</description> 
     <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

    <item> 
     <title><font size="6">This is the news for...</title> 
     <link>http://www.wiggle100.com/news.php</link> 
     <description><font size="6">This is the news for today. Blah Blah Blah!</font> 
</description> 
     <pubDate>Fri, 23 Oct 2009 23:49:04 EDT</pubDate> 
    </item> 

</channel> 
</rss>

font标签将消失时我取消了strip_tags（）的注释。

而不是解析html作为字符串，你可以在PHP中使用html解析器。 http://www.onderstekop.nl/articles/114/ – Xinus 2009-10-24 04:53:47

为什么投票？ – 2009-10-24 22:58:16

答

的问题结束了，我从来没有重置$ newArticle假写入XML输出之后。因此，在$ newArticle设置为true后（发现</p>时），在输出文章之前，读取的行数不会超过一行。通过在写入输出后将$ newArticle设置为false，程序会正确地向文章添加行，直到遇到</p>。

答

我做了一些测试，发现它在输入文件中的所有单行上的段落都能正常工作，如下例所示。（除了它读取左引号作为URL的一部分，但是这很容易固定。）

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> 
<html><head><title>Test Page</title> 
<meta name="GENERATOR" content="MSHTML 8.00.6001.18812"> 
<meta content="text/html; charset=unicode" http-equiv="Content-Type"></head> 
<body bgcolor="#ffffff"> 
<p>&nbsp;</p> 
<p>This is an article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p>This is another article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>This is the 3rd article. Blah. Blah. Blah. Blah. Blah. Blah. Blah.</p> 
<p>&nbsp;</p> 
<p align="center"><font size="6">This is the news for today. Blah Blah Blah!</font> <a href="http://www.thedailyreview.com/news/"> http://www.thedailyreview.com/news/</a></p> 
</body> 
</html>

谢谢。这帮助我找到了问题。 – 2009-10-24 23:00:36

尝试将HTML转换为XML时的链接问题

相关推荐