从RSS提要解析只是IMG SRC麻烦?
我试图创建一个基于这个例子的RSS阅读器:从RSS提要解析只是IMG SRC麻烦?
http://www.w3schools.com/php/php_ajax_rss_reader.asp
具体来说,我试图修改此示例,使读者可以访问和显示所有可用的漫画图像(没有别的)从任何给定的网络漫画RSS提要。我意识到可能有必要使代码至少有点特定于站点,但我正在尽可能将其作为通用目标。目前,我已经修改了最初的示例,以生成一个显示给定RSS源列表的所有漫画的阅读器。但是,它也显示了我试图摆脱的其他不需要的文本信息。这里是我的代码,到目前为止,与那些给我找麻烦特别是一些供稿:
index.php文件:
<html>
<head>
<script>
function showRSS()
{
if (window.XMLHttpRequest)
{
// code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp=new XMLHttpRequest();
} else
{ // code for IE6, IE5
xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.onreadystatechange=function()
{
if (xmlhttp.readyState==4 && xmlhttp.status==200)
{
document.getElementById("rssOutput").innerHTML=xmlhttp.responseText;
}
}
xmlhttp.open("GET","logger.php",true);
xmlhttp.send();
}
</script>
</head>
<body onload="showRSS()">
<div id="rssOutput"></div>
</body>
</html>
(相当肯定,有什么不对这个文件,我认为出现的问题在接下来的一个虽然我包括这一个完整性)
logger.php:
<?php
//function to get all comics from an rss feed
function getComics($xml)
{
$xmlDoc = new DOMDocument();
$xmlDoc->load($xml);
$x=$xmlDoc->getElementsByTagName('item');
foreach ($x as $x)
{
$comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;
//output the comic
echo ($comic_image . "</p>");
echo ("<br>");
}
}
//create array of all RSS feed URLs
$URLs =
[
"SMBC" => "http://www.smbc-comics.com/rss.php",
"garfieldMinusGarfield" => "http://garfieldminusgarfield.net/rss",
"babyBlues" => "http://www.comicsyndicate.org/Feed/Baby%20Blues",
];
//Loop through all RSS feeds
foreach ($URLs as $xml)
{
getComics($xml);
}
?>
由于这种方法包括在漫画图像之间的额外的文本(人SMBC中的随机东西,只有几个广告链接gMg和婴儿蓝调的版权链接),我看了一下RSS源并得出结论,问题在于它是包含图像源的描述标签,但也包括其他的东西。接下来,我尝试修改getComics函数直接扫描图像标记,而不是先查找描述标记。我更换了部分DOM文档创建/加载,并与URL列表之间:
$images=$xmlDoc->getElementsByTagName('img');
print_r($images);
foreach ($images as $image)
{
//echo $image->item(0)->getAttribute('src');
echo $image->item(0)->nodeValue;
echo ("<br>");
}
但显然的getElementsByTagName不拿起嵌入描述标签内的图像标签,因为我没有得到任何的漫画图像输出,从print_r的语句下面的输出:
DOMNodeList Object ([length] => 0) DOMNodeList Object ([length] => 0)
最后,我试了两种方法的结合,试图用getElementsByTagNam(“IMG”),它分析出来的描述标签的内容里面的代码。我更换了行:
$comic_image=$x->getElementsByTagName('description')->item(0)->childNodes->item(0)->nodeValue;
有:
$comic_image=$x->getElementsByTagName('description')->item(0)->getElementsByTagName('img');
print_r($comic_image);
但这也觉得没有什么,产生的输出:
DOMNodeList Object ([length] => 0)
所以对于很长的背景很抱歉,但我想知道是否有一种方法可以解析给定的RSS提要中的img src而没有其他文本和链接,我不想要?
帮助,将不胜感激
内部,描述的内容被转义,所以下面的代码应该工作:
foreach ($x as $y) {
$description = $y->getElementsByTagName('description')->item(0);
$decoded_description = htmlspecialchars_decode($description->nodeValue);
$description_xml = new DOMDocument();
$description_xml->loadHTML($decoded_description);
$comic_image = $description_xml->getElementsByTagName('img')->item(0)->getAttribute('src');
//output the comic
echo ($comic_image);
echo ("<br>");
}
对于任何后来其他人阅读本论坛的参考,这里是我的代码结束了。我更换了里面的一切只是一个getImageSrc功能每次循环调用一个函数getImageTag:
//function to find an image tag within a specific section if there is one
function getImageTag ($item,$tagName)
{
//pull desired section from given item
$section = $item->getElementsByTagName($tagName)->item(0);
//reparse description as if it were a string, because for some reason PHP woon't let you directly go to the source image with getElementsByTagName
$decoded_section = htmlspecialchars_decode($section->nodeValue);
$section_xml = new DOMDocument();
@$section_xml->loadHTML($decoded_section); //the @ is to suppress a bunch of warnings about characters this parser doesn't like
//pull image tag from section if there
$image_tag = $section_xml->getElementsByTagName('img')->item(0);
return $image_tag;
}
//function to get the image source URL from a given item
function getImageSrc ($item)
{
$image_tag = getImageTag($item,'description');
if (is_null($image_tag)) //if there was nothing with the tag name of image in the description section
{
//check in content:encoded section, because that's the next most likely place
$image_tag = getImageTag($item,'encoded');
if (is_null($image_tag)) //if there was nothing with the tag name of image in the encoded content section
{
//if the program gets here, it's probably because the feed is crap and doesn't include images,
//or it's because this particular item doesn't have a comic image in it
$image_src = '';
//THIS EXCEPTION WILL PROBABLY NEED TO BE HANDLED LATER TO AVOID POTENTIAL ERRORS
} else
{
$image_src = $image_tag->getAttribute('src');
}
} else
{
$image_src = $image_tag->getAttribute('src');
}
return $image_src;
}
谢谢,我想我通常能获得你说的话,我尝试了您的特定代码。它适用于某些订阅源,但会为其他订阅者产生一个奇怪的错误。例如,对于SMBC,它会输出5个有效的图像URL,但会反复给出以下错误:Warning:DOMDocument :: loadHTML():htmlParseEntityRef:expected';'在第30行的C:\ xampp \ htdocs \ comic_database_logger \ logger.php中的实体行中:30,这让我很困惑。我不明白为什么在婴儿蓝调的某些描述文字 – user2472083
中预计会出现一个分号,它完全起作用(尽管它放出了图像的URL而不是图像本身,我想我可以在以后解决) ,而加菲尔德减去加菲猫,它给出的是上面列出的错误。非常困惑 – user2472083
实际上,我试着在导致问题的行前添加@,因为它们只是警告,现在一切都很完美,除了我需要弄清楚如何显示图像而不是图像源链接 – user2472083