个人项目“RSS FEED”XML解析器

问题描述:

我对Java相对较新,我一直在试图弄清楚如何在几天的长时间内输出以下标记。我真的很感谢对这个问题的一些见解。看起来我所能找到的或尝试的所有东西都无法正确显示。 (请原谅,俗气的新闻文章)个人项目“RSS FEED”XML解析器

<item> 
<pubDate>Sat, 21 Sep 2013 02:30:23 EDT</pubDate> 
<title> 
<![CDATA[ 
Carmen Bryan Lashes Out at Beyonce Fans for Throwing Shade (@carmenbryan) 
]]> 
</title> 
<link> 
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/ 
</link> 
<guid> 
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/ 
</guid> 
<description> 
<![CDATA[ 
<img ... /><br />. 
<p>In response to someone who reminded Bryan that Jay Z has Beyonce now, she tweeted.</p> 
<p>Check out what else Bryan had to say above.</p> 
<p>Source: </p> 
]]> 
</description> 
</item> 

我已成功地解析XML并打印出标题和描述元素标签的内容,但输出为description元素标签还包括其所有子元素标记。我希望将来可以使用此项目来构建我的Java产品组合,请帮助!

我迄今为止代码:

public class NewXmlReader 
    { 

     /** 
     * @param args the command line arguments 
     */ 
     public static void main(String[] args) { 
       try { 

         DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); 
         DocumentBuilder builder = factory.newDocumentBuilder(); 
         Document docXml = builder.parse(NewXMLReaderHandlers.inputHandler()); 
         docXml.getDocumentElement().normalize(); 

         NewXMLReaderHandlers.handleItemTags(docXml, "item"); 

       } catch (ParserConfigurationException | SAXException parserConfigurationException) { 
         System.out.println("You Are Not XML formated !!"); 
         parserConfigurationException.printStackTrace(); 
       } catch (IOException iOException) { 
         System.out.println("URL NOT FOUND"); 
         iOException.getCause(); 
       } 
     } 

    } 

public class NewXMLReaderHandlers { 

    private static int ARTICLELENGTH; 

    public static String inputHandler() throws IOException { 
     InputStreamReader inputStream = new InputStreamReader(System.in); 
     BufferedReader bufferRead = new BufferedReader(inputStream); 
     System.out.println("Please Enter A Proper URL: "); 
     String urlPageString = bufferRead.readLine(); 
     return urlPageString; 
    } 

    public static void handleItemTags(Document document, String rssFeedParentTopicTag){ 
     NodeList listOfArticles = document.getElementsByTagName(rssFeedParentTopicTag); 
     NewXMLReaderHandlers.ARTICLELENGTH = listOfArticles.getLength(); 
     String rootElement = document.getDocumentElement().getNodeName(); 
     if (rootElement == "rss"){ 
      System.out.println("We Have An RSS Feed To Parse"); 

      for (int i = 0; i < NewXMLReaderHandlers.ARTICLELENGTH; i++) { 
       Node itemNode = (Node) listOfArticles.item(i); 
       if (itemNode.getNodeType() == Node.ELEMENT_NODE) { 
        Element itemElement= (Element) itemNode; 
        tagContent (itemElement, "title"); 
        tagContent (itemElement, "description"); 
       } 
      } 
     } 

    } 

    public static void tagContent (Element item, String tagName) { 
      NodeList tagNodeList = item.getElementsByTagName(tagName); 
      Element tagElement = (Element)tagNodeList.item(0); 
      NodeList tagTElist = tagElement.getChildNodes(); 
      Node tagNode = tagTElist.item(0); 

//   System.out.println(" - " + tagName + " : " + tagNode.getNodeValue() + "\n"); 
      if(tagName == "description"){ 
       System.out.println(" - " + tagName + " : " + tagNode.getNodeValue() + "\n\n"); 
       System.out.println(" Do We Have Any Siblings? " + tagNode.getNextSibling().getNodeValue() + "\n"); 
      } 
     } 
    } 

对于我的钱,最简单的解决办法是使用XPath API。

本质上,它是XML的查询语言。参见XPath Tutorial作为底漆。

此示例使用从SO RSS源,它采用<entry...>代替<item>,但我已经用于其他RSS(和XML)文件,甚至是非常复杂的HTML文档相同的技术......

import java.io.IOException; 
import java.util.logging.Level; 
import java.util.logging.Logger; 
import javax.xml.parsers.DocumentBuilderFactory; 
import javax.xml.parsers.ParserConfigurationException; 
import javax.xml.xpath.XPath; 
import javax.xml.xpath.XPathConstants; 
import javax.xml.xpath.XPathExpression; 
import javax.xml.xpath.XPathExpressionException; 
import javax.xml.xpath.XPathFactory; 
import org.w3c.dom.Document; 
import org.w3c.dom.Element; 
import org.w3c.dom.Node; 
import org.w3c.dom.NodeList; 
import org.xml.sax.SAXException; 

public class TestRSSFeed { 

    public static void main(String[] args) { 
     try { 
      // Read the feed... 
      DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance(); 
      Document doc = factory.newDocumentBuilder().parse("http://*.com/feeds/tag?tagnames=java&sort=newest"); 
      Element root = doc.getDocumentElement(); 

      // Create a xPath instance 
      XPath xPath = XPathFactory.newInstance().newXPath(); 
      // Find all the nodes that are named <entry...> any where in 
      // the document that live under the parent node... 
      XPathExpression expression = xPath.compile("//entry"); 
      NodeList nl = (NodeList) expression.evaluate(root, XPathConstants.NODESET); 

      System.out.println("Found " + nl.getLength() + " items..."); 
      for (int index = 0; index < nl.getLength(); index++) { 
       Node node = nl.item(index); 
       // This is a sub node search. 
       // The search is based on the parent node and looks for a single 
       // node titled "title" that belongs to the parent node... 
       // I did this because I'm only expecting a single node... 
       expression = xPath.compile("title"); 
       Node child = (Node) expression.evaluate(node, XPathConstants.NODE); 
       System.out.println(child.getTextContent()); 
      } 

     } catch (IOException | ParserConfigurationException | SAXException exp) { 
      exp.printStackTrace(); 
     } catch (XPathExpressionException ex) { 
      ex.printStackTrace(); 
     } 
    } 

} 

现在,你可以做一些非常复杂的查询,但我想我会用一个简单的例子开始;)

+0

与使用常规DOM方法相比,它实现起来要容易得多。我尝试了几种组合,但我认为我遇到了一个新问题,或者我原本必须面对的实际问题。似乎问题是解析CDATA作为父节点的任何子元素。我似乎无法获得之后的任何信息,即

]]>似乎无法穿越。我试过解析子元素,但NodeList总是空着。还有什么建议? – Khalismatic

+0

没关系,我回答了我自己的问题!再次感谢MadProgrammer的帮助! – Khalismatic

+0

对不起,打算回到你身边,跟踪了一下:P – MadProgrammer

万一有人还在纳闷,我怎么设法解决CDATA之谜:

逻辑如下s:

一旦你得到程序提取所有xml以显示正确的节点树,就像rss提要显示的那样,如果任何xml数据被包装在CDATA标签中,唯一访问这些信息的方法是创建新的xml基于CDATA标签中的文本内容。解析新文档后,您应该能够访问所需的所有数据。