SAX解析器为一个非常巨大的XML文件

问题描述：

我正在处理一个非常巨大的XML文件，4 GB，我总是得到一个内存不足错误，我的java堆已经达到最大，这是为什么代码：SAX解析器为一个非常巨大的XML文件

Handler h1 = new Handler("post"); 
     Handler h2 = new Handler("comment"); 
     posts = new Hashtable<Integer, Posts>(); 
     comments = new Hashtable<Integer, Comments>(); 
     edges = new Hashtable<String, Edges>(); 
     try { 
       output = new BufferedWriter(new FileWriter("gephi.gdf")); 
       SAXParser parser = SAXParserFactory.newInstance().newSAXParser(); 
       SAXParser parser1 = SAXParserFactory.newInstance().newSAXParser(); 


       parser.parse(new File("G:\\posts.xml"), h1); 
       parser1.parse(new File("G:\\comments.xml"), h2); 
      } catch (Exception ex) { 
       ex.printStackTrace(); 
      } 

    @Override 
     public void startElement(String uri, String localName, String qName, 
        Attributes atts) throws SAXException { 
       if(qName.equalsIgnoreCase("row") && type.equals("post")) { 
        post = new Posts(); 
        post.id = Integer.parseInt(atts.getValue("Id")); 
        post.postTypeId = Integer.parseInt(atts.getValue("PostTypeId")); 
        if (atts.getValue("AcceptedAnswerId") != null) 
         post.acceptedAnswerId = Integer.parseInt(atts.getValue("AcceptedAnswerId")); 
        else 
         post.acceptedAnswerId = -1; 
        post.score = Integer.parseInt(atts.getValue("Score")); 
        if (atts.getValue("OwnerUserId") != null) 
         post.ownerUserId = Integer.parseInt(atts.getValue("OwnerUserId")); 
        else 
         post.ownerUserId = -1; 
        if (atts.getValue("ParentId") != null) 
         post.parentId = Integer.parseInt(atts.getValue("ParentId")); 
        else 
         post.parentId = -1; 
       } 
       else if(qName.equalsIgnoreCase("row") && type.equals("comment")) { 
        comment = new Comments(); 
        comment.id = Integer.parseInt(atts.getValue("Id")); 
        comment.postId = Integer.parseInt(atts.getValue("PostId")); 
        if (atts.getValue("Score") != null) 
         comment.score = Integer.parseInt(atts.getValue("Score")); 
        else 
         comment.score = -1; 
        if (atts.getValue("UserId") != null) 
         comment.userId = Integer.parseInt(atts.getValue("UserId")); 
        else 
         comment.userId = -1; 
       } 
      } 



public void endElement(String uri, String localName, String qName) 
     throws SAXException { 
      if(qName.equalsIgnoreCase("row") && type.equals("post")){ 
       posts.put(post.id, post); 
       //System.out.println("Size of hash table is " + posts.size()); 
      }else if (qName.equalsIgnoreCase("row") && type.equals("comment")) 
       comments.put(comment.id, comment); 
     }

有没有什么办法可以优化这段代码，使我不会耗尽内存？可能使用流？如果是的话，你会怎么做？

如果您不喜欢SAX编码风格，而希望能够使用XPath，则还有另一个选项，称为扩展VTD-XML ...它执行部分XML加载以节省内存..并且它是高性能的。 ..这里是一篇论文http://sdiwc.us/digitlib/journal_paper.php?paper=00000582.pdf – 2016-04-07 17:35:26

答

SAX解析器对故障有效。

帖子，评论和边缘HashMaps立即跳出我作为潜在的问题。我怀疑你需要定期从内存中清空这些地图以避免OOME。

是啊......让我们在内存中构建巨大的数据结构，但是归咎于SAX。 – 2011-04-16 03:57:22

你如何定期冲洗那些？ – aherlambang 2011-04-16 04:08:32

@EquinoX要刷新，您需要暂停每个X元素并将数据写出到JVM之外的某个位置（例如数据库，磁盘文件等），并清除下一批的映射。 – 2011-04-16 04:37:21

答

看一看一个叫SaxDoMix http://www.devsphere.com/xml/saxdomix/

项目它可以让你解析一个大的XML文件，并有返回解析DOM实体的某些元素。比购买SAX解析器更容易。

SAX解析器为一个非常巨大的XML文件

相关推荐