开发者

SAX parser for a very huge XML file

开发者 https://www.devze.com 2023-02-25 18:09 出处:网络
I am dealing with a very huge XML file, 4 GB and I am always getting an out of memory error, my java heap is already maxed up to the maximum, here\'s why code:

I am dealing with a very huge XML file, 4 GB and I am always getting an out of memory error, my java heap is already maxed up to the maximum, here's why code:

Handler h1 = new Handler("post");
        Handler h2 = new Handler("comment");
        posts = new Hashtable<Integer, Posts>();
        comments = new Hashtable<Integer, Comments>();
        edges = new Hashtable<String, Edges>();
         try {
                output = new BufferedWriter(new FileWriter("gephi.gdf"));
                SAXParser parser = SAXParserFactory.newInstance().newSAXParser();
                SAXParser parser1 = SAXParserFactory.newInstance().newSAXParser();


                parser.parse(new File("G:\\posts.xml"), h1);
                parser1.parse(new File("G:\\comments.xml"), h2);
            } catch (Exception ex) {
                ex.printStackTrace();
            }

    @Override
         public void startElement(String uri, String localName, String qName, 
                    Attributes atts) throws SAX开发者_C百科Exception {
                if(qName.equalsIgnoreCase("row") && type.equals("post")) {
                    post = new Posts();
                    post.id = Integer.parseInt(atts.getValue("Id"));
                    post.postTypeId = Integer.parseInt(atts.getValue("PostTypeId"));
                    if (atts.getValue("AcceptedAnswerId") != null)
                        post.acceptedAnswerId = Integer.parseInt(atts.getValue("AcceptedAnswerId"));
                    else
                        post.acceptedAnswerId = -1;
                    post.score = Integer.parseInt(atts.getValue("Score"));
                    if (atts.getValue("OwnerUserId") != null)
                        post.ownerUserId = Integer.parseInt(atts.getValue("OwnerUserId"));
                    else
                        post.ownerUserId = -1;
                    if (atts.getValue("ParentId") != null)
                        post.parentId = Integer.parseInt(atts.getValue("ParentId"));
                    else
                        post.parentId = -1;
                }
                else if(qName.equalsIgnoreCase("row") && type.equals("comment")) {
                    comment = new Comments();
                    comment.id = Integer.parseInt(atts.getValue("Id"));
                    comment.postId = Integer.parseInt(atts.getValue("PostId"));
                    if (atts.getValue("Score") != null)
                        comment.score = Integer.parseInt(atts.getValue("Score"));
                    else
                        comment.score = -1;
                    if (atts.getValue("UserId") != null)
                        comment.userId = Integer.parseInt(atts.getValue("UserId"));
                    else
                        comment.userId = -1;
                }
            }



public void endElement(String uri, String localName, String qName) 
         throws SAXException {
             if(qName.equalsIgnoreCase("row") && type.equals("post")){ 
                 posts.put(post.id, post);
                 //System.out.println("Size of hash table is " + posts.size());
             }else if (qName.equalsIgnoreCase("row") && type.equals("comment"))
                 comments.put(comment.id, comment);
         }

Is there any way to optimize this code so that I don't run out of memory? Probably use streams? If yes, how would you do that?


The SAX parser is efficient to a fault.

The posts, comments, and edges HashMaps immediately jump out at me as potential problems. I suspect you will need to periodically flush those maps out of memory to avoid a OOME.


Have a look at a project called SaxDoMix http://www.devsphere.com/xml/saxdomix/

It allows you to parse a large XML file, and have certain elements returned as parsed DOM entities. Much easier to work with than purs SAX parser.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号