开发者

Which XML parser to use here?

开发者 https://www.devze.com 2023-03-27 20:28 出处:网络
I am receving an XML file as an input, whose size can vary from a few KBs to a lot more. I am gett开发者_运维技巧ing this file over a network. I need to extract a small number of nodes as per my use,

I am receving an XML file as an input, whose size can vary from a few KBs to a lot more. I am gett开发者_运维技巧ing this file over a network. I need to extract a small number of nodes as per my use, so most of the document is pretty useless for me. I have no memory preferences, I just need speed.

Considering all this, I concluded :

  1. Not using DOM here (due to possible huge size of doc , no CRUD requirement, and source being network)

  2. No SAX as I only need to get a small subset of data.

  3. StaX can be a way to go, but I am not sure if it is the fastest way.

  4. JAXB came up as another option - but what sort of parser does it use ? I read it uses Xerces by default (which is what type - push or pull ?), although I can configure it for use with Stax or Woodstock as per this link

I am reading a lot, still confused with so many options ! Any help would be appreciated.

Thanks !

Edit : I want to add one more question here : What is wrong in using JAXB here ?


Fastest solution is by far a StAX parser, specially as you only need a specific subset of the XML file and you can easily ignore whatever isn't really necessary using StAX, while you would receive the event anyway if you were using a SAX parser.

But it's also a little bit more complicated than using SAX or DOM. One of these days I had to write a StAX parser for the following XML:

<?xml version="1.0"?>
<table>
    <row>
        <column>1</column>
        <column>Nome</column>
        <column>Sobrenome</column>
        <column>email@gmail.com</column>
        <column></column>
        <column>2011-06-22 03:02:14.915</column>
        <column>2011-06-22 03:02:25.953</column>
        <column></column>
        <column></column>
    </row>
</table>    

Here's how the final parser code looks like:

public class Parser {

private String[] files ;

public Parser(String ... files) {
    this.files = files;
}

private List<Inscrito> process() {

    List<Inscrito> inscritos = new ArrayList<Inscrito>();


    for ( String file : files ) {

        XMLInputFactory factory = XMLInputFactory.newFactory();

        try {

            String content = StringEscapeUtils.unescapeXml( FileUtils.readFileToString( new File(file) ) );

            XMLStreamReader parser = factory.createXMLStreamReader( new ByteArrayInputStream( content.getBytes() ) );

            String currentTag = null;
            int columnCount = 0;
            Inscrito inscrito = null;           

            while ( parser.hasNext() ) {

                int currentEvent = parser.next();

                switch ( currentEvent ) {
                case XMLStreamReader.START_ELEMENT: 

                    currentTag = parser.getLocalName();

                    if ( "row".equals( currentTag ) ) {
                        columnCount = 0;
                        inscrito = new Inscrito();                      
                    }

                    break;
                case XMLStreamReader.END_ELEMENT:

                    currentTag = parser.getLocalName();

                    if ( "row".equals( currentTag ) ) {
                        inscritos.add( inscrito );
                    }

                    if ( "column".equals( currentTag ) ) {
                        columnCount++;
                    }                   

                    break;
                case XMLStreamReader.CHARACTERS:

                    if ( "column".equals( currentTag ) ) {

                        String text = parser.getText().trim().replaceAll( "\n" , " "); 

                        switch( columnCount ) {
                        case 0:
                            inscrito.setId( Integer.valueOf( text ) );
                            break;
                        case 1:                         
                            inscrito.setFirstName( WordUtils.capitalizeFully( text ) );
                            break;
                        case 2:
                            inscrito.setLastName( WordUtils.capitalizeFully( text ) );
                            break;
                        case 3:
                            inscrito.setEmail( text );
                            break;
                        }

                    }

                    break;
                }

            }

            parser.close();

        } catch (Exception e) {
            throw new IllegalStateException(e);
        }           

    }

    Collections.sort(inscritos);

    return inscritos;

}

public Map<String,List<Inscrito>> parse() {

    List<Inscrito> inscritos = this.process();

    Map<String,List<Inscrito>> resultado = new LinkedHashMap<String, List<Inscrito>>();

    for ( Inscrito i : inscritos ) {

        List<Inscrito> lista = resultado.get( i.getInicial() );

        if ( lista == null ) {
            lista = new ArrayList<Inscrito>();
            resultado.put( i.getInicial(), lista );
        }

        lista.add( i );

    }

    return resultado;
}

}

The code itself is in portuguese but it should be straightforward for you to understand what it is, here's the repo on github.


If you're only extracting a small amount, consider looking into using XPath as this is somewhat simpler than trying to extract the whole document.


Note: I'm the EclipseLink JAXB (MOXy) lead, and a member of the JAXB 2 (JSR-222) expert group.

StAX (JSR-173) is generally the fastest way to parse XML, and Woodstox is know for being a fast StAX parser. In addition to parsing, you need to collect the XML data. This is where a combination of StAX and JAXB comes in handy.

To ensure that our JAXB implementation uses the Woodstox StAX implementation. Configure your environment to use Woodstox (this is as simple as adding Woodstox to your classpath). Create an instance of XMLStreamReader and pass that as the source that JAXB should unmarshal.


Either SAX or StAX could handle this with some complex work figuring out that you're at something you want, but for extracting a small set of things by explicit path, you might be best off with XPath.

Another potential tactic is to first filter to only the parts you want using XSLT and then parse with anything you like, as the result of the filter will be a much smaller document.


I think that you should use SAX or parser based on SAX. I'd recommend you apache Digester. SAX is event driven and does not store state. This is what you need here due to you have to extract only small part of the document (I guess one tag).

0

精彩评论

暂无评论...
验证码 换一张
取 消