I've a problem with SAX and Java.
I'm parsing the dblp digital library database xml file (which enumerates journal, conferences, paper). The XML file is very large (> 700MB).
However, my problem is that when the callback characters() returns, if the string retrieved contains several entities, the method only returns the string starting from the last entity characters found.
i.e.: Rüdiger Mecke
is the origina开发者_JAVA百科l author name held between <author>
tags
üdiger Mecke
is the result
(The String returned from characters (ch[], start, length) method).
I would like to know:
- how to prevent the PArser to automatically resolve entities?
- how to solve the truncated characters problem previously described?
characters()
is not guaranteed to return of all the characters in a single call. From the Javadoc:
The Parser will call this method to report each chunk of character data. SAX parsers may return all contiguous character data in a single chunk, or they may split it into several chunks.
You need to append the characters returned in all of the calls, something like:
private StringBuffer tempValue = new StringBuffer();
startElement()
{
tempValue.setLength(0); // clear buffer...
}
characters(characters(char[] ch, int start, int length)
{
tempValue.append(ch, start, length); // append to buffer
}
endElement()
{
String value = tempValue.toString(); // use characters in buffer...
}
I don't think you can turn off entity resolution.
The characters method can be called multiple times for a single tag, and you have to collect the characters across the multiple calls rather than expecting them all to arrive at once.
精彩评论