I inherited a data-storage which was using simple text-files to save documents.
Documents had some attributes (date, title, and text), and these were encoded in a filename: <date>-<title>.txt, with the body of the file being the text.
However in reality Documents in the system have many more attributes, and even more again were proposed to be added.
It seemed logical to switch to an XML format, and I have done so, with each document now encoded in it's own XML file.
However, reading the files in from XML is now RIDICULOUSLY slow! (Where 2000 articles in the .txt format took seconds, now 2000 articles in the .xml format takes more than 10 minutes).
I WAS using a DOM parser, and after I discovered how slow the reading was, I switched to a SAX parser, however it's STILL just as slow (well, faster, but still 10 minutes).
Is XML JUST THAT slow, or am I doing something strange? Any thoughts would be appreciated.
The system is written in JavaSE 1.6. The Parser is created like this:
/*
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
*/
SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser saxParser;
try {
开发者_StackOverflow中文版 saxParser = factory.newSAXParser();
ArticleSaxHandler handler = new ArticleSaxHandler();
saxParser.parse(is, handler);
return handler.getArticle();
} catch (ParserConfigurationException e) {
throw new IOException(e);
} catch (SAXException e) {
throw new IOException(e);
} finally {
if (is != null) {
try {
is.close();
} catch (IOException e) {
logger.error(e);
}
}
}
}
private class ArticleSaxHandler extends DefaultHandler {
private URI uri = null;
private String source = null;
private String author = null;
private DateTime articleDatetime = null;
private DateTime processedDatetime = null;
private String title = null;
private String text = null;
private ArticleElement currentElement;
private final StringBuilder builder = new StringBuilder();
public Article getArticle() {
return new Article(uri, source, author, articleDatetime, processedDatetime, title, text);
}
/** Receive notification of the start of an element. */
public void startElement(String uri, String localName, String qName, Attributes attributes) {
if (builder.length() != 0) {
throw new RuntimeException(new SAXParseException(currentElement + " was not finished before " + qName + " was started", null));
}
currentElement = ArticleElement.getElement(qName);
}
public void endElement(String uri, String localName, String qName) {
final String elementText = builder.toString();
builder.delete(0, builder.length());
if (currentElement == null) {
return;
}
switch (currentElement) {
case ARTICLE:
break;
case URI:
try {
this.uri = new URI(elementText);
} catch (URISyntaxException e) {
throw new RuntimeException(e);
}
break;
case SOURCE:
source = elementText;
break;
case AUTHOR:
author = elementText;
break;
case ARTICLE_DATE_TIME:
articleDatetime = getDateTimeFormatter().parseDateTime(elementText);
break;
case PROCESSED_DATE_TIME:
processedDatetime = getDateTimeFormatter().parseDateTime(elementText);
break;
case TITLE:
title = elementText;
break;
case TEXT:
this.text = elementText;
break;
default:
throw new IllegalStateException("Unexpected ArticleElement: " + currentElement);
}
currentElement = null;
}
/** Receive notification of character data inside an element. */
public void characters(char[] ch, int start, int length) {
builder.append(ch, start, length);
}
public void error(SAXParseException e) {
fatalError(e);
}
public void fatalError(SAXParseException e) {
logger.error("currentElement: " + currentElement + " ||builder: " + builder.toString() + "\n\n" + e.getMessage(), e);
}
}
private enum ArticleElement {
ARTICLE(ARTICLE_ELEMENT_NAME), URI(URI_ELEMENT_NAME), SOURCE(SOURCE_ELEMENT_NAME), AUTHOR(AUTHOR_ELEMENT_NAME), ARTICLE_DATE_TIME(
ARTICLE_DATETIME_ELEMENT_NAME), PROCESSED_DATE_TIME(PROCESSED_DATETIME_ELEMENT_NAME), TITLE(TITLE_ELEMENT_NAME), TEXT(TEXT_ELEMENT_NAME);
private String name;
private ArticleElement(String name) {
this.name = name;
}
public static ArticleElement getElement(String qName) {
for (ArticleElement element : ArticleElement.values()) {
if (element.name.equals(qName)) {
return element;
}
}
return null;
}
}
Reading data from an unbuffered stream could explain these performance problems. This is not directly related to the change from text to XML but maybe by chance your new implementation doesn't use a BufferedInputStream
anymore.
Follwing that path, in detail, check if this is
is buffered:
saxParser.parse(is, handler);
I ran into this problem too with slow loading using an SAX parser. The issue was actually related to my XML file that has a DTD reference from the W3C:
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" >
<html xmlns="http://www.w3.org/TR/1999/REC-html-in-xml" xml:lang="en"
lang="en">
An excerpt from Chapter 2 of "Core Java, Volume II" about SAX and XML describes what's going on and also how to addres:
An XHTML file starts with a tag that contains a DTD reference, and the parser will want to load it. Understandably, the W3C isn’t too happy to serve billions of copies of files such as www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd. At one point, they refused altogether, but at the time of this writing, they serve the DTD at a glacial pace. If you don’t need to validate the document, just call
SAXParserFactory factory = SAXParserFactory.newInstance();
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
This fixed it for me. In addition, I used IntelliJ IDE to show that my XML file had an extra (unnecessary) <HTML>
tag and an extra <meta charset="UTF-8"/>
. That helped rid me of some SAX exceptions.
精彩评论