I have a Java SAXparser that downloads and parses, using parse(new InputSource(conn.getInputStream())). Unfortunately, sometimes it gives error when downloading a site's xml: "XML or text declaration not at start of entity" Apparently this is bad xml, declaration has to be first:
<!DOCTYPE ... stuff here ...>
<?xml ... stuff here ...?>
Unfortunately, there doesn't seem to be any way to ignore this error. I suppose I could download the entire xml, then use regex or something to fix this, then parse it, but it seems this wouldn't have the benefit of parsing as i开发者_运维问答t's downloading? Is there a way to replace it while it's parsing?
Easy solution: read the first line from the stream, consuming those bytes, and then pass it to the parser.
Proper Java solution: create an intermediate stream interface that wraps any kind of stream and offers a SAX parser compatible stream in return. Then create a class implementing that interface specifically for your case.
That way, you can detect the problematic header before it ever reaches the SAX parser.
Edit: I would just use the Apache commons XML parser, or a DOM parser instead of SAX. Also, unless your XML is really long, there's not much difference in parsing it during or after the download.
Have a look at Jsoup. It can deal with wrongly formatted xml.
精彩评论