I am writing a crawler/parser that should be able to process different types of content, being RSS, Atom and just plain html files. To determine the correct parser, I wrote a class called ParseFactory, which takes an URL, tries to detect the content-type, and returns the correct parser.
Unfortunately, checking the content-type using the provided in method in URLConnection doesn't always work. For example,
String contentType = url.openConnection().getContentType();
doesn't always provide the correct content-type (e.g "text/html" where it should be RSS) or doesn't allow to distinguish between RSS and Atom (e.g. "application/xml" could be both an Atom or a RSS feed). To solve this problem, I started looking for clues in the InputStream. Problem is that I am having trouble coming up an elegant class design, where I need to download the InputStream only once. In my current design I have wrote a separate class first that determines the correct content-type, next the ParseFactory uses this information to create an instance of the corresponding parser, which in turn, when the method 'parse()' is called, downloads the entire InputStream a second time.
public Parser createParser(){
InputStream inputStream = null;
String contentType = null;
String contentEncoding = null;
ContentTypeParser contentTypeParser = new ContentTypeParser(this.url);
Parser parser = null;
try {
inputStream = new BufferedInputStream(this.url.openStream());
contentTypeParser.parse(inputStream);
contentType = contentTypeParser.getContentType();
contentEncoding = contentTypeParser.getContentEncoding();
assert (contentType != null);
inputStream = new BufferedInputStream(this.url.openStream());
if (contentType.equals(ContentTypes.rss))
{
logger.info("RSS feed detected");
parser = new RssParser(this.url);
parser.parse(inputStream);
}
else if (contentType.equals(ContentTypes.atom))
{
logger.info("Atom feed detected");
parser = new AtomParser(this.url);
}
else if (contentType.equals(ContentTypes.html))
{
logger.info("html detected");
parser = new HtmlParser(this.url);
parser.setContentEncoding(contentEncoding);
}
else if (contentType.equals(ContentTypes.UNKNOWN))
logger.debug("Unable to recognize content type");
if (parser != null)
parser.parse(inputStream);
} catch (IOException e) {
e.printStackTrace();
} finally {
try {
inputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
return parser;
}
Basically, I am looking for a solution that allow开发者_StackOverflow中文版s me to eliminate the second "inputStream = new BufferedInputStream(this.url.openStream())".
Any help would be greatly appreciated!
Side note 1: Just for the sake of being complete, I also tried using the URLConnection.guessContentTypeFromStream(inputStream) method, but this returns null way too often.
Side note 2: The XML-parsers (Atom and Rss) are based on SAXParser, the Html-parser on Jsoup.
Can you just call mark
and reset
?
inputStream = new BufferedInputStream(this.url.openStream());
inputStream.mark(2048); // Or some other sensible number
contentTypeParser.parse(inputStream);
contentType = contentTypeParser.getContentType();
contentEncoding = contentTypeParser.getContentEncoding();
inputstream.reset(); // Let the parser have a crack at it now
Perhaps your ContentTypeParser
should cache the content internally and feed it to the appropiate ContentParser
instead of reacquiring data from InputStream
.
精彩评论