开发者

Inputstream handled by different objects depending on the content

开发者 https://www.devze.com 2023-03-21 18:34 出处:网络
I am writing a crawler/parser that should be able to process different types of content, beingRSS, Atom and just plain html files. To determine the correct parser, I wrote a class called ParseFactory,

I am writing a crawler/parser that should be able to process different types of content, being RSS, Atom and just plain html files. To determine the correct parser, I wrote a class called ParseFactory, which takes an URL, tries to detect the content-type, and returns the correct parser.

Unfortunately, checking the content-type using the provided in method in URLConnection doesn't always work. For example,

String contentType = url.openConnection().getContentType();

doesn't always provide the correct content-type (e.g "text/html" where it should be RSS) or doesn't allow to distinguish between RSS and Atom (e.g. "application/xml" could be both an Atom or a RSS feed). To solve this problem, I started looking for clues in the InputStream. Problem is that I am having trouble coming up an elegant class design, where I need to download the InputStream only once. In my current design I have wrote a separate class first that determines the correct content-type, next the ParseFactory uses this information to create an instance of the corresponding parser, which in turn, when the method 'parse()' is called, downloads the entire InputStream a second time.

public Parser createParser(){

    InputStream inputStream = null;
    String contentType = null;
    String contentEncoding = null;

    ContentTypeParser contentTypeParser = new ContentTypeParser(this.url);
    Parser parser = null;

    try {

        inputStream = new BufferedInputStream(this.url.openStream());
        contentTypeParser.parse(inputStream);
        contentType = contentTypeParser.getContentType();
        contentEncoding = contentTypeParser.getContentEncoding();

        assert (contentType != null);

        inputStream = new BufferedInputStream(this.url.openStream());

        if (contentType.equals(ContentTypes.rss))
        {
            logger.info("RSS feed detected");
            parser = new RssParser(this.url);
            parser.parse(inputStream);
        }
        else if (contentType.equals(ContentTypes.atom))
        {
            logger.info("Atom feed detected");
            parser = new AtomParser(this.url);
        }
        else if (contentType.equals(ContentTypes.html))
        {
            logger.info("html detected");
            parser = new HtmlParser(this.url);
            parser.setContentEncoding(contentEncoding);
        }
        else if (contentType.equals(ContentTypes.UNKNOWN))
            logger.debug("Unable to recognize content type");

        if (parser != null)
            parser.parse(inputStream);

    } catch (IOException e) {
        e.printStackTrace();
    } finally {
        try {
            inputStream.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    return parser;

}

Basically, I am looking for a solution that allow开发者_StackOverflow中文版s me to eliminate the second "inputStream = new BufferedInputStream(this.url.openStream())".

Any help would be greatly appreciated!

Side note 1: Just for the sake of being complete, I also tried using the URLConnection.guessContentTypeFromStream(inputStream) method, but this returns null way too often.

Side note 2: The XML-parsers (Atom and Rss) are based on SAXParser, the Html-parser on Jsoup.


Can you just call mark and reset?

inputStream = new BufferedInputStream(this.url.openStream());
inputStream.mark(2048); // Or some other sensible number

contentTypeParser.parse(inputStream);
contentType = contentTypeParser.getContentType();
contentEncoding = contentTypeParser.getContentEncoding();

inputstream.reset(); // Let the parser have a crack at it now


Perhaps your ContentTypeParser should cache the content internally and feed it to the appropiate ContentParser instead of reacquiring data from InputStream.

0

精彩评论

暂无评论...
验证码 换一张
取 消