I need to parse/read a lot of HTML webpages (100+) for specific content (a few lines of text that is almost the same).
I used scanner objects with reg. expressions and jsoup with its html parser.
Both methods are slow and with jsoup I get the following error: java.net.SocketTimeoutException: Read timed out (Multiple computers with different connections)
Is there anything better?
EDIT:
开发者_Go百科Now that I've gotten jsoup to work, I think a better question is how do I speed it up?
Did you try lengthening the timeout on JSoup? It's only 3 seconds by default, I believe. See e.g. this.
I will suggest Nutch, an open source web-search solution that includes support for HTML parsing. It's a very mature library. It uses Lucene under the hood and I find it to be a very reliable crawler.
A great skill to learn would be xpath. It would be perfect for that job! I just started learning it myself for automation testing. If you have questions, shoot me a message. I'd be glad to help you out, even though I'm not an expert.
Here's a nice link since you are interested in Java: http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html
xpath is also a good thing to know when you're not using Java, so that's why I would choose that route.
精彩评论