I am looking for a simple lightweight java library that parses HTML. I have looked a lot and there are many options out there. But I cannot find something simple. I really would like to have something like pyquery in python except in java. My requirements are: fast, easy to use and lightweight.
What do I need it for? Not sure if this matters, but I need to index parts of an html documents. So I am hoping to be able 开发者_开发知识库to select part of that document quickly and then parse it.
I have used HTMLParser in the past. I wasn't very happy with it. I found tagsoup and jsoup. I really like jsoup. Haven't used it extensively yet but you can do something like:
Elements resultLinks = doc.select("h3 > a"); // direct a after h3
try groovy. It has a number of "slurpers," which are DSLs for reading in markup like XML and HTML, as well as JSON. here for example.
Use tagsoup to normalize the HTML into xhtml, and XOM to parse the resulting document. It's not that hard.
XPath will give you easy selection similiar to CSS selectors.
Look at Jerry which looks very promising http://jodd.org/doc/jerry/
精彩评论