I need to process some HTML pages in my Android App and I would prefer to use XPath for extracting the relevant information. For regular J2SE there are a lot of possible implementations for parsing regular HTML into a org.w3c.dom.Document:
- jTidy
- TagSoup
- Jericho
- NekoHTML
- HTMLCleaner
(List may be incomplete - it has been extracted from https://stackoverflow.com/questions/2009897/recommend-an-alternative-to-jtidy)
But it is very complicated to estimate if and how good those libraries work on Android (library size, cpu and memory consumption).
Based on your 开发者_JAVA技巧experience - what is the library of your choice for Android?
OK, looks like no-one can answer that question - then I have to check it myself.
jTidy
I downloaded the latest jTidy sources, compiled them and added the created jar file as library to my Android app. There were no problems using jTidy in my App (emulator and real phone). At runtime jTidy also works fine - but it seems that it is not a good fit for the limited Android environment - it works really slow. Looking at the Logcat output even parsing a ~10kb html file causes the garbage collector to work heavily.
HTMLCleaner
From my experience HTMLCleaner works also nice on Android; the library size is relatively small (106KB for v2.2). However the parsed DOM it creates is not as expected - HTMLCleaner inserts for example additional <span>
elements into the DOM. This may be OK if you want to display it as an HTML file but for my use case - extrecting information via XPath expressions - this is a no-go!
TagSoup
Not tested
Jericho
Not tested
NekoHTML
Not tested
JSoup
Not tested
精彩评论