I am using HtmlCleaner library for html content extraction. It works fairly but with few limitations.
It is not able to handle special characters like £ or quotes etc. For e.x. for url : http://www.basiceleganc开发者_StackOverflow社区efurnishings.co.uk/alaska-3-and-2-seater-sofa-setspan-classukmadespan-p-280.html, On giving xpath to price, It gives me "& pound;" inplace of £
Is there any property which we can set in htmlcleaner for handling this or any other solution.
Thanks
Jitendra
No, I don't believe HtmlCleaner can do this. However, you can use Apache Commons StringEscapeUtils to "unescape" the html, like this:
StringEscapeUtils.unescapeHtml("£679.00");
will produce £679.00
.
Instead of HtmlCleaner, I would recommend you try JSoup.
The version of htmlcleaner I am using is 2.2, and org.htmlcleaner.CleanerProperties - setTransSpecialEntitiesToNCR(true)
is useful to me. While I have to use the string.replace(" ", " ")
to make the html content I got be right completely.
This can now be done through org.htmlcleaner.CleanerProperties - setTransSpecialEntitiesToNCR(true).
精彩评论