I'm using tagsoup to clean some HTML I'm scraping from the internet, and I'm getting the following error when parsing through pages with comments:
The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM comment: Comment data cannot start with a hyphen.
I'm using JDOM 1.1, and here's the code that does the actual cleaning:
SAXBuilder builder = new org.jdom.input.SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build
// Don't check the doctype! At our usage rate, we'll get 503 responses
// from the w3.
builder.setEntityResolver(dummyEntityResolver);
Reader in = new StringReader(str);
org.jdom.Document doc = builder.build(in);
String cleanXmlDoc = new org.jdom.output.XMLOutputter().outputString(doc);
Any idea what's going wro开发者_如何学编程ng, or how to fix this? I need to be able to parse pages with long comment strings of <!--------- data ------------>
An XML/HTML/SGML comment begins with --
, ends with --
and does not contain --
. A comment declaration contains zero or more comments.
Your example string can be reformatted as:
<!----
----
- data
----
----
---->
As you can see, - data
is not a valid comment and therefore the document is not valid HTML. In your specific case you can probably fix it by replacing the regular expression /<?!--.*?-->/
with the empty string, but be aware that this change might also break some valid documents.
精彩评论