开发者

JDOM 1.1: hyphen is not a valid comment character

开发者 https://www.devze.com 2022-12-26 04:35 出处:网络
I\'m using tagsoup to clean some HTML I\'m scraping from the internet, and I\'m getting the following error when parsing through pages with comments:

I'm using tagsoup to clean some HTML I'm scraping from the internet, and I'm getting the following error when parsing through pages with comments:

The data "- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - " is not legal for a JDOM comment: Comment data cannot start with a hyphen.

I'm using JDOM 1.1, and here's the code that does the actual cleaning:

    SAXBuilder builder = new org.jdom.input.SAXBuilder("org.ccil.cowan.tagsoup.Parser"); // build
    // Don't check the doctype! At our usage rate, we'll get 503 responses
    // from the w3.
    builder.setEntityResolver(dummyEntityResolver);
    Reader in = new StringReader(str);
    org.jdom.Document doc = builder.build(in);
    String cleanXmlDoc = new org.jdom.output.XMLOutputter().outputString(doc);

Any idea what's going wro开发者_如何学编程ng, or how to fix this? I need to be able to parse pages with long comment strings of <!--------- data ------------>


An XML/HTML/SGML comment begins with --, ends with -- and does not contain --. A comment declaration contains zero or more comments.

Your example string can be reformatted as:

<!----
  ----
  - data
  ----
  ----
  ---->

As you can see, - data is not a valid comment and therefore the document is not valid HTML. In your specific case you can probably fix it by replacing the regular expression /<?!--.*?-->/ with the empty string, but be aware that this change might also break some valid documents.

0

精彩评论

暂无评论...
验证码 换一张
取 消