开发者

NekoHTML SAX fragment parsing

开发者 https://www.devze.com 2023-04-02 09:51 出处:网络
I\'m trying to parse a simple fragment of HTML with NekoHTML : <h1>This is a basic test</h1>

I'm trying to parse a simple fragment of HTML with NekoHTML :

<h1>This is a basic test</h1>

To do so, I've set a specific Neko feature not to have any HTML, HEAD or BODY tag calling startElement(..) callback.

Unfortunatly, it doesn't work for me.. I certainly missed something but can't figured out what it would be.

Here is a very simple code to reproduce my problem :

 public static class MyContentHandler implements ContentHandler {

     public void characters(char[] ch, int start, int length) throws SAXException {
         String text = String.valueOf(ch, start, length);
         System.out.println(text);
     }

     public void startElement(String nameSpaceURI, String localName, String rawName, Attributes attributes) throws SAXException {
         System.out.println(rawName);
     }

     public void endElement(String nameSpaceURI, String localName, String rawName) throws SAXException {
         System.out.println("end " + localName);
     }
 }

And the main() to launch a test :

  public static void main(String[] args) throws SAXException, IOException {
       SAXParser saxReader = new SAXParser();
       // set the feature like explained in documentation : http://nekohtml.sourceforge.net/faq.html#fragments
       saxReader.setFeature("http://cyberneko.org/html/features/balance-tags/document-fragment", true);
       saxReader.setContentHandler(new MyContentHandler());
       saxReader.parse(new InputSource(new StringInputStream("<h1>This is a basic test开发者_高级运维</h1>")));
  }

The corresponding output :

HTML
HEAD
end HEAD
BODY
H1
This is a basic test
end H1
end BODY
end HTML

whereas I was expecting

H1
This is a basic test
end H1

Any idea ?


I finally got it !

Actually, I was parsing my HTML string in a GWT application, where I've added the gwt-dev.jar dependency. This jar packages a lot of external librairies, like the xercesImpl. But the version of embedded xerces classes does not match the one requiered by NeokHTML.

As a (strange) result, it appears that NeokHTML SAX parser didn't use any custom feature when using gwt-dev embedded xerces version.

So, I had to rework some code to remove the gwt-dev dependency, which by the way is not recommanded to be added to any standard GWT project.

0

精彩评论

暂无评论...
验证码 换一张
取 消