How can I get content of HTML <body>_问答_开发者

开发者 https://www.devze.com 2022-12-14 14:28 出处：网络

when I have html: <html> <head> </head> <body> text <div> text2 <div> text3

相关专题：dom

when I have html:

<html>
<head>
</head>
<body>
 text
  <div>
  text2
    <div>
    text3
    </div>
  </div>
</body>
</html>

how can I get with DOM parser in JAVA content of body: text <div> text2 <div> text3 </div> </div> becasuse method getTextContent return:text text2 text3. - so开发者_Go百科 without tags.

It is possible with SAX, but it is possible with DOM, too?

The getTextContent is behaving as I would expect - getting the textural content of the HTML fragment. Can you check the API docs for the DOM parser and see if there's a similar method with a name like getHtmlContent?

You would need to parse the document into a DOM and serialise only the portion of the DOM you wanted. Using the DOM Level 3 LS interfaces you can serialise the outer-XML of a single node with:

LSSerializer serializer= implementation.createLSSerializer();
String html= serializer.writeToString(node);

To get the inner-XML you would need to writeToString each child node in turn (eg. into a StringBuffer).

Depending on what DOM implementation you are using there may be alternative non-standard methods. There may also be risks with serialising HTML as XML, if that's what you're doing... eg. a standard XML serialiser may output a self-closing tag for an empty tag, which can confuse browsers parsing the output as legacy-HTML.