开发者

How to convert an Html source of a webpage into org.w3c.dom.Document in java?

开发者 https://www.devze.com 2022-12-20 13:20 出处:网络
How to convert an Html source of a webpage int开发者_运维知识库o org.w3c.dom.Documentin Java?I suggest http://about.validator.nu/htmlparser/, which implements the HTML5 parsing algorithm. Firefox is i

How to convert an Html source of a webpage int开发者_运维知识库o org.w3c.dom.Documentin Java?


I suggest http://about.validator.nu/htmlparser/, which implements the HTML5 parsing algorithm. Firefox is in the process of replacing its own HTML parser with this one.


I have just been playing with JSoup, which is a fantastic Java HTML parser that works a little like jQuery. Really easy to use.


That's actually a fairly difficult thing to do robustly, because arbitrary HTML web pages are sometimes malformed (the major browsers are fairly tolerant). You may want to look into the swing html parser, which I've never tried but looks like it may be the best option. You also could try something along the lines of this and handle any parsing exceptions that may come up (although I've only ever tried this for xml):

import java.io.File;
import org.w3c.dom.Document;
import org.w3c.dom.*;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException; 

...

try {
    DocumentBuilderFactory docBuilderFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder docBuilder = docBuilderFactory.newDocumentBuilder();
    Document doc = docBuilder.parse (InputStreamYouBuiltEarlierFromAnHTTPRequest);
}
catch (ParserConfigurationException e)
{
    ...
}
catch (SAXException e)
{
    ...
}
catch (IOException e)
{
    ...
}

...
0

精彩评论

暂无评论...
验证码 换一张
取 消