i am using JSOUP and I have html/text something like:
<html><head><style type="text/css">
</style></head>
<body><div style="font-family:times new roman,new york,times,serif;font-size:14pt">first text<br><div><br></div><div style="font-family: times new roman,new york,times,serif; font-size: 14pt;"><br><div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"><font size="2" face="Tahoma"><hr size="1"><b><span style="font-weight: bold;">one:</span></b> second text<br><b><span style="font-weight: bold;">two:</span></b> third text<br><b><span style="font-weight: bold;">three:</span></b> fourth text<br><b><span style="font-weight: bold;">five:</span></b> fifth text<br></font><br>
and I want to extract the first div that co开发者_如何学Gontains a text (the whole div) to get an output like:
<div style="font-family:times new roman,new york,times,serif;font-size:14pt">first text<br></div>
and one more question is how to get the first html tag (in general) that contains a text meaning the first text maybe inside <p>
or <span>
thanks in advance
You can use a SAX styled HTML parser, like TagSoup.
To do this, initialize the parser with an extended DefaultHandler
to cache the last element visited in a local member variable, then detect when the first time the characters(...)
method is called and print out the cached element and the text result.
Look to http://sax.sourceforge.net/quickstart.html for some direction in how to setup the parser.
Use HTML parser, or, if you know that HTML is XHTML, XSLT processor
Here is the list of open-source HTML parsers.
What about loading a temporaty DOM (a DOMFragment http://ejohn.org/blog/dom-documentfragments/) then turn to jQuery to find the div you want inside the fragment?
精彩评论