开发者

html content extraction using htmlunit

开发者 https://www.devze.com 2023-03-18 09:05 出处:网络
I have series of HTML files with the same structures. Let take this example code. ><html> ><head>

I have series of HTML files with the same structures.

Let take this example code.

>     <html>
>     <head>
>     <title>main page</title>
>     </head>
>     <body>
>     <t开发者_开发百科able><tr>
>     <td>content1</td>
>     </tr></table>
>     </body>
>     </html>

I want to extract the title tag content and td tag content. How to do this using htmlunit? I am new to htmlunit. Please help me.


See this instructive snippet from the HTMLUnit page.

In there you first construct a client, then retrieve your page, finally ask for the title text (page.getTitleText()), or get the entire page as a HTML String (page.asXml()). You could then assertContains on that string.

There are plenty of other options, like retrieving elements by id. Best see the examples for yourself.


htmlunit is a testing system. Not a DOM parser.

To parse HTML to a DOM use http://about.validator.nu/htmlparser/ and use the HtmlDocumentBuilder class.

Once you have a Document you can do myDocument.getElementsByTagName("title") to find the title element.

0

精彩评论

暂无评论...
验证码 换一张
取 消