开发者

How to extract specific text from a webpage? [duplicate]

开发者 https://www.devze.com 2023-04-05 12:34 出处:网络
This question already has answers here: Text Extraction from HTML Java (8 answers) Closed 10 years ago. I\'m trying to extract a specific text from a webpage?
This question already has answers here: Text Extraction from HTML Java (8 answers) Closed 10 years ago.

I'm trying to extract a specific text from a webpage?

This is the part of the webpage which contains the specific text:

<div class="module">
<div class="body">
<dl class="per_info">
<dt>F.Name:</dt>
&l开发者_如何学编程t;dd><a class="nm" href="http://">a Variable Name1</a></dd>
<dt>L.Name:</dt>
<dd><a class="nm" href="http://">a Variable Name2</a></dd>
</dl>
</div>
</div>

How to extract the content of Variable Name1 and Variable Name2?

Is there any html parser could do this extraction?


well, you can try Selenium, it loads the html page to your java code in a DOM-aware fashion, such that afterwards you can pick content of HTML elements based on id, xpath, etc.

http://seleniumhq.org/


TagSoup is a SAX-compliant parser that is able to parse HTML found in the "wild". So there's no need for well formed XML.


jsoup is a Java library that can parse HTML and extract element data. To use jsoup, first you create a jsoup Document by parsing it from a file, URL, whole document string, or HTML fragment string. A HTML fragment example is something like:

String html = "<div class='module'>" +
    "<div class='body'>" +
    "<dl class='per_info'>" +
    "<dt>F.Name:</dt>" +
    "<dd><a class='nm' href='http://'>a Variable Name1</a></dd>" +
    "<dt>L.Name:</dt>" +
    "<dd><a class='nm' href='http://'>a Variable Name2</a></dd>" +
    "</dl>" +
    "</div>" +
    "</div>";
Document doc = Jsoup.parseBodyFragment(html);

With the document, you can use jsoup's selectors to locate specific elements:

// select all <a/> elements from the document
Elements anchors = doc.select("a")

With the element collection, you can iterator over the elements and extract their element contents:

for (Element anchor : anchors) {
    String contents = anchor.text();
    System.out.println(contents);
}
0

精彩评论

暂无评论...
验证码 换一张
取 消