开发者

Html 2 Text - Remove "hidden" Text

开发者 https://www.devze.com 2023-02-16 05:40 出处:网络
I am currently looking for ways to read the visible text开发者_如何学Go of a website and store it into plaintext string using Java.

I am currently looking for ways to read the visible text开发者_如何学Go of a website and store it into plaintext string using Java.

In other words, I'd like to convert something like this:

Hello <span style="display: none">stupid</span> World into "Hello World"

or something like

<span>Un</span>friendly into "Unfriendly" (and not something like "Un friendly")

or

Hello

World

into "Hello World" (as new lines are ignored in HTML)

Do you know of any lib capable of assisting in this task?

Cheers,

Matthias


Boilerpipe is an HTML cleaning library written in Java.


Have a look at Cobra to see if the API provides any method to render the HTML and convert it into plain text.

0

精彩评论

暂无评论...
验证码 换一张
取 消