开发者

Extract all text from a HTML page without losing context

开发者 https://www.devze.com 2022-12-29 05:23 出处:网络
For a translation program I am trying to g开发者_JS百科et a 95% accurate text from a HTML file in order to translate the sentences and links.

For a translation program I am trying to g开发者_JS百科et a 95% accurate text from a HTML file in order to translate the sentences and links.

For example:

<div><a href="stack">Overflow</a> <span>Texts <b>go</b> here</span></div>

Should give me 2 results to translate:

Overflow

Texts <b>go</b> here

Any suggestions or commercial packages available for this problem?


I'm not exactly sure what you're asking, but look at simplehtmldom. Specifically the "Extract Contents from HTML" tab under quick start on that front page (can't link directly, sigh). With that you can extract the text of a website without all those pesky tags.

0

精彩评论

暂无评论...
验证码 换一张
取 消