Getting the text of a webpage with HTTPClient_问答_开发者

Getting the text of a webpage with HTTPClient

开发者 https://www.devze.com 2023-01-05 21:18 出处：网络

I\'m just getting started with HTTPClient, and I want to take a webpage and extract out the raw text from it minus all the html markup.

相关专题：httpclient

I'm just getting started with HTTPClient, and I want to take a webpage and extract out the raw text from it minus all the html markup.

Can HTTPClient accomplish that? If so, how? Or is ther开发者_Python百科e another library I should be looking at?

for example if the page contains

<body><p>para1 test info</p><div><p>more stuff here</p></div>

I'd like it to output

para1 test info more stuff here

I'd suggest using HttpComponents Client (HTTPClient 4) (instead of version 3 you've linked to).

This being said, it's independent of the HTTP client library (there are others). What you need is to convert the HTML into plain text. This could be of interest: http://www.rgagnon.com/javadetails/java-0424.html

No. HttpClient handles network protocol - sending requests and receiving responses. It's up to you to figure out what to do with the response once you receive it. That said, you can use other libraries to parse HTML as others suggested.

The HTML Parser library might be what you are looking for. It allows for extraction of content from a HTML document.

As others have mentioned, you need an HTML parsing library. Here is a relevant question.