i'm trying to get an entire WebPage through a URLConnection.
What's the most efficient way to do this?
I'm doing t开发者_如何学编程his already:
URL url = new URL("http://www.google.com/");
URLConnection connection;
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
StringBuffer html = new StringBuffer();
String line = bf.readLine();
while(line!=null){
html.append(line);
line = bf.readLine();
}
bf.close();
html has the entire HTML page.
I think this is the best way. The size of the page is fixed ("it is what it is"), so you can't improve on memory. Perhaps you can compress the contents once you have them, but they aren't very useful in that form. I would imagine that eventually you'll want to parse the HTML into a DOM tree.
Anything you do to parallelize the reading would overly complicate the solution.
I'd recommend using a StringBuilder with a default size of 2048 or 4096.
Why are you thinking that the code you posted isn't sufficient? You sound like you're guilty of premature optimization.
Run with what you have and sleep at night.
What do you want to do with the obtained HTML? Parse it? It may be good to know that a bit decent HTML parser can already have a constructor or method argument which takes straight an URL
or InputStream
so that you don't need to worry about streaming performance like that.
Assuming that all you want to do is described in your previous question, with for example Jsoup you could obtain all those news links extraordinary easy like follows:
Document document = Jsoup.connect("http://news.google.com.ar/nwshp?hl=es&tab=wn").get();
Elements newsLinks = document.select("h2.title a:eq(0)");
for (Element newsLink : newsLinks) {
System.out.println(newsLink.attr("href"));
}
This yields the following after only a few seconds:
http://www.infobae.com/mundo/541259-100970-0-Pinera-confirmo-que-el-rescate-comenzara-las-20-y-durara-24-y-48-horas http://www.lagaceta.com.ar/nota/403112/Argentina/Boudou-disculpo-con-DAIA-pero-volvio-cuestionar-medios.html http://www.abc.es/agencias/noticia.asp?noticia=550415 http://www.google.com/hostednews/epa/article/ALeqM5i6x9rhP150KfqGJvwh56O-thi4VA?docId=1383133 http://www.abc.es/agencias/noticia.asp?noticia=550292 http://www.univision.com/contentroot/wirefeeds/noticias/8307387.shtml http://noticias.terra.com.ar/internacionales/ecuador-apoya-reclamo-argentino-por-ejercicios-en-malvinas,3361af2a712ab210VgnVCM4000009bf154d0RCRD.html http://www.infocielo.com/IC/Home/index.php?ver_nota=22642 http://www.larazon.com.ar/economia/Cristina-Fernandez-Censo-indispensable-pais_0_176100098.html http://www.infobae.com/finanzas/541254-101275-0-Energeticas-llevaron-la-Bolsa-portena-ganancias http://www.telam.com.ar/vernota.php?tipo=N&idPub=200661&id=381154&dis=1&sec=1 http://www.ambito.com/noticia.asp?id=547722 http://www.canal-ar.com.ar/noticias/noticiamuestra.asp?Id=9469 http://www.pagina12.com.ar/diario/cdigital/31-154760-2010-10-12.html http://www.lanacion.com.ar/nota.asp?nota_id=1314014 http://www.rpp.com.pe/2010-10-12-ganador-del-pulitzer-destaca-nobel-de-mvll-noticia_302221.html http://www.lanueva.com/hoy/nota/b44a7553a7/1/79481.html http://www.larazon.com.ar/show/sdf_0_176100096.html http://www.losandes.com.ar/notas/2010/10/12/batista-siento-comodo-dieron-respaldo-520595.asp http://deportes.terra.com.ar/futbol/los-rumores-empiezan-a-complicar-la-vida-de-river-y-vuelve-a-sonar-gallego,a24483b8702ab210VgnVCM20000099f154d0RCRD.html http://www.clarin.com/deportes/futbol/Exigieron-Roman-regreso-Huracan_0_352164993.html http://www.el-litoral.com.ar/leer_noticia.asp?idnoticia=146622 http://www.nuevodiarioweb.com.ar/nota/181453/Locales/C%C3%A1ncer_mama:_200_casos_a%C3%B1o_Santiago.html http://www.ultimahora.com/notas/367322-Funcionarios-sanitarios-capacitaran-sobre-cancer-de-mama http://www.lanueva.com/hoy/nota/65092f2044/1/79477.html http://www.infobae.com/policiales/541220-101275-0-Se-suspendio-la-declaracion-del-marido-Fernanda-Lemos http://www.clarin.com/sociedad/educacion/titulo_0_352164863.html
Did someone already said that regex is absolutely the wrong tool to parse HTML? ;)
See also:
- Pros and cons of HTML parsers in Java
Your approach looks pretty good, however you can make it somewhat more efficient by avoiding the creation of intermediate String objects for each line.
The way to do this is to read directly into a temporary char[] buffer.
Here is a slightly modified version of your code that does this (minus all the error checking, exception handling etc. for clarity):
URL url = new URL("http://www.google.com/");
URLConnection connection;
connection = url.openConnection();
InputStream in = connection.getInputStream();
BufferedReader bf = new BufferedReader(new InputStreamReader(in));
StringBuffer html = new StringBuffer();
char[] charBuffer = new char[4096];
int count=0;
do {
count=bf.read(charBuffer, 0, 4096);
if (count>=0) html.append(charBuffer,0,count);
} while (count>0);
bf.close();
For even more performance, you can of course do little extra things like pre-allocating the character array and StringBuffer if this code is going to be called frequently.
You can try using commons-io from apache (http://commons.apache.org/io/api-release/org/apache/commons/io/IOUtils.html)
new String(IOUtils.toCharArray(connection.getInputStream()))
There are some technical considerations. You may wish to use HTTPURLConnection instead of URLConnection.
HTTPURLConnection supports chunked transfer encoding, which allows you to process the data in chunks, rather than buffering all of the content before you start doing work. This can lead to an improved user experience.
Also, HTTPURLConnection supports persistent connections. Why close that connection if you're going to request another resource right away? Keeping the TCP connection open with the web server allows your application to quickly download multiple resources without spending the overhead (latency) of establishing a new TCP connection for each resource.
Tell the server that you support gzip and wrap a BufferedReader around GZIPInputStream if the response header says the content is compressed.
精彩评论