开发者

with httpclient is there a way to get the character set of the page with a HEAD request?

开发者 https://www.devze.com 2023-01-06 11:51 出处:网络
I\'m doing a basic HEAD request using httpclient library. I\'m curious how I\'d be able to get the character set that apache returns E.g.: utf-8, iso-8859-1, etc...

I'm doing a basic HEAD request using httpclient library. I'm curious how I'd be able to get the character set that apache returns E.g.: utf-8, iso-8859-1, etc... thanks!

  HttpParams httpParams = new BasicHttpParams();
  HttpConnectionParams.setConnectionTimeout(httpParams, 2000);
  HttpConnectionParams.setSoTimeout(httpParams, 2000);

  DefaultHttpClient httpclient = new DefaultHttpClient(httpParams);
  httpclient.getParams().setParameter("http.useragent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)");

  HttpContext localContext = new BasicHttpContext();
  httpget = new HttpHead(url); 

  HttpResponse response = httpclient.execute(httpget, localContext);

  this.sparrowResult.statusCode = response.getStatusLine().getStatusCode();

WOR开发者_Go百科KING RESULT UPDATED

Header contentType = response.getFirstHeader("Content-Type");
String charset= contentType.getValue();


If using HttpClient 4.2

import java.nio.charset.Charset;
import org.apache.http.entity.ContentType;

ContentType contentType = ContentType.getOrDefault(entity);
Charset charSet = contentType.getCharset();


If using HttpClient 4.1 (latest):

import org.apache.http.protocol.HTTP;
import org.apache.http.util.EntityUtils;

String charset = EntityUtils.getContentCharSet(entity);
if (charset == null) {
    charset = HTTP.DEFAULT_CONTENT_CHARSET;
}


in HTTP 1.1, the character set is in the Content-Type Header

HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8

So that should be buried in

HttpResponse.Headers

So, this should work

HttpResponse.Headers.["Content-Type"]

** didnt test this, but you get the idea


In some cases the server will not give you the charset in the header but its written in the content, e.g. this url: http://seniv.dlmostil.ru/jacket/p/kupit-sportivnie-bryki-adidas-s-dostavkoy/

When you do

ContentType contentType = ContentType.getOrDefault(entity); 
Charset charSet = contentType.getCharset();

then charSet is null.

In that case I read the stream and try to extract the charSet from the html code with a regular expression, so when you read the content from the input stream into

ByteArrayOutputStream out = new ByteArrayOutputStream();

then you can do this:

String help = new String(out.toByteArray());
Pattern charSet = Pattern.compile("charset\\s*=\\s*\"?(.*?)[\";\\>]", Pattern.CASE_INSENSITIVE);
Matcher m = charSet.matcher(help);
String encoding = m.find() ? m.group(1).trim() : "UTF-8";
if (Charset.availableCharsets().get(encoding) == null) encoding = Charsets.UTF_8.toString();
String html = new String(out.toByteArray(), encoding);

I hope you get the idea of this last exit when all the other methods won't work.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号