I'm doing a basic HEAD request using httpclient library. I'm curious how I'd be able to get the character set that apache returns E.g.: utf-8, iso-8859-1, etc... thanks!
HttpParams httpParams = new BasicHttpParams();
HttpConnectionParams.setConnectionTimeout(httpParams, 2000);
HttpConnectionParams.setSoTimeout(httpParams, 2000);
DefaultHttpClient httpclient = new DefaultHttpClient(httpParams);
httpclient.getParams().setParameter("http.useragent", "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)");
HttpContext localContext = new BasicHttpContext();
httpget = new HttpHead(url);
HttpResponse response = httpclient.execute(httpget, localContext);
this.sparrowResult.statusCode = response.getStatusLine().getStatusCode();
WOR开发者_Go百科KING RESULT UPDATED
Header contentType = response.getFirstHeader("Content-Type");
String charset= contentType.getValue();
If using HttpClient 4.2
import java.nio.charset.Charset;
import org.apache.http.entity.ContentType;
ContentType contentType = ContentType.getOrDefault(entity);
Charset charSet = contentType.getCharset();
If using HttpClient 4.1 (latest):
import org.apache.http.protocol.HTTP;
import org.apache.http.util.EntityUtils;
String charset = EntityUtils.getContentCharSet(entity);
if (charset == null) {
charset = HTTP.DEFAULT_CONTENT_CHARSET;
}
in HTTP 1.1, the character set is in the Content-Type Header
HTTP/1.1 200 OK
Content-Type: text/plain; charset=utf-8
So that should be buried in
HttpResponse.Headers
So, this should work
HttpResponse.Headers.["Content-Type"]
** didnt test this, but you get the idea
In some cases the server will not give you the charset in the header but its written in the content, e.g. this url: http://seniv.dlmostil.ru/jacket/p/kupit-sportivnie-bryki-adidas-s-dostavkoy/
When you do
ContentType contentType = ContentType.getOrDefault(entity);
Charset charSet = contentType.getCharset();
then charSet is null.
In that case I read the stream and try to extract the charSet from the html code with a regular expression, so when you read the content from the input stream into
ByteArrayOutputStream out = new ByteArrayOutputStream();
then you can do this:
String help = new String(out.toByteArray());
Pattern charSet = Pattern.compile("charset\\s*=\\s*\"?(.*?)[\";\\>]", Pattern.CASE_INSENSITIVE);
Matcher m = charSet.matcher(help);
String encoding = m.find() ? m.group(1).trim() : "UTF-8";
if (Charset.availableCharsets().get(encoding) == null) encoding = Charsets.UTF_8.toString();
String html = new String(out.toByteArray(), encoding);
I hope you get the idea of this last exit when all the other methods won't work.
精彩评论