开发者

character encoding in a web page using java

开发者 https://www.devze.com 2023-02-12 07:58 出处:网络
how to find out the type of charac开发者_Go百科ter encoding in a web page using javaOpen a connection to the URL (using URL.openConnection()), adn the parse the content type returned by the getContent

how to find out the type of charac开发者_Go百科ter encoding in a web page using java


Open a connection to the URL (using URL.openConnection()), adn the parse the content type returned by the getContentType() method (which should contain the charset). If not present in this header, you might have to parse the HTML content and look for a tag such as

<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />


I believe this does exactly what you need. Has both code and explanation. http://nadeausoftware.com/node/73

A quick summary is as follows:

Create a WebFile class where:

  1. Constructor public WebFile( String urlString ) opens a URLConnection, reads in the headers, including the character encoding. If the encoding is not present, then you'll have to read the encoding from the web page itself. If this is not present either, you could try your luck with Character Encoding Detection Algorithm
  2. Method private Object readStream(int length, java.io.InputStream stream) reads the page data from the stream and returns a String using the character encoding, i.e. return new String( bytes, charset ), or returns the byte array created by reading the stream if there is no encoding present or if there's an encoding exception.
  3. You have getters and setters for the page content (e.g. invokes readStream just once, returns the encoding)
0

精彩评论

暂无评论...
验证码 换一张
取 消