开发者

How to get non-latin characters from website?

开发者 https://www.devze.com 2023-02-14 05:08 出处:网络
I try to get data from latata.pl/pl.php and view all sign (polish - iso-8859-2) final URL url = new URL(\"http://latata.pl/pl.php\");

I try to get data from latata.pl/pl.php and view all sign (polish - iso-8859-2)

 final URL url = new URL("http://latata.pl/pl.php");
    final URLConnect开发者_如何学Goion urlConnection = url.openConnection();
    final BufferedReader in = new BufferedReader(new InputStreamReader(
            urlConnection.getInputStream()));
    String inputLine;

    while ((inputLine = in.readLine()) != null) {
        System.out.println(inputLine);
    }
    in.close();

It doesn't work. :( Any ideas?


InputStream reader has multiple constructors and you can (should/have to) specify encoding in such case in one of these constructors.


Your InputStreamReader will be attempting to convert the bytes coming back over the TCP connection using your platform default encoding (which is most likely UTF-8 or one of the horrible Windows ones). You should explicitly specify an encoding.

Assuming the web server is doing a good job, you can find the correct encoding in one of the HTTP headers (I forget which one). Or you can just assume it's iso-8859-2, but that might break later.


This is too long for a comment but who set that webpage? You? From what I can see it doesn't look correct.

Here's what you get back:

$ telnet latata.pl 80
Trying 91.205.74.65...
Connected to latata.pl.
Escape character is '^]'.
GET /pl.php HTTP/1.0
Host: latata.pl

HTTP/1.1 200 OK
Date: Sun, 27 Feb 2011 13:49:19 GMT
Server: Apache/2
X-Powered-By: PHP/5.2.16
Vary: Accept-Encoding,User-Agent
Content-Length: 10
Connection: close
Content-Type: text/html

����ʣ��Connection closed by foreign host.

The HTML is simply:

<html>
<head></head>
<body>±ê³ó¿¡Ê£¯¬</body>
</html>

And that's how your page appears from a browser. Is there a valid reason why no charset is specified in that HTML page?


The output of your php-script pl.php is faulty. There is a HTTP-header Content-Type: text/html set without a declared charset. Without a declared charset, the client has to assume that it is ISO-8859-1 regarding to the HTTP-specifications. The sent body is ±ê³ó¿¡Ê£¯¬ if interpreted as ISO-8859-1.

The bytes sended by the php-script are representing ąęłóżĄĘŁŻŹ if it were declared as

Content-Type: text/html; charset=ISO-8859-2

You can check this with the simple code fragment, which will transform the faulty ISO-8859-1 encoding to ISO-8859-2:

final String test="±ê³ó¿¡Ê£¯¬";
String corrupt=new String(test.getBytes("ISO-8859-1"),"ISO-8859-2");
System.out.println(corrupt);    

The output will be ąęłóżĄĘŁŻŹ, which are some polish characters.

As a quick fix, set the charset in your php-script to output Content-Type: text/html; charset=ISO-8859-2 as HTTP-Header.

But you should think about to switch to UTF-8 encoded output anyway.


As someone has already stated there is no charset encoding specified for the response. Forcing the response document to be viewed as ISO-8859-2 (typically used in central Europe) results in legitimate polish characters being displayed, so I assume this is the encoding actually being used. Since no encoding has been specified, ISO-8859-1 will be assumed as this is the default.

The response headers need to include the header Content-Type: text/html; charset=ISO-8859-2 for the character code points to be interpreted correctly. This charset will be used when constructing the response InputStream.

0

精彩评论

暂无评论...
验证码 换一张
取 消