I am quite new to this, and this might be very easy to most people, but I have been struggling with this for days.
I'm writing a web crawler using perl, and the web crawler will extract certain information using LWP and some simple regular expression.
These information are saved in a mySQL database, which will be used on an android device. However, when I tested the web crawler, I realized some information are in Chinese (典華) using HTML numeric coding (&# 20856 ; &# 33775 ;), a开发者_StackOverflow中文版nd some information are using iso-8859-1 encoding (Zhífú). I solved the Chinese part using the PERL HTML::Entities library, which can be displayed when I set my console to utf8. However, the other letters (Zhífú) can only be displayed in iso-8859-1. If I try to display it in utf8, it will become Zh�f�. My question is:
- How could I determine which kind of encoding it use, and how can I display it differently?
- Would I be able to store it in mySQL directly, or I should process the information first (correct me if I am wrong, but my understanding is that mySQL use utf8 as the default language).
- Would this cause some kind of problem when I display it on an android device?
Thank you very much.
(Zhífú) can only be displayed in iso-8859-1. If I try to display it in utf8, it will become Zh�f�.
That's completely false. You can display "Zhífú" in both iso-8859-1 and UTF-8 terminals/applications/whatever. In fact, the fact that you see "Zhífú" is proof that it can be displayed in UTF-8, since this is a UTF-8 web page. If you're getting "Zh�f�", it's because you didn't encode the string using UTF-8 before giving it to the terminal/application/whatever that wants UTF-8.
Anyway, on to the question. I'm assuming that you're storing text, not HTML.
Decode every input! Encode every output! Then no problem.
From the web
5a 68 c3 ad 66 c3 ba
|
decode Done for you by ->decoded_content (LWP::UA)
| or by ->content (WWW::Mech)
v
Decoded text Manipulate as desired
Zhífú
|
encode Done for you by DBI
|
v
Database
5a 68 c3 83 c2 ad 66 c3 83 c2 ba
In fact, the decoding should already be done for you by ->decoded_content
, and the encoding should already be done for you by DBI, so I don't see why you're having trouble with this.
Same thing when you read from the database and output to the screen/whatever.
5a 68 c3 83 c2 ad 66 c3 83 c2 ba
Database
|
decode Done for you by DBI if you use
| the ..._utf8 flag for your driver
v
Decoded text Manipulate as desired
Zhífú
|
encode use open ':std', ':locale';
|
v
Screen
5a 68 c3 83 c2 ad 66 c3 83 c2 ba
精彩评论