I'm a web developer mostly working in Ruby (and Rails) and C#.
I'm currently reading "The Ruby Programming Language", the one with input from Matz and drawings by "_Why the lucky stiff", to sharpen my knowledge on how Ruby really works.
The chapter on strings speaks a lot about encoding, multibyte characters etc., and 开发者_如何学编程I seem to remember Joel Spolsky blogging about how every developer should know x about encoding. But at what point do you really start to see the effects of that?
For instance, on the original Rails screencast there wasn't a 20 minute intro on encoding, however some developers say it's crucial knowledge.
So how much do you need to know and when?
Back in my day, we didn't care ever. Everything was text. Then along came Microsoft with their ASCII extensions, and the next thing we knew everything went all to heck. :-) HEY YOU MICROSOFT, GET OFF MY LAWN!
Unfortunately, in today's internet and web world, it's important to consider it from the first line of code or text content that is created.
When your site is generating the output, you have an advantage and can make sure that all your source and text and templates use UTF-8 encoding.
If you are ingesting other people's content via parsing or scraping, then your task gets a lot harder because web servers like to lie about what they're sending you, the HTML pages like to lie, even, and it's hard to believe, the XML pages will lie, though they're not supposed to. Because of that, your code has to be very defensive and be ready to do multibyte encoding when you sense characters in a "foreign" codeset. You might have to jump through a few hoops to convert back to your chosen encoding of UTF-8, which is my recommendation, or ISO1859-1, or CP1252 or whatever it is. Make sure you're using rescue
blocks and test, test, test if you want to make your system hardened and bullet-proof.
That's my recommendation, based on some hard-won knowledge writing many scrapers in Perl and Ruby.
精彩评论