Working with encoding in ruby_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-03-20 16:05 出处：网络

I\'m making a simple sinatra based web app to display chinese text, and I know enough about encoding to know that I can potentially lose information if I don\'t do it properly, but I feel a bit lost i

I'm making a simple sinatra based web app to display chinese text, and I know enough about encoding to know that I can potentially lose information if I don't do it properly, but I feel a bit lost in the space of encoding. It's also the first time I'm working with non-english 开发者_Go百科based text in ruby.

Are there any areas in particular that I have to be careful about within my programming stack? Also are there extra libraries I should know about to ensure I encode/decode properly?

My programming stack currently consists of:

ruby 1.9.2
sinatra 1.2.6
possibly postgresql
textmate editor (currently set to UTF8 encoding) - do I need to change my encoding here?

Ruby works pretty well with UTF8 encoding, so you shouldn't have a problems with it.

But in some cases you should use magic comment #encoding: UTF-8 at the start of your files.

You can read this http://blog.grayproductions.net/articles/understanding_m17n to understand encoding in Ruby.

The best post I've read on the ruby charset implementation was written by one of the guys behind most of the code involved:

http://yokolet.blogspot.com/2009/07/design-and-implementation-of-ruby-m17n.html

I ran into it while looking into ICU support in ruby:

http://redmine.ruby-lang.org/issues/2034

I've bee screen scraping Chinese characters for a few months at http://sinograms.com. I'm using rails3, ruby 1.9.2, and heroku.

I found no encoding issues, however I'm only accepting unicode characters. UTF is the same thing as unicode except that it is backwards compatible with ASCII so if you stick with that you should be find.

This is the best resource I found for ruby and encoding:

http://blog.grayproductions.net/articles/ruby_19s_string

You can check if the Chinese Character is unicode with the following script:

 def check(char)
   char = char.unpack('U*').first
   if char >= 0x4E00 && char <= 0x9FFF
     return true
   end
   if char >= 0x3400 && char <= 0x4DBF
     return true
   end
   if char >= 0x20000 && char <= 0x2A6DF
     return true
   end
   if char >= 0x2A700 && char <= 0x2B73F
     return true
   end
   return false
 end