开发者

How do I escape a Unicode string with Ruby?

开发者 https://www.devze.com 2023-02-22 03:16 出处:网络
I need to encode/convert a Un开发者_JS百科icode string to its escaped form, with backslashes.Anybody know how?In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.

I need to encode/convert a Un开发者_JS百科icode string to its escaped form, with backslashes. Anybody know how?


In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.

>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"

>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""

>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil

In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:

>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""

In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):

>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"


To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/


If you have Rails kicking around you can use the JSON encoder for this:

require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"

The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.


There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.

Finding the value:

Method 1a: from Ruby with String#dump:

If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$ (plus a trailing newline) – running the following code (executed either in irb or as a script):

s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.

... should print out:

"\u20AC\u00A3\u00A5$\n"

Thus you can see that is U+20AC, £ is U+00A3, and ¥ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)

(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)

Method 1b: with Ruby using String#encode and rescue:

Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:

encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
  begin
    c.encode("ASCII") # try to encode it to ASCII
  rescue Encoding::UndefinedConversionError # but if that fails
    encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
  end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
  puts "#{char} encodes to #{dumped}."
end

With the same input as above, this would then print:

€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".

Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of

0

精彩评论

暂无评论...
验证码 换一张
取 消