开发者

adding backslash to fix character encoding in ruby string

开发者 https://www.devze.com 2023-03-23 06:36 出处:网络
I\'m sure this is very easy but I\'m getting tied in a knot with all these backslashes. I have some data that I\'m scraping (politely) from a website. Occasionally a sentence comes to me looking some

I'm sure this is very easy but I'm getting tied in a knot with all these backslashes.

I have some data that I'm scraping (politely) from a website. Occasionally a sentence comes to me looking something like this:

开发者_如何学Cu00a362 000? you must be joking

Which should of course be '£2 000? you must be joking'. A short test in irb deciphered it.

ruby-1.9.2-p180 :001 > string = "u00a3"
  => "u00a3" 
ruby-1.9.2-p180 :002 > string = "\u00a3"
  => "£" 

Of course: add a backslash and it will be decoded. I created the following with the help of this question:

puts str.gsub('u00', '\\u00') 

which resulted in \u00a3 being output. This is all well and good, but I want it to be £ in the string itself. just putsing it isn't enough.

It's no good doing gsub('u00a3', '£') as there will doubtless be other characters I'm missing.

thanks for any help.


Try the Iconv library for converting the incoming string. You might also take a look at the stringex gem. It has methods to "go the other way" but it may provide the mappings you're looking for. That said if you've got bad encoding it can be impossible to get it right.


Warning, the following is not really pretty.

str = "u00a362 000? you must be joking"
split_unicode = str.gsub(/(u00[a-z0-9]{2})/, "split_here\\1split_here").split(/split_here/)
final = split_unicode.map do |elem|
  if elem =~ /^u00/
    [("0x" + elem.gsub(/u00/, '')).hex].pack("U*")
  else
    elem
  end
end
puts final.join

So the idea here is to find u00xx values and convert them to hex. From there, we can use the pack method to output the right unicode characters.

It can also be crunched in an horrible one-liner!

puts (str.gsub(/(u00[a-z0-9]{2})/, "split_here\\1split_here").split(/split_here/).map {|elem| elem =~ /^u00/ ? [("0x" + elem.gsub(/u00/, '')).hex].pack("U*") : elem}).join

There might be a better solution (I hope!) but this one works.

0

精彩评论

暂无评论...
验证码 换一张
取 消