开发者

Hpricot encodings in ruby 1.9

开发者 https://www.devze.com 2023-02-05 01:31 出处:网络
I have a rails3 application running on ruby 1.9 here, and Im having some pain making encodings work. My task was to open a remote html page, and parse some information from it.

I have a rails3 application running on ruby 1.9 here, and Im having some pain making encodings work.

My task was to open a remote html page, and parse some information from it. all my code and database are in UTF-8, im using the # code: UTF-8, mysql fix, and so on.

The page I open, is in charset ISO-8859-1, and when my parser find strage characters it complains its not a valid UTF-8 one.

I tryed to use .force_encoding("UTF-8") in all strings I've parsed, but it still. When I try to convert the whole page, I get this:

a = open("someurl")
b = a.read.encode("UTF-8")
Encoding::UndefinedConversionError: "\xE9" from ASCII-8BIT to UTF-8
    from (irb):7:in `encode'
    from (irb):7
    from /Users/tscolari/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.0/lib/rails/commands/console.rb:44:in `start'
    from /Users/tscolari/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.0/lib/rails/commands/console.rb:8:in `start'
    from /Users/tscolari/.rvm/gems/ruby-1.9.2-p0/gems/railties-3.0.0/lib/rails/commands.rb:23:in `<top (required)>'
    from script/rails:6:in `require'
    from script/rails:6:in `<main>'

how could I fix this? it seems it already went wrong when he "converted" the iso8859 page to ascii.

UPDATE

I tryed opening the url using 'r:iso-8859-1:utf-8', but apparently my problem now is with Hpricot, that I use for parsing.

>a = open(b, 'r:iso-8859-1:utf-8')
>a开发者_如何学运维.read.encoding
 => #<Encoding:UTF-8>
> Hpricot(a).inner_html.encoding
 => #<Encoding:ASCII-8BIT> 

and all the errors again... probably this is an hpricot issue, but if anyone knows a fix, please.


Hpricot - UTF-8 issues invalid byte sequence in UTF-8 (ArgumentError)

require 'hpricot'
require 'open-uri'

doc = open('http://www.amazon.co.jp/') { |f| Hpricot(f.read) }

puts doc.to_html

open('http://www.amazon.co.jp/') { |f| Hpricot(f.read.encode("UTF-8")) }


a = open("someurl", "r:iso-8859-1:utf-8")

See this other SO question for more details...

0

精彩评论

暂无评论...
验证码 换一张
取 消