开发者

HTML encode UTF-8 string gets mangled into latin1

开发者 https://www.devze.com 2022-12-25 13:32 出处:网络
I\'m parsing my nginx logs, and I want to discover some details from the HTTP_REFERER string, for example, the query string used to find the web site. One user typed in \"México\" which gets encoded

I'm parsing my nginx logs, and I want to discover some details from the HTTP_REFERER string, for example, the query string used to find the web site. One user typed in "México" which gets encoded in the log as "query=M%E9xico".

Passing this through Rack::Utils.parse_query('query=M%E9xico') you get a hash, {"query" => "M?xico"}

When you to stuff "M?exico" into Postgres (but not the more 开发者_JAVA百科forgiving SQLite), it pukes because the string isn't proper UTF-8. Looking at http://rack.rubyforge.org/doc/Rack/Utils.html#M000324, unescape is packing a hex string.

How can I convert the string back to UTF-8, or can I get parse_query to return UTF-8 in the first place.


unescape will decode the URL encoding:

Rack::Utils.parse_query(URI.unescape('query=M%E9xico'))

Or

Rack::Utils.parse_query(Utils.unescape('query=M%E9xico'))


The problem here happens well before you get ahold of the data. You need to fix the problem upstream if you can, and if you can't then my suggestion is find out the encoding and convert it on input or using conversion libraries in Ruby (iconv for example).

The problem is not in PostgreSQL, though.

0

精彩评论

暂无评论...
验证码 换一张
取 消