开发者

sanitizing content from open(url).read

开发者 https://www.devze.com 2023-03-18 23:06 出处:网络
I am using Ruby to open a URL and read its content. The content type of the file I am reading is \'text/plain\'.

I am using Ruby to open a URL and read its content. The content type of the file I am reading is 'text/plain'.

The issue is that this contains some characters which I want to escape. For example, one of the c开发者_StackOverflow社区haracters that is coming up in the plain text is "\240" which is ASCII for a hyphen.

I am curious how this is being generated, because I don't see a hyphen anywhere in the text. Yet it exists invisibly and "\240" shows up when I use puts to print the text in the console.

Second of all, how do I escape such instances of weird characters? Ideally, I want to escape all characters which are of the form "\[some number]". I am using

"\240".gsub(Regexp.new("\\\d+"),"")

but it doesn't seem to work.

Are there more traditional ways of sanitizing plain text content read from opening a URL?


You might want to check on the character set of the text that's getting returned. It could be UTF-8, which frequently has characters that high. Ruby 1.9 has great support for character sets and switching between them. I've used str.encode("US-ASCII", :invalid => :replace, :undef => :replace, :replace => "?") to force a string to standard ASCII, replacing any odd characters with a ?.


After having a play with this, I found the following regular expression which does the trick for me:

str.gsub(/[^\x00-\x7F]/,'')
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号