I am using Ruby to open a URL and read its content. The content type of the file I am reading is 'text/plain'.
The issue is that this contains some characters which I want to escape. For example, one of the c开发者_StackOverflow社区haracters that is coming up in the plain text is "\240" which is ASCII for a hyphen.
I am curious how this is being generated, because I don't see a hyphen anywhere in the text. Yet it exists invisibly and "\240" shows up when I use puts
to print the text in the console.
Second of all, how do I escape such instances of weird characters? Ideally, I want to escape all characters which are of the form "\[some number]". I am using
"\240".gsub(Regexp.new("\\\d+"),"")
but it doesn't seem to work.
Are there more traditional ways of sanitizing plain text content read from opening a URL?
You might want to check on the character set of the text that's getting returned. It could be UTF-8, which frequently has characters that high. Ruby 1.9 has great support for character sets and switching between them. I've used str.encode("US-ASCII", :invalid => :replace, :undef => :replace, :replace => "?")
to force a string to standard ASCII, replacing any odd characters with a ?
.
After having a play with this, I found the following regular expression which does the trick for me:
str.gsub(/[^\x00-\x7F]/,'')
精彩评论