Given an HTML email, I'm using the following to strip down to just the text:
body = body.gsub(/\\r\\n?/, "\n");
body = body.gsub(/\\n\\n?/, "\n");
body = simple_format(body)
body = strip_tags(body)
But I'm now seeing that one tag gets passed this:
<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">
Which outputs like so:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.开发者_如何学Python01 Transitional//EN">
Any ideas why?
I guess for strip_tags, which looks like it's been deprecated, considers the doctype statement neither a tag, nor a comment. You could strip it out separately:
string.gsub(/<!.*?$/,'')
I ended up using Hpricot to text, worked great
I'd recommend using Nokogiri for your parsing needs. It's very well supported, plenty fast, very flexible, and the basis of a lot of other HTML/XML type gems. It has a Hpricot mode, though I'm not sure why anyone would need that as its syntax is more full-featured.
In particular, to strip tags from HTML, I'd recommend looking into Loofah. It can whitelist tags, and has several layers of cleansing it can do.
精彩评论