开发者

Rails - strip_tags - Not catching DOCTYPE?

开发者 https://www.devze.com 2023-02-19 07:47 出处:网络
Given an HTML email, I\'m using the following to strip down to just the text: body = body.gsub(/\\\\r\\\\n?/, \"\\n\");

Given an HTML email, I'm using the following to strip down to just the text:

  body = body.gsub(/\\r\\n?/, "\n");
  body = body.gsub(/\\n\\n?/, "\n");
  body = simple_format(body)
  body = strip_tags(body)

But I'm now seeing that one tag gets passed this:

<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">

Which outputs like so:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.开发者_如何学Python01 Transitional//EN">

Any ideas why?


I guess for strip_tags, which looks like it's been deprecated, considers the doctype statement neither a tag, nor a comment. You could strip it out separately:

string.gsub(/<!.*?$/,'')


I ended up using Hpricot to text, worked great


I'd recommend using Nokogiri for your parsing needs. It's very well supported, plenty fast, very flexible, and the basis of a lot of other HTML/XML type gems. It has a Hpricot mode, though I'm not sure why anyone would need that as its syntax is more full-featured.

In particular, to strip tags from HTML, I'd recommend looking into Loofah. It can whitelist tags, and has several layers of cleansing it can do.

0

精彩评论

暂无评论...
验证码 换一张
取 消