开发者

Rails HTML Sanitizing

开发者 https://www.devze.com 2023-03-15 20:42 出处:网络
I am trying to sanitize an HTML file and it isn\'t working correctly.I want to all be entirely plain text except for paragraph and line break tags.Here is my sanitization code (the dots signify other

I am trying to sanitize an HTML file and it isn't working correctly. I want to all be entirely plain text except for paragraph and line break tags. Here is my sanitization code (the dots signify other code in my class that isn't relevant to the problem):

.
.
.
include ActionView::Helpers::SanitizeHelp开发者_运维百科er
.
.
.
def remove_html(html_content)
    sanitized_content_1 = sanitize(html_content, :tags => %w(p br))
    sanitized_content_2 = Nokogiri::HTML(sanitized_content_1)
    sanitized_content_2.css("style","script").remove
    return sanitized_content_2
end

It isn't working correctly. Here is the original HTML file from which the function is reading its input, and here is the "sanitized" code it is returning. It is leaving in the body of CSS tags, JavaScript, and HTML Comment Tags. It might be leaving in other stuff as well that I have not noticed. Please advise on how to thoroughly remove all CSS, HTML, and JavaScript other than paragraph and line break tags?


I don't think you want to sanitize it. Sanitizing strips HTML, leaving the text behind, except for the HTML elements you deem OK. It is intended for allowing a user-input field to contain some markup.

Instead, you probably want to parse it. For example, the following will print the text content of the <p> tags in a given html string.

doc = Nokogiri::HTML.parse(html)

doc.search('p').each do |el|
  puts el.text
end


You can sanitize with using CGI namespace too.

require 'CGI'
str = "<html><head><title>Hello</title></head><body></body></html>"
p str
p CGI::escapeHTML str

Run this script, we get following result.

$ ruby sanitize.rb
"<html><head><title>Hello</title></head><body></body></html>"
"&lt;html&gt;&lt;head&gt;&lt;title&gt;Hello&lt;/title&gt;&lt;/head&gt;&lt;body&gt;&lt;/body&gt;&lt;/html&gt;"
0

精彩评论

暂无评论...
验证码 换一张
取 消