I am trying to sanitize an HTML file and it isn't working correctly. I want to all be entirely plain text except for paragraph and line break tags. Here is my sanitization code (the dots signify other code in my class that isn't relevant to the problem):
.
.
.
include ActionView::Helpers::SanitizeHelp开发者_运维百科er
.
.
.
def remove_html(html_content)
sanitized_content_1 = sanitize(html_content, :tags => %w(p br))
sanitized_content_2 = Nokogiri::HTML(sanitized_content_1)
sanitized_content_2.css("style","script").remove
return sanitized_content_2
end
It isn't working correctly. Here is the original HTML file from which the function is reading its input, and here is the "sanitized" code it is returning. It is leaving in the body of CSS tags, JavaScript, and HTML Comment Tags. It might be leaving in other stuff as well that I have not noticed. Please advise on how to thoroughly remove all CSS, HTML, and JavaScript other than paragraph and line break tags?
I don't think you want to sanitize it. Sanitizing strips HTML, leaving the text behind, except for the HTML elements you deem OK. It is intended for allowing a user-input field to contain some markup.
Instead, you probably want to parse it. For example, the following will print the text content of the <p>
tags in a given html string.
doc = Nokogiri::HTML.parse(html)
doc.search('p').each do |el|
puts el.text
end
You can sanitize with using CGI namespace too.
require 'CGI'
str = "<html><head><title>Hello</title></head><body></body></html>"
p str
p CGI::escapeHTML str
Run this script, we get following result.
$ ruby sanitize.rb
"<html><head><title>Hello</title></head><body></body></html>"
"<html><head><title>Hello</title></head><body></body></html>"
精彩评论