开发者

How can unwanted tags be removed from HTML using Nokogiri?

开发者 https://www.devze.com 2022-12-23 01:09 出处:网络
I need to strip out all font tags from a document.When attempting to do so with the following Ruby code, other elements and text within the font tags are lost.I\'ve also attempted to iterate through a

I need to strip out all font tags from a document. When attempting to do so with the following Ruby code, other elements and text within the font tags are lost. I've also attempted to iterate through all children elements and make them siblings of the font tag before unlinking the font tag--which also results in lost HTML. What is a good method for removing tags which can contain other elements and/or text?

  doc.css('font').each do |element|
    element.unlink
  end

UPDATE (in response to first solution):

The problem with using node.children to obtain the children and then move the children to the font node's parent node is that none of the children nodes include the text found within the font node. As soon as the font tag is removed (unlinked), all text within the font tag also disappears from the document.

My revised question is thus: how do I use Nokogiri to obtain the t开发者_StackOverflow社区ext of the font node and how can this text be moved to replace the font tag, in the font node's position.


I created a more generic solution based on the code in your comment:

module Filter
    def remove_tags_preserve_content!(*list)
        xpath('.//*').each do |element|
            if list.include?(element.name)
                element.children.reverse.each do |child|
                    child_clone = child.clone
                    element.add_next_sibling child_clone
                    child.unlink
                end
                element.unlink
            end
        end
    end
end

class Nokogiri::XML::Element
    include Filter
end

class Nokogiri::XML::NodeSet
    include Filter
end

# === Example ===

doc.remove_tags_preserve_content!('font')


The problem is you're lopping off the node, which also trims the child nodes. You need to preserve the children then append them to the parent node. Once you've done that you can delete the target node.

Take a look at "Replace Node w/ Children" - http://rubyforge.org/pipermail/nokogiri-talk/2009-June/000333.html

In that message Aaron is talking about replacing XML nodes, but it's all the same once a HTML document has been parsed by Nokogiri. You'll need to do some minor tweaks but it should get you going.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号