开发者

Parsing inner tags using Nokogiri

开发者 https://www.devze.com 2023-03-07 11:56 出处:网络
I\'m stuck not being able to parse irregularly embedded html tags. Is there a way to remove all html tags from a node and retain all text?

I'm stuck not being able to parse irregularly embedded html tags. Is there a way to remove all html tags from a node and retain all text?

I'm using the code:

rows = doc.search('//table[@id="table_1"]/tbody/tr')

details = rows.collect do |row|
  detail = {}
  [
    [:word, 'td[1]/text()'],
    [:meaning, 'td[6]/font'],
  ].collect do |name, xpath|
      detail[name] = row.at_xpath(xpath).to_s.strip
    end
  detail
end

Using Xpath:

[:meaning, 'td[6]/font']

generates

:meaning: ! '<font size="3">asking for information specifying <font
    color="#CC0000" size="3">w开发者_JS百科hat is your name?</font> /what/ as in, <font color="#CC0000" size="3">I'm not sure what you mean</font>
    /what/ as in <a style="text-decoration: none;" href="http://somesecretlink.com">what</a></font>

On the other hand, using Xpath:

'td/font/text()'

generates

:meaning: asking for information specifying

thus ignoring all children of the node. What I want to achieve is this

:meaning: asking for information specifying what is your name? /what/ as in, I'm not sure what you mean /what/ as in what? I can't hear you


This depends on what you need to extract. If you want all text in font elements, you can do it with the following xpath:

'td/font//text()'

It extracts all text nodes in font tags. If you want all text nodes in the cell, then:

'td//text()'

You can also call the text method on a Nokogiri node:

row.at_xpath(xpath).text


I added an answer for this same sort of question the other day. It's a very easy process.

Take a look at: Convert HTML to plain text and maintain structure/formatting, with ruby

0

精彩评论

暂无评论...
验证码 换一张
取 消