开发者

How do I correctly deal with non-breaking spaces using Nokogiri?

开发者 https://www.devze.com 2023-03-05 10:10 出处:网络
I am using Nokogiri to parse an HTML page, but I am having odd problems with non-breaking spaces. I tried different encodings, replacing the whitespace, and a few other headache inducing attempts.

I am using Nokogiri to parse an HTML page, but I am having odd problems with non-breaking spaces. I tried different encodings, replacing the whitespace, and a few other headache inducing attempts.

Here is the HTML snippet in question:

<td>Amount 15,300&nbsp;at&nbsp;dollars</td>

Note the change for the &nbsp;开发者_JS百科 representation after I use Nokogiri:

<td>Amount 15,300&#xa0;at&#xa0;dollars</td>

And outputting the inner_text:

Amount 15,300 at dollars

This is my base Nokogiri grab, I did try a few alternatives to solve but failed miserably:

doc = Nokogiri::HTML(open(url))

And then I do a doc.search for the item in question.

Note that if I look at the doc, the line shows up with the &#xa0; on that line.

Clarification: I do not think I clearly stated the difficulty I am having. I can't get the inner_text to show up without the strange  symbol.


Unless you really, really want to keep the &nbsp; notation, there shouldn't be a problem here.

A0 is the hex character code for a non-breaking space. As such, &#xa0; prints a non-breaking space, and is exactly equivalent to &nbsp;. &#160; does the same thing, too.

What Nokogiri is doing here is reading the text node, recognizing the entities, and converting them to their actual string representation internally. Then, when converting it back to an HTML-friendly version of the text node, it represents the non-breaking space by its hex code, rather than taking the performance overhead of looking it up in an entity table, since it's equivalent, anyway.

Assuming that  was what you were seeing and wasn't just an issue pasting into StackOverflow, this is a text encoding issue: the output software (browser?) isn't in UTF-8 mode, so doesn't know how to handle character code A0, so does the best it can. If this is a browser, adding <meta charset="utf-8"> to the head will solve this issue, and will make the rest of the output more Unicode-friendly.

If you really, really want &nbsp;, use gsub to replace them in your final output. Otherwise, don't worry about it.


I know this is old, but it took me an hour to find out how to solve this problem, and it is really easy once you know. Just pass your string to this function and it will be "de-nbsp-fied".

def strip_html(str)
  nbsp = Nokogiri::HTML("&nbsp;").text
  str.gsub(nbsp,'')
end

You could also replace it whith a space if you wished. May many of you find this answer!


As @sawa says, the main problem is what you see when writing to the console. It's not correctly displaying the non-breaking space after Nokogiri converts it to the appropriate binary value.

The usual way to fix the problem is to preprocess the content:

require 'nokogiri'

html = '<td>Amount 15,300&nbsp;at&nbsp;dollars</td>'
doc = Nokogiri::HTML::DocumentFragment.parse(html.gsub(/&(?:#xa0|#160|nbsp);/i, ' '))
puts doc.to_html

Which outputs:

<td>Amount 15,300 at dollars</td>
0

精彩评论

暂无评论...
验证码 换一张
取 消