I am using Nokogiri to parse an HTML page, but I am having odd problems with non-breaking spaces. I tried different encodings, replacing the whitespace, and a few other headache inducing attempts.
Here is the HTML snippet in question:
<td>Amount 15,300 at dollars</td>
Note the change for the
开发者_JS百科 representation after I use Nokogiri:
<td>Amount 15,300 at dollars</td>
And outputting the inner_text
:
Amount 15,300 at dollars
This is my base Nokogiri grab, I did try a few alternatives to solve but failed miserably:
doc = Nokogiri::HTML(open(url))
And then I do a doc.search
for the item in question.
Note that if I look at the doc, the line shows up with the  
on that line.
Clarification: I do not think I clearly stated the difficulty I am having. I can't get the inner_text
to show up without the strange Â
symbol.
Unless you really, really want to keep the
notation, there shouldn't be a problem here.
A0
is the hex character code for a non-breaking space. As such,  
prints a non-breaking space, and is exactly equivalent to
.  
does the same thing, too.
What Nokogiri is doing here is reading the text node, recognizing the entities, and converting them to their actual string representation internally. Then, when converting it back to an HTML-friendly version of the text node, it represents the non-breaking space by its hex code, rather than taking the performance overhead of looking it up in an entity table, since it's equivalent, anyway.
Assuming that Â
was what you were seeing and wasn't just an issue pasting into StackOverflow, this is a text encoding issue: the output software (browser?) isn't in UTF-8 mode, so doesn't know how to handle character code A0
, so does the best it can. If this is a browser, adding <meta charset="utf-8">
to the head will solve this issue, and will make the rest of the output more Unicode-friendly.
If you really, really want
, use gsub
to replace them in your final output. Otherwise, don't worry about it.
I know this is old, but it took me an hour to find out how to solve this problem, and it is really easy once you know. Just pass your string to this function and it will be "de-nbsp-fied".
def strip_html(str)
nbsp = Nokogiri::HTML(" ").text
str.gsub(nbsp,'')
end
You could also replace it whith a space if you wished. May many of you find this answer!
As @sawa says, the main problem is what you see when writing to the console. It's not correctly displaying the non-breaking space after Nokogiri converts it to the appropriate binary value.
The usual way to fix the problem is to preprocess the content:
require 'nokogiri'
html = '<td>Amount 15,300 at dollars</td>'
doc = Nokogiri::HTML::DocumentFragment.parse(html.gsub(/&(?:#xa0|#160|nbsp);/i, ' '))
puts doc.to_html
Which outputs:
<td>Amount 15,300 at dollars</td>
精彩评论