Parsing document with special characters, using Nokogiri_问答_开发者

Parsing document with special characters, using Nokogiri

开发者 https://www.devze.com 2023-02-04 03:40 出处：网络

I\'m using Nokogiri to parse a webpage which contains special characters, however the special characters do not get pars开发者_StackOverflow社区ed correctly- showing up as \"genealÃ³gica\"

I'm using Nokogiri to parse a webpage which contains special characters, however the special characters do not get pars开发者_StackOverflow社区ed correctly- showing up as "genealÃ³gica"

doc=Nokogiri::HTML(open("#{BASE_URL}search=#{book}#{chapters}&version=NVI")).css('.result-text-style-normal')
doc.css('.footnotes').remove
doc.css('h4').remove
doc

any ideas how I could fix this?

EDIT: I did a bit more work looking at the page, how you're trying to process it, and think this works better. I changed how you process the page also, because it wasn't as clear as how I like seeing it, for maintainability and readability.

require 'addressable/uri'
require 'nokogiri'
require 'open-uri'

def get_chapter(base_url, params={})
  uri = Addressable::URI.parse(base_url)
  uri.query_values = params

  doc = Nokogiri::XML(open(uri.to_s))
  doc.encoding = 'UTF-8'

  div = doc.at_css('.result-text-style-normal')
  div.css('.footnotes').remove
  div.css('h4').remove

  doc
end

page = get_chapter('http://www.biblegateway.com/passage/', :search => 'Mateo1-2', :version => 'NVI')
puts page.content

Rather than build a URL like you were, I prefer seeing it passed in as chunks, with the base URL and parameters split. I build the URI using the Addressable gem, which is my go-to for munging URLs. Ruby's built-in URI is having some growing pains right now, related to encoding of parameters.

The document at the far end of the URL you gave says it is XHTML, so it should meet the XHTML specs. You can parse XHTML using Nokogiri::HTML() but I think you get better results using Nokogiri::XML(), which is more strict.

To give Nokogiri an additional nudge in the right direction for parsing the content, I add:

doc.encoding = 'UTF-8'

I prefer finding the desired div and assigning it to a temporary variable, and working from that point, rather than doing it chained to the parse step like you did. It's a bit more idiomatic and readable this way because we're dealing with chunks of the document.

Running the code outputs what appears to be nice and clean content. There is some embedded Javascript, but that is unavoidable because Javascript is treated as text inside the <script> tags. That isn't an issue if you are presenting the HTML for a browser to render.

Changing Nokogiri::HTML(...) to Nokogiri::HTML5(...) should help.

EXAMPLE:

url = 'https://www.youtube.com/watch?v=4r6gr7uytQA'

doc = Nokogiri::HTML(open(url))
doc.title
=> "Josh Waitzkin â\u0080\u0094 How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"

doc = Nokogiri::HTML5(open(url))
doc.title
=> "Josh Waitzkin — How to Cram 2 Months of Learning into 1 Day | The Tim Ferriss Show - YouTube"

If you are using 1.9 you can simply put