HTML returned by Nokogiri is different from actual HTML source code_问答_开发者

HTML returned by Nokogiri is different from actual HTML source code

开发者 https://www.devze.com 2023-01-22 11:29 出处：网络

I have been successfully screen-scraping certain sites but have come across some very odd behavior with Nokogiri today on a certain site.

相关专题：nokogiri

I have been successfully screen-scraping certain sites but have come across some very odd behavior with Nokogiri today on a certain site.

If I view the HTML source code pulled down by Nokogiri with the actual HTML scource code from the site on a certain pages it is truncated. Some pages work just fine and all the data is there and others wig out and get truncated.

www.bento.com/revj/0172.html (Doesn't wor开发者_C百科k - truncated HTML returned by Nokogiri) www.bento.com/revj/0101.html (Works great)

scraped_jpage = Nokogiri::HTML(open(page_to_scrape)
puts scraped_pagej

I have tried all sorts of different code, changed encoding (UTF-8, SHIFT_JIS etc) but I cannot see any reason whatsoever that Nokogiri truncates the returned HTML.

The english versions of these pages all work perfectly.

www.bento.com/rev/0172.html www.bento.com/rev/0101.html

Thanks for any help - hopefully it's something obvious I have missed and not a bug.

Because that source page with have bad html structure.

Try to print result errors: