开发者

Unicode and UTF-8 encoding issue with Scrapy XPath selector text

开发者 https://www.devze.com 2023-02-24 02:51 出处:网络
I\'m using Scrapy and Python (as part of a Django project) to scrape a site with German content. I have libxml2 installed as the backend for Scrapy selectors.

I'm using Scrapy and Python (as part of a Django project) to scrape a site with German content. I have libxml2 installed as the backend for Scrapy selectors.

If I extract the word 'Hüftsitz' (this is how it is displayed on the site开发者_运维技巧) through selectors, I get: u'H\ufffd\ufffdftsitz' (Scrapy XPath selectors return Unicode strings).

If I encode this into UTF-8, I get: 'H\xef\xbf\xbd\xef\xbf\xbdftsitz'. And if I print that, I get 'H??ftsitz' which isn't correct. I am wondering why this may be happening.

The character-set on the site is set to UTF-8. I am testing the above on a Python shell with sys.getdefaultencoding set to UTF-8. Using the Django application where the data from XPath selectors is written to a MySQL database with UTF-8 character set, I see the same behaviour.

Am I overlooking something obvious here? Any clues or help will be greatly appreciated.


u'\ufffd' is the "unicode replacement character", which is usually printed as a question mark inside a black triangle. NOT a u umlaut. So the problem must be somewhere upstream. Check what encoding the web page headers say are being returned and verify that it is in fact, what it says it is.

The unicode replacement character is usually inserted as a replacement for an illegal or unrecognized character, which could be caused by several things, but the likeliest is that the encoding is not what it claims to be.


Thanks very much for your answers, John and Steven. Your answers got me thinking differently, which led me to find the source of the problem and also a working solution.

I was working with the following test code:

import urllib
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse

URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256"

url_handler = urllib2.build_opener()
urllib2.install_opener(url_handler)

handle = url_handler.open(URL)
response = handle.read()
handle.close()

html_response = HtmlResponse(URL).replace(body=response) # Problematic line
hxs = HtmlXPathSelector(html_response)

desc = hxs.select('//span[@id="attribute-content"]/text()')
desc_text = desc.extract()[0]
print desc_text
print desc_text.encode('utf-8')

Inside the Scrapy shell, when I extracted the description data, it came out fine. It gave me reason to suspect something was wrong in my code, because on the pdb prompt, I was seeing the replacement characters in the extracted data.

I went through the Scrapy docs for the Response class and adjusted the code above to this:

import urllib
import urllib2
from scrapy.selector import HtmlXPathSelector
from scrapy.http import HtmlResponse

URL = "http://jackjones.bestsellershop.com/DE/jeans/clark-vintage-jos-217-sup/37246/37256"

url_handler = urllib2.build_opener()
urllib2.install_opener(url_handler)

handle = url_handler.open(URL)
response = handle.read()
handle.close()

#html_response = HtmlResponse(URL).replace(body=response)
html_response = HtmlResponse(URL, body=response)
hxs = HtmlXPathSelector(html_response)

desc = hxs.select('//span[@id="attribute-content"]/text()')
desc_text = desc.extract()[0]
print desc_text
print desc_text.encode('utf-8')

The change I made was to replace the line html_response = HtmlResponse(URL).replace(body=response) with html_response = HtmlResponse(URL, body=response). It is my understanding that the replace() method was somehow mangling the special characters from an encoding point of view.

If anyone would like to chip in with any details of what exactly the replace() method did wrong, I'd very much appreciate the effort.

Thank you once again.


U+FFFD is the replacement character that you get when you do some_bytes.decode('some-encoding', 'replace') and some substring of some_bytes can't be decoded.

You have TWO of them: u'H\ufffd\ufffdftsitz' ... this indicates that the u-umlaut was represented as TWO bytes each of which failed to decode. Most likely, the site is encoded in UTF-8 but the software is attempting to decode it as ASCII. Attempting to decode as ASCII usually happens when there is an unexpected conversion to Unicode, and ASCII is used as the default encoding. However in that case one would not expect the 'replace' arg to be used. More likely the code takes in an encoding and has been written by someone who thinks "doesn't raise an exception" means the same as "works".

Edit your question to provide the URL, and show the minimum code that produces u'H\ufffd\ufffdftsitz'.

0

精彩评论

暂无评论...
验证码 换一张
取 消