Safe decoding in python ('?' symbol instead of exception)_问答_开发者

Safe decoding in python ('?' symbol instead of exception)

开发者 https://www.devze.com 2023-02-15 13:59 出处：网络

I have code: encoding = guess_encoding() text = unicode(text, encoding) when wrong symbol appears in text UnicodeDecode exception is raised. How can I silently skip exception replacing wrong s开发者

相关专题：python

I have code:

encoding = guess_encoding()    
text = unicode(text, encoding)

when wrong symbol appears in text UnicodeDecode exception is raised. How can I silently skip exception replacing wrong s开发者_运维知识库ymbol with '?' ?

Try

text = unicode(text, encoding, "replace")

From the documentation:

'replace' causes the official Unicode replacement character, U+FFFD, to be used to replace input characters which cannot be decoded.

If you want to use "?" instead of the official Unicode replacement character, you can do

text = text.replace(u"\uFFFD", "?")

after converting to unicode.

In Python 3, you can decode a bytes object into a string using the decode method. It accepts two parameters:

encoding, which is "utf-8" by default, and
errors, which defines what to do on illegal character sequences. The default value is "strict", which raises a UnicodeDecodeError; other alternatives are ignore and replace -- the latter replaces illegal characters with the Unicode replacement character "\uFFFD".

Therefore, you'd need to do this to decode-and-replace:

encoding = guess_encoding()
text = text_bytes.decode(encoding, errors='replace').replace('\uFFFD', '?')

As Sven Marnach pointed out in a comment, you can supply the errors argument directly to open; otherwise you'd get the decode errors while reading the file (if it falls out of the character map).