For some reason, Python seems to be having issues with BOM when reading unicode strings from a UTF-8 file. Consider the following:
with open('test.py') as f:
for line in f:
print unicode(line, 'utf-8')
Seems straightforward, doesn't it?
That's what I thought until I ran it from command line and got:
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: ch开发者_如何学Goaracter maps to
<undefined>
A brief visitation to Google revealed that BOM has to be cleared manually:
import codecs
with open('test.py') as f:
for line in f:
print unicode(line.replace(codecs.BOM_UTF8, ''), 'utf-8')
This one runs fine. However I'm struggling to see any merit in this.
Is there a rationale behind above-described behavior? In contrast, UTF-16 works seamlessly.
The 'utf-8-sig'
encoding will consume the BOM signature on your behalf.
You wrote:
UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>
When you specify the "utf-8"
encoding in Python, it takes you at your word. UTF-8 files aren’t supposed to contain a BOM in them. They are neither required nor recommended. Endianness makes no sense with 8-bit code units.
BOMs screw things up, too, because you can no longer just do:
$ cat a b c > abc
if those UTF-8 files have extraneous (read: any) BOMs in them. See now why BOMs are so stupid/bad/harmful in UTF-8? They actually break things.
A BOM is metadata, not data, and the UTF-8 encoding spec makes no allowance for them the way the UTF-16 and UTF-32 specs do. So Python took you at your word and followed the spec. Hard to blame it for that.
If you are trying to use the BOM as a filetype magic number to specify the contents of the file, you really should not be doing that. You are really supposed to use a higher-level prototocl for these metadata purposes, just as you would with a MIME type.
This is just another lame Windows bug, the workaround for which is to use the alternate encoding "utf-8-sig"
to pass off to Python.
精彩评论