开发者

Why do Python unicode strings require special treatment for UTF-8 BOM?

开发者 https://www.devze.com 2023-04-01 17:13 出处:网络
For some reason, Python seems to be having issues with BOM when reading unicode strings from a UTF-8 file. Consider the following:

For some reason, Python seems to be having issues with BOM when reading unicode strings from a UTF-8 file. Consider the following:

with open('test.py') as f:
   for line in f:
      print unicode(line, 'utf-8')

Seems straightforward, doesn't it?

That's what I thought until I ran it from command line and got:

UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: ch开发者_如何学Goaracter maps to <undefined>

A brief visitation to Google revealed that BOM has to be cleared manually:

import codecs
with open('test.py') as f:
   for line in f:
      print unicode(line.replace(codecs.BOM_UTF8, ''), 'utf-8')

This one runs fine. However I'm struggling to see any merit in this.

Is there a rationale behind above-described behavior? In contrast, UTF-16 works seamlessly.


The 'utf-8-sig' encoding will consume the BOM signature on your behalf.


You wrote:

 UnicodeEncodeError: 'charmap' codec can't encode character u'\ufeff' in position 0: character maps to <undefined>

When you specify the "utf-8" encoding in Python, it takes you at your word. UTF-8 files aren’t supposed to contain a BOM in them. They are neither required nor recommended. Endianness makes no sense with 8-bit code units.

BOMs screw things up, too, because you can no longer just do:

$ cat a b c > abc 

if those UTF-8 files have extraneous (read: any) BOMs in them. See now why BOMs are so stupid/bad/harmful in UTF-8? They actually break things.

A BOM is metadata, not data, and the UTF-8 encoding spec makes no allowance for them the way the UTF-16 and UTF-32 specs do. So Python took you at your word and followed the spec. Hard to blame it for that.

If you are trying to use the BOM as a filetype magic number to specify the contents of the file, you really should not be doing that. You are really supposed to use a higher-level prototocl for these metadata purposes, just as you would with a MIME type.

This is just another lame Windows bug, the workaround for which is to use the alternate encoding "utf-8-sig" to pass off to Python.

0

精彩评论

暂无评论...
验证码 换一张
取 消