I write a simple Python script to translate Chinese Punctuation to English.
import codecs, sys
def trcn():
tr = lambda x: x.translate(str.maketrans(""",。!?;:、()【】『』「」﹁﹂“”‘’《》~¥…—×""", """,.!?;:,()[][][][]""''<>~$^-*"""))
out = codecs.getwriter('utf-8')(sys.stdout)
for line in sys.stdin:
out.write(tr(line))
if __name__ == '__main__':
if not len(sys.argv) == 1:
print("usage:\n\t{0} STDIN STDOUT".format(sys.argv[0]))
sys.exit(-1)
trcn()
sys.exit(0)
But something is wrong with UNICODE. I cannot get it passed. Error msg:
Traceback (most recent call last):
File "trcn.py", line 13, in <module>
trcn()
File "trcn.py", line 7, in trcn
out.write(tr(line))
File "C:\Python31\Lib\codecs.py", line 356, in write
self.stream.write(data)
TypeError: must be str, not bytes
After then, I test the out.write() in IDLE and Console. They produced different results. I don't know why.
In IDLE
Python 3.1.2 (r312:79149, Mar 21 2010, 00:41:52) [MSC v.1500 32 bit (Intel)] on win32
Type "copyright", "credits" or "license()" for more information.
>>> import sys,codecs
>>> out = codecs.getwriter('utf-8')(sys.stdout)
>>> out.write('hello')
hello
>>>
In Console
Python 3.1.2 (r312:79149, Mar 21 2010, 00开发者_C百科:41:52) [MSC v.1500 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys,codecs
>>> out = codecs.getwriter('utf-8')(sys.stdout)
>>> out.write('hello')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python31\Lib\codecs.py", line 356, in write
self.stream.write(data)
TypeError: must be str, not bytes
>>>
Platform: Windows XP EN
Your encoded output is coming out of the encoder as bytes, and therefore must be passed to sys.stdout.buffer
:
out = codecs.getwriter('utf-8')(sys.stdout.buffer)
I'm not entirely sure why your code acts differently in IDLE versus the console, but the above may help. Perhaps IDLE's sys.stdout
actually expects bytes instead of characters (hopefully it has a .buffer
that also expects bytes).
IDLE redirects the stdout to its own GUI output. It apparently accepts bytes as well as strings, which normal stdout doesn't.
Either decode it to Unicode, or print it to sys.stdout.buffer.
It is very well obvious that the console's encoding is not utf-8. there is a way to specify the encoding as optional parameter when invoking python in console. just look for it in python docs.
精彩评论