开发者

Piping latin-1 encoded output of a program to a Python 3 script

开发者 https://www.devze.com 2023-02-16 16:40 出处:网络
I want to process the output of a running program line-by-line (think tail -f) with a Python 3 script 开发者_JS百科(on Linux).

I want to process the output of a running program line-by-line (think tail -f) with a Python 3 script 开发者_JS百科(on Linux).

The programs output, which is getting piped to the script, is encoded in latin-1, so, in Python 2, I used the codecs module to decode the input of sys.stdin properly:

#!/usr/bin/env python
import sys, codecs

sin = codecs.getreader('latin-1')(sys.stdin)
for line in sin:
    print '%s "%s"' % (type (line), line.encode('ascii','xmlcharrefreplace').strip())

This worked:

<type 'unicode'> "Hi! &#246;&#228;&#223;"
...

However, in Python 3, sys.stdin.encoding is UTF-8, and if I just read naively from stdin:

#!/usr/bin/env python3
import sys

for line in sys.stdin:
    print ('type:{0} line:{1}'.format(type (line), line))

I get this error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 4: invalid start byte

How can I read non UTF-8 text data piped to stdin in Python 3?


import sys
import io

with io.open(sys.stdin.fileno(),'r',encoding='latin-1') as sin:
    for line in sin:
        print ('type:{0} line:{1}'.format(type (line), line))

yields

type:<class 'str'> line:Hi! öäß


Take a look at this link in the documentation: sys.stdin. The relevant part is:

The standard streams are in text mode by default. To write or read binary data to these, use the underlying binary buffer. For example, to write bytes to stdout, use sys.stdout.buffer.write(b'abc'). Using io.TextIOBase.detach() streams can be made binary by default. This function sets stdin and stdout to binary:

def make_streams_binary():  
    sys.stdin = sys.stdin.detach()  
    sys.stdout = sys.stdout.detach()

After doing this you can encode the binary input into whatever encoding you want.


Also see this post: How to set sys.stdout encoding in Python 3?
The suggestion from that post was to use:

sys.stdin = codecs.getreader("utf-8")(sys.stdin.detach())
0

精彩评论

暂无评论...
验证码 换一张
取 消