Why don't I see the hebrew characters, when I print text from an utf-8 file in Python?_问答_开发者

Why don't I see the hebrew characters, when I print text from an utf-8 file in Python?

开发者 https://www.devze.com 2023-03-22 11:00 出处：网络

I\'m trying to read hebrew from a text file: def task1(): f = open(\'C:\\\\Users\\\\royi\\\\Desktop\\\\final project\\\\corpus-haaretz.txt\', \'r\',\"utf-8\")

I'm trying to read hebrew from a text file:

def task1():
    f = open('C:\\Users\\royi\\Desktop\\final project\\corpus-haaretz.txt', 'r',"utf-8")
    print 'success'
    return f

a = task1()

When i read it it shows m开发者_开发问答e this:

'[\xee\xe0\xee\xf8 \xee\xf2\xf8\xeb\xfa \xf9\xec \xe4\xf0\xe9\xe5-\xe9\xe5\xf8\xf7 \xe8\xe9\xe9\xee\xf1: \xf2\xec \xe1\xe9\xfa \xe4\xee\xf9\xf4\xe8 \xec\xe1\xe8\xec \xe0\xfa \xe7\xe5\xf7 \xe4\xe7\xf8\xed, \xec\xe8\xe5\xe1\xfa \xe9\xf9\xf8\xe0\xec \xee\xe0\xfa \xf0\xe9\xe5

and many more.

how do i read it?

You print it like this:

print task1().encode('your terminal encoding here')

You must be sure that your terminal is able to display hebrew characters. For exemple, under an full utf-8 Linux distrib with hebrew locales installed:

print task1().encode('utf-8')

Careful with open:

with python 2.7, you have no encoding parameter. Use the codecs module.
with python 3+, the encoding parameter is the fourth one, not the third like you do. You may mean something like open(path, 'r', encoding='utf-8'). You can even omit 'r'.

So why would you use encode ?

Well, when you read a file and tell Python the encoding, it returns a unicode object, not string object. For example on my system:

>>> import codecs
>>> content = codecs.open('/etc/fstab', encoding='utf-8').read()
>>> type(content)
<type 'unicode'>
>>> type('')
<type 'str'>
>>> type(u'')
<type 'unicode'>

You need to encode it back to a string if you want to make it a printable string if it contains non ascii characters:

>>> type(content.encode('utf-8'))
<type 'str'>

We use encode because here we are talking a more or less generic text object (unicode is as generic as you can get with text manipulation), and you turn it (encode) in a specific representation (utf-8).

And we need this specifi representation because your system doesn't nkow about Python internal and can only print ascii characters if you don't specify the encoding. So when you ouput, you encode specifically to an encoding your system can understand. For me it's luckly 'utf-8', so it's easy. If you are on Windows, it can get tricky.

You need to use the codecs module to open a file. The open() (see docs) call doesn't take a third argument like that, the third argument would be the bufsize.

Specifically codecs.open(). Always decode when you read, encode when you output :-)

From the look of it, it seems to me that the encoding of the string you get is 'windows-1255', not 'utf-8'. Try to open the file using that encoding instead.

Your description of how you read the file appears to be incorrect. It is puzzling that "it" manages to show you bytes that are obviously Hebrew text encoded in cp1255.

We need to be shown unambiguously what is in the first few (say 200) bytes of your file. Please run one of the following commands in a Command Prompt window, depending on what Python you are using:

Python 2.x (assuming 2.7 installed in the standard place):

prompt>c:\python27\python -c "import locale; print locale.getpreferredencoding(), repr(open('your_file.txt', 'rb').read(200))"

or Python 3.x

prompt>c:\python32\python -c "import locale; print(locale.getpreferredencoding(),ascii(open('your_file.txt', 'rb').read(200)))"

Edit your question and (1) copy/paste the output from the command (2) tell us what version of Python you are using.

Why don't I see the hebrew characters, when I print text from an utf-8 file in Python?

精彩评论

关注公众号

热门标签

图文推荐

Why don't I see the hebrew characters, when I print text from an utf-8 file in Python?

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：