I'm trying to read hebrew from a text file:
def task1():
f = open('C:\\Users\\royi\\Desktop\\final project\\corpus-haaretz.txt', 'r',"utf-8")
print 'success'
return f
a = task1()
When i read it it shows m开发者_开发问答e this:
'[\xee\xe0\xee\xf8 \xee\xf2\xf8\xeb\xfa \xf9\xec \xe4\xf0\xe9\xe5-\xe9\xe5\xf8\xf7 \xe8\xe9\xe9\xee\xf1: \xf2\xec \xe1\xe9\xfa \xe4\xee\xf9\xf4\xe8 \xec\xe1\xe8\xec \xe0\xfa \xe7\xe5\xf7 \xe4\xe7\xf8\xed, \xec\xe8\xe5\xe1\xfa \xe9\xf9\xf8\xe0\xec \xee\xe0\xfa \xf0\xe9\xe5
and many more.
how do i read it?
You print it like this:
print task1().encode('your terminal encoding here')
You must be sure that your terminal is able to display hebrew characters. For exemple, under an full utf-8 Linux distrib with hebrew locales installed:
print task1().encode('utf-8')
Careful with open
:
- with python 2.7, you have no encoding parameter. Use the
codecs
module. - with python 3+, the encoding parameter is the fourth one, not the third like you do. You may mean something like
open(path, 'r', encoding='utf-8')
. You can even omit'r'
.
So why would you use encode
?
Well, when you read a file and tell Python the encoding, it returns a unicode object, not string object. For example on my system:
>>> import codecs
>>> content = codecs.open('/etc/fstab', encoding='utf-8').read()
>>> type(content)
<type 'unicode'>
>>> type('')
<type 'str'>
>>> type(u'')
<type 'unicode'>
You need to encode it back to a string if you want to make it a printable string if it contains non ascii characters:
>>> type(content.encode('utf-8'))
<type 'str'>
We use encode
because here we are talking a more or less generic text object (unicode is as generic as you can get with text manipulation), and you turn it (encode) in a specific representation (utf-8).
And we need this specifi representation because your system doesn't nkow about Python internal and can only print ascii characters if you don't specify the encoding. So when you ouput, you encode specifically to an encoding your system can understand. For me it's luckly 'utf-8', so it's easy. If you are on Windows, it can get tricky.
You need to use the codecs
module to open a file. The open()
(see docs) call doesn't take a third argument like that, the third argument would be the bufsize
.
Specifically codecs.open()
. Always decode when you read, encode when you output :-)
From the look of it, it seems to me that the encoding of the string you get is 'windows-1255'
, not 'utf-8'
. Try to open the file using that encoding instead.
Your description of how you read the file appears to be incorrect. It is puzzling that "it" manages to show you bytes that are obviously Hebrew text encoded in cp1255.
We need to be shown unambiguously what is in the first few (say 200) bytes of your file. Please run one of the following commands in a Command Prompt window, depending on what Python you are using:
Python 2.x (assuming 2.7 installed in the standard place):
prompt>c:\python27\python -c "import locale; print locale.getpreferredencoding(), repr(open('your_file.txt', 'rb').read(200))"
or Python 3.x
prompt>c:\python32\python -c "import locale; print(locale.getpreferredencoding(),ascii(open('your_file.txt', 'rb').read(200)))"
Edit your question and (1) copy/paste the output from the command (2) tell us what version of Python you are using.
精彩评论