#source file is encoded in utf8
import urllib2
import re
req = urllib2.urlopen('http://people.w3.org/rishida/scripts/samples/hungarian.html')
c = req.read()#.decode('utf-8')
p = r'title="This is Latin script \(Hungarian language\)">(.+)'
text = re.search(p, c).group(1)
name = text[:10]+'.txt' #file name will have special chars in it
f = open(name, 'wb')
f.write(text) #content of file will have special chars in it
f.close()
x = raw_input('done')
As you can see the script does a couple things: - Reads content that is known to have unicode characters from a webpage into a variable
(The source file is saved in utf-8 but this should not make a difference unless unicode strings are actually being defined in the source code... As you can see the unicode string is being defined dynamially into a variable.. what encoding the source is shouldn't matter in this scenario)
- Writes a file with a name containing unicode characters
- Write unicode content into this file as well
Here's the weird behavior I get (Windows 7, Python 2.7) : When I don't use the decode function:
c = req.read()
The NAME of the file will come out gibberish, but the CONTENT of the file will come out readable (that is you can see the correct unicode hungarian characters)
Yet, when I USE the decode function:
c = req.read().decode('utf-8')
It will NOT ERROR on opening the file (really creating it with 'w' mode) and the resulting file's NAME will be readable, yep now it shows the correct unicode characters.
So far so good right? Well, then it WILL ERROR on trying to write the unicode content to the file:
f.write(text) #content of file will have special chars in it
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 8: ordinal not in range(128)
You see, I can't seem to have the cake and eat it too... Either I can correctly write the NAME of the file or I can correctly write the CONTENT of the开发者_运维技巧 file..
How can I do both?
I've also tried writing the file with
f = codecs.open(name, encoding='utf-8', mode='wb')
But it also errors..
The only problem for you seems to be just "unreadable" file name from your original source file. This can solve your problem:
f = open(name.decode('utf-8').encode( sys.getfilesystemencoding() ) , 'wb')
While winterTTR's answer does work.. I've realized that this approach is convoluted. Rather, all you really need to do is encode the data you write to the file. The name you don't need to encode and both the name and the content will come out "readable".
content = '\xunicode chars'.decode('utf-8')
f = open(content[:5]+'.txt', 'wb')
f.write(content.encode('utf-8'))
f.close()
精彩评论