Write unicode content and unicode file name in Windows_问答_开发者

Write unicode content and unicode file name in Windows

开发者 https://www.devze.com 2023-03-11 08:58 出处：网络

#source file is encoded in utf8 import urllib2 import re req = urllib2.urlopen(\'http://people.w3.org/rishida/scripts/samples/hungarian.html\')

相关专题：python windows

#source file is encoded in utf8
import urllib2
import re

req = urllib2.urlopen('http://people.w3.org/rishida/scripts/samples/hungarian.html')
c = req.read()#.decode('utf-8')

p = r'title="This is Latin script \(Hungarian language\)">(.+)'
text = re.search(p, c).group(1)

name = text[:10]+'.txt'  #file name will have special chars in it

f = open(name, 'wb')
f.write(text)  #content of file will have special chars in it
f.close()   


x = raw_input('done')

As you can see the script does a couple things: - Reads content that is known to have unicode characters from a webpage into a variable

(The source file is saved in utf-8 but this should not make a difference unless unicode strings are actually being defined in the source code... As you can see the unicode string is being defined dynamially into a variable.. what encoding the source is shouldn't matter in this scenario)

Writes a file with a name containing unicode characters
Write unicode content into this file as well

Here's the weird behavior I get (Windows 7, Python 2.7) : When I don't use the decode function:

c = req.read()

The NAME of the file will come out gibberish, but the CONTENT of the file will come out readable (that is you can see the correct unicode hungarian characters)

Yet, when I USE the decode function:

c = req.read().decode('utf-8')

It will NOT ERROR on opening the file (really creating it with 'w' mode) and the resulting file's NAME will be readable, yep now it shows the correct unicode characters.

So far so good right? Well, then it WILL ERROR on trying to write the unicode content to the file:

    f.write(text)  #content of file will have special chars in it
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 8: ordinal not in range(128)

You see, I can't seem to have the cake and eat it too... Either I can correctly write the NAME of the file or I can correctly write the CONTENT of the开发者_运维技巧 file..

How can I do both?

I've also tried writing the file with

f = codecs.open(name, encoding='utf-8', mode='wb')

But it also errors..

The only problem for you seems to be just "unreadable" file name from your original source file. This can solve your problem:

f = open(name.decode('utf-8').encode( sys.getfilesystemencoding() ) , 'wb')

While winterTTR's answer does work.. I've realized that this approach is convoluted. Rather, all you really need to do is encode the data you write to the file. The name you don't need to encode and both the name and the content will come out "readable".

content = '\xunicode chars'.decode('utf-8')
f = open(content[:5]+'.txt', 'wb')
f.write(content.encode('utf-8'))
f.close()

Write unicode content and unicode file name in Windows

精彩评论

关注公众号

热门标签

图文推荐

Write unicode content and unicode file name in Windows

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：