I am working on a program that reads a downloaded webpage (stored as 'something'.html) and parses it accordingly. I am having some trouble getting the encoding and decoding correct for this program. It's my understanding most webpages are encoded in ISO-8859-1 and I checked the response from this page and that is the charset I was given:
>>> print r.info()
Content-Type: text/html; charset=ISO-8859-1
Connection: close
Cache-Control: no-cache
Date: Sun, 20 Feb 2011 15:16:31 GMT
Server: Apache/2.0.40 (Red Hat Linux)
X-Accel-Cache-Control: no-cache
However, in the meta tags of the page it declares 'utf-8' as it's encoding set:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
So, in python I've tried a number of approaches to read these pages, parse them, and write utf-8 including reading the file in normally and writing normally:
with open('../results/1.html','r') as f:
page = f.read()
...
with open('../parsed.txt','w') as f:
for key in fieldD:
f.write(key+'\t'+fieldD[key]+'\n')
I have tried explicitly telling the file which encoding to use during the read & write process:
with codecs.open('../results/1.html','r','utf-8') as f:
page = f.read()
...
with codecs.open('../parsed.txt','w','utf-8') as f:
for key in fieldD:
f.write(key+'\t'+fieldD[key]+'\n')
Explicitly telling the file to read from 'iso-8849-1' and write to 'utf-8':
with codecs.open('../results/1.html','r','iso_8859_1') as f:
page = f.read()
...
with codecs.open('../parsed.txt','w','utf-8') as f:
for key in fieldD:
f.write(key+'\t'+fieldD[key]+'\n')
As well as all the permutations of these ideas, including writing as utf-16, encoding each string separately before they are added to the dictionary, and other erroneous ideas. I'm not sure what the best approach here is. It seems I've had the best luck not using ANY encoding because that at least will result in SOME text editors viewing the results correctly (emacs, textwrangler)
I've read through a couple posts开发者_运维百科 on here regarding this topic and still can't seem to make heads or tails of what is going on.
Thanks.
I followed your instructions. The displayed page is NOT encoded in UTF-8; decoding using UTF-8 fails. According to an experimental character set detector that I muck about with occasionally, it is encoded in a Latin-based encoding ... one of ISO-8859-1, cp1252, and ISO-8859-15, and the language appears to be 'es' (Spanish) or 'fr' (French). According to me looking at it, it's Spanish. Firefox (View >>> view encoding) says it's ISO-8859-1.
So now what you need to do is experiment with what tools will display your saved files correctly. If you can't find one, you will need to transcode your files to UTF-8 i.e. data.decode('ISO-8859-1').encode('UTF-8') and find a tool that displays UTF-8 correctly. Shouldn't be too hard. Firefox can nut out the encoding and display it correctly for just about any encoding that I've thrown at it.
Update after request for "intuition":
In your 3rd block of code, you include only the the input and the output, with "..." between. The input code should produce unicode
objects OK. However in the output code, you use the str
function (why???). Assuming that you still have unicode
objects after the "...", applying str()
to them would raise an exception if your system's default encoding is 'ascii' (as it should be) or silently mangle your data if it is 'utf8' (as it shouldn't be). Please publish (1) the contents of "..." (2) the result of doing import sys; print sys.getdefaultencoding()
(3) what you "see" in the output file instead of the expected ó in "Iglesia Católica" -- is it ó
? (4) the actual byte(s) in the file (use print repr(the data)) instead of the expected ó
SOLVED You say in a comment that you see Iglesia Católica
... note that there are FOUR characters displayed instead of the ONE expected. This is symptomatic of encoding in UTF-8 twice. The next puzzle was what was displaying those characters, two of which are not mapped in ISO-8859-1 nor cp1252. I tried the old DOS codepages cp437 and cp850, still used in Windows' Command Prompt window, but it didn't fit. koi8r wasn't going to fit either; it needs a Latin-based character set. Hmm what about macroman? Tada!! You sent the doubly-encoded guff to stdout on your Mac Terminal. See the demonstration below.
>>> from unicodedata import name
>>> oacute = u"\xf3"
>>> print name(oacute)
LATIN SMALL LETTER O WITH ACUTE
>>> guff = oacute.encode('utf8').decode('latin1').encode('utf8')
>>> guff
'\xc3\x83\xc2\xb3'
>>> for c in guff.decode('macroman'):
... print name(c)
...
SQUARE ROOT
LATIN CAPITAL LETTER E WITH ACUTE
NOT SIGN
GREATER-THAN OR EQUAL TO
>>>
Inspecting the saved file I too saved the web page to a file (plus a directory containin *.jpg, a css file etc) -- using Firefox "save page as". Try this with your saved page and publish the results.
>>> data = open('g0.htm', 'rb').read()
>>> uc = data.decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "c:\python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xb7 in position 1130: invalid start byte
>>> pos = data.find("Iglesia Cat")
>>> data[pos:pos+20]
'Iglesia Cat\xf3lica</a>'
>>> # Looks like one of ISO-8859-1 and its cousins to me.
Note carefully: If your file is encoded in UTF-8, then reading it with the UTF-8 codec will produce unicode. If you don't mangle the data somehow when parsing, and write the parsed unicode with the UTF-8 codec, it will NOT be doubly encoded. You need to look carefully at your code for instances of "str" (remember the "typo"?), "unicode", "encode", "decode", "utf", "UTF", etc. Do you call a 3rd-party library to do the parsing? What do you see when you do print repr(key), repr(field[key])
just before writing to the output file?
This is becoming tedious. Consider putting your code and saved page on the web somewhere we can look at it instead of guessing.
32766.html: I've just realised that you are the guy who had blown all his inodes trying to write too many files to a folder on a vfat file system (or something like that). So you are not doing a manual "save as". Please publish the code that you have used to "save" these files.
>>> url = 'http://213.97.164.119/ABSYS/abwebp.cgi/X5104/ID31295/G0?ACC=DCT1'
>>> data = urllib2.urlopen(url).read()[4016:4052]; data
'Iglesia+Cat%f3lica">Iglesia Cat\xf3lica'
>>> data.decode('latin-1')
u'Iglesia+Cat%f3lica">Iglesia Cat\xf3lica'
>>> data.decode('latin-1').encode('utf-8')
'Iglesia+Cat%f3lica">Iglesia Cat\xc3\xb3lica'
What do you get?
精彩评论