I have a bunch of HTML files I downloaded using HTTPLIB2 package in Python. ' ' are showing as 'Â '.
<font color="#ff0000">02/12/2004Â </font> is showing while <font color="#ff0000">02/12/2004 </font> is t开发者_JS百科he desired format.
How do I replace the 'Â '
with ' '
in Python? Thanks a lot!
You've got an encoding problem. Instead of trying to remove this characters, look for the encoding of the page, then when you read the file, use the codecs
module instead of open()
, using the proper character encoding.
filtered_content = filter(lambda x: x in string.printable, content)
This solved my problem. Thank you!
s.replace('Â ', ' ');
However, while I haven't used HTTPLIB2, I'm pretty sure something is wrong if the source of the HTML files is being changed when you download them. It may be that there's a decoding problem going on. What version of Python are you using? If it's Python 3, the contents will be byte sequences, not strings, so you'll have to specify the right codepage to decode the bytes to.
http://code.google.com/p/httplib2/wiki/ExamplesPython3
EDIT: If you aren't limited to using just httplib2, perhaps you could try looking into using the urllib
, urllib2
, or httplib
modules that are part of the Python 2.6 standard library?
精彩评论