I'm Nicola, a new user of Python without a real background in computer programming. Therefore, I'd really need some help with a problem I have. I wrote a code to scrape data from this webpage:
http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02
Basically, the goal of my code is to scrape the data from all the tables in the page and write them in a txt file. Here I paste my code:
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os
def extract(soup):
table = soup.findAll("table")[1]
for row in table.findAll('tr')[1:19]:
col = row.findAll('td')
voce = col[0].string
accertamento = col[1].string
competenza = col[2].string
residui = col[3].string
record = (voce, accertamento, competenza, residui)
print >> outfile, "|".join(record)
table = soup.findAll("table")[2]
for row in table.findAll('tr')[1:21]:
col = row.findAll('td')
voce = col[0].string
accertamento = col[1].string
competenza = col[2].string
residui = col[3].string
record = (voce, accertamento, competenza, residui)
print >> outfile, "|".join(record)
table = soup.findAll("table")[3]
for row in table.findAll('tr')[1:44]:
col = row.findAll('td')
voce = col[0].string
accertamento = col[1].string
competenza = col[2].string
residui = col[3].string
record = (voce, accertamento, competenza, residui)
print >> outfile, "|".j开发者_如何转开发oin(record)
table = soup.findAll("table")[4]
for row in table.findAll('tr')[1:18]:
col = row.findAll('td')
voce = col[0].string
accertamento = col[1].string
competenza = col[2].string
residui = col[3].string
record = (voce, accertamento, competenza, residui)
print >> outfile, "|".join(record)
table = soup.findAll("table")[5]
for row in table.findAll('tr')[1:]:
col = row.findAll('td')
voce = col[0].string
accertamento = col[1].string
competenza = col[2].string
residui = col[3].string
record = (voce, accertamento, competenza, residui)
print >> outfile, "|".join(record)
table = soup.findAll("table")[6]
for row in table.findAll('tr')[1:]:
col = row.findAll('td')
voce = col[0].string
accertamento = col[1].string
competenza = col[2].string
residui = col[3].string
record = (voce, accertamento, competenza, residui)
print >> outfile, "|".join(record)
outfile = open("modena_quadro02.txt", "w")
br = Browser()
br.set_handle_robots(False)
url = "http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
outfile.close()
Everything would work fine, but the first column of some tables in that page contains words with accented characters. When I run the code, I get the following:
Traceback (most recent call last):
File "modena2.py", line 158, in <module>
extract(soup1)
File "modena2.py", line 98, in extract
print >> outfile, "|".join(record)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 32: ordinal not in range(128)
I know that the problem is with the encoding of the accented characters. I tried to find a solution to this, but it really goes beyond my knowledge. I want to thank in advance everybody that is going to help me.I really appreciate it! And sorry if the question is too basic, but, as I said, I'm just getting started with python and I'm learning everything by myself.
Thanks! Nicola
I'm going to try again based on feedback. Since you are using the print statement to produce the output, your output must be bytes not characters (that's the reality of present day operating systems). By default Python's sys.stdout
(what the print statement writes to) uses the 'ascii' character encoding. Because only byte values 0 to 127 are defined by ASCII, those are the only byte values you can print. Hence the error for byte value '\xe0'
.
You can change the character encoding of sys.stdout
to UTF-8 by doing this:
import codecs, sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
print u'|'.join([u'abc', u'\u0100'])
The print statement above will not complain about printing a Unicode string that cannot be represented in the ASCII encoding. However, the below code, which prints bytes not characters, produces a UnicodeDecodeError exception, so beware:
import codecs, sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
print '|'.join(['abc', '\xe0'])
You may find that your code is trying to print characters, and that setting the character encoding of sys.stdout to UTF-8 (or ISO-8859-1) fixes it. But you might find that the code is trying to print bytes (obtained from the BeautifulSoup API), in which case the fix might be something like this:
import codecs, sys
sys.stdout = codecs.getwriter('UTF-8')(sys.stdout)
print '|'.join(['abc', '\xe0']).decode('ISO-8859-1')
I'm not familiar with the BeautifulSoup package, but I advise testing it with various documents to see whether its detection of character encoding is correct. Your code is not explicitly providing an encoding, and it is clearly deciding on an encoding on its own. If that decision comes from the meta
encoding tag, then great.
edit: I just tried it, and as I am assuming you want a table in the end, here is a solution that leads to a csv.
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
import urllib2, os
import csv
def extract(soup):
table = soup.findAll("table")[1]
for row in table.findAll('tr')[1:19]:
col = row.findAll('td')
voce = col[0].string
accertamento = col[1].string
competenza = col[2].string
residui = col[3].string
record = (voce, accertamento, competenza, residui)
outfile.writerow([s.encode('utf8') if type(s) is unicode else s for s in record])
# swap print for outfile statement in all other blocks as well
# ...
outfile = csv.writer(open(r'modena_quadro02.csv','wb'))
br = Browser()
br.set_handle_robots(False)
url = "http://finanzalocale.interno.it/sitophp/showQuadro.php?codice=2080500230&tipo=CO&descr_ente=MODENA&anno=2009&cod_modello=CCOU&sigla=MO&tipo_cert=C&isEuro=0&quadro=02"
page1 = br.open(url)
html1 = page1.read()
soup1 = BeautifulSoup(html1)
extract(soup1)
I had a similar issue last week. It was easy to fix in my IDE (PyCharm).
Here was my fix:
Starting from PyCharm menu bar: File -> Settings... -> Editor -> File Encodings, then set: "IDE Encoding", "Project Encoding" and "Default encoding for properties files" ALL to UTF-8 and she now works like a charm.
Hope this helps!
The issue is with printing Unicode text to a binary file:
>>> print >>open('e0.txt', 'wb'), u'\xe0'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe0' in position 0: ordinal not in range(128)
To fix it, either encode the Unicode text into bytes (u'\xe0'.encode('utf-8')
) or open the file in the text mode:
#!/usr/bin/env python
from __future__ import print_function
import io
with io.open('e0.utf8.txt', encoding='utf-8') as file:
print(u'\xe0', file=file)
Try changing this line:
html1 = page1.read()
To this:
html1 = page1.read().decode(encoding)
where encoding
would be, for example, 'UTF-8'
, 'ISO-8859-1'
etc. I'm not familiar with the mechanize package but hopefully there is a way to discover the encoding of the document returned by the read()
method. It seems that the read()
method is giving you a byte-string, not a character string, and therefore the join call later on must assume ASCII as the encoding.
精彩评论