I am more than a bit tired, but here goes:
I am doing tome HTML scraping in python 2.6.5 with BeautifulSoap on an ubun开发者_如何转开发tubox
Reason for python 2.6.5: BeautifulSoap sucks under 3.1
I try to run the following code:
# dataretriveal from html files from DETHERM
# -*- coding: utf-8 -*-
import sys,os,re,csv
from BeautifulSoup import BeautifulSoup
sys.path.insert(0, os.getcwd())
raw_data = open('download.php.html','r')
soup = BeautifulSoup(raw_data)
for numdiv in soup.findAll('div', {"id" : "sec"}):
currenttable = numdiv.find('table',{"class" : "data"})
if currenttable:
numrow=0
numcol=0
data_list=[]
for row in currenttable.findAll('td', {"class" : "dataHead"}):
numrow=numrow+1
for ncol in currenttable.findAll('th', {"class" : "dataHead"}):
numcol=numcol+1
for col in currenttable.findAll('td'):
col2 = ''.join(col.findAll(text=True))
if col2.index('±'):
col2=col2[:col2.index('±')]
print(col2.encode("utf-8"))
ref=numdiv.find('a')
niceref=''.join(ref.findAll(text=True))
Now due to the ± signs i get the following error when trying to interprent the code with:
python code.py
Traceback (most recent call last): File "detherm-wtest.py", line 25, in if col2.index('±'): UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
How do i solve this? putting an u in so we have: '±' -> u'±' results in:
Traceback (most recent call last): File "detherm-wtest.py", line 25, in if col2.index(u'±'): ValueError: substring not found
current code file encoding is utf-8
thank you
Byte strings like "±"
(in Python 2.x) are encoded in the source file's encoding, which might not be what you want. If col2
is really a Unicode object, you should use u"±"
instead like you already tried. You might know that somestring.index
raises an exception if it doesn't find an occurrence whereas somestring.find
returns -1. Therefore, this
if col2.index('±'):
col2=col2[:col2.index('±')] # this is not indented correctly in the question BTW
print(col2.encode("utf-8"))
should be
if u'±' in col2:
col2=col2[:col2.index(u'±')]
print(col2.encode("utf-8"))
so that the if statement doesn't lead to an exception.
精彩评论