Special character use in Python 2.6_问答_开发者

开发者 https://www.devze.com 2023-01-19 16:41 出处：网络

I am more than a bit tired, but here goes: I am doing tome HTML scraping in python 2.6.5 with BeautifulSoap on an ubun开发者_如何转开发tubox

I am more than a bit tired, but here goes:

I am doing tome HTML scraping in python 2.6.5 with BeautifulSoap on an ubun开发者_如何转开发tubox

Reason for python 2.6.5: BeautifulSoap sucks under 3.1

I try to run the following code:

# dataretriveal from html files from DETHERM
# -*- coding: utf-8 -*-

import sys,os,re,csv
from BeautifulSoup import BeautifulSoup


sys.path.insert(0, os.getcwd())

raw_data = open('download.php.html','r')
soup = BeautifulSoup(raw_data)

for numdiv in soup.findAll('div', {"id" : "sec"}):
    currenttable = numdiv.find('table',{"class" : "data"})
    if currenttable:
        numrow=0
        numcol=0
        data_list=[]
        for row in currenttable.findAll('td', {"class" : "dataHead"}):
            numrow=numrow+1
        for ncol in currenttable.findAll('th', {"class" : "dataHead"}):
            numcol=numcol+1
        for col in currenttable.findAll('td'):
            col2 = ''.join(col.findAll(text=True))
        if col2.index('±'):
        col2=col2[:col2.index('±')]
            print(col2.encode("utf-8"))
        ref=numdiv.find('a')
        niceref=''.join(ref.findAll(text=True))

Now due to the ± signs i get the following error when trying to interprent the code with:

python code.py

Traceback (most recent call last): File "detherm-wtest.py", line 25, in if col2.index('±'): UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

How do i solve this? putting an u in so we have: '±' -> u'±' results in:

Traceback (most recent call last): File "detherm-wtest.py", line 25, in if col2.index(u'±'): ValueError: substring not found

current code file encoding is utf-8

thank you

Byte strings like "±" (in Python 2.x) are encoded in the source file's encoding, which might not be what you want. If col2 is really a Unicode object, you should use u"±" instead like you already tried. You might know that somestring.index raises an exception if it doesn't find an occurrence whereas somestring.find returns -1. Therefore, this

    if col2.index('±'):
        col2=col2[:col2.index('±')] # this is not indented correctly in the question BTW
        print(col2.encode("utf-8"))

should be

    if u'±' in col2:
        col2=col2[:col2.index(u'±')]
        print(col2.encode("utf-8"))

so that the if statement doesn't lead to an exception.

Special character use in Python 2.6

精彩评论

关注公众号

热门标签

图文推荐

Special character use in Python 2.6

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：