开发者

How to: remove part of a Unicode string in Python following a special character

开发者 https://www.devze.com 2023-01-18 22:56 出处:网络
first a short summery: python ver: 3.1 system: Linux (Ubuntu) I am trying to do some data retrieval through Python and BeautifulSoup.

first a short summery:

python ver: 3.1 system: Linux (Ubuntu)

I am trying to do some data retrieval through Python and BeautifulSoup.

Unfortunately some of the tables I am trying to process contains cells where the following text string exists:

789.82 ± 10.28

For this i to work i need two things:

How do i handle "weird" symbols such as: ± and how do i remove the par开发者_JS百科t of the string containing: ± and everything to the right of this?

Currently i get an error like: SyntaxError: Non-ASCII charecter '\xc2' in file ......

Thank you for your help

[edit]:

# dataretriveal from html files from DETHERM
# -*- coding: utf8 -*-

import sys,os,re
from BeautifulSoup import BeautifulSoup


sys.path.insert(0, os.getcwd())

raw_data = open('download.php.html','r')
soup = BeautifulSoup(raw_data)


for numdiv in soup.findAll('div', {"id" : "sec"}):
    currenttable = numdiv.find('table',{"class" : "data"})
    if currenttable:
        numrow=0
        for row in currenttable.findAll('td', {"class" : "dataHead"}):
            numrow=numrow+1

        for col in currenttable.findAll('td'):
            col2 = ''.join(col.findAll(text=True))
            if col2.index('±'):
                col2=col2[:col2.indeindex('±')]
            print(col)
        print(numrow)
        ref=numdiv.find('a')
        niceref=''.join(ref.findAll(text=True))
        print(niceref)

Now this code is followed by an error of:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)

Where did the ASCII reference pop up from ?


You need to have your Python file encoded in utf-8. Otherwise, it's quite trivial:

>>> s = '789.82 ± 10.28'
>>> s[:s.index('±')]
'789.82 '
>>> s.partition('±')
('789.82 ', '±', ' 10.28')
0

精彩评论

暂无评论...
验证码 换一张
取 消