I am trying to remove
[<span class="street-address">
510 E Airline Way
</span>]
and I have used this clean function to remove the one that is in between < >
def clean(val):
if type(val) is not StringType: val = str(val)
val = re.sub(r'<.*?>', '',val)
val = re.sub("\s+" , " ", val)
return val.strip()
开发者_如何学运维
and it produces [ 510 E Airline Way ]
i am trying to add within "clean" function to remove the char '['
and ']'
and basically i just want to get the "510 E Airline Way"
.
anyone has any clue what can i add to clean
function?
thank you
Using re:
>>> import re
>>> s='[<span class="street-address">\n 510 E Airline Way\n </span>]'
>>> re.sub(r'\[|\]|\s*<[^>]*>\s*', '', s)
'510 E Airline Way'
Using BeautifulSoup:
>>> from BeautifulSoup import BeautifulSoup
>>> s='[<span class="street-address">\n 510 E Airline Way\n </span>]'
>>> b = BeautifulSoup(s)
>>> b.find('span').getText()
u'510 E Airline Way'
Using lxml:
>>> from lxml import html
>>> s='[<span class="street-address">\n 510 E Airline Way\n </span>]'
>>> h = html.document_fromstring(s)
>>> h.cssselect('span')[0].text.strip()
'510 E Airline Way'
精彩评论