I am trying to build a parser and save the results as an xml file but i have problems..
Would you experts please have a look at my code ?
Traceback :TypeError: expected string or buffer
import urllib2, re
from xml.dom.minidom import Document
from BeautifulSoup import BeautifulSoup as bs
osc = open('OSCTEST.html','r')
oscread = osc.read()
soup=bs(oscread)
doc = Document()
root = doc.createElement('root')
doc.appendChild(root)
countries = doc.createElement('countries')
root.appendChild(countries)
findtags1 = re.compile ('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>', re.DOTALL | re.IGNORECASE).findall(soup)
findtags2 = re.compile ('<span class="content_text">(.*?)</span>', re.DOTALL | re.IGNORECASE).findall(soup)
for header in findtags1:
title_elem = doc.createElement('title')
countries.appendChild(title_elem)
header_elem = doc.createTextNode(header)
title_elem.appendChild(header_elem)
for item in findtags2:
art_elem = doc.createElement('artikel')
countries.appendChild(art_elem)
s = item.repl开发者_开发问答ace('<P>','')
t = s.replace('</P>','')
text_elem = doc.createTextNode(t)
art_elem.appendChild(text_elem)
print doc.toprettyxml()
It's good that you're trying to using BeautifulSoup to parse HTML but this won't work:
re.compile('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>',
re.DOTALL | re.IGNORECASE).findall(soup)
You're trying to parse a BeautifulSoup object using a regular expression. Instead you should be using the findAll method on the soup, like this:
regex = re.compile('^title metadata_title content_perceived_text', re.IGNORECASE)
for tag in soup.findAll('h1', attrs = { 'class' : regex }):
print tag.contents
If you do actually want to parse the document as text with a regular expression then don't use BeautifulSoup - just read the document into a string and parse that. But I'd suggest you take the time to learn how BeautifulSoup works as this is the preferred way to do it. See the documentation for more details.
精彩评论