Python regex on list_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2022-12-31 12:27 出处：网络

I am trying to build a parser and save the results as an xml file but i have problems.. Would you experts please have a look at my code ?

I am trying to build a parser and save the results as an xml file but i have problems..

Would you experts please have a look at my code ?

Traceback :TypeError: expected string or buffer

import urllib2, re
from xml.dom.minidom import Document
from BeautifulSoup import BeautifulSoup as bs
osc = open('OSCTEST.html','r')
oscread = osc.read()
soup=bs(oscread)
doc = Document()
root = doc.createElement('root')
doc.appendChild(root)
countries = doc.createElement('countries')
root.appendChild(countries)
findtags1 = re.compile ('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>', re.DOTALL |  re.IGNORECASE).findall(soup)
findtags2 = re.compile ('<span class="content_text">(.*?)</span>', re.DOTALL |  re.IGNORECASE).findall(soup)
for header in findtags1:
title_elem = doc.createElement('title')
countries.appendChild(title_elem)
header_elem = doc.createTextNode(header)
title_elem.appendChild(header_elem)
 for item in findtags2:
    art_elem = doc.createElement('artikel')
    countries.appendChild(art_elem)
    s = item.repl开发者_开发问答ace('<P>','')
    t = s.replace('</P>','')
    text_elem = doc.createTextNode(t)
    art_elem.appendChild(text_elem)    

print doc.toprettyxml()

It's good that you're trying to using BeautifulSoup to parse HTML but this won't work:

re.compile('<h1 class="title metadata_title content_perceived_text(.*?)`</h1>',
           re.DOTALL | re.IGNORECASE).findall(soup)

You're trying to parse a BeautifulSoup object using a regular expression. Instead you should be using the findAll method on the soup, like this:

regex = re.compile('^title metadata_title content_perceived_text', re.IGNORECASE)
for tag in soup.findAll('h1', attrs = { 'class' : regex }):
    print tag.contents

If you do actually want to parse the document as text with a regular expression then don't use BeautifulSoup - just read the document into a string and parse that. But I'd suggest you take the time to learn how BeautifulSoup works as this is the preferred way to do it. See the documentation for more details.

Python regex on list

精彩评论

关注公众号

热门标签

图文推荐

Python regex on list

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：