开发者

How to get span value using python,BeautifulSoup

开发者 https://www.devze.com 2023-02-28 10:43 出处:网络
I am using BeautifulSoup for the first time and trying to collect several data such as email,phone number, and mailing address from a soup object.

I am using BeautifulSoup for the first time and trying to collect several data such as email,phone number, and mailing address from a soup object.

Using regular expressions, I can identi开发者_Python百科fy the email address. My code to find the email is:

def get_email(link):
mail_list = []
for i in link:
        a = str(i)
        email_pattern = re.compile("<a\s+href=\"mailto:([a-zA-Z0-9._@]*)\">", re.IGNORECASE)
        ik = re.findall(email_pattern, a)
        if (len(ik) == 1):
                mail_list.append(i)
        else:
                pass
s_email = str(mail_list[0]).split('<a href="')
t_email = str(s_email[1]).split('">')
print t_email[0]

Now, I also need to collect the phone number, mailing address and web url. I think in BeautifulSoup there must be an easy way to find those specific data.

A sample html page is as below:

<ul>
    <li>
    <span>Email:</span>
    <a href="mailto:abc@gmail.com">Message Us</a>
    </li>
    <li>
    <span>Website:</span>
    <a target="_blank" href="http://www.abcl.com">Visit Our Website</a>
    </li>
    <li>
    <span>Phone:</span>
    (123)456-789
    </li>
    </ul>

And using BeatifulSoup, I am trying to collect the span values of Email, website and Phone.

Thanks in advance.


The most obvious problem with your code is that you're turning the object representing the link back into HTML and then parsing it with a regular expression again - that ignores much of the point of using BeautifulSoup in the first place. You might need to use a regular expression to deal with the contents of the href attribute, but that's it. Also, the else: pass is unnecessary - you can just leave it out entirely.

Here's some code that does something like what you want, and might be a useful starting point:

from BeautifulSoup import BeautifulSoup
import re

# Assuming that html is your input as a string:
soup = BeautifulSoup(html)

all_contacts = []

def mailto_link(e):
    '''Return the email address if the element is is a mailto link,
    otherwise return None'''
    if e.name != 'a':
        return None
    for key, value in e.attrs:
        if key == 'href':
            m = re.search('mailto:(.*)',value)
            if m:
                return m.group(1)
    return None

for ul in soup.findAll('ul'):
    contact = {}
    for li in soup.findAll('li'):
        s = li.find('span')
        if not (s and s.string):
            continue
        if s.string == 'Email:':
            a = li.find(mailto_link)
            if a:
                contact['email'] = mailto_link(a)
        elif s.string == 'Website:':
            a = li.find('a')
            if a:
                contact['website'] = a['href']
        elif s.string == 'Phone:':
            contact['phone'] = unicode(s.nextSibling).strip()
    all_contacts.append(contact)

print all_contacts

That will produce a list of one dictionary per contact found, in this case that will just be:

[{'website': u'http://www.abcl.com', 'phone': u'(123)456-789', 'email': u'abc@gmail.com'}]
0

精彩评论

暂无评论...
验证码 换一张
取 消