I was wondering if there is any way I could extract domain names from the body of email messages in python. I was thinking of using regular expressions, but I am not too great in writing them, and was wondering if someone could help me out. Here's a sample email body:
<tr><td colspan="5"><font face="verdana" size="4" color="#999999"><b>Resource Links - </b></font><span class="snv"><a href="http://clk.about.com/?zi=4/RZ">Get Listed Here</a></span></td><td class="snv" valign="bottom" align="right"><a href="http://sprinks.about.com/faq/index.htm">What Is This?</a></td></tr><tr><td colspan="6" bgcolor="#999999"><img height="1" width="1"></td></tr><tr><td colspan="6"><map name="sgmap"><area href="http://x.about.com/sg/r/3412.htm?p=0&ref=fooddrinksl_sg" shape="rect" coords="0, 0, 600, 20"><area href="http://x.about.com/sg/r/3412.htm?p=1&ref=fooddrinksl_sg" shape="rect" coords="0, 55, 600, 75"><area href="http://x.about.com/sg/r/3412.htm?p=2&ref=fooddrinksl_sg" shape="rect" coords="0, 110, 600, 130"></map><img border="0" src="http://z.about.com/sg/sg.gif?cuni=3412" usemap="#sgmap" width="600" height="16开发者_运维知识库0"></td></tr><tr><td colspan="6"> </td></tr>
<tr><td colspan="6"><a name="d"><font face="verdana" size="4" color="#cc0000"><b>Top Picks - </b></font></a><a href="http://slclk.about.com/?zi=1/BAO" class="srvb">Fun Gift Ideas</a><span class="snv">
from your <a href="http://chinesefood.about.com">Chinese Cuisine</a> Guide</span></td></tr><tr><td colspan="6" bgcolor="cc0000"><img height="1" width="1"></td></tr><tr><td colspan="6" class="snv">
So I would need "clk.about.com" etc.
Thanks!
The cleanest way to do it is with cssselect
from lxml.html
and urlparse
. Here is how:
from lxml import html
from urlparse import urlparse
doc = html.fromstring(html_data)
links = doc.cssselect("a")
domains = set([])
for link in links:
try: href=link.attrib['href']
except KeyError: continue
parsed=urlparse(href)
domains.add(parsed.netloc)
print domains
First you load the html data into the a document object with fromstring
. You query the document for links using standard css selectors with cssselect
. You traverse the links, grab their urls with .attrib['href']
- and skip them if they don't have any (except - continue
). Parse the url into a named tuple with urlparse
and put the domain (netloc
) into a set. Voila!
Try avoiding regular expressions when you have good libraries online. They are hard for maintenance. Also a no-go for a html parsing.
UPDATE:
The href
filter suggestion in the comments is very helpful, the code will look like this:
from lxml import html
from urlparse import urlparse
doc = html.fromstring(html_data)
links = doc.cssselect("a[href]")
domains = set([])
for link in links:
href=link.attrib['href']
parsed=urlparse(href)
domains.add(parsed.netloc)
print domains
You don't need the try-catch
block since the href
filter makes sure you catch only the anchors that have href
attribute in them.
You can use HTMLParser
from the Python standard library to get to certain parts of the document.
HTMLParser is the clean way to do it. If you want something quick and dirty, or just want to see what a moderately complex regex looks like, here's an example regex to find href's (off the top of my head, not tested):
r'<a\s+href="\w+://[^/"]+[^"]*">'
from lxml import etree
from StringIO import StringIO
from urlparse import urlparse
html = """<tr><td colspan="5"><font face="verdana" size="4" color="#999999"><b>Resource Links - </b></font><span class="snv"><a href="http://clk.about.com/?zi=4/RZ">Get Listed Here</a></span></td><td class="snv" valign="bottom" align="right"><a href="http://sprinks.about.com/faq/index.htm">What Is This?</a></td></tr><tr><td colspan="6" bgcolor="#999999"><img height="1" width="1"></td></tr><tr><td colspan="6"><map name="sgmap"><area href="http://x.about.com/sg/r/3412.htm?p=0&ref=fooddrinksl_sg" shape="rect" coords="0, 0, 600, 20"><area href="http://x.about.com/sg/r/3412.htm?p=1&ref=fooddrinksl_sg" shape="rect" coords="0, 55, 600, 75"><area href="http://x.about.com/sg/r/3412.htm?p=2&ref=fooddrinksl_sg" shape="rect" coords="0, 110, 600, 130"></map><img border="0" src="http://z.about.com/sg/sg.gif?cuni=3412" usemap="#sgmap" width="600" height="160"></td></tr><tr><td colspan="6"> </td></tr><tr><td colspan="6"><a name="d"><font face="verdana" size="4" color="#cc0000"><b>Top Picks - </b></font></a><a href="http://slclk.about.com/?zi=1/BAO" class="srvb">Fun Gift Ideas</a><span class="snv"> from your <a href="http://chinesefood.about.com">Chinese Cuisine</a> Guide</span></td></tr><tr><td colspan="6" bgcolor="cc0000"><img height="1" width="1"></td></tr><tr><td colspan="6" class="snv">"""
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html), parser)
r = tree.xpath("//a")
links = []
for i in r:
try:
links.append(i.attrib['href'])
except KeyError:
pass
for link in links:
print urlparse(link)
From hereon the domain can be distinguished as netloc. The xPath is not probably the best here, someone one please suggest an improvement, but should suit your needs.
Given you always have an http protocol specifier in front of the domains, this should work (txt is your example).
import re
[groups[0] for groups in re.findall(r'http://(\w+(\.\w+){1,})(/\w+)*', txt)]
The pattern for domains is not perfect, though.
精彩评论