Trying to grab just absolute links from a webpage using BeautifulSoup_问答_开发者

Trying to grab just absolute links from a webpage using BeautifulSoup

开发者 https://www.devze.com 2022-12-23 17:47 出处：网络

I am reading the contents of a webpage using BeautifulSoup. What I want is to just grab the <a href> that start with http://. I know in beautifulsoup you can search by the attributes. I guess I

相关专题：python

I am reading the contents of a webpage using BeautifulSoup. What I want is to just grab the <a href> that start with http://. I know in beautifulsoup you can search by the attributes. I guess I am just having a syntax issue. I would imagine it would go something like.

page = urllib2.urlopen("http://www.linkpages.com")
soup = BeautifulSoup(page)
for link in soup.findAll('a'):
    if link['href'].startswith('http://'):
        print links

But that returns:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__
    return self._getAttrMap()[key]
KeyError: 'href'

Any ideas? Thanks in advance.

EDIT This isn't for any site in particular. The script gets the url from the user. So internal link targets would be an issue, that's also why I only want the <'a'> from the pages. If I turn it towards www.reddit.com, it parses the beginning links and it gets to this:

<a href="http://www.reddit.com/top/">top</a>
<a href="http://www.reddit.com/saved/">saved</a>
Traceback (most recent call last):
  File 开发者_开发百科"<stdin>", line 2, in <module>
  File "C:\Python26\lib\BeautifulSoup.py", line 598, in __getitem__
    return self._getAttrMap()[key]
KeyError: 'href'

from BeautifulSoup import BeautifulSoup
import re
import urllib2

page = urllib2.urlopen("http://www.linkpages.com")
soup = BeautifulSoup(page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    print link

Do you possibly have some <a> tags without href attributes? Internal link targets, perhaps?

Please give us an idea of what you're parsing here - as Andrew points out, it seems likely that there are some anchor tags without associated hrefs.

If you really want to ignore them, you could wrap it in a try block and afterwards catch with

except KeyError: pass

But that has its own issues.

f=open('Links.txt','w')
import urllib2
from bs4 import BeautifulSoup
url='http://www.redit.com'
page=urllib2.urlopen(url)
soup=BeautifulSoup(page)
atags=soup.find_all('a')
for item in atags:
    for x in item.attrs: 
        if x=='href':
            f.write(item.attrs[x]+',\n')
        else:
            continue
f.close()

A less efficient solution.