So I am looking for a dynamic way to crawl a website and grab links from each page. I decided to experiment with Beauitfulsoup. Two questions: How do I do this more dynamically then using nested while statements searching for links. I want to get all the links from this site. But I don't want to continue to put nested while loops.
topLevelLinks = self.getAllUniqueLinks(baseUrl)
listOfLinks = list(topLevelLinks)
length = len(listOfLinks)
count = 0
while(count < length):
twoLevelLinks = self.getAllUniqueLinks(listOfLinks[count])
twoListOfLinks = list(twoLevelLinks)
twoCount = 0
twoLength = len(twoListOfLinks)
for twoLinks in twoListOfLinks:
listOfLinks.append(twoLinks)
count = count + 1
while(twoCount < twoLength):
threeLevelLinks = self.getAllUniqueLinks(twoListOfLinks[twoCount])
threeListOfLinks = list(threeLevelLinks)
for threeLinks in threeListOfLinks:
listOfLinks.append(threeLinks)
twoCount = twoCount +1
print '--------------------------------------------------------------------------------------'
#remove all duplicates
finalList = list(set(listOfLinks))
print finalList
My second questions is there anyway to tell if I got all the links from the site. Please forgive me, I am s开发者_开发技巧omewhat new to python (year or so) and I know some of my processes and logic might be childish. But I have to learn somehow. Mainly I just want to do this more dynamic then using nested while loop. Thanks in advance for any insight.
The problem of spidering over a web site and getting all the links is a common problem. If you Google search for "spider web site python" you can find libraries that will do this for you. Here's one I found:
http://pypi.python.org/pypi/spider.py/0.5
Even better, Google found this question already asked and answered here on StackOverflow:
Anyone know of a good Python based web crawler that I could use?
If using BeautifulSoup, why don't you use findAll() method ?? Basically, in my crawler i do:
self.soup = BeautifulSoup(HTMLcode)
for frm in self.soup.findAll(str('frame')):
try:
if not frm.has_key('src'):
continue
src = frm[str('src')]
#rest of URL processing here
except Exception, e:
print 'Parser <frame> tag error: ', str(e)
for the frame tag. The same goes for "img src"and "a href" tags. I like the topic though - maybe its me who has sth wrong here... edit: there is ofc a top-level instance, which saves the URLs and gets the HTMLcode from each link later...
To answer your question from the comment, here's an example (it's in ruby, but I don't know python, and they are similar enough for you to be able to follow along easily):
#!/usr/bin/env ruby
require 'open-uri'
hyperlinks = []
visited = []
# add all the hyperlinks from a url to the array of urls
def get_hyperlinks url
links = []
begin
s = open(url).read
s.scan(/(href|src)\w*=\w*[\",\']\S+[\",\']/) do
link = $&.gsub(/((href|src)\w*=\w*[\",\']|[\",\'])/, '')
link = url + link if link[0] == '/'
# add to array if not already there
links << link if not links =~ /url/
end
rescue
puts 'Looks like we can\'t be here...'
end
links
end
print 'Enter a start URL: '
hyperlinks << gets.chomp
puts 'Off we go!'
count = 0
while true
break if hyperlinks.length == 0
link = hyperlinks.shift
next if visited.include? link
visited << link
puts "Connecting to #{link}..."
links = get_hyperlinks(link)
puts "Found #{links.length} links on #{link}..."
hyperlinks = links + hyperlinks
puts "Moving on with #{hyperlinks.length} links left...\n\n"
end
sorry about the ruby, but its a better language :P and shouldn't be hard to adapt or, like i said, understand.
1) In Python, we do not count elements of a container and use them to index in; we just iterate over its elements, because that's what we want to do.
2) To handle multiple levels of links, we can use recursion.
def followAllLinks(self, from_where):
for link in list(self.getAllUniqueLinks(from_where)):
self.followAllLinks(link)
This does not handle cycles of links, but neither did the original approach. You can handle that by building a set
of already-visited links as you go.
Use scrapy:
Scrapy is a fast high-level screen scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.
精彩评论