Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this questionIs there a way to make a web robot like websiteoutlook开发者_开发知识库.com does? I need something that searches the internet for URLs only...I don't need links, descriptions, etc.
What is the best way to do this without getting too technical? I guess it could even be a cronjob that runs a PHP script grabbing URLs from Google, or is there a better way?
A simple example or a link to more information would be much appreciated.
I've just had a quick look at the site you mentioned - it appears to fetch info for one domain, rather than crawl for urls.
Anyway, you would write a script which gets a url from a queue, fetches the page contents, parses out the urls within and adds these to the queue. Then add a starting url to the queue and run the script as a crontab.
Around 4 million unique URLs can be found at DMOZ.org
. It is allowed to crawl over the catalogue with a frequency of no more than 1 page per second. As a crawler you can use a site downloading software like HTTrack (it supports an option of complying with robots.txt
rules). All you have to do is to parse downloaded pages for URLs then (and to properly attribute the site afterwards).
精彩评论