I have written myself a web crawler using simplehtmldom, and have got the crawl process working quite nicely. It crawls the start page, adds all links into a database table, sets a session pointer, and meta refreshes the page to carry onto the next page. That keeps going until it runs out of links
That works fine however obviously the crawl time for larger websites is pretty tedious. I wanted to be able to speed things up a bit though, and possibly make it a cron job.
Any ideas on making 开发者_开发技巧it as quick and efficient as possible other than setting the memory limit / execution time higher?
Looks like you're running your script in a web browser. You may consider running it from the command line. You can execute multiple scripts to crawl on different pages at the same time. That should speed things up.
Memory must not be an problem for a crawler.
Once you are done with one page and have written all relevant data to the database you should get rid of all variables you created for this job.
The memory usage after 100 pages must be the same as after 1 page. If this is not the case find out why.
You can split up the work between different processes: Usually parsing a page does not take as long as loading it,so you can write all links that you find to a database and have multiple other processes that just download the documents to a temp directory. If you do this you must ensure that
- no link is downloaded by to workers.
- your processes wait for new links if there are none.
- temp files are removed after each scan.
- the download process stop when you run out of links. You can archive this by setting a "kill flag" this can be a file with a special name or an entry in the database.
精彩评论