Most optimized way to store crawler states?_问答_开发者

Most optimized way to store crawler states?

开发者 https://www.devze.com 2022-12-11 03:53 出处：网络

I\'m currently writing a web crawler (using the python framework scrapy). Recently I had to implement a pause/resume system.

I'm currently writing a web crawler (using the python framework scrapy).

Recently I had to implement a pause/resume system.

The solution I implemented is of the simplest kind and, basically, stores links when they get scheduled, and marks them as 'processed' once they actually are.

Thus, I'm able to fetch those links (obviously there is a little bit more stored than just an URL, depth value, the domain the link belongs to, etc ...) when resuming the spider and so far everything works well.

Right now, I've just been using a mysql table to handle those storage action, mostly for fast prototyping.

Now I'd like to know how I could optimize this, since I believe a database shouldn't be the only option available here. By optimize, I mean, using a very simple and light system, while still being able to handle a gr开发者_运维问答eat amount of data written in short times

For now, it should be able to handle the crawling for a few dozen of domains, which means storing a few thousand links a second ...

Thanks in advance for suggestions

The fastest way of persisting things is typically to just append them to a log -- such a totally sequential access pattern minimizes disk seeks, which are typically the largest part of the time costs for storage. Upon restarting, you re-read the log and rebuild the memory structures that you were also building on the fly as you were appending to the log in the first place.

Your specific application could be further optimized since it doesn't necessarily require 100% reliability -- if you miss writing a few entries due to a sudden crash, ah well, you'll just crawl them again. So, your log file can be buffered and doesn't need to be obsessively fsync'ed.

I imagine the search structure would also fit comfortably in memory (if it's only for a few dozen sites you could probably just keep a set with all their URLs, no need for bloom filters or anything fancy) -- if it didn't, you might have to keep in memory only a set of recent entries, and periodically dump that set to disk (e.g., merging all entries into a Berkeley DB file); but I'm not going into excruciating details about these options since it does not appear you will require them.

There was a talk at PyCon 2009 that you may find interesting, Precise state recovery and restart for data-analysis applications by Bill Gribble.

Another quick way to save your application state may be to use pickle to serialize your application state to disk.