I have a filesystem with a few hundred million files (several petabytes) and I want to get pretty much everything that stat would return and store it in some sort of database. Right now, we have an MPI program that is fed directory names from a central queue and worker nodes that slam NFS (which can handle this without trying too hard) with stat calls. The worker nodes then hit postgres to store the results.
Although this works, it's very slow. A single run will take over 24 hours on a modern 30 node cluster.
Does anyone have any ideas for splitting up the directory structure instead of having a centralized queue (I'm under the impression that exact algorithms for this are NP hard)? Also, I've been considering replacing postgres with something like MongoDB's autosharding with several routers (since postgres is currently a huge bottleneck).
I'm pretty much just looking for ideas in general on how this setup could be improved.
Unfortunately, using something like the 2.6 kernel audit subsystem is probably out of the question since it would be extremely difficult (in a political way) to get that running on every machine that hits this filesystem.
If it matters, every machine (several thousand) using this filesystem is running linux 2.6.x.
The actual primary purpose o开发者_运维百科f this is to find files that are older than a certain date so we can have the ability to delete them. We also want to collect data in general on how the filesystem is being used.
Expanding on my comments.
Having the files in a central location is one of the biggest bottlenecks. If you can't optimize the filesystem access times in other ways, probably the best way to do is to have one (or a couple) of workers doing the stat
calls. You will not have performance improvements by adding more than a couple of workers, because they are all accessing the same filesystem.
Because of this, I think that putting the workers on the node where the filesystem is located (instead of accessing it through NFS) should give you a great performance boost.
On the other side, the database writes can be optimized by changing your db engine. As anticipated in the comments, Redis key-value model is better suited for such a task (yes, it is pretty fast): you can use its hash type to store the result of the stat
call using the full pathname as the key.
Additionally, redis will also support clustering in the near future.
We ended up creating our own solution in the end for this (using redis). We've brought the time down from about 24 hours for running it to about 2.5 hours.
- http://github.com/hpc/libcircle for distributing the work.
- http://github.com/hpc/purger for the tools to control everything.
精彩评论