Assuming that the given directory tree is of reasonable size: say an open source project like Twisted or Python, what is the fastest 开发者_运维知识库way to traverse and iterate over the absolute path of all files/directories inside that directory?
I want to do this from within Python. os.path.walk is slow. So I tried ls -lR and tree -fi. For a project with about 8337 files (including tmp, pyc, test, .svn files):
$ time tree -fi > /dev/null
real 0m0.170s
user 0m0.044s
sys 0m0.123s
$ time ls -lR > /dev/null
real 0m0.292s
user 0m0.138s
sys 0m0.152s
$ time find . > /dev/null
real 0m0.074s
user 0m0.017s
sys 0m0.056s
$
tree
appears to be faster than ls -lR
(though ls -R
is faster than tree
, but it does not give full paths). find
is the fastest.
Can anyone think of a faster and/or better approach? On Windows, I may simply ship a 32-bit binary tree.exe or ls.exe if necessary.
Update 1: Added find
Update 2: Why do I want to do this? ... I am trying to make a smart replacement for cd, pushd, etc.. and wrapper commands for other commands relying on passing paths (less, more, cat, vim, tail). The program will use file traversal occasionally to do that (eg: typing "cd sr grai pat lxml" would automatically translate to "cd src/pypm/grail/patches/lxml"). I won't be satisfied if this cd replacement took, say, half a second to run. See http://github.com/srid/pf
Your approach in pf is going to be hopelessly slow, even if os.path.walk
took no time at all. Doing a regex match containing 3 unbounded closures across all extant paths will kill you right there. Here is the code from Kernighan and Pike that I referenced, this is a proper algorithm for the task:
/* spname: return correctly spelled filename */
/*
* spname(oldname, newname) char *oldname, *newname;
* returns -1 if no reasonable match to oldname,
* 0 if exact match,
* 1 if corrected.
* stores corrected name in newname.
*/
#include <sys/types.h>
#include <sys/dir.h>
spname(oldname, newname)
char *oldname, *newname;
{
char *p, guess[DIRSIZ+1], best[DIRSIZ+1];
char *new = newname, *old = oldname;
for (;;) {
while (*old == '/') /* skip slashes */
*new++ = *old++;
*new = '\0';
if (*old == '\0') /* exact or corrected */
return strcmp(oldname,newname) != 0;
p = guess; /* copy next component into guess */
for ( ; *old != '/' && *old != '\0'; old++)
if (p < guess+DIRSIZ)
*p++ = *old;
*p = '\0';
if (mindist(newname, guess, best) >= 3)
return -1; /* hopeless */
for (p = best; *new = *p++; ) /* add to end */
new++; /* of newname */
}
}
mindist(dir, guess, best) /* search dir for guess */
char *dir, *guess, *best;
{
/* set best, return distance 0..3 */
int d, nd, fd;
struct {
ino_t ino;
char name[DIRSIZ+1]; /* 1 more than in dir.h */
} nbuf;
nbuf.name[DIRSIZ] = '\0'; /* +1 for terminal '\0' */
if (dir[0] == '\0') /* current directory */
dir = ".";
d = 3; /* minimum distance */
if ((fd=open(dir, 0)) == -1)
return d;
while (read(fd,(char *) &nbuf,sizeof(struct direct)) > 0)
if (nbuf.ino) {
nd = spdist(nbuf.name, guess);
if (nd <= d && nd != 3) {
strcpy(best, nbuf.name);
d = nd;
if (d == 0) /* exact match */
break;
}
}
close(fd);
return d;
}
/* spdist: return distance between two names */
/*
* very rough spelling metric:
* 0 if the strings are identical
* 1 if two chars are transposed
* 2 if one char wrong, added or deleted
* 3 otherwise
*/
#define EQ(s,t) (strcmp(s,t) == 0)
spdist(s, t)
char *s, *t;
{
while (*s++ == *t)
if (*t++ == '\0')
return 0; /* exact match */
if (*--s) {
if (*t) {
if (s[1] && t[1] && *s == t[1]
&& *t == s[1] && EQ(s+2, t+2))
return 1; /* transposition */
if (EQ(s+1, t+1))
return 2; /* 1 char mismatch */
}
if (EQ(s+1, t))
return 2; /* extra character */
}
if (*t && EQ(s, t+1))
return 2; /* missing character */
return 3;
}
Note: this code was written way before ANSI C, ISO C, or POSIX anything was even imagined when one read directory files raw. The approach of the code is far more useful than all the pointer slinging.
It would be hard to get much better than find
in performance, but the question is how much faster and why do you need it to be so fast? You claim that os.path.walk is slow, indeed, it is ~3 times slower on my machine over a tree of 16k directories. But then again, we're talking about the difference between 0.68 seconds and 1.9 seconds for Python.
If setting a speed record is your goal, you can't beat hard coded C which is completely 75% system call bound and you can't make the OS go faster. That said, 25% of the Python time is spent in system calls. What is it that you want to do with the traversed paths?
One solution you have not mentioned is 'os.walk'. I'm not sure it'd be any faster than os.path.walk, but it's objectively better.
You have not said what you're going to do with the list of directories when you have it, so it's hard to give more specific suggestions.
Although I doubt you have multiple read heads, here's how you can traverse a few million files (we've done 10M+ in a few minutes).
https://github.com/hpc/purger/blob/master/src/treewalk/treewalk.c
精彩评论