os.walk() caching/speeding up_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-12 07:26 出处：网络

I have a prototype server[0] that\'s doing an os.walk()[1] for each query a client[0] makes. I\'m currently looking into ways of:

I have a prototype server[0] that's doing an os.walk()[1] for each query a client[0] makes.

I'm currently looking into ways of:

caching this data in memory,
speeding up queries, and
hopefully allowing for expansion into storing metadata and data persistence later on.

I find SQL complicated for tree structures, so I thought I would get some advice before actually committing to SQLite

Are there any cross-platform, embeddable or bundle-able non-SQL databases that might be able to handle this kind of data?

I have a small (10k-100k files) list.
I have an extremely small amount of connections (maybe 10-20).
I want to be able to scale to handling metadata as well.

[0] the server and client are actually the same piece of software, this is a P2P application, that's designed to share files over a local trusted network with out a main server, using zeroc开发者_如何转开发onf for discovery, and twisted for pretty much everything else

[1] query time is currently 1.2s with os.walk() on 10,000 files

Here is the related function in my Python code that does the walking:

def populate(self, string):
    for name, sharedir in self.sharedirs.items():
        for root, dirs, files, in os.walk(sharedir):
            for dir in dirs:
                if fnmatch.fnmatch(dir, string):
                    yield os.path.join(name, *os.path.join(root, dir)[len(sharedir):].split("/"))
            for file in files:
                if fnmatch.fnmatch(file, string): 
                    yield os.path.join(name, *os.path.join(root, ile)[len(sharedir):].split("/"))

You don't need to persist a tree structure -- in fact, your code is busily dismantling the natural tree structure of the directory tree into a linear sequence, so why would you want to restart from a tree next time?

Looks like what you need is just an ordered sequence:

i   X    result of os.path.join for X

where X, a string, names either a file or directory (you treat them just the same), i is a progressively incrementing integer number (to preserve the order), and the result column, also a string, is the result of os.path.join(name, *os.path.join(root, &c.

This is perfectly easy to put in a SQL table, of course!

To create the table the first time, just remove the guards if fnmatch.fnmatch (and the string argument) from your populate function, yield the dir or file before the os.path.join result, and use a cursor.executemany to save the enumerate of the call (or, use a self-incrementing column, your pick). To use the table, populate becomes essentially a:

select result from thetable where X LIKE '%foo%' order by i

where string is foo.

I misunderstood the question at first, but I think I have a solution now (and sufficiently different from my other answer to warrant a new one). Basically, you do the normal query the first time you run walk on a directory, but you store the yielded values. The second time around, you just yield those stored values. I've wrapped the os.walk() call because it's short, but you could just as easily wrap your generator as a whole.

cache = {}
def os_walk_cache( dir ):
   if dir in cache:
      for x in cache[ dir ]:
         yield x
   else:
      cache[ dir ]    = []
      for x in os.walk( dir ):
         cache[ dir ].append( x )
         yield x
   raise StopIteration()

I'm not sure of your memory requirements, but you may want to consider periodically cleaning out cache.

Have you looked at MongoDB? What about mod_python? mod_python should allow you to do your os.walk() and just store the data in Python data structures, since the script is persistent between connections.