开发者

List files in a folder as a stream to begin process immediately

开发者 https://www.devze.com 2023-01-30 08:03 出处:网络
I get a folder with 1 million files in it. I would like to begin proce开发者_Python百科ss immediately, when listing files in this folder, in Python or other script langage.

I get a folder with 1 million files in it.

I would like to begin proce开发者_Python百科ss immediately, when listing files in this folder, in Python or other script langage.

The usual functions (os.listdir in python...) are blocking and my program has to wait the end of the list, which can take a long time.

What's the best way to list huge folders ?


If convenient, change your directory structure; but if not, you can use ctypes to call opendir and readdir.

Here is a copy of that code; all I did was indent it properly, add the try/finally block, and fix a bug. You might have to debug it. Particularly the struct layout.

Note that this code is not portable. You would need to use different functions on Windows, and I think the structs vary from Unix to Unix.

#!/usr/bin/python
"""
An equivalent os.listdir but as a generator using ctypes
"""

from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER
from ctypes.util import find_library

class c_dir(Structure):
    """Opaque type for directory entries, corresponds to struct DIR"""
    pass
c_dir_p = POINTER(c_dir)

class c_dirent(Structure):
    """Directory entry"""
    # FIXME not sure these are the exactly correct types!
    _fields_ = (
        ('d_ino', c_long), # inode number
        ('d_off', c_long), # offset to the next dirent
        ('d_reclen', c_ushort), # length of this record
        ('d_type', c_byte), # type of file; not supported by all file system types
        ('d_name', c_char * 4096) # filename
        )
c_dirent_p = POINTER(c_dirent)

c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p

# FIXME Should probably use readdir_r here
readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p

closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int

def listdir(path):
    """
    A generator to return the names of files in the directory passed in
    """
    dir_p = opendir(path)
    try:
        while True:
            p = readdir(dir_p)
            if not p:
                break
            name = p.contents.d_name
            if name not in (".", ".."):
                yield name
    finally:
        closedir(dir_p)

if __name__ == "__main__":
    for name in listdir("."):
        print name


This feels dirty but should do the trick:

def listdirx(dirname='.', cmd='ls'):
    proc = subprocess.Popen([cmd, dirname], stdout=subprocess.PIPE)
    filename = proc.stdout.readline()
    while filename != '':
        yield filename.rstrip('\n')
        filename = proc.stdout.readline()
    proc.communicate()

Usage: listdirx('/something/with/lots/of/files')


For people coming in off Google, PEP 471 added a proper solution to the Python 3.5 standard library and it got backported to Python 2.6+ and 3.2+ as the scandir module on PIP.

Source: https://stackoverflow.com/a/34922054/435253

Python 3.5+:

  • os.walk has been updated to use this infrastructure for better performance.
  • os.scandir returns an iterator over DirEntry objects.

Python 2.6/2.7 and 3.2/3.3/3.4:

  • scandir.walk is a more performant version of os.walk
  • scandir.scandir returns an iterator over DirEntry objects.

The scandir() iterators wrap opendir/readdir on POSIX platforms and FindFirstFileW/FindNextFileW on Windows.

The point of returning DirEntry objects is to allow metadata to be cached to minimize the number of system calls made. (eg. DirEntry.stat(follow_symlinks=False) never makes a system call on Windows because the FindFirstFileW and FindNextFileW functions throw in stat information for free)

Source: https://docs.python.org/3/library/os.html#os.scandir


Here is your answer on how to traverse a large directory file by file on Windows!

I searched like a maniac for a Windows DLL that will allow me to do what is done on Linux, but no luck.

So, I concluded that the only way is to create my own DLL that will expose those static functions to me, but then I remembered pywintypes. And, YEEY! this is already done there. And, even more, an iterator function is already implemented! Cool!

A Windows DLL with FindFirstFile(), FindNextFile() and FindClose() may be still somewhere there but I didn't find it. So, I used pywintypes.

EDIT: They were hiding in plain sight in kernel32.dll. Please see ssokolow's answer, and my comment to it.

Sorry for dependency. But I think that you can extract win32file.pyd from ...\site-packages\win32 folder and eventual dependencies and distribute it independent of win32types with your program if you have to.

I found this question when searching on how to do this, and some others as well.

Here:

How to copy first 100 files from a directory of thousands of files using python?

I posted a full code with Linux version of listdir() from here (by Jason Orendorff) and with my Windows version that I present here.

So anyone wanting a more or less cross-platform version, go there or combine two answers yourself.

EDIT: Or better still, use scandir module or os.scandir() (in Python 3.5) and following versions. It better handles errors and some other stuff as well.

from win32file import FindFilesIterator
import os

def listdir (path):
    """
    A generator to return the names of files in the directory passed in
    """
    if "*" not in path and "?" not in path:
        st = os.stat(path) # Raise an error if dir doesn't exist or access is denied to us
        # Check if we got a dir or something else!
        # Check gotten from stat.py (for fast checking):
        if (st.st_mode & 0170000) != 0040000:
            e = OSError()
            e.errno = 20; e.filename = path; e.strerror = "Not a directory"
            raise e
        path = path.rstrip("\\/")+"\\*"
    # Else:  Decide that user knows what she/he is doing
    for file in FindFilesIterator(path):
        name = file[-2]
        # Unfortunately, only drives (eg. C:) don't include "." and ".." in the list:
        if name=="." and name=="..": continue
        yield name
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号