I get a folder with 1 million files in it.
I would like to begin proce开发者_Python百科ss immediately, when listing files in this folder, in Python or other script langage.
The usual functions (os.listdir in python...) are blocking and my program has to wait the end of the list, which can take a long time.
What's the best way to list huge folders ?
If convenient, change your directory structure; but if not, you can use ctypes to call opendir
and readdir
.
Here is a copy of that code; all I did was indent it properly, add the try/finally
block, and fix a bug. You might have to debug it. Particularly the struct layout.
Note that this code is not portable. You would need to use different functions on Windows, and I think the structs vary from Unix to Unix.
#!/usr/bin/python
"""
An equivalent os.listdir but as a generator using ctypes
"""
from ctypes import CDLL, c_char_p, c_int, c_long, c_ushort, c_byte, c_char, Structure, POINTER
from ctypes.util import find_library
class c_dir(Structure):
"""Opaque type for directory entries, corresponds to struct DIR"""
pass
c_dir_p = POINTER(c_dir)
class c_dirent(Structure):
"""Directory entry"""
# FIXME not sure these are the exactly correct types!
_fields_ = (
('d_ino', c_long), # inode number
('d_off', c_long), # offset to the next dirent
('d_reclen', c_ushort), # length of this record
('d_type', c_byte), # type of file; not supported by all file system types
('d_name', c_char * 4096) # filename
)
c_dirent_p = POINTER(c_dirent)
c_lib = CDLL(find_library("c"))
opendir = c_lib.opendir
opendir.argtypes = [c_char_p]
opendir.restype = c_dir_p
# FIXME Should probably use readdir_r here
readdir = c_lib.readdir
readdir.argtypes = [c_dir_p]
readdir.restype = c_dirent_p
closedir = c_lib.closedir
closedir.argtypes = [c_dir_p]
closedir.restype = c_int
def listdir(path):
"""
A generator to return the names of files in the directory passed in
"""
dir_p = opendir(path)
try:
while True:
p = readdir(dir_p)
if not p:
break
name = p.contents.d_name
if name not in (".", ".."):
yield name
finally:
closedir(dir_p)
if __name__ == "__main__":
for name in listdir("."):
print name
This feels dirty but should do the trick:
def listdirx(dirname='.', cmd='ls'):
proc = subprocess.Popen([cmd, dirname], stdout=subprocess.PIPE)
filename = proc.stdout.readline()
while filename != '':
yield filename.rstrip('\n')
filename = proc.stdout.readline()
proc.communicate()
Usage: listdirx('/something/with/lots/of/files')
For people coming in off Google, PEP 471 added a proper solution to the Python 3.5 standard library and it got backported to Python 2.6+ and 3.2+ as the scandir
module on PIP.
Source: https://stackoverflow.com/a/34922054/435253
Python 3.5+:
os.walk
has been updated to use this infrastructure for better performance.os.scandir
returns an iterator overDirEntry
objects.
Python 2.6/2.7 and 3.2/3.3/3.4:
scandir.walk
is a more performant version ofos.walk
scandir.scandir
returns an iterator overDirEntry
objects.
The scandir()
iterators wrap opendir
/readdir
on POSIX platforms and FindFirstFileW
/FindNextFileW
on Windows.
The point of returning DirEntry
objects is to allow metadata to be cached to minimize the number of system calls made. (eg. DirEntry.stat(follow_symlinks=False)
never makes a system call on Windows because the FindFirstFileW
and FindNextFileW
functions throw in stat
information for free)
Source: https://docs.python.org/3/library/os.html#os.scandir
Here is your answer on how to traverse a large directory file by file on Windows!
I searched like a maniac for a Windows DLL that will allow me to do what is done on Linux, but no luck.
So, I concluded that the only way is to create my own DLL that will expose those static functions to me, but then I remembered pywintypes. And, YEEY! this is already done there. And, even more, an iterator function is already implemented! Cool!
A Windows DLL with FindFirstFile(), FindNextFile() and FindClose() may be still somewhere there but I didn't find it. So, I used pywintypes.
EDIT: They were hiding in plain sight in kernel32.dll. Please see ssokolow's answer, and my comment to it.
Sorry for dependency. But I think that you can extract win32file.pyd from ...\site-packages\win32 folder and eventual dependencies and distribute it independent of win32types with your program if you have to.
I found this question when searching on how to do this, and some others as well.
Here:
How to copy first 100 files from a directory of thousands of files using python?
I posted a full code with Linux version of listdir() from here (by Jason Orendorff) and with my Windows version that I present here.
So anyone wanting a more or less cross-platform version, go there or combine two answers yourself.
EDIT: Or better still, use scandir module or os.scandir() (in Python 3.5) and following versions. It better handles errors and some other stuff as well.
from win32file import FindFilesIterator
import os
def listdir (path):
"""
A generator to return the names of files in the directory passed in
"""
if "*" not in path and "?" not in path:
st = os.stat(path) # Raise an error if dir doesn't exist or access is denied to us
# Check if we got a dir or something else!
# Check gotten from stat.py (for fast checking):
if (st.st_mode & 0170000) != 0040000:
e = OSError()
e.errno = 20; e.filename = path; e.strerror = "Not a directory"
raise e
path = path.rstrip("\\/")+"\\*"
# Else: Decide that user knows what she/he is doing
for file in FindFilesIterator(path):
name = file[-2]
# Unfortunately, only drives (eg. C:) don't include "." and ".." in the list:
if name=="." and name=="..": continue
yield name
精彩评论