开发者

Filter files in a very large folder

开发者 https://www.devze.com 2022-12-19 06:48 出处:网络
I have a folder with 100k text files. I want to put files with over 20 lines in another folder. How do I do this in python? I used os.listdir, but of course, there isn\'t enough memory for even loadin

I have a folder with 100k text files. I want to put files with over 20 lines in another folder. How do I do this in python? I used os.listdir, but of course, there isn't enough memory for even loading the filenames into memory. Is there a way to get maybe 100 filenames at a time?

Here's my code:

import os
import shutil

dir = '/somedir/'

def file_len(fname):
    f = open(fname,'r')
    for i, l in enumerate(f):
        pass
    f.close()
    return i + 1

filenames = os.listdir(dir+'labels/')

i = 0
for filename in filenames:
    flen = file_len(dir+'labels/'+filename)
    print flen
    if flen > 15:
        i = i+1
        shutil.copyfile(dir+'originals/'+filename[:-5], dir+'filteredOrigs/'+filename[:-5])
print i

And Output:

Traceback (most recent call last):
  File "filterimage.py", line 13, in <module>
    filenames = os.listdir(dir+'labels/')
OSError: [Errno 12] Cannot allocate memory: '/somedir/'

Here's the modified script:

import os
import shutil
import glob

topdir = '/somedir'

def filelen(fname, many):
    f = open(fname,'r')
    for i, l in enumerate(f):
  开发者_开发问答      if i > many:
            f.close()
            return True
    f.close()
    return False

path = os.path.join(topdir, 'labels', '*')
i=0
for filename in glob.iglob(path):
    print filename
    if filelen(filename,5):
        i += 1
print i

it works on a folder with fewer files, but with the larger folder, all it prints is "0"... Works on linux server, prints 0 on mac... oh well...


you might try using glob.iglob that returns an iterator:

topdir = os.path.join('/somedir', 'labels', '*')
for filename in glob.iglob(topdir):
     if filelen(filename) > 15:
          #do stuff

Also, please don't use dir for a variable name: you're shadowing the built-in.

Another major improvement that you can introduce is to your filelen function. If you replace it with the following, you'll save a lot of time. Trust me, what you have now is the slowest alternative:

def many_line(fname, many=15):
    for i, line in enumerate(open(fname)):
        if i > many:
            return True
    return False


A couple thoughts. First, you might use the glob module to get smaller groups of files. Second, sorting by line count is going to be very time consuming, as you have to open every file and count lines. If you can partition by byte count, you can avoid opening the files by using the stat module. If it's crucial that the split happens at 20 lines, you can at least cut out large swaths of files by figuring out a minimum number of characters that a 20 line file of your type will have, and not opening any file smaller than that.


import os,shutil
os.chdir("/mydir/")
numlines=20
destination = os.path.join("/destination","dir1")
for file in os.listdir("."):
    if os.path.isfile(file):
        flag=0
        for n,line in enumerate(open(file)):
            if n > numlines: 
                flag=1
                break
        if flag:
            try:
                shutil.move(file,destination) 
            except Exception,e: print e
            else:
                print "%s moved to %s" %(file,destination)


how about using a shell script? you could pick one file at a time:

for f in `ls`;
loop
if `wc -l f`>20; then
  mv f newfolder
fi
end loop

ppl please correct if i am wrong in any way


The currently accepted answer just plain doesn't work. This function:

def many_line(fname, many=15):
    for i, line in enumerate(line):
        if i > many:
            return True
    return False

has two problems: Firstly, the fname arg is not used and the file is not opened. Secondly, the call to enumerate(line) will fail because line is not defined.

Changing enumerate(line) to enumerate(open(fname)) will fix it.


You can use os.scandir which is a generator, and therefore does not read all file names at once (comes with python 3.5, otherwise or just simply: pip install scandir).

Example:

    import os
    for file in os.scandir(path):
        do_something_with_file(path+file.name)

scandir documentation: https://pypi.org/project/scandir/

0

精彩评论

暂无评论...
验证码 换一张
取 消