I have a folder with 100k text files. I want to put files with over 20 lines in another folder. How do I do this in python? I used os.listdir, but of course, there isn't enough memory for even loading the filenames into memory. Is there a way to get maybe 100 filenames at a time?
Here's my code:
import os
import shutil
dir = '/somedir/'
def file_len(fname):
f = open(fname,'r')
for i, l in enumerate(f):
pass
f.close()
return i + 1
filenames = os.listdir(dir+'labels/')
i = 0
for filename in filenames:
flen = file_len(dir+'labels/'+filename)
print flen
if flen > 15:
i = i+1
shutil.copyfile(dir+'originals/'+filename[:-5], dir+'filteredOrigs/'+filename[:-5])
print i
And Output:
Traceback (most recent call last):
File "filterimage.py", line 13, in <module>
filenames = os.listdir(dir+'labels/')
OSError: [Errno 12] Cannot allocate memory: '/somedir/'
Here's the modified script:
import os
import shutil
import glob
topdir = '/somedir'
def filelen(fname, many):
f = open(fname,'r')
for i, l in enumerate(f):
开发者_开发问答 if i > many:
f.close()
return True
f.close()
return False
path = os.path.join(topdir, 'labels', '*')
i=0
for filename in glob.iglob(path):
print filename
if filelen(filename,5):
i += 1
print i
it works on a folder with fewer files, but with the larger folder, all it prints is "0"... Works on linux server, prints 0 on mac... oh well...
you might try using glob.iglob
that returns an iterator:
topdir = os.path.join('/somedir', 'labels', '*')
for filename in glob.iglob(topdir):
if filelen(filename) > 15:
#do stuff
Also, please don't use dir
for a variable name: you're shadowing the built-in.
Another major improvement that you can introduce is to your filelen
function. If you replace it with the following, you'll save a lot of time. Trust me, what you have now is the slowest alternative:
def many_line(fname, many=15):
for i, line in enumerate(open(fname)):
if i > many:
return True
return False
A couple thoughts. First, you might use the glob
module to get smaller groups of files. Second, sorting by line count is going to be very time consuming, as you have to open every file and count lines. If you can partition by byte count, you can avoid opening the files by using the stat
module. If it's crucial that the split happens at 20 lines, you can at least cut out large swaths of files by figuring out a minimum number of characters that a 20 line file of your type will have, and not opening any file smaller than that.
import os,shutil
os.chdir("/mydir/")
numlines=20
destination = os.path.join("/destination","dir1")
for file in os.listdir("."):
if os.path.isfile(file):
flag=0
for n,line in enumerate(open(file)):
if n > numlines:
flag=1
break
if flag:
try:
shutil.move(file,destination)
except Exception,e: print e
else:
print "%s moved to %s" %(file,destination)
how about using a shell script? you could pick one file at a time:
for f in `ls`;
loop
if `wc -l f`>20; then
mv f newfolder
fi
end loop
ppl please correct if i am wrong in any way
The currently accepted answer just plain doesn't work. This function:
def many_line(fname, many=15):
for i, line in enumerate(line):
if i > many:
return True
return False
has two problems: Firstly, the fname
arg is not used and the file is not opened. Secondly, the call to enumerate(line)
will fail because line
is not defined.
Changing enumerate(line)
to enumerate(open(fname))
will fix it.
You can use os.scandir which is a generator, and therefore does not read all file names at once (comes with python 3.5, otherwise or just simply: pip install scandir).
Example:
import os
for file in os.scandir(path):
do_something_with_file(path+file.name)
scandir documentation: https://pypi.org/project/scandir/
精彩评论