I have two path
path1 = "/home/x/nearline"
path2 = "/home/x/sge_jobs_output"
In path1, I have a bunch of fastq files:
ERR001268_1.recal.fastq.gz
ERR001268_2.recal.fastq.gz
ERR001269_1.recal.fastq.gz
ERR001269_2.recal.fastq.gz
.............
In path2, I have many .txt corresponding to the fastq files in path1:
ERR001268_1.txt
ERR001268_2.txt
ERR001269_1.txt
ERR001269_2.txt
.............
NOw I've made script to calculate fastq_seq_num from fastq files in path1, see below:
for file in os.listdir(path1):
if re.match('.*\.recal.fastq.gz', file):
fullpath1 = os.path.join(path1, file)
#To calculate the sequence number in fastq.gz files
result = commands.getoutput('zcat ' + fullpath1 + ' |wc -l')
fastq_seq_num = int(result)/4.0
print file,fastq_seq_num
And also calculate num_seq_processed_sai from .txt files in path2, see below:
for file in os.listdir(path2):
if re.match('.*\.txt', file):
fullpath2 = os.path.join(path2, file)
#To calculate how many sequences have been processed in .sai file
linelist = open (fullpath2,'r').readlines
lastline = linelist[len(linelist)-1]
num_seq_processed_sai = lastline.split(']')[1].split()[0]
print file,num_seq_processed_sai
OK, now my problem is : I want to create a loop, in which I calculate fastq_seq_num for the FIRST fastq file in path1; then calcula开发者_StackOverflow社区te num_seq_processed for the FIRST txt file in path2; then compare this two numbers; then end the loop. Then the second loop starts...How can I desgin some loop to achieve this? thanks!!!
Iterate the fastq files, check for the corresponding .txt file and if the pair exists, run your processing and compare the outputs
import commands
import glob
from os import path
dir1 = '/home/x/nearline'
dir2 = '/home/x/sge_jobs_output'
for filepath in glob.glob(path.join(dir1, '*.recal.fastq.gz')):
filename = path.basename(filepath)
job_id = filename.split('.', 1)[0]
## Look for corresponding .txt file
txt_filepath = path.join(dir2, '%s.txt' % job_id)
## Fail early if corresponding .txt file is missing
if not path.isfile(txt_filepath):
print('Missing %s for %s' % (txt_filepath, filepath))
continue
## Both exist, process each
## This is from your code snippet
result = commands.getoutput('zcat ' + fullpath1 + ' |wc -l')
fastq_seq_num = int(result)/4.0
linelist = open(txt_filepath).readlines()
lastline = linelist[len(linelist)-1]
num_seq_processed_sai = lastline.split(']')[1].split()[0]
if fastq_seq_num == num_seq_processed_sai:
print "Sequence numbers match (%d : %d)" % (fastq_seq_num, num_seq_processed_sai)
else:
print "Sequence numbers do not match (%d : %d)" % (fastq_seq_num, num_seq_processed_sai)
I suggest using glob.glob()
to list your files (as I've done in this snippet).
I also suggest replacing commands
with subprocess
, it's newer and I think commands is deprecated.
Also, reading all lines of a .txt file to get at the last line is inefficient and may cause memory issues if the files are large. Consider calling tail
to get the last line (*nix only I think).
Are the number of files in both directories guaranteed to be the same? If so, you can use the zip
function to accomplish this:
for fastqFile, txtFile in zip(glob.glob(path1+'/*.recal.fastq.gz'), glob.glob(path2+'/*.txt')):
result = commands.getoutput('zcat ' + fastqFile + ' |wc -l')
fastq_seq_num = int(result)/4.0
lastline = linelist[-1]
num_seq_processed_sai = lastline.split(']')[1].split()[0]
print fastqFile, fastq_seq_num
print txtFile, num_seq_processed_sai
EDIT: As side notes, using glob.glob()
is almost always preferable to manually filtering the output of os.listdir()
and the word file
is a built-in type in Python, which you should never use as a variable name. In addition, to get to the last item of a list
, you only need to access it with listName[-1]
. Accessing it with listName[len(listName)-1]
is unpythonic.
精彩评论