I have huge text file. It looks as follows
> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000
.....
Now i want to create python script to find words like (> <Enzymologic: Ki nM 1>
, > <Enzymologic: EC50/IC50 nM 1>
) and print next line to each word in tab delimited format as follows
> <Enzymologic: Ki nM 1> > <Enzymologic: EC50/IC50 nM 1>
257000 n/a
5000 1000
....
I tried following code
infile = path of th开发者_如何学Pythone file
lines = infile.readlines()
infile.close()
searchtxt = "> <Enzymologic: IC50 nM 1>", "> <Enzymologic: Ki nM 1>"
for i, line in enumerate(lines):
if searchtxt in line and i+1 < len(lines):
print lines[i+1]
But it doesnt work can any body suggest some code...to acheive it.
Thanks in advance
s = '''Enzymologic: Ki nM 1
257000
Enzymologic: IC50 nM 1
n/a
ITC: Delta_G0 kJ/mole 1
n/a
Enzymologic: Ki nM 1
5000
Enzymologic: IC50 nM 1
1000'''
from collections import defaultdict
lines = [x for x in s.splitlines() if x]
keys = lines[::2]
values = lines[1::2]
result = defaultdict(list)
for key, value in zip(keys, values):
result[key].append(value)
print dict(result)
>>> {'ITC: Delta_G0 kJ/mole 1': ['n/a'], 'Enzymologic: Ki nM 1': ['257000', '5000'], 'Enzymologic: IC50 nM 1': ['n/a', '1000']}
Then format output as you like.
I think your problem comes from the fact that you do if searchtxt in line
instead of doing if pattern in line
for each pattern
in your searchtxt
. Here is what I'd do:
>>> path = 'D:\\temp\\Test.txt'
>>> lines = open(path).readlines()
>>> searchtxt = "Enzymologic: IC50 nM 1", "Enzymologic: Ki nM 1"
>>> from collections import defaultdict
>>> dict_patterns = defaultdict(list)
>>> for i, line in enumerate(lines):
for pattern in searchtxt:
if pattern in line and i+1 < len(lines):
dict_patterns[pattern].append(lines[i+1])
>>> dict_patterns
defaultdict(<type 'list'>, {'Enzymologic: Ki nM 1': ['257000\n', '5000\n'],
'Enzymologic: IC50 nM 1': ['n/a\n', '1000']})
The use of the dict allows to group results by pattern (defaultdict
is a convenient way not to be forced to initialize your object).
You really have too separate problems:
Parse the file and extract the data from it
import itertools
# let's imitate a file
pseudo_file = """
> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000
""".split('\n')
def iterate_on_couple(iterable):
"""
Iterate on two elements, by two elements
"""
iterable = iter(iterable)
for x in iterable:
yield x, next(iterable)
plain_lines = (l for l in pseudo_file if l.strip()) # ignore empty lines
results = {}
# store all results in a dictionary
for name, value in iterate_on_couple(plain_lines):
results.setdefault(name, []).append(value)
# now you got a dictionary with all values linked to a name
print results
Now this code make the assumption that your files are not corrupted and that you have always the structure:
- blank
- name
- value
If not you may need something more robust.
Secondly, this stores all the values in memory, which could be a problem if
your have a lot of values. In that case, you'll need to look at some storage
solution such as the shelve
module or sqlite
.
Save the results into a file
import csv
def get(iterable, index, default):
"""
Return an item from array or default if IndexError
"""
try:
return iterable[index]
except IndexError:
return default
names = results.keys() # get a list of all names
# now we write our tab separated file using the csv module
out = csv.writer(open('/tmp/test.csv', 'w'), delimiter='\t')
# first the header
out.writerow(names)
# get the size of the longest column
max_size = list(reversed(sorted(len(results[name]) for name in names)))[0]
# then write the lines one by one
for i in xrange(max_size):
line = [get(results[name], i, "-") for name in names]
out.writerow(line)
Since I'm writting the whole code for you, I deliberatly used some advanced Python idioms so you'll have some food for thought while using it.
import itertools
def search(lines, terms):
results = [[t] for t in terms]
lines = iter(lines)
for l in lines:
for i,t in enumerate(terms):
if t in l:
results[i].append(lines.next().strip())
break
return results
def format(results):
s = []
rows = list(itertools.izip_longest(*results, fillvalue=""))
for row in rows:
s.append("\t".join(row))
s.append('\n')
return ''.join(s)
And here's how to call the functions:
example = """> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000"""
def test():
terms = ["> <Enzymologic: IC50 nM 1>", "> <Enzymologic: Ki nM 1>"]
lines = example.split('\n')
result = search(lines, terms)
print format(result)
>>> test() > <Enzymologic: IC50 nM 1> > <Enzymologic: Ki nM 1> n/a 257000
The above example separates each column by a single tab. If you need fancier formatting (as per your example), the format function gets a bit more complicated:
import math
def format(results):
maxcolwidth = [0] * len(results)
rows = list(itertools.izip_longest(*results, fillvalue=""))
for row in rows:
for i,col in enumerate(row):
w = int(math.ceil(len(col)/8.0))*8
maxcolwidth[i] = max(maxcolwidth[i], w)
s = []
for row in rows:
for i,col in enumerate(row):
s += col
padding = maxcolwidth[i]-len(col)
tabs = int(math.ceil(padding/8.0))
s += '\t' * tabs
s += '\n'
return ''.join(s)
import re
pseudo_file = """
> <Enzymologic: Ki nM 1>
257000
> <Enzymologic: IC50 nM 1>
n/a
> <ITC: Delta_G0 kJ/mole 1>
n/a
> <Enzymologic: Ki nM 1>
5000
> <Enzymologic: EC50/IC50 nM 1>
1000"""
searchtxt = "nzymologic: Ki nM 1>", "<Enzymologic: IC50 nM 1>"
regx_AAA = re.compile('([^:]+: )([^ \t]+)(.*)')
tu = tuple(regx_AAA.sub('\\1.*?\\2.*?\\3',x) for x in searchtxt)
model = '%%-%ss %%s\n' % len(searchtxt[0])
regx_BBB = re.compile(('%s[ \t\r\n]+(.+)[ \t\r\n]+'
'.+?%s[ \t\r\n]+(.+?)[ \t]*(?=\r?\n|\Z)') % tu)
print 'tu ==',tu
print 'model==',model
print 'regx_BBB.findall(pseudo_file)==\n',regx_BBB.findall(pseudo_file)
with open('woof.txt','w') as f:
f.write(model % searchtxt)
f.writelines(model % x for x in regx_BBB.findall(pseudo_file))
result
tu == ('nzymologic: .*?Ki.*? nM 1>', '<Enzymologic: .*?IC50.*? nM 1>')
model== %-20s %s
regx_BBB.findall(pseudo_file)==
[('257000', 'n/a'), ('5000', '1000')]
and content of file 'woof.txt' is:
> <Enzymologic: Ki nM 1> > <Enzymologic: IC50 nM 1>
257000 n/a
5000 1000
To obtain regx_BBB, I first compute a tuple tu because you want to catch a line > but there is only "> " in searchtxt
So, the tuple tu introduces .*? in the strings of searchtxt in order that the regex regx_BBB is able to catch lines CONTAINING IC50 and not only the lines strictly EQUAL to the elements of searchtxt
Note that I put strings "nzymologic: Ki nM 1>"
and "<Enzymologic: IC50 nM 1>"
in searchtxt, other than the ones you utilize, to show that the regexes are build so that the result is obtained yet.
The only condition is that there must be at least ONE character before the ':' in each of the strings of searchtxt
.
EDIT 1
I thought that in the file, a line '> <Enzymologic: IC50 nM 1>'
or '> <Enzymologic: EC50/IC50 nM 1>'
should always follow a line '> <Enzymologic: Ki nM 1>'
But after having read the answer of others, I think it is not evident (that's the common problem of questions: they don't give enough information and precisions)
If every line must be catched independantly, the following simpler regex regx_BBB can be used:
regx_AAA = re.compile('([^:]+: )([^ \t]+)(.*)')
li = [ regx_AAA.sub('\\1.*?\\2.*?\\3',x) for x in searchtxt]
regx_BBB = re.compile('|'.join(li).join('()') + '[ \t\r\n]+(.+?)[ \t]*(?=\r?\n|\Z)')
But the formatting of the recording file will be harder. I am tired to write a new complete code without knowing what is precisely wanted
Probably the simplest way to find a string in a line then print the next line is to use itertools islice:
from itertools import islice
searchtxt = "<Enzymologic: IC50 nM 1>"
with open ('file.txt','r') as itfile:
for line in itfile:
if searchtxt in line:
print line
print ''.join(islice(itfile,1)
精彩评论