开发者

Finding multiple words and print next line using Python

开发者 https://www.devze.com 2023-03-12 14:38 出处:网络
I have huge text file.It looks as follows > <Enzymologic: Ki nM 1> 257000 > <Enzymologic: IC50 nM 1>

I have huge text file. It looks as follows

> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000

.....

Now i want to create python script to find words like (> <Enzymologic: Ki nM 1>, > <Enzymologic: EC50/IC50 nM 1>) and print next line to each word in tab delimited format as follows

> <Enzymologic: Ki nM 1>     > <Enzymologic: EC50/IC50 nM 1>
257000                       n/a
5000                         1000
.... 

I tried following code

infile = path of th开发者_如何学Pythone file
lines = infile.readlines()
infile.close()
searchtxt = "> <Enzymologic: IC50 nM 1>", "> <Enzymologic: Ki nM 1>"
for i, line in enumerate(lines): 
     if searchtxt in line and i+1 < len(lines):
         print lines[i+1]

But it doesnt work can any body suggest some code...to acheive it.

Thanks in advance


s = '''Enzymologic: Ki nM 1

257000

Enzymologic: IC50 nM 1

n/a

ITC: Delta_G0 kJ/mole 1

n/a

Enzymologic: Ki nM 1

5000

Enzymologic: IC50 nM 1

1000'''
from collections import defaultdict

lines = [x for x in s.splitlines() if x]
keys = lines[::2]
values = lines[1::2]
result = defaultdict(list)
for key, value in zip(keys, values):
    result[key].append(value)
print dict(result)

>>> {'ITC: Delta_G0 kJ/mole 1': ['n/a'], 'Enzymologic: Ki nM 1': ['257000', '5000'], 'Enzymologic: IC50 nM 1': ['n/a', '1000']}

Then format output as you like.


I think your problem comes from the fact that you do if searchtxt in line instead of doing if pattern in line for each pattern in your searchtxt. Here is what I'd do:

>>> path = 'D:\\temp\\Test.txt'
>>> lines = open(path).readlines()
>>> searchtxt = "Enzymologic: IC50 nM 1", "Enzymologic: Ki nM 1"
>>> from collections import defaultdict
>>> dict_patterns = defaultdict(list)
>>> for i, line in enumerate(lines):
    for pattern in searchtxt:
        if pattern in line and i+1 < len(lines):
             dict_patterns[pattern].append(lines[i+1])

>>> dict_patterns
defaultdict(<type 'list'>, {'Enzymologic: Ki nM 1': ['257000\n', '5000\n'],
                            'Enzymologic: IC50 nM 1': ['n/a\n', '1000']})

The use of the dict allows to group results by pattern (defaultdict is a convenient way not to be forced to initialize your object).


You really have too separate problems:

Parse the file and extract the data from it

import itertools

# let's imitate a file
pseudo_file = """
> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000
""".split('\n')

def iterate_on_couple(iterable):
  """
    Iterate on two elements, by two elements
  """
  iterable = iter(iterable)
  for x in iterable:
    yield x, next(iterable)

plain_lines = (l for l in pseudo_file  if l.strip()) # ignore empty lines

results = {}

# store all results in a dictionary
for name, value in iterate_on_couple(plain_lines):
  results.setdefault(name, []).append(value)

# now you got a dictionary with all values linked to a name
print results

Now this code make the assumption that your files are not corrupted and that you have always the structure:

  • blank
  • name
  • value

If not you may need something more robust.

Secondly, this stores all the values in memory, which could be a problem if your have a lot of values. In that case, you'll need to look at some storage solution such as the shelve module or sqlite.

Save the results into a file

import csv

def get(iterable, index, default):
  """
    Return an item from array or default if IndexError
  """
  try:
      return iterable[index]
  except IndexError:
      return default

names = results.keys() # get a list of all names

# now we write our tab separated file using the csv module
out = csv.writer(open('/tmp/test.csv', 'w'), delimiter='\t')

# first the header
out.writerow(names)

# get the size of the longest column
max_size = list(reversed(sorted(len(results[name]) for name in names)))[0]

# then write the lines one by one
for i in xrange(max_size):
    line = [get(results[name], i, "-") for name in names]
    out.writerow(line)

Since I'm writting the whole code for you, I deliberatly used some advanced Python idioms so you'll have some food for thought while using it.


import itertools

def search(lines, terms):
    results = [[t] for t in terms]
    lines = iter(lines)
    for l in lines:
        for i,t in enumerate(terms):
            if t in l:
                results[i].append(lines.next().strip())
                break
    return results

def format(results):
    s = []
    rows = list(itertools.izip_longest(*results, fillvalue=""))
    for row in rows:
        s.append("\t".join(row))
        s.append('\n')
    return ''.join(s)

And here's how to call the functions:

example = """> <Enzymologic: Ki nM 1>
257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000"""

def test():
    terms = ["> <Enzymologic: IC50 nM 1>", "> <Enzymologic: Ki nM 1>"]
    lines = example.split('\n')
    result = search(lines, terms)
    print format(result)
>>> test()
> <Enzymologic: IC50 nM 1>   > <Enzymologic: Ki nM 1>
n/a 257000

The above example separates each column by a single tab. If you need fancier formatting (as per your example), the format function gets a bit more complicated:

import math

def format(results):
    maxcolwidth = [0] * len(results)
    rows = list(itertools.izip_longest(*results, fillvalue=""))
    for row in rows:
        for i,col in enumerate(row):
            w = int(math.ceil(len(col)/8.0))*8
            maxcolwidth[i] = max(maxcolwidth[i], w)

    s = []
    for row in rows:
        for i,col in enumerate(row):
            s += col
            padding = maxcolwidth[i]-len(col)
            tabs = int(math.ceil(padding/8.0))
            s += '\t' * tabs
        s += '\n'

    return ''.join(s)


import re

pseudo_file = """
> <Enzymologic: Ki nM 1>
 257000

> <Enzymologic: IC50 nM 1>
n/a

> <ITC: Delta_G0 kJ/mole 1>
n/a

> <Enzymologic: Ki nM 1>
5000

> <Enzymologic: EC50/IC50 nM 1>
1000"""

searchtxt = "nzymologic: Ki nM 1>", "<Enzymologic: IC50 nM 1>"

regx_AAA = re.compile('([^:]+: )([^ \t]+)(.*)')

tu = tuple(regx_AAA.sub('\\1.*?\\2.*?\\3',x) for x in searchtxt)

model = '%%-%ss  %%s\n' % len(searchtxt[0])

regx_BBB = re.compile(('%s[ \t\r\n]+(.+)[ \t\r\n]+'
                       '.+?%s[ \t\r\n]+(.+?)[ \t]*(?=\r?\n|\Z)') % tu)


print 'tu   ==',tu
print 'model==',model
print 'regx_BBB.findall(pseudo_file)==\n',regx_BBB.findall(pseudo_file)



with open('woof.txt','w') as f:
    f.write(model % searchtxt)
    f.writelines(model % x for x in regx_BBB.findall(pseudo_file))

result

tu   == ('nzymologic: .*?Ki.*? nM 1>', '<Enzymologic: .*?IC50.*? nM 1>')
model== %-20s  %s

regx_BBB.findall(pseudo_file)==
[('257000', 'n/a'), ('5000', '1000')]

and content of file 'woof.txt' is:

> <Enzymologic: Ki nM 1>  > <Enzymologic: IC50 nM 1>
257000                    n/a
5000                      1000

To obtain regx_BBB, I first compute a tuple tu because you want to catch a line > but there is only "> " in searchtxt

So, the tuple tu introduces .*? in the strings of searchtxt in order that the regex regx_BBB is able to catch lines CONTAINING IC50 and not only the lines strictly EQUAL to the elements of searchtxt

Note that I put strings "nzymologic: Ki nM 1>" and "<Enzymologic: IC50 nM 1>" in searchtxt, other than the ones you utilize, to show that the regexes are build so that the result is obtained yet.

The only condition is that there must be at least ONE character before the ':' in each of the strings of searchtxt

.

EDIT 1

I thought that in the file, a line '> <Enzymologic: IC50 nM 1>' or '> <Enzymologic: EC50/IC50 nM 1>' should always follow a line '> <Enzymologic: Ki nM 1>'

But after having read the answer of others, I think it is not evident (that's the common problem of questions: they don't give enough information and precisions)

If every line must be catched independantly, the following simpler regex regx_BBB can be used:

regx_AAA = re.compile('([^:]+: )([^ \t]+)(.*)')

li = [ regx_AAA.sub('\\1.*?\\2.*?\\3',x) for x in searchtxt]

regx_BBB = re.compile('|'.join(li).join('()') + '[ \t\r\n]+(.+?)[ \t]*(?=\r?\n|\Z)')

But the formatting of the recording file will be harder. I am tired to write a new complete code without knowing what is precisely wanted


Probably the simplest way to find a string in a line then print the next line is to use itertools islice:

    from itertools import islice
    searchtxt = "<Enzymologic: IC50 nM 1>"
    with open ('file.txt','r') as itfile:
            for line in itfile:
                    if searchtxt in line:
                            print line
                            print ''.join(islice(itfile,1)
0

精彩评论

暂无评论...
验证码 换一张
取 消