I have data in tab delimited format that looks like:
0/0:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
I am only interested in the first 3 characters of each entry (ie 0/0 and 0/1). I figured the best way to do this would be to use match
and the genfromtxt
in numpy. This example is as far as I have gotten:
import re
csvfile = 'home/python/batch1.hg19.table'
from numpy import genfromtxt
data = genfromtxt(csvfile, delimiter="\t", dtype=None)
for i in data[1]:
m = re.match('[0-9]/[0-9]', i)
if m:
print m.group(0),
else:
pri开发者_Go百科nt "NA",
This works for the first row of the data which but I am having a hard time figuring out how to expand it for every row of the input file.
Should I make it a function and apply it to each row seperately or is there a more pythonic way to do this?
Unless you really want to use NumPy, try this:
file = open('home/python/batch1.hg19.table')
for line in file:
for cell in line.split('\t'):
print(cell[:3])
Which just iterates through each line of the file, tokenizes the line using the tab character as the delimiter, then prints the slice of the text you are looking for.
Numpy is great when you want to load in an array of numbers. The format you have here is too complicated for numpy to recognize, so you just get an array of strings. That's not really playing to numpy's strength.
Here's a simple way to do it without numpy:
result=[]
with open(csvfile,'r') as f:
for line in f:
row=[]
for text in line.split('\t'):
match=re.search('([0-9]/[0-9])',text)
if match:
row.append(match.group(1))
else:
row.append("NA")
result.append(row)
print(result)
yields
# [['0/0', '0/1', '0/0'], ['NA', '0/1', '0/0']]
on this data:
0/0:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
---:23:-1.03,-7.94,-83.75:69.15 0/1:34:-1.01,-11.24,-127.51:99.00 0/0:74:-1.02,-23.28,-301.81:99.00
Its pretty easy to parse the whole file without regular expressions:
for line in open('yourfile').read().split('\n'):
for token in line.split('\t'):
print token[:3] if token else 'N\A'
I haven't written python in a while. But I would probably write it as such.
file = open("home/python/batch1.hg19.table")
for line in file:
columns = line.split("\t")
for column in columns:
print column[:3]
file.close()
Of course if you need to validate the first three characters, you'll still need the regex.
精彩评论