开发者

Python: Parsing a colon delimited file with various counts of fields

开发者 https://www.devze.com 2023-02-07 21:46 出处:网络
I\'m trying to parse a a few files with the following format in \'clientname\'.txt hostname:comp1 time: Fri Jan 28 20:00:02 GMT 2011

I'm trying to parse a a few files with the following format in 'clientname'.txt

hostname:comp1
time: Fri Jan 28 20:00:02 GMT 2011
ip:xxx.xxx.xx.xx
fs:good:45
memory:ba开发者_如何学JAVAd:78
swap:good:34
Mail:good

Each section is delimited by a : but where lines 0,2,6 have 2 fields... lines 1,3-5 have 3 or more fields. (A big issue I've had trouble with is the time: line, since 20:00:02 is really a time and not 3 separate fields.

I have several files like this that I need to parse. There are many more lines in some of these files with multiple fields.

...
for i in clients:
if os.path.isfile(rpt_path + i + rpt_ext):          # if the rpt exists then do this
    rpt = rpt_path + i + rpt_ext
    l_count = 0
    for line in open(rpt, "r"):
        s_line = line.rstrip()
        part = s_line.split(':')
        print part
        l_count = l_count + 1
else:                                               # else break
    break

First I'm checking if the file exists first, if it does then open the file and parse it (eventually) As of now I'm just printing the output (print part) to make sure it's parsing right. Honestly, the only trouble I'm having at this point is the time: field. How can I treat that line specifically different than all the others? The time field is ALWAYS the 2nd line in all of my report files.


split method has the following syntax split( [sep [,maxsplit]]) and if the maxsplit is given, it will make maxsplit+1 parts. In you case, you just have give maxsplit as 1. Just split(':',1) would solve your problem.


If time is a special case, you could do:

[...]
s_line = line.rstrip()
if line.startswith('time:'):
    part = s_line.split(':', 1)
else:
    part = s_line.split(':')
print part
[...]

This would give you:

['hostname', 'comp1']
['time', ' Fri Jan 28 20:00:02 GMT 2011']
['ip', 'xxx.xxx.xx.xx']
['fs', 'good', '45']
['memory', 'bad', '78']
['swap', 'good', '34']
['Mail', 'good']

And doesn't rely on the position of time in the file.


Design considerations:

Robustly handle extraneous whitespace, including blank lines, and missing colons.

Extract a record_type, which is then used to decide how to parse the remainder of the line.

>>> def munched(s, n=None):
...     if n is None:
...         n = 99999999 # this kludge should not be necessary
...     return [x.strip() for x in s.split(':', n)]
...
>>> def parse_line(line):
...     if ':' not in line:
...         return [line.strip(), '']
...     record_type, remainder = munched(line, 1)
...     if record_type == 'time':
...         data = [remainder]
...     else:
...         data = munched(remainder)
...     return record_type, data
...
>>> for guff in """
... hostname:comp1
... time: Fri Jan 28 20:00:02 GMT 2011
... ip:xxx.xxx.xx.xx
... fs:good:45
...     memory   :    bad   :   78
... missing colon
... Mail:good""".splitlines(True):
...    print repr(guff), parse_line(guff)
...
'\n' ['', '']
'hostname:comp1\n' ('hostname', ['comp1'])
'time: Fri Jan 28 20:00:02 GMT 2011\n' ('time', ['Fri Jan 28 20:00:02 GMT 2011'])
'ip:xxx.xxx.xx.xx\n' ('ip', ['xxx.xxx.xx.xx'])
'fs:good:45\n' ('fs', ['good', '45'])
'    memory   :    bad   :   78    \n' ('memory', ['bad', '78'])
'missing colon\n' ['missing colon', '']
'Mail:good' ('Mail', ['good'])
>>>


If the time field always the 2nd line. Why can't you skip it and parse it separately?

Something like

for i, line in enumerate(open(rpt, "r").read().splitlines()):
    if i==1: # Special parsing for time: line
        data = line[5:]
    else:
        # your normal parsing logic
0

精彩评论

暂无评论...
验证码 换一张
取 消