Using sed/awk and regex to process logs_问答_开发者

开发者 https://www.devze.com 2023-03-27 03:11 出处：网络

I have 1000s of log files generated by a very verbose PHP script. The general structure is as follows

###Unknown no of lines, which I want to ignore###
=================================================
$insert_vars['cdr_pkey']=17568
$id<TAB>$g1<TAB>$i1<tab>rating1<TAB>$g2<TAB>$i2<tab>rating2 #<TAB>more $gX,$iX,$ratingX
#numerical values of $id $g1 $i1 etc. separated by tab
#numerical values of ---""---
#I do not know how many lines will be there (unique column is $id)
=================================================
###Unknown no of lines, which I want to ignore###

I have to process these log files and create an excel sheet (I am thinking csv format) and report the data ba开发者_Go百科ck. I am really bad at excel, but I thought of outputting something like :

cdr_pkey<TAB>id<TAB>g1<TAB>i1<TAB>rating1<TAB>g2<TAB>rating2 #and so on
17568<TAB>1349<TAB>0.0004532<TAB>0.01320<TAB>2.014E-4<TAB>...#rest of numerical values
17568<TAB>1364<TAB>...#values for id=1364
17568<TAB>1321<TAB>...#values for id=1321
...
17569<TAB>1048<TAB>...#values for id=1048
17569<TAB>1426<TAB>...#values for id=1426
...
...

So my cdr_pkey is unique column in the sheet, and for each $cdr_pkey, I have multiple $ids, each having their own set of $g1,$i1,$rating1...

After testing such format, it can be read by excel. Now I just want to extend it to all those 1000s of files.

I am just not sure how to proceed further. What's the next step?

The following bash script does something that might be related to what you want. It is parameterized by what you meant when you said <TAB>. I assume you mean the ascii tab character, but if your logs are so verbose that they spell out <TAB> you will need to modify the variable $WHAT_DID_YOU_MEAN_BY_TAB accordingly. Note that there is very little about this script that does The Right Thing™; it reads the entire file into a string variable, which might not even be possible depending on how big your log files are. On the up side, the script could be easily modified to make two passes, instead, if you think that's better.

#!/bin/bash

WHAT_DID_YOU_MEAN_BY_TAB='\t'

if [[ $# -ne 1 ]] ; then echo "Requires one argument: the file to process" ; exit 1 ; fi

FILENAME="$1"

RELEVANT=$(sed -n '/^==*$/,/^==*$/p' "$FILENAME" | sed '1d' | head -n '-1')
CDR_PKEY=$(echo "$RELEVANT" | \
    grep '$insert_vars\['"'cdr_pkey'\]" | \
    sed 's/.*=\(.*\)/\1/')
echo "$RELEVANT" | sed '1,2d' | \
    sed "s/.*/${CDR_PKEY}$WHAT_DID_YOU_MEAN_BY_TAB\0/"

The following find command is an example use, but your case will depend on how your logs are organized.

find . LOG_PATTERN -exec THIS_SCRIPT '{}' \;

Lastly, I have ignored the issue of putting the CSV headers on the output. This is easily done out-of-band.

(Edit: updated the script to reflect discussion in the comments.)

EDIT: James tells me that changing the sed in last echo from ... 1d ... to ... 1,2 ... and dropping the grep -v 'id' should do the trick.
Confirmed that it works. So changing it below. Thanks again to James Wilcox.

Based on @James script this is what I came up with. I just piped the final echo to grep -v 'id'
Thanks again James Wilcox

WHAT_DID_YOU_MEAN_BY_TAB='\t'

if [[ $# -lt 1 ]] ; then echo "Requires at least one argument: the files to process" ; exit 1 ; fi

echo -e "key\tid\tg1\ti1\td1\tc1\tr1\tg2\ti2\td2\tc2\tr2\tg3\ti3\td3\tc3\tr3"

for i in "$@"
do
    FILENAME="$i"
    RELEVANT=$(sed -n '/^==*$/,/^==*$/p' "$FILENAME" | sed '1d' | head -n '-1')
    CDR_PKEY=$(echo "$RELEVANT" | \
        grep '$insert_vars\['"'cdr_pkey'\]" | \
        sed 's/.*=\(.*\)/\1/')
    echo "$RELEVANT" | sed '1, 2d' | \
        sed "s/.*/${CDR_PKEY}$WHAT_DID_YOU_MEAN_BY_TAB\0/"
    #the one with grep looked like :-
    #echo "$RELEVANT" | sed '1d' | \
        #sed "s/.*/${CDR_PKEY}$WHAT_DID_YOU_MEAN_BY_TAB\0/" | grep -v 'id'
done