开发者

How do I get all statistics-related information from a file, using awk?

开发者 https://www.devze.com 2023-03-29 14:12 出处:网络
I am trying to parse some CSV files using awk. The CSV file I am working with looks like this: fnName,minAccessTime,maxAccessTime

I am trying to parse some CSV files using awk.

The CSV file I am working with looks like this:

fnName,minAccessTime,maxAccessTime
getInfo,300,600
getStage,600,800
getStage,600,800
getInfo,250,620
getInfo,200,700
getStage,700,1000
getInfo,280,600

I need to find the minimum, maximum and average figures for columns 2 and 3, both across all data and individual functi开发者_JAVA技巧ons.


I realize you're not looking for non-awk solution, but I thought I'd share some R code to demonstrate how seamless it is to summarize data.

# read in data
awk <- read.table(textConnection("fnName,minAccessTime,maxAccessTime
getInfo,300,600
getStage,600,800
getStage,600,800
getInfo,250,620
getInfo,200,700
getStage,700,1000
getInfo,280,600"), header = TRUE, sep = ",")

# split according to the function
awk.split <- split(awk, awk$fnName)

# for each function, calculate full summary for columns 2 and 3
lapply(X = awk.split, FUN = function(x) {
            summary(x[2:3])
        })

Result:

$getInfo
 minAccessTime   maxAccessTime
 Min.   :200.0   Min.   :600  
 1st Qu.:237.5   1st Qu.:600  
 Median :265.0   Median :610  
 Mean   :257.5   Mean   :630  
 3rd Qu.:285.0   3rd Qu.:640  
 Max.   :300.0   Max.   :700  

$getStage
 minAccessTime   maxAccessTime   
 Min.   :600.0   Min.   : 800.0  
 1st Qu.:600.0   1st Qu.: 800.0  
 Median :600.0   Median : 800.0  
 Mean   :633.3   Mean   : 866.7  
 3rd Qu.:650.0   3rd Qu.: 900.0  
 Max.   :700.0   Max.   :1000.0 


This awk script should give you all the skills necessary to get what you want.

It basically runs through all lines in your input file, ignoring those where the second field is minAccessTime (the CSV header).

On all other records, it updates the count of, minimum-of-minima, maximum-of-minima, minimum-of-maxima, maximum-of-maxima, sum-of-minima, and sum-of-maxima for both the overall data plus each individual function name.

The former is stored in count, min_min, max_min, min_max, max_max, sum_min and sum_max. The latter are stored in associative arrays with similar names (with _arr appended).

Then, once all records are read, the END section outputs the information.

NR > 1 {
    count++;
    sum_min += $2;
    sum_max += $3;
    if (count == 1) {
        min_min = $2;
        max_min = $2;
        min_max = $3;
        max_max = $3;
    } else {
        if ($2 < min_min) { min_min = $2; }
        if ($2 > max_min) { max_min = $2; }
        if ($3 < min_max) { min_max = $3; }
        if ($3 > max_max) { max_max = $3; }
    }

    count_arr[$1]++;
    sum_min_arr[$1] += $2;
    sum_max_arr[$1] += $3;
    if (count_arr[$1] == 1) {
        min_min_arr[$1] = $2;
        max_min_arr[$1] = $2;
        min_max_arr[$1] = $3;
        max_max_arr[$1] = $3;
    } else {
        if ($2 < min_min_arr[$1]) { min_min_arr[$1] = $2; }
        if ($2 > max_min_arr[$1]) { max_min_arr[$1] = $2; }
        if ($3 < min_max_arr[$1]) { min_max_arr[$1] = $3; }
        if ($3 > max_max_arr[$1]) { max_max_arr[$1] = $3; }
    }
}

END {
    print "Overall:"
    print "   Total records = " count
    print "   Sum of minima = " sum_min
    print "   Sum of maxima = " sum_max
    if (count > 0) {
        print "   Min of minima = " min_min
        print "   Max of minima = " max_min
        print "   Min of maxima = " min_max
        print "   Max of maxima = " max_max
        print "   Avg of minima = " sum_min / count
        print "   Avg of maxima = " sum_max / count
    }
    for (task in count_arr) {
       print "Function " task ":"
        print "   Total records = " count_arr[task]
        print "   Sum of minima = " sum_min_arr[task]
        print "   Sum of maxima = " sum_max_arr[task]
        print "   Min of minima = " min_min_arr[task]
        print "   Max of minima = " max_min_arr[task]
        print "   Min of maxima = " min_max_arr[task]
        print "   Max of maxima = " max_max_arr[task]
        print "   Avg of minima = " sum_min_arr[task] / count_arr[task]
        print "   Avg of maxima = " sum_max_arr[task] / count_arr[task]
    }
}

Storing that script into qq.awk and placing your sample data into qq.in, then running:

awk -F, -f qq.awk qq.in

generates the following output, which I'm relatively certain will give you every possible piece of information you need:

Overall:
   Total records = 7
   Sum of minima = 2930
   Sum of maxima = 5120
   Min of minima = 200
   Max of minima = 700
   Min of maxima = 600
   Max of maxima = 1000
   Avg of minima = 418.571
   Avg of maxima = 731.429
Function getStage:
   Total records = 3
   Sum of minima = 1900
   Sum of maxima = 2600
   Min of minima = 600
   Max of minima = 700
   Min of maxima = 800
   Max of maxima = 1000
   Avg of minima = 633.333
   Avg of maxima = 866.667
Function getInfo:
   Total records = 4
   Sum of minima = 1030
   Sum of maxima = 2520
   Min of minima = 200
   Max of minima = 300
   Min of maxima = 600
   Max of maxima = 700
   Avg of minima = 257.5
   Avg of maxima = 630


If you insist on Awk...

$ awk -F, '
> func newmin(fname, array, value) { if (!(fname in array) || array[fname]>value) array[fname] = value }
> func newmax(fname, array, value) { if (!(fname in array) || array[fname]<value) array[fname] = value }
> NR>1 {
>   newmin($1,min2,$2)
>   newmin("global",min2,$2)
>   newmax($1,max2,$2)
>   newmax("global",max2,$2)
>   newmin($1,min3,$3)
>   newmin("global",min3,$3)
>   newmax($1,max3,$3)
>   newmax("global",max3,$3)
> }
> END { for (fname in min2) { print fname, min2[fname], max2[fname], min3[fname], max3[fname] } }'
0

精彩评论

暂无评论...
验证码 换一张
取 消