AWK/BASH: How to remove duplicate rows from file with known field range?_问答_开发者

AWK/BASH: How to remove duplicate rows from file with known field range?

开发者 https://www.devze.com 2023-01-20 13:42 出处：网络

I was wondering if there was a way to use bash开发者_运维百科/awk to remove duplicate rows based on a known field range. For example:

Easy Going                  USA:22 May 1926
Easy Going Gordon               USA:6 August 1925   
Easy Life                   USA:20 May 1944
Easy Listening                  USA:14 January 2002 
Easy Listening                  USA:10 October 2002 
Easy Listening                  USA:27 January 2004 
Easy Living                     USA:7 July 1937 
Easy Living                     USA:16 July 1937
Easy Living                     USA:4 September 2009

I would like to remove duplicate move titles. The movie title will always be from $1 through $(NF-3). Ideally I would like to stick with the first occurrence (earliest date), but if that's not possible then it doesn't matter.

Thanks,

Tomek

#!/bin/bash

awk 'BEGIN{
   m=split("January|February|March|April|May|June|July|August|September|October|November|December",d,"|")
   for(o=1;o<=m;o++){
      months[d[o]]=sprintf("%02d",o)
   }
}
{
   sub(/.*:/,"",$(NF-2))
   t=mktime($(NF)" "months[$(NF-1)]" "$(NF-2)" 0 0 0")
   time[t]=$(NF-2) FS $(NF-1) FS $(NF)
   $(NF-2)=$(NF-1)=$(NF)=""
   gsub(/ +$/,"")
   if (!($0 in array)){array[$0]=99999999999999}
   if ( t <= array[$0] ){ array[$0]=t }
}
END{
  for(i in array){ print "->",i,time[array[i]]  }
} ' file

output

$ ./shell.sh
-> Easy Living 7 July 1937
-> Easy Going Gordon 6 August 1925
-> Easy Listening 14 January 2002
-> Easy Going 22 May 1926
-> Easy Life 20 May 1944

awk '
    {
        line = $0
        $(NF-2) = $(NF-1) = $NF = ""
        if ( ! ($0 in movies)) 
            movies[$0] = line
    }
    END {
        for (m in movies) print movies[m] 
    }
' movies.txt

That does not preserve the original line ordering. You might want to sort the output.

This could be a quick answer