开发者

Does grep allows search duplicates?

开发者 https://www.devze.com 2023-04-10 14:39 出处:网络
I have many (near 100) big csv files with sellID in first column. I know that some sellID are repeated 2 or more times in 2 or more files. Is possible with grep find a开发者_运维知识库ll this duplicat

I have many (near 100) big csv files with sellID in first column. I know that some sellID are repeated 2 or more times in 2 or more files. Is possible with grep find a开发者_运维知识库ll this duplicate sellID (create map sellID-file_name)? Or exists another open source application for this purpose? My OS - CentOS.


Here's a very simple, somewhat crude awk script to accomplish something pretty close to what you seem to be describing:

#!/usr/bin/awk -f

{ if ($1 in seenbefore) {
    printf( "%s\t%s\n", $1, seenbefore[$1]);
    printf( "%s\t%s\n", $1, FILENAME);
    }
  seenbefore[$1]=FILENAME;
  }

As you can hopefully surmise all we're doing is building an associative array of each value you find in the first column/field (set FS in the BEGIN special block to change the input field separator ... for a trivially naive form of CSV support). As we encounter any duplicate we print out the dupe, the file we previously saw it in and the current filename. In any event we then add/update the array with the current file's name.

With more code you could store and print the line numbers of each, append filename/line number tuples to a list and move all the output to an END block where you summarize it in some a more concise format, and so on.

For any of that I'd personally shift to Python where the data types are richer (actual lists and tuples rather than having to concatenate them into strings or built and array of arrays) and I'd have access to much more power (an actual CSV parser which can handle various flavors of quoted CSV and alternative delimiters, and where producing sorted results is trivially easy).

However, this should, hopefully, get you on the right track.


Related question: https://serverfault.com/questions/66301/removing-duplicate-lines-from-file-with-grep

You could cat all the files in a single one, and then look for dupes as suggested in the link above.

BTW, it is not clear if you want to keep only the dupes or remove them.


Yet another answer:

If your SellID is of fixed length (say 6 characters) you can use

sort data.txt | uniq -w 6 -D

This will print out the lines that are not unique

If all you want to do is to automatically remove duplicate lines (keeping the first), try:

sort -u --key=1,1 data.txt


Try this:

#Save duplicate columns
find path -type f -name '*.csv' -execdir cut -d, -f1 {} \+ | sort | uniq -d \
  > duplicate-ids.log
#List duplicate records
find path -type f -name '*.csv' -execdir grep -F -f duplicate-ids.log {} \+

Note: i did not test it.

0

精彩评论

暂无评论...
验证码 换一张
取 消