开发者

parsing issue with comma separated csv file

开发者 https://www.devze.com 2022-12-23 02:54 出处:网络
I am trying to extract 4th column from csv file (comma separated, and skipping first 2 header lines) using this command,

I am trying to extract 4th column from csv file (comma separated, and skipping first 2 header lines) using this command,

 awk 'NR <2 {next}{FS =","}{print $4}' filename.csv | more

However, it doesn't work because the first column cantains comma, thus 4th column is not 开发者_运维知识库really 4th. Below is an example of a row:

"sdfsdfsd, sfsdf", 454,fgdfg, I_want_this_column,sdfgdg,34546, 456465, etc


Unless you have specific reasons for using awk, I would recommend using a CSV parsing library. Many scripting languages have one built-in (or at least available) and they'll save you from these headaches.


if your first column has quotes always,

 $ awk 'BEGIN{ FS="\042[ ]*," } { m=split($2,a,","); print a[3] } ' file
 I_want_this_column

if the column you want is always the last 2nd,

$ awk -F"," '{print $(NF-1)}' file
 I_want_this_column

You can try this demo script to break down the columns

awk 'BEGIN{ FS="," }
{
   for(i=1;i<=NF;i++){
      # save normal
      if($i !~ /^[ ]*\042|[ ]*\042[ ]*$/){
        a[++j]=$i
      }
      # if quotes at the end
      if(f==1 && $i ~ /[ ]*\042[ ]*$/){
        s=s","$i
        a[++j]=s
        #reset
        s="";f=0
      }
      # if quotes in front
      if($i ~ /^[ ]*\042/){
        s=s $i
        f=1
      }
      if(f==1 && ( $i !~/\042/ ) ){
         s=s","$i
      }
   }
}
END{
  # print columns
  for(p=1;p<=j;p++){
     print "Field "p,": "a[p]
  }
} ' file

output

$ cat file
"sdfsdfsd, sfsdf", "454,fgdfg blah , words ", I_want_this_column,sdfgdg

$ ./shell.sh
Field 1 : "sdfsdfsd, sfsdf"
Field 2 : fgdfg blah
Field 3 :  "454,fgdfg blah , words "
Field 4 :  I_want_this_column
Field 5 : sdfgdg


You shouldn't use awk here. Use Python csv module or Perl Text::CSV or Text::CSV_XS modules or another real csv parser.

Related question - parse csv file using gawk


If you can't avoid awk, this piece of code does the job you need:

BEGIN {FS=",";}

{
        f=0;
        j=0;
        for (i = 1; i <=NF ; ++i) {
                if (f) {
                        a[j] = a[j] "," $(i);
                        if ($(i) ~ "\"$") {
                                f = 0;
                        }
                }
                else {
                        ++j;
                        a[j] = $(i);
                        if ((a[j] ~ "^\"[^\"]*$")) {
                                f = 1;
                        }
                }
        }
        for (i = 1; i <= j; ++i) {
                gsub("^\"","",a[i]);
                gsub("\"$","",a[i]);
                gsub("\"\"","\"",a[i]);
print "i = \"" a[i] "\"";
        }
}


Working with CSV files that have quoted fields with commas inside can be difficult with the standard UNIX text tools.

I wrote a program called csvquote to make the data easy for them to handle. In your case, you could use it like this:

csvquote filename.csv | awk 'NR <2 {next}{FS =","}{print $4}' | csvquote -u | more

or you could use cut and tail like this:

csvquote filename.csv | tail -n +3 | cut -d, -f4 | csvquote -u | more

The code and docs are here: https://github.com/dbro/csvquote

0

精彩评论

暂无评论...
验证码 换一张
取 消