开发者

Parsing CSV File enclosed with quotes in C#

开发者 https://www.devze.com 2023-03-02 01:28 出处:网络
I\'ve seen lots of 开发者_运维知识库samples in parsing CSV File. but this one is kind of annoying file...

I've seen lots of 开发者_运维知识库samples in parsing CSV File. but this one is kind of annoying file...

so how do you parse this kind of CSV

"1",1/2/2010,"The sample ("adasdad") asdada","I was pooping in the door "Stinky", so I'll be damn","AK"


The best answer in most cases is probably @Jim Mischel's. TextFieldParser seems to be exactly what you want for most conventional cases -- though it strangely lives in the Microsoft.VisualBasic namespace! But this case isn't conventional.

The last time I ran into a variation on this issue where I needed something unconventional, I embarrassingly gave up on regexp'ing and bullheaded a char by char check. Sometimes, that's not-wrong enough to do. Splitting a string isn't as difficult a problem if you byte push.

So I rewrote for this case as a string extension. I think this is close.

Do note that, "I was pooping in the door "Stinky", so I'll be damn", is an especially nasty case. Without the *** STINKY CONDITION *** code, below, you'd get I was pooping in the door "Stinky as one value and so I'll be damn" as the other.

The only way to do better than that for any anonymous weird splitter/escape case would be to have some sort of algorithm to determine the "usual" number of columns in each row, and then check for, in this case, fixed length fields like your AK state entry or some other possible landmark as a sort of normalizing backstop for nonconformist columns. But that's serious crazy logic that likely isn't called for, as much fun as it'd be to code. As @Vash points out, you're better off following some standard and coding a little more OFfensively.

But the problem here is probably easier than that. The only lexically meaningful case is the one in your example -- ", -- double quote, comma, and then a space. So that's what the *** STINKY CONDITION *** code checks. Even so, this code is getting nastier than I'd like, which means you have ever stranger edge cases, like "This is also stinky," a f a b","Now what?" Heck, even "A,"B","C" doesn't work in this code right now, iirc, since I treat the begin and end chars as having been escape pre- and post-fixed. So we're largely back to @Vash's comment!

Apologies for all the brackets for one-line if statements, but I'm stuck in a StyleCop world right now. I'm not necessarily suggesting you use this -- that strictEscapeToSplitEvaluation plus the STINKY CONDITION makes this a little complex. But it's worth keeping in mind that a normal csv parser that's intelligent about quotes is significantly more straightforward to the point of being tedious, but otherwise trivial.

namespace YourFavoriteNamespace 
{
    using System;
    using System.Collections.Generic;
    using System.Text;

    public static class Extensions
    {
        public static Queue<string> SplitSeeingQuotes(this string valToSplit, char splittingChar = ',', char escapeChar = '"', 
            bool strictEscapeToSplitEvaluation = true, bool captureEndingNull = false)
        {
            Queue<string> qReturn = new Queue<string>();
            StringBuilder stringBuilder = new StringBuilder();

            bool bInEscapeVal = false;

            for (int i = 0; i < valToSplit.Length; i++)
            {
                if (!bInEscapeVal)
                {
                    // Escape values must come immediately after a split.
                    // abc,"b,ca",cab has an escaped comma.
                    // abc,b"ca,c"ab does not.
                    if (escapeChar == valToSplit[i] && (!strictEscapeToSplitEvaluation || (i == 0 || (i != 0 && splittingChar == valToSplit[i - 1]))))
                    {
                        bInEscapeVal = true;    // not capturing escapeChar as part of value; easy enough to change if need be.
                    }
                    else if (splittingChar == valToSplit[i])
                    {
                        qReturn.Enqueue(stringBuilder.ToString());
                        stringBuilder = new StringBuilder();
                    }
                    else
                    {
                        stringBuilder.Append(valToSplit[i]);
                    }
                }
                else
                {
                    // Can't use switch b/c we're comparing to a variable, I believe.
                    if (escapeChar == valToSplit[i])
                    {
                        // Repeated escape always reduces to one escape char in this logic.
                        // So if you wanted "I'm ""double quote"" crazy!" to come out with 
                        // the double double quotes, you're toast.
                        if (i + 1 < valToSplit.Length && escapeChar == valToSplit[i + 1])
                        {
                            i++;
                            stringBuilder.Append(escapeChar);
                        }
                        else if (!strictEscapeToSplitEvaluation)
                        {
                            bInEscapeVal = false;
                        }
                        // *** STINKY CONDITION ***  
                        // Kinda defense, since only `", ` really makes sense.
                        else if ('"' == escapeChar && i + 2 < valToSplit.Length &&
                            valToSplit[i + 1] == ',' && valToSplit[i + 2] == ' ')
                        {
                            i = i+2;
                            stringBuilder.Append("\", ");
                        }
                        // *** EO STINKY CONDITION ***  
                        else if (i+1 == valToSplit.Length || (i + 1 < valToSplit.Length && valToSplit[i + 1] == splittingChar))
                        {
                            bInEscapeVal = false;
                        }
                        else
                        {
                            stringBuilder.Append(escapeChar);
                        }
                    }
                    else
                    {
                        stringBuilder.Append(valToSplit[i]);
                    }
                }
            }

            // NOTE: The `captureEndingNull` flag is not tested.
            // Catch null final entry?  "abc,cab,bca," could be four entries, with the last an empty string.
            if ((captureEndingNull && splittingChar == valToSplit[valToSplit.Length-1]) || (stringBuilder.Length > 0))
            {
                qReturn.Enqueue(stringBuilder.ToString());
            }

            return qReturn;
        }
    }
}

Probably worth mentioning that the "answer" you gave yourself doesn't have the "Stinky" problem in its sample string. ;^)

[Understanding that we're three years after you asked,] I will say that your example isn't as insane as folks here make out. I can see wanting to treat escape characters (in this case, ") as escape characters only when they're the first value after the splitting character or, after finding an opening escape, stopping only if you find the escape character before a splitter; in this case, the splitter is obviously ,.

If the row of your csv is abc,bc"a,ca"b, I would expect that to mean we've got three values: abc, bc"a, and ca"b.

Same deal in your "The sample ("adasdad") asdada" column -- quotes that don't begin and end a cell value aren't escape characters and don't necessarily need doubling to maintain meaning. So I added a strictEscapeToSplitEvaluation flag here.

Enjoy. ;^)


I very strongly recommend using TextFieldParser. Hand-coded parsers that use String.Split or regular expressions almost invariably mishandle things like quoted fields that have embedded quotes or embedded separators.

I would be surprised, though, if it handled your particular example. As others have said, that line is, at best, ambiguous.


Split based on

",

I would use MyString.IndexOf("\","

And then substring the parts. Other then that im sure someone written a csv parser out there that can handle this :)


I found a way to parse this malformed CSV. I looked for a pattern and found it.... I first replace (",") with a character... like "¤" and then split it...

from this:

"Annoying","CSV File","poop@mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby","yeah!"

to this:

"Annoying¤CSV File¤poop@mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby¤yeah!"

then split it:

ArrayA[0]: "Annoying //this value will be trimmed by replace("\"","") same as the array[4]
ArrayA[1]: CSV File
ArrayA[2]: poop@mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby
ArrayA[3]: yeah!"

after splitting it, I will replace strings from ArrayA[2] ", and ," with ¤ and then split it again

from this

ArrayA[2]: poop@mypants.com",1999,01-20-2001,"oh,boy",01-20-2001,"yeah baby

to this

ArrayA[2]: poop@mypants.com¤1999,01-20-2001¤oh,boy¤01-20-2001¤yeah baby

then split it again and would turn to this

ArrayB[0]: poop@mypants.com
ArrayB[1]: 1999,01-20-2001
ArrayB[2]: oh,boy
ArrayB[3]: 01-20-2001
ArrayB[4]: yeah baby

and lastly... I'll split the Year only and the date from ArrayB[1] with , to ArrayC

It's tedious but there's no other way to do it...


There is one another open source library, Cinchoo ETL, handle quoted string fine. Here is sample code.

string csv = @"""1"",1/2/2010,""The sample(""adasdad"") asdada"",""I was pooping in the door ""Stinky"", so I'll be damn"",""AK""";

using (var r = ChoCSVReader.LoadText(csv)
    .QuoteAllFields()
    )
{
    foreach (var rec in r)
        Console.WriteLine(rec.Dump());
}

Output:

[Count: 5]
Key: Column1 [Type: Int64]
Value: 1
Key: Column2 [Type: DateTime]
Value: 1/2/2010 12:00:00 AM
Key: Column3 [Type: String]
Value: The sample(adasdad) asdada
Key: Column4 [Type: String]
Value: I was pooping in the door Stinky, so I'll be damn
Key: Column5 [Type: String]
Value: AK


You could split the string by ",". It is recomended that the csv file could each cell value should be enclosed in quotes like "1","2","3".....


I don't see how you could if each line is different. This line is a malformed for CSV. Quotes contained within a value must be doubled as shown below. I can't even tell for sure where the values should be terminated.

"1",1/2/2010,"The sample (""adasdad"") asdada","I was pooping in the door ""Stinky"", so I'll be damn","AK"

Here's my code to parse a CSV file but I don't see how any code would know how to handle your line because it's malformed.


You might want to give CsvReader a try. It will handle quoted string fine, so you just will have to remove leading and trailing quotes.

It will fail if your strings contains a coma. To avoid this, the quotes needs to be doubled as said in other answers.


As no (decent) .csv parser can parse non-csv-data correctly, the task isn't to parse the data, but to fix the file(s) (and then to parse the correct data).

To fix the data you need a list of bad rows (to be sent to the person responsible for the garbage for manual editing). To get such a list, you can

  1. use Access with a correct import specification to import the file. You'll get a list of import failures.

  2. write a script/program that opens the file via the OLEDB text driver.

Sample file:

"Id","Remark","DateDue"
1,"This is good",20110413
2,"This is ""good""",20110414
3,"This is ""good"","bad",and "ugly",,20110415
4,"This is ""good""" again,20110415

Sample SQL/Result:

 SELECT * FROM [badcsv01.csv]
 Id Remark               DateDue   
  1 This is good         4/13/2011 
  2 This is "good"       4/14/2011 
  3 This is "good",        NULL    
  4 This is "good" again 4/15/2011 

SELECT * FROM [badcsv01.csv] WHERE DateDue Is Null
 Id Remark          DateDue 
  3 This is "good",  NULL   


First you will do it for the columns names:

            DataTable pbResults = new DataTable();
            OracleDataAdapter oda = new OracleDataAdapter(cmd);
            oda.Fill(pbResults);

            StringBuilder sb1 = new StringBuilder();
            StringBuilder sb2 = new StringBuilder();
            IEnumerable<string> columnNames = pbResults.Columns.Cast<DataColumn>().Select(column => column.ColumnName);

            sb1.Append(string.Join("\"" + "," + "\"", columnNames));                
            sb2.Append("\"");
            sb2.Append(sb1);
            sb2.AppendLine("\"");

Second you will do it for each row:

            foreach (DataRow row in pbResults.Rows)
            {
                IEnumerable<string> fields = row.ItemArray.Select(field => field.ToString());
                sb2.Append("\"");
                sb2.Append(string.Join("\"" + "," + "\"", fields));
                sb2.AppendLine("\"");
            }
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号