开发者

Quotes in tab-delimited file

开发者 https://www.devze.com 2022-12-22 10:50 出处:网络
I\'ve got a simple application that opens a tab-delimited text file, and inserts that data into a database.

I've got a simple application that opens a tab-delimited text file, and inserts that data into a database.

I'm using this CS开发者_StackOverflowV reader to read the data: http://www.codeproject.com/KB/database/CsvReader.aspx

And it is all working just fine!

Now my client has added a new field to the end of the file, which is "ClaimDescription", and in some of these claim descriptions, the data has quotes in it, example:

"SUMISEI MARU NO 2" - sea of Japan

This seems to be causing a major headache for my app. I get an exception which looks like this:

The CSV appears to be corrupt near record '1470' field '26 at position '181'. Current raw data : ...

And in that "raw data", sure enough the claim description field shows data with quotes in it.

I want to know if anyone has ever had this problem before, and got round it? Obviously I can ask the client to change the data they originally send to me, but this is an automated process that they use to generate the tab-delimited file; and I'd rather use that as a last resort.

I was thinking I could maybe open the file using a standard TextReader before hand, escape any quotes, write the content back into a new file, then feed that file into the CSV Reader. It is probably worth mentioning that the average file size of these tab-delimited files is around 40MB.

Any help is greatly appreciated!

Cheers, Sean


Check the comment on the codeproject article about quotes:

http://www.codeproject.com/Messages/3382857/Re-Quotes-inside-of-the-Field.aspx

You need to specify in the constructor that you want another character besides " to be used as quotes.


Use the FileHelpers library instead. It is widely used and will cope with quoted fields, or fields that contain quotes.


I recently solved a similar issue, and although CsvReader was working properly on all but a few lines of my TSV file, what solved my problem in the end was setting a customDelimiter in the constructor of CsvReader

public static void ParseTSV(string filepath)
    {
        using (CsvReader csvReader = new CsvReader(new StreamReader(filepath), true, '\t')) {
        //if that didn't work, passing unlikely characters into the other params might help
        //using (CsvReader csvReader = new CsvReader(new StreamReader(filepath), true, '\t', '~', '`', '~', ValueTrimmingOptions.None)) {
            int fieldcount = csvReader.FieldCount;

            //Does not work, since it's read only property
            //csvReader.Delimiter = "\t";

            string[] headers = csvReader.GetFieldHeaders();

            while (csvReader.ReadNextRecord()) {
                for (int i = 0; i < fieldcount; i++) {
                    string msg = String.Format("{0}\r{1};", headers[i],
                                               csvReader[i]);
                    Console.Write(msg);
                }
                Console.WriteLine();
            }
        }
    }


use OleDbConnection http://social.msdn.microsoft.com/Forums/en/winformsdatacontrols/thread/98fce7d7-b02d-4027-ad2e-2df3a28bd439


Maybe you can open the file with your application and replace each quote with another character and then process it.


I did some searching, and there is an RFC for CSV files (RFC 4180), and that does explicitly prohibit what they are doing:

Each field may or may not be enclosed in double quotes (however some programs, such as Microsoft Excel, do not use double quotes at all). If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.

Basicly, if they want to do that, they need to enclose that whole field in quotes, like so:

,""SUMISEI MARU NO 2" - sea of Japan",

So if you want you can throw this problem back at them and insist they send you a "proper" RFC 4180 CSV file.

Since you have access to the source files for that CSV reader, another option would be to modify it to handle the kind of quoted strings they are feeding you.

This kind of situation is exactly why it is vital to have source code access to your toolset.

If instead you'd like to preprocess (hack) their files before feeing them to your tool, the correct method would be to look for fields with a quote not immediately in front of or behind a separator, and enclose its whole field in another set of quotes.


Right - after a late night of redbull and scratching my head, i eventually found the problem, it was commas in the "Claim_Description" field. Didn't even think about that because I was using a tab-delimited file, but as soon as i did a find and replace on all commas in the file it worked absolutely fine!

The next step is to find out how to replace those commas before processing.

Again, thanks for all the suggestions.

Cheers, Sean

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号