开发者

How to bulk insert from CSV when some fields have new line character?

开发者 https://www.devze.com 2022-12-24 19:20 出处:网络
I have a CSV dump from another DB that looks like this (id, name, notes): 1001,John Smith,15 Main Street

I have a CSV dump from another DB that looks like this (id, name, notes):

1001,John Smith,15 Main Street

1002,Jane Smith,"2010 Rockliffe Dr.

Pleasantville, IL

USA"

1003,Bill Karr,2820 West Ave.

The last field may contain carriage returns and commas, in which case it is surrounded by double quotes. And I need to preserve those returns and commas.

I use this code to import CSV into my table:

BULK INSERT CSVTest
FROM 'c:\csvfile.csv'
W开发者_如何学GoITH
(
  FIELDTERMINATOR = ',',
  ROWTERMINATOR = '\n'
)

SQL Server 2005 bulk insert cannot figure out that carriage returns inside quotes are not row terminators.

How to overcome?


UPDATE:

Looks like the only way to keep line breaks inside a field is to use different row separator. So, I want to mark all row separating line breaks by putting a pipe in front of them. How can I change my CSV to look like this?

1001,John Smith,15 Main Street|

1002,Jane Smith,"2010 Rockliffe Dr.

Pleasantville, IL

USA"|

1003,Bill Karr,2820 West Ave.|


Bulk operations on SQL Server do not specifically support CSV even though they can import them if the files are carefully formatted. My suggestion would be to enclose all field values in quotes. BULK INSERT might then allow the carriage returns within a field value. If it does not, then your next solution might be an Integration Services package.

See Preparing Data for Bulk Export or Import for more.


you can massage these line breaks into one line with a script, eg you can use GNU sed to remove line breaks. eg

$ more file
1001,John Smith,15 Main Street
1002,Jane Smith,"2010 Rockliffe Dr.
Pleasantville, IL
USA"
1003,Bill Karr,"2820
West Ave"

$ sed '/"/!s/$/|/;/.*\".*[^"]$/{ :a;N };/"$/ { s/$/|/ }' file
1001,John Smith,15 Main Street|
1002,Jane Smith,"2010 Rockliffe Dr.
Pleasantville, IL
USA"|
1003,Bill Karr,"2820
West Ave"|

then you can bulk insert.

Edit:

Save this :/"/!s/$/|/;/.*\".*[^"]$/{ :a;N };/"$/ { s/$/|/ } in a file , say myformat.sed. then do this on the command line

c:\test> sed.exe -f myformat.sed myfile


According to the source of all knowledge (Wikipedia), csv uses new lines to separate records. So what you have is not valid csv.

My suggestion is that you write a perl program to process your file and add each record to the db.

If you're not a perl person, then you could use a programming site or see if some kind SO person will write the parsing section of the program for you.

Added:

Possible Solution

Since the OP states that he can change the input file, I'd change all the new lines that do not follow a " to be a reserved char sequence, eg XXX

This can be an automated replacement in many editors. In Windows, UltraEdit includes regexp find/replace functionality

Then import into the dbms since you'll no longer have the embedded new lines.

Then use SQL Replace to change the XXX occurrences back into new lines.


If you have control over the contents of the CSV file, you could replace the in-field line breaks (CRLF) with a non-linebreak character (perhaps just CR or LF), then run a script after the import to replace them with CRLF again.

This is how MS Office products (Excel, Access) deal with this problem.


OK, here's a small Java program that I end up writing to solve the problem.
Comments, corrections and optimizations are welcome.

import java.io.*;

public class PreBulkInsert
{
    public static void main(String[] args)
    {
        if (args.length < 3)
        {
            System.out.println ("Usage:");
            System.out.println ("  java PreBulkInsert input_file output_file separator_character");
            System.exit(0);
        }

        try
        {
            boolean firstQuoteFound = false;
            int fromIndex;
            int lineCounter = 0;
            String str;

            BufferedReader in = new BufferedReader(new FileReader(args[0]));
            BufferedWriter out = new BufferedWriter(new FileWriter(args[1])); 
            String newRowSeparator = args[2];

            while ((str = in.readLine()) != null)
            {
                fromIndex = -1;
                do
                {
                    fromIndex = str.indexOf('"', fromIndex + 1);
                    if (fromIndex > -1)
                        firstQuoteFound = !firstQuoteFound;
                } while (fromIndex > -1);

                if (!firstQuoteFound)
                    out.write(str + newRowSeparator + "\r\n");
                else
                    out.write(str + "\r\n");
                lineCounter++;
            }
            out.close();
            in.close();
            System.out.println("Done! Total of " + lineCounter + " lines were processed.");
        }
        catch (IOException e)
        {
            System.out.println(e.getMessage());
            System.exit(1);
        }       
    }
}


You cannot import this unless the CSV is in valid format. So, you have to either fix the dump or manually using search & replace fix the unwanted new line characters.

0

精彩评论

暂无评论...
验证码 换一张
取 消