开发者

Line break issue from CSV to MySQL

开发者 https://www.devze.com 2023-02-03 21:34 出处:网络
I am importing a .csv file into MySQL and everything works fine, except the line breaks that are in the file.

I am importing a .csv file into MySQL and everything works fine, except the line breaks that are in the file.

One of my .csv rows looks like this:

42,E-A-R™ Classic™ Earplugs,ear,images/ear/classic.jpg,5%,"Proven size, shape, and foam
3M's most popular earplug
Corded and uncorded in a variety of individual packs
NRR 29 dB / CSA Class AL",312-1201,,"E-A-R™ Classic™ Uncorded Earplugs, in Poly Bag",310-1001,,E-A-R™ Classic™ Uncorded Earplugs in Pillow Pack,311-1101,,"E-A-R™ Classic™ Corded Earplugs, in Poly Bag"

The sixth field over should break into a new line when called, but it doesn't. When importing the .csv I select Lines terminated by \r. I have tried \n and auto but no luck.

Weird thing is, the field looks correct in the database with all of the appropriate breaks. If I manually go in to insert the line breaks in PHPmyadmin it prints co开发者_JS百科rrectly. Each field is set to UTF-8 as well.

Any ideas on this? Thanks.

edit: here is the MySQL statement

LOAD DATA LOCAL INFILE '/tmp/php89FC0F' REPLACE INTO TABLE `ohes_flyer_products`
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
ESCAPED BY '\\'
LINES TERMINATED BY '\r'


LOAD DATA LOCAL INFILE '/tmp/php89FC0F' REPLACE INTO TABLE `ohes_flyer_products`
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '\\'
LINES TERMINATED BY '\r\n'


maybe you could use fgetcsv to parse each csv line into an array and then dump that array into the database?

something along the lines of

$fd = fopen($csvfile, "r");
while ($line = fgetcsv($fd))
{
    $sql = sprintf("INSERT INTO tablename (...) VALUES ('%s', ...)", $line[0], ...);
    $res = mysql_query($sql);
}

note 1: code not ready for production, check SQL injections!

note 2: please, use prepared statements as using them will speed the thing a lot (or make one multi-row insert statement).

note 3: wrap all in a transaction.


Your CSV file has some qualities that you might be able to exploit.

  • The field containing carriage returns that do not terminate the record are enclosed in quotation marks.
  • The carriage return denoting the end of record follows a record with data enclosed in quotation marks. If this is true for all records, it is a way to possibly distinguish mid-field carriage returns from record terminators.

Knowing this, here are some things you can try:

  1. Using a program like UltraEdit (or Notepad++) and its find/replace features (that include regular expression handling):

    • Find all carriage returns that are preceded by a quotation mark and replace them with a unique character or string. I suggest the pipe character "|" but first ensure they aren't used anywhere in the CSV file. These will represent end-of-record.
    • Next, replace all carriage returns with spaces. This will bring your fields with unwanted carriage returns back into alignment with the other data.
    • Finally, replace all special end-of-record characters with carriage returns. The end result that the only carriage returns present are end-of-record indicators.
  2. Given that the carriage returns appear within a field that is enclosed by a delimiter (the quotation marks) you can specify that the import engine should only honor field and record delimiters outside of quotations. (MySQL LOAD DATA INFILE syntax) Specifically, look at the ENCLOSED BY 'char' parameter. Since not all of your fields use the delimiter, you will need to specify OPTIONALLY. In theory you should be able to specify how the CSV file is constructed and not need to parse it beforehand. I am of the opinion, however, that the in-field carriage returns should probably be removed so that the text will properly wrap when output in new context.


Your CSV appears to be non-standard, but that's often the reality of dealing with customer datasets.

As tools like MySQL's LOAD DATA statement are made to handle only the perfect use case, I've found that dealing with non-standard datasets like this requires code.

One way to handle this is to first scrub your CSV, replacing mid-field line breaks with a special, unique string (like ===MIDFIELD_LINE_BREAK===). Then I would write a custom CSV parser in a scripting language (Python, Ruby, PHP, Perl, etc).

In your CSV parser, iterate through lines in the file. For each line:

  • Swap the \n or \r characters back in for the ===MIDFIELD_LINE_BREAK=== characters.
  • Construct and execute an INSERT statement.


This worked for me:

$query = <<<EOT

LOAD DATA LOCAL INFILE '$file' REPLACE INTO TABLE `$table`
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
ESCAPED BY '\\\'
LINES TERMINATED BY '\\\n'
IGNORE 1 ROWS;

EOT;

I had to tweak @Krunal's answer, due to getting errors, by adding a few extra forward slashes.

Unix line returns used here, by the way.

DOS:     \\\r\\\n
Old Mac: \\\r
Unix:    \\\n
0

精彩评论

暂无评论...
验证码 换一张
取 消