Recently I've been informed by a reputable SO user, that TStringList
has splitting bugs which would cause it to fail parsing CSV data. I haven't been informed about the nature of these bugs, and a search on the internet including Quality Central did not produce any results, so I'm asking. What are TStringList splitting bugs?
Please note, I'm not interested in unfounded opinion based answers.
What I know:Not much... One is that, these bugs show up rarely with test data, but not so rarely in real world.
The other is, as stated, they prevent proper parsing of CSV. Thinking that it is difficult to reproduce the bugs with test data, I am (probably) seeking help from whom have tried us开发者_运维知识库ing a string list as a CSV parser in production code.
Irrelevant problems:
I obtained the information on a 'Delphi-XE' tagged question, so failing parsing due to the "space character being considered as a delimiter" feature do not apply. Because the introduction of the StrictDelimiter
property with Delphi 2006 resolved that. I, myself, am using Delphi 2007.
Also since the string list can only hold strings, it is only responsible for splitting fields. Any conversion difficulty involving field values (f.i. date, floating point numbers..) arising from locale differences etc. are not in scope.
Basic rules:
There's no standard specification for CSV. But there are basic rules inferred from various specifications.
Below is demonstration of how TStringList handles these. Rules and example strings are from Wikipedia. Brackets ([
]
) are superimposed around strings to be able to see leading or trailing spaces (where relevant) by the test code.
Test string: [1997, Ford , E350] Items: [1997] [ Ford ] [ E350]Fields with embedded commas must be enclosed within double-quote characters.
Test string: [1997,Ford,E350,"Super, luxurious truck"] Items: [1997] [Ford] [E350] [Super, luxurious truck]Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.
Test string: [1997,Ford,E350,"Super, ""luxurious"" truck"] Items: [1997] [Ford] [E350] [Super, "luxurious" truck]Fields with embedded line breaks must be enclosed within double-quote characters.
Test string: [1997,Ford,E350,"Go get one now they are going fast"] Items: [1997] [Ford] [E350] [Go get one now they are going fast]In CSV implementations that trim leading or trailing spaces, fields with such spaces must be enclosed within double-quote characters.
Test string: [1997,Ford,E350," Super luxurious truck "] Items: [1997] [Ford] [E350] [ Super luxurious truck ]Fields may always be enclosed within double-quote characters, whether necessary or not.
Test string: ["1997","Ford","E350"] Items: [1997] [Ford] [E350]Testing code:
var
SL: TStringList;
rule: string;
function GetItemsText: string;
var
i: Integer;
begin
for i := 0 to SL.Count - 1 do
Result := Result + '[' + SL[i] + '] ';
end;
procedure Test(TestStr: string);
begin
SL.DelimitedText := TestStr;
Writeln(rule + sLineBreak, 'Test string: [', TestStr + ']' + sLineBreak,
'Items: ' + GetItemsText + sLineBreak);
end;
begin
SL := TStringList.Create;
SL.Delimiter := ','; // default, but ";" is used with some locales
SL.QuoteChar := '"'; // default
SL.StrictDelimiter := True; // required: strings are separated *only* by Delimiter
rule := 'Spaces are considered part of a field and should not be ignored.';
Test('1997, Ford , E350');
rule := 'Fields with embedded commas must be enclosed within double-quote characters.';
Test('1997,Ford,E350,"Super, luxurious truck"');
rule := 'Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.';
Test('1997,Ford,E350,"Super, ""luxurious"" truck"');
rule := 'Fields with embedded line breaks must be enclosed within double-quote characters.';
Test('1997,Ford,E350,"Go get one now'#10#13'they are going fast"');
rule := 'In CSV implementations that trim leading or trailing spaces, fields with such spaces must be enclosed within double-quote characters.';
Test('1997,Ford,E350," Super luxurious truck "');
rule := 'Fields may always be enclosed within double-quote characters, whether necessary or not.';
Test('"1997","Ford","E350"');
SL.Free;
end;
If you've read it all, the question was :), what are "TStringList splitting bugs?"Not much... One is that, these bugs show up rarely with test data, but not so rarely in real world.
All it takes is one case. Test data is not random data, one user with one failure case should submit the data and voilà, we've got a test case. If no one can provide test data, maybe there's no bug/failure?
There's no standard specification for CSV.
That one sure helps with the confusion. Without a standard specification, how do you prove something is wrong? If this is left to one's own intuition, you might get into all kinds of troubles. Here's some from my own happy interaction with government issued software; My application was supposed to export data in CSV format, and the government application was supposed to import it. Here's what got us into a lot of trouble several years in a row:
- How do you represent empty data? Since there's no CSV standard, one year my friendly gov decided anything goes, including nothing (two consecutive commas). Next they decided only consecutive commas are OK, that is,
Field,"",Field
is not valid, should beField,,Field
. Had a lot of fun explaining to my customers that the gov app changed validation rules from one week to the next... - Do you export ZERO integer data? This was probably an bigger abuse, but my "gov app" decided to validate that also. At one time it was mandatory to include the
0
, then it was mandatory NOT to include the0
. That is, at one timeField,0,Field
was valid, nextField,,Field
was the only valid way...
And here's an other test-case where (my) intuition failed:
1997, Ford, E350, "Super, luxurious truck"
Please note the space between ,
and "Super
, and the very lucky comma that follows "Super
. The parser employed by TStrings
only sees the quote char if it immediately follows the delimiter. That string is parsed as:
[1997]
[ Ford]
[ E350]
[ "Super]
[ luxurious truck"]
Intuitively I'd expect:
[1997]
[ Ford]
[ E350]
[Super luxurious truck]
But guess what, Excel does it the same way Delphi does it...
Conclusion
TStrings.CommaText
is fairly good and nicely implemented, at least the Delphi 2010 version I looked at is quite effective (avoids multiple string allocations, uses aPChar
to "walk" the parsed string) and works about the same as Excel's parser does.- In the real world you'll need to exchange data with other software, written using other libraries (or no libraries at all), where people might have miss-interpreted some of the (missing?) rules of CSV. You'll have to adapt, and it'll probably not be a case of right-or-wrong but a case of "my clients need to import this crap". If that happens, you'll have to write your own parser, one that adapts to the requirements of the 3rd party app you'd be dealing with. Until that happens, you can safely use
TStrings
. And when it does happen, it might not beTString
's fault!
I'm going to go out on a limb and say that the most common failure case is the embedded linebreak. I know most of the CSV parsing I do ignores that. I'll use 2 TStringLists, 1 for the file I'm parsing, the other for the current line. So I'll end up with code similar to the following:
procedure Foo;
var
CSVFile, ALine: TStringList;
s: string;
begin
CSVFile := TStringList.Create;
ALine := TStringList.Create;
ALine.StrictDelimiter := True;
CSVFile.LoadFromFile('C:\Path\To\File.csv');
for s in CSVFile do begin
ALine.CommaText := s;
DoSomethingInteresting(ALine);
end;
end;
Of course, since I'm not taking care to make sure that each line is "complete", I can potentially run into cases where the input contains a quoted linebreak in a field and I miss it.
Until I run into real world data where it's an issue, I'm not going to bother fixing it. :-P
Another example... this TStringList.CommaText bug exists in Delphi 2009.
procedure TForm1.Button1Click(Sender: TObject);
var
list : TStringList;
begin
list := TStringList.Create();
try
list.CommaText := '"a""';
Assert(list.Count = 1);
Assert(list[0] = 'a');
Assert(list.CommaText = 'a'); // FAILS -- actual value is "a""
finally
FreeAndNil(list);
end;
end;
The TStringList.CommaText setter and related methods corrupt the memory of the string that holds the a
item (its null terminator character is overwritten by a "
).
Already tried use TArray<String>
split?
var
text: String;
arr: TArray<String>;
begin
text := '1997,Ford,E350';
arr := text.split([',']);
So arr
would be:
arr[0] = 1997;
arr[1] = Ford;
arr[2] = E350;
精彩评论