开发者

regular expression should split , that are contained outside the double quotes in a CSV file?

开发者 https://www.devze.com 2022-12-08 22:25 出处:网络
This is the sample \"abc\",\"开发者_开发技巧abcsds\",\"adbc,ds\",\"abc\" Output should be abc abcsds

This is the sample

"abc","开发者_开发技巧abcsds","adbc,ds","abc"

Output should be

abc
abcsds
adbc,ds
abc


Try this:

"(.*?)"

if you need to put this regex inside a literal, don't forget to escape it:

Regex re = new Regex("\"(.*?)\"");


This is a tougher job than you realize -- not only can there be commas inside the quotes, but there can also be quotes inside the quotes. Two consecutive quotes inside of a quoted string does not signal the end of the string. Instead, it signals a quote embedded in the string, so for example:

"x", "y,""z"""

should be parsed as:

x
y,"z"

So, the basic sequence is something like this:

Find the first non-white-space character.
If it was a quote, read up to the next quote. Then read the next character.
    Repeat until that next character is not also a quote.
    If the next (non-whitespace) character is not a comma, input is malformed.
If it was not a quote, read up to the next comma.
Skip the comma, repeat the whole process for the next field.

Note that despite the tag, I'm not providing a regex -- I'm not at all sure I've seen a regex that can really handle this properly.


This answer has a C# solution for dealing with CSV.

In particular, the line

private static Regex rexCsvSplitter = new Regex( @",(?=(?:[^""]*""[^""]*"")*(?![^""]*""))" );

contains the Regex used to split properly, i.e., taking quoting and escaping into consideration.

Basically what it says is, match any comma that is followed by an even number of quote marks (including zero). This effectively prevents matching a comma that is part of a quoted string, since the quote character is escaped by doubling it.

Keep in mind that the quotes in the above line are doubled for the sake of the string literal. It might be easier to think of the expression as

,(?=(?:[^"]*"[^"]*")*(?![^"]*"))


If you can be sure there are no inner, escaped quotes, then I guess it's ok to use a regular expression for this. However, most modern languages already have proper CSV parsers.

Use a proper parser is the correct answer to this. Text::CSV for Perl, for example.

However, if you're dead set on using regular expressions, I'd suggest you "borrow" from some sort of module, like this one: http://metacpan.org/pod/Regexp::Common::balanced

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号