开发者

how can i optimize the performance of this regular expression?

开发者 https://www.devze.com 2023-03-18 03:33 出处:网络
I\'m using a regular expression to replace commas that are not contained by text qualifying quotes into tab spaces.

I'm using a regular expression to replace commas that are not contained by text qualifying quotes into tab spaces. I'm running the regex on file content through a script task in SSIS. The file content is over 6000 lines long. I saw an example of using a regex on file content that looked like this

String FileContent = ReadFile(FilePath, ErrInfo);        
Regex r = new Regex(@"(,)(?=(?:[^""]|""[^""]*"")*$)");
FileContent = r.Replace(FileContent, "\t");

That replace can understandably take its sweet开发者_C百科 time on a decent sized file.

Is there a more efficient way to run this regex? Would it be faster to read the file line by line and run the regex per line?


It seems you're trying to convert comma separated values (CSV) into tab separated values (TSV).

In this case, you should try to find a CSV library instead and read the fields with that library (and convert them to TSV if necessary).

Alternatively, you can check whether each line has quotes and use a simpler method accordingly.


The problem is the lookahead, which looks all the way to the end on each comman, resulting in O(n2) complexity, which is noticeable on long inputs. You can get it done in a single pass by skipping over quotes while replacing:

Regex csvRegex = new Regex(@"
    (?<Quoted>
        ""                  # Open quotes
        (?:[^""]|"""")*     # not quotes, or two quotes (escaped)
        ""                  # Closing quotes
    )
    |                       # OR
    (?<Comma>,)             # A comma
    ",
RegexOptions.IgnorePatternWhitespace);
content = csvRegex.Replace(content,
                        match => match.Groups["Comma"].Success ? "\t" : match.Value);

Here we match free command and quoted strings. The Replace method takes a callback with a condition that checks if we found a comma or not, and replaced accordingly.


The simplest optimization would be

Regex r = new Regex(@"(,)(?=(?:[^""]|""[^""]*"")*$)", RegexOptions.Compiled);
foreach (var line in System.IO.File.ReadAllLines("input.txt"))
    Console.WriteLine(r.Replace(line, "\t"));

I haven't profiled it, but I wouldn't be surprised if the speedup was huge.

If that's not enough I suggest some manual labour:

var input = new StreamReader(File.OpenRead("input.txt"));

char[] toMatch = ",\"".ToCharArray ();
string line;
while (null != (line = input.ReadLine()))
{
    var result = new StringBuilder(line);
    bool inquotes = false;

    for (int index=0; -1 != (index = line.IndexOfAny (toMatch, index)); index++)
    {
        bool isquote = (line[index] == '\"');
        inquotes = inquotes != isquote;

        if (!(isquote || inquotes))
            result[index] = '\t';
    }
    Console.WriteLine (result);
}

PS: I assumed @"\t" was a typo for "\t", but perhaps it isn't :)

0

精彩评论

暂无评论...
验证码 换一张
取 消