Improve performance of a regexp_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-06 11:55 出处：网络

My software allows users to use regexp to prepare files. I am in the process of adding a default regexp library with common expressions that can be re-used t开发者_开发知识库o prepare a variety of formats. One common task is to remove crlf in specific parts of the files, but not in others. For instance, this:

    <TU>Lorem 
    Ipsum</TU>
    <SOURCE>This is a sentence
    that should not contain
    any line break.
    </SOURCE>

Should become:

    <TU>Lorem 
    Ipsum</TU>
    <SOURCE>This is a sentence that should not contain any line break.
    </SOURCE>

I have a rexep that does the job pretty nicely:

(?(?<=<SOURCE>(?:(?!</?SOURCE>).)*)(\r\n))

The problem is that it is processing intensive and with files above 500kb, it can take 30+ seconds. (regex is compiled, in this case, uncompiled is much slower)

It's not a big issue, but I wonder is there is a better way to achieve the same results with Regex.

Thanks in advance for your suggestions.

Try this:

\r\n(?=(?>[^<>]*(?><(?!/?SOURCE>)[^<>]*)*)</SOURCE>)

It starts out by matching \r\n, then uses a lookahead to see if the match is between <SOURCE> and </SOURCE>. It does that by looking for a </SOURCE>, but if it finds <SOURCE> first it fails. Atomic groups prevent it from saving the state information that would be needed for backtracking, because pass or fail, backtracking is never necessary.

I would prepend a negative lookahead assertion to the regex to make sure that you can in fact match a \r\n in the current position. Otherwise, the engine has to do the entire lookbehind (arbitrary-size to boot) on every single character in the entire file, only to find out then that there is no carriage return to be replaced.

(?=\r\n)(?(?<=<SOURCE>(?:(?!</?SOURCE>).)*)(\r\n))

should be a lot faster. At least in RegexBuddy, the regex engine needs a lot fewer steps to complete the match. If it doesn't in .NET, I don't know why. Perhaps the conditional regex is not as efficient (I must admit that at first I didn't recognize it and thought that there was a syntax error in your regex). I think you don't need a conditional regex in this scenario. How about

\r\n(?<=<SOURCE>(?:(?!</?SOURCE>).)*)

Is this faster? I'm assuming you're using RegexOptions.Singleline to compile the regex.

If not, well, then there probably are very many carriage returns and many other characters inside your <SOURCE> blocks, and the arbitrary-size lookbehind simply takes a long time. Then my other suggestion is to split up the task:

Match the <SOURCE> block
Replace all CRLFs inside the block (no regex required)
Replace the <SOURCE> block

"Compiling" a regular expression simply means converting it from a Deterministic Finite Automaton to a Non-deterministic Finite Automaton. It isn't "compiling into machine code" as you might expect.

NFAs are typically smaller than their corresponding DFA and can execute more space efficiently. Every DFA has at least one equivalent NFA and vice-versa. However, Perl Compatible Regular Expressions are aren't actually Regular. So I don't know that they are converted into NFAs or if "compiling" just another form of lexical analysis that done once need not be done again.

PCREs are slow according to Russ Cox, partly because of their non-regularity, and your expression above is quite non-regular.

Oh, and if you are trying to recognize nested tags with regexps, don't. Use a real (X|HT)?ML parser. You really don't want the pony to come