开发者

Find matching bracket with Regex

开发者 https://www.devze.com 2023-01-14 02:31 出处:网络
The input is a string representing a list of elements. A list is defined as an open curly { followed by 0 or more elements separated by whitespace followed by a closed curly }.

The input is a string representing a list of elements.

A list is defined as an open curly { followed by 0 or more elements separated by whitespace followed by a closed curly }.

An element is either a literal or a list of elements.

A literal is a succession of non-whitespace characters. If an element contains a curly bracket, it must be escaped with a backslash : \{ and \}. (Or you could assume curlies are not allowed inside literals, for simplicity)

Example:

"{abc { def ghi } 7 { 1 {2} {3 4} } {5 6} x\{yz \}foo }"

No curlies inside literals:

"{abc { def ghi } 7 { 1 {2} {3 4} } {5 6} xyz foo }"

(This is a simplified definition of a Tcl list.)

What I want to know is: can the input be split into the elements of the outermost loop using regex?

Expected output:

abc
{ def ghi }
7
{ 1 {2} {3 4} }
{5 6}
x{yz
}foo

The real qu开发者_Python百科estion is: can this be done with a Regex?

I'm most interested in the .NET flavour, but will accept any answers.

I'll post my own assumption in an answer, and see if it's validated or destroyed.


Unfortunately the answer is YES for some flavor of Regex, e.g. PCRE and .NET because they support recursive pattern and stack-like operations respectively.

The grammar can be written as

ELEMENT  -> (?!\{)\S+ | LIST
LIST     -> '\{\s*' ELEMENT? ('\s+' ELEMENT)* '\s*\}' 

thus in PCRE, this can be transformed into the pattern:

   \{\s*(?0)?(?:\s+(?0))*\s*\}|(?!\{)(?:[^\s}]|\}(?![\s}]))+

#  ---------------------------                   ^^^^^^^^^
#            LIST                    Make sure the } is not closing the group

See http://www.ideone.com/SnGsU for example (I have stripped the top-level { and } for simplicity).

(Of course, don't try this at work :) )

(BTW, I don't know how to transform this PCRE into .NET flavor. If someone knows, please try Converting PCRE recursive regex pattern to .NET balancing groups definition)


Well, the edit removes curly braces from tokens and takes the sting from the question, and now it is easily doable with .Net Regexes, using balancing groups. It is simply matching braces, which is a basic example.
Much like KennyTM's answer, this will only work if you remove the top level braces, or it will match the whole input.
Again, this is better used for recreational purposes:

(?:                    # try matching...
    (?:\\[{}]|[^\s{}])+\s*? # a literal (allow escaped curly braces)
    |                       # OR
    (?<Curly>{)\s*          # "{" and push to stack
    |                       # OR
    (?<-Curly>})\s*?        # "}", pop from stack and fail if the stack is empty
)+?                    # ...a few times, and stop whenever you can.
(?(Curly)(?!))         # Make sure there aren't any extra open curly braces

For much more details see this article: Regex Balancing Group in Depth


The traditional answer to this is a resounding "NO". As we have learned in the compilers class, a regular grammar can't be used to describe a language with a recursive definition (i.e. you can't use a finite state machine)

What's needed here is a context free parser, the implementation of which boils down to a finite state machine + a STACK.
See ANTLR, bison etc.


@Cristi is right about the regex: It is theoretically impossible to solve recursive expressions using a stackless, finite-state automaton. The solution, however, is simpler: you only need to keep a counter of the number of open parentheses, and make sure it doesn't drop below 0. It is more memory-saving than maintaining a stack, and you only need the count - not the contents - of the Parentheses.

Algorithm:

counter = 0                        // Number of open parens
For char c in string:              
    print c                        
    if c=='{':                     // Keep track on number of open parens
        counter++
    if c=='}':
        counter--
    if counter==1:                 // New line if we're back to the top level
        print "\n"
    elif counter<1:                // Error if the nesting is malformed
        print "ERROR: parentheses mismatch"
        break
0

精彩评论

暂无评论...
验证码 换一张
取 消