开发者

Regex to match specific syntax

开发者 https://www.devze.com 2023-02-18 20:08 出处:网络
Hi I\'m wanting a Regex pattern to match a very specific string syntax. Below is the Pattern string that I have put together; it works in some cases but not in others and I\'m quite certain it is way

Hi I'm wanting a Regex pattern to match a very specific string syntax. Below is the Pattern string that I have put together; it works in some cases but not in others and I'm quite certain it is way too over complicated:

\[\CONTENT\((?:(?:(?:(\w+) ?= 开发者_JAVA技巧?((?:"(?:[^"]+)")|(?:'(?:[^']+)')|(?:(?:[^"',]+))) ?, ?)+(?:(?:\w+) ?= ?(?:(?:"(?:.+)")|(?:'(?:.+)')|(?:(?:[^"',]+)))))|(?:(?:\w+) ?= ?(?:(?:"(?:.+)")|(?:'(?:.+)')|(?:(?:[^"',]+)))))\)]

The string syntax that I'm trying to match is below:

[CONTENT(Name="value, Name2='value2', Name_3 = value3, Name4= "value 4 \" includes an escaped quote")] etc

The match groups I want returned are as follows

Match Group 1 - Match 1: [CONTENT(Name="value", Name2='value2', Name_3 = value3, Name4= "value 4 \" includes an escaped quote")]

Match Group 2 - Match 1: Name="value"
Match Group 2 - Match 2: Name
Match Group 2 - Match 3: value

Match Group 3 - Match 1: Name2='value'
Match Group 3 - Match 2: Name2
Match Group 3 - Match 3: value2

Match Group 4- Match 1: Name_3 = value3
Match Group 4- Match 2: Name_3
Match Group 4- Match 3: value3

Match Group 5 - Match 1: Name4= "value 4 \" includes an escaped quote"
Match Group 5 - Match 2: Name4
Match Group 5 - Match 3: value 4 \" includes an escaped quote

When I refer to match groups I'm referring to Match Groups in .NET. The Results don't have to be just as above, but similar if possible.

I'm quite good with simple Regex but I can't get my head around look-arounds etc. The "Name = Value" sets can repeat numerous (possibly, but unlikely unlimited times) each separated by a ',' (comma) - except for the last set (the last name/value set will not be followed by a ',' (comma)). There can be spaces either side of the '=' (equals) sign (or not) as well as either side of the ',' (comma).

I don't know if this is too complicated to do with Regex or what (if it is I'm open to any suggestions anyone can give as an alternative on how to parse such a string.)

Thanks for any help anyone can provide.

Chris


Assuming...

  1. There must be at least one attrib/value pair. AND
  2. Each attrib/value pair is separated by one comma and optional whitespace. AND
  3. Each attribute value is either a properly quoted string or a single "word". AND
  4. Quoted attribute value strings may contain escaped chars: (e.g. v1="That's not \"MY\" problem!" and/or v2='That\'s not "MY" problem!'). AND
  5. An attribute name or unquoted value "word" consists of alphanum and dashes only (i.e. [A-Za-z0-9_\-]+). (Note that the original question does not define this requirement clearly.)

Then this regex (in C#) will correctly match a [CONTENT(a1=v1, a2=v2...)] structure:

Regex regexObj = new Regex(
    @"# Match a [CONTENT(a1=v1, a2=v2...)] structure.
    \[CONTENT\(\s*                  # Opening delimiter
    # Match required first attrib/value pair.
    [\w\-]+                         # First attrib name (Allow [A-Z-a-z0-9_-].
    \s*=\s*                         # Name and value separated by =.
    (?:                             # Group value spec alternatives.
      ""[^""\\]*(\\.[^""\\]*)*""    # Either double quoted string,
    | '[^'\\]*(\\.[^'\\]*)*'        # or a single quoted string,
    |  [\w\-]+                      # or single unquoted ""word"".
    )                               # End group for value alternatives.
    # Match optional second, third... attrib/value pairs.
    (?:                             # Group to allow optional pairs.
      \s*,\s*                       # All pairs separated by comma.
      [\w\-]+                       # Attrib name.
      \s*=\s*                       # Name and value separated by =.
      (?:                           # Group value spec alternatives.
        ""[^""\\]*(\\.[^""\\]*)*""  # Either double quoted string,
      | '[^'\\]*(\\.[^'\\]*)*'      # or a single quoted string,
      |  [\w\-]+                    # or single unquoted ""word"".
      )                             # End group for value alternatives.
    )*                              # Zero or more optional A=V pairs.
    \s*\)\]                         # Closing delimiter.", 
    RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);

Once you have matched and captured a single [CONTENT(...)] structure, you can pick it apart using another regex which matches each atrib/value pair, one at a time.

And for goodness sakes, when writing non-trivial regex such as this one, always use free-spacing mode and add comments and indentation!


It is certainly not for regular expressions. Use a proper parser instead - it is very easy to implement recursive descent parsers using parsing combinators in C#. For example, see this or this.

0

精彩评论

暂无评论...
验证码 换一张
取 消