开发者

Regular Expression - Get Word Between Characters

开发者 https://www.devze.com 2023-03-08 06:26 出处:网络
Given the following example string: \"[ One].[Two ].[ Three].[Four]\" I want to match \"One\"; \"Two\", \"Three\" and \"Four\".

Given the following example string: "[ One].[Two ].[ Three ].[Four]" I want to match "One"; "Two", "Three" and "Four".

In other words: I need to get the word between the brackets, regardless how many white spaces are surround this w开发者_开发问答ord.

I've tried it with the following expression:

(?<=\[)(?s)(.*?)(?=\s*\])

That results in " One", "Two", " Three" and "Four".

EDIT: It's a little bit more complicated than I first tought it would be:

  1. There are many (at least one) word(s) encapsulated by brackets which might seperated by an arbitrary char (e.g. "[one]" or "[one] [two][three].[four]").
  2. The brackets contain one single word and many, or even no whitespaces (e.g. "[one]" or "[two ]" or "[ three ]".
  3. These blocks of words and there enclosing brackets are surrounded by a known sequence of chars: "These words [word-1] .. [word-n] are well known" or "These words [word-1] .. [word-n] are well known".

Please note that "[word-1] .. [word-n]" just stands for an arbitrary count of the blocks described above.

I want to match just the single word(s) between the brackets and eliminate the surround sequence ("These words" and "are well known") as well as possibly existing whitespaces within the brackets and between the blocks. In addition, the possibly existing char (it couldn't be more than only one) between the blocks should be eliminiated, too. Hope that wasn't too weird ;)


You can use this, with the "global" flag enabled

\[\s*(\S+?)\s*\]

Explanation

\[      # a literal "["
\s*     # any number of white space
(\S+?)  # at least one non white-space character, non-greedily (group 1)
\s*     # any number of white space
\]      # a literal "]"

EDIT:

@Kobi noted that \S+? can actually match the ] in targets like "[ One]". So for a moment, group 1 would contain "One]".

But then there still is the \] at the end of the regex, at which point the regex engine would backtrack and give the "]" to \], so the expression can succeed.

It is vitally important to use on-greedy matching here (\S+?, as opposed to \S+). I got that wrong in the first version of my answer as well.

Further, the \S is very unspecific. If you have anything more specific in terms of what "a word" is for you - by all means, use it.


Non-greedy matching is the key. Try the following:

\[\s*(.+?)\s*\]

It will match anything within brackets and capture it without the whitespace before or after. If the string within the brackets cannot have spaces, I recommend the following as it's a better expression.

\[\s*(\S+)\s*\]


A simple solution is to use capturing groups to get the part of the match you really want:

\[\s*(.*?)\s*\]

Example:

MatchCollection matches = Regex.Matches(s, @"\[\s*(.*?)\s*\]");
string[] words = matches.Cast<Match>().Select(m => m.Groups[1].Value).ToArray();

A similar option is to use trim:

MatchCollection matches = Regex.Matches(s, @"\[([^\]]*)\]");
string[] words = matches.Cast<Match>().Select(m => m.Groups[1].Value.Trim()).ToArray();

If you really want, you can use look-arounds:

(?<=\[\s*)\S.*?(?=\s*\])

Example:

MatchCollection matches = Regex.Matches(s, @"(?<=\[\s*)\S.*?(?=\s*\])");
string[] words = matches.Cast<Match>().Select(m => m.Value).ToArray();


Is regex absolutely necessary? If not, I believe you could just Trim to get rid of the brackets, dots, and spaces.

char[] chars = new char[] {'[', ']', '.', ' '};
inputString = inputString.Trim(chars);
0

精彩评论

暂无评论...
验证码 换一张
取 消