开发者

Regular expression matching any subset of a given set?

开发者 https://www.devze.com 2023-03-30 20:17 出处:网络
Is it possible to write a regular expression which will match any subset of a given set of characters a1 ... an ?

Is it possible to write a regular expression which will match any subset of a given set of characters

a1 ... an ?

I.e. it should match any string where any of these characters appears at most once, there are no other characters and the relative order of the characters doesn't matter.

Some approaches that arise at once:

1. [a1,...,an]* or (a1|a2|...|an)*- this allows multiple presence of characters

2. (a1?a2?...an?) - no multiple presence, but relative order is important开发者_C百科 - this matches any subsequence but not subset.

3. ($|a1|...|an|a1a2|a2a1|...|a1...an|...|an...a1), i.e. write all possible subsequences (just hardcode all matching strings :)) of course, not acceptable.

I also have a guess that it may be theoretically impossible, because during parsing the string we will need to remember which character we have already met before, and as far as I know regular expressions can check out only right-linear languages.

Any help will be appreciated. Thanks in advance.


This doesn't really qualify for the language-agnostic tag, but...

^(?:(?!\1)a1()|(?!\2)a2()|...|(?!\n)an())*$

see a demo on ideone.com

The first time an element is matched, it gets "checked off" by the capturing group following it. Because the group has now participated in the match, a negative lookahead for its corresponding backreference (e.g., (?!\1)) will never match again, even though the group only captured an empty string. This is an undocumented feature that is nevertheless supported in many flavors, including Java, .NET, Perl, Python, and Ruby.

This solution also requires support for forward references (i.e., a reference to a given capturing group (\1) appearing in the regex before the group itself). This seems to be a little less widely supported than the empty-groups gimmick.


Can't think how to do it with a single regex, but this is one way to do it with n regexes: (I will usr 1 2 ... m n etc for your as)

^[23..n]*1?[23..n]*$
^[13..n]*2?[13..n]*$
...
^[12..m]*n?[12..m]*$

If all the above match, your string is a strict subset of 12..mn.

How this works: each line requires the string to consist exactly of:

  • any number of charactersm drawn fromthe set, except a particular one
  • perhaps a particular one
  • any number of charactersm drawn fromthe set, except a particular one

If this passes when every element in turn is considered as a particular one, we know:

  • there is nothing else in the string except the allowed elements
  • there is at most one of each of the allowed elements

as required.


for completeness I should say that I would only do this if I was under orders to "use regex"; if not, I'd track which allowed elements have been seen, and iterate over the characters of the string doing the obvious thing.


Not sure you can get an extended regex to do that, but it's pretty easy to do with a simple traversal of your string.

You use a hash (or an array, or whatever) to store if any of your allowed characters has already been seen or not in the string. Then you simply iterate over the elements of your string. If you encounter an element not in your allowed set, you bail out. If it's allowed, but you've already seen it, you bail out too.

In pseudo-code:

foreach char a in {a1, ..., an}
   hit[a1] = false

foreach char c in string
   if c not in {a1, ..., an} => fail
   if hit[c] => fail
   hit[c] = true


Similar to Alan Moore's, using only \1, and doesn't refer to a capturing group before it has been seen:

#!/usr/bin/perl
my $re = qr/^(?:([abc])(?!.*\1))*$/;
foreach (qw(ba pabc abac a cc cba abcd abbbbc), '') {
    print "'$_' ", ($_ =~ $re) ? "matches" : "does not match", " \$re \n";
}

We match any number of blocks (the outer (?:)), where each block must consist of "precisely one character from our preferred set, which is not followed by a string containing that character".

If the string might contain newlines or other funny stuff, it might be necessary to play with some flags to make ^, $ and . behave as intended, but this all depends on the particular RE flavor.

Just for sillyness, one can use a positive look-ahead assertion to effectively AND two regexps, so we can test for any permutation of abc by asserting that the above matches, followed by an ordinary check for 'is N characters long and consists of these characters':

my $re2 = qr/^(?=$re)[abc]{3}$/;
foreach (qw(ba pabc abac a cc abcd abbbbc abc acb bac bca cab cba), '') {
    print "'$_' ", ($_ =~ $re2) ? "matches" : "does not match", " \$re2 \n";
}
0

精彩评论

暂无评论...
验证码 换一张
取 消