Regex implementation with event driven matches?_问答_开发者

This may sound a little odd, but it would be extremely useful to me. Are there any regex implementations (any language but preferably java, javascript, c, c++) that use an event based model for matches?

I would like to be able to register a bunch of different regular expressions I am looking for in a string via an event based model, feed the string though the regex engine, 开发者_运维知识库and just have the events fired off correctly. Does anything like this exist?

I realize this is bordering on the territory of a heavy duty lexer/parser, but I would prefer to stay away from that if at all possible, as my search expressions would need to be dynamic (completely).

Thanks

This is very easy to do in Perl regular expressions. All you do is insert your event callouts at the appropriate point in the pattern in the most straightforward manner imaginable.

First, imagine a pattern for pulling out decimal numbers from string:

my $rx0 = /[+-]?(?:\d+(?:\.\d*)?|\.\d+)/;

Let’s expand that out so we can insert our callouts:

my $rx1 = qr{
    [+-] ?
    (?: \d+
        (?: \. \d* ) ?
      |
        \. \d+
    )
}x;

For callouts, I’ll just print some debugging, but you could do anything you want:

my $rx2 = qr{
    (?: [+-]                (?{ say "\tleading sign"                })
    ) ?
    (?: \d+                 (?{ say "\tinteger part"                })
        (?: \.              (?{ say "\tinternal decimal point"      })
            \d*             (?{ say "\toptional fractional part"    })
        ) ?
      |
        \.                  (?{ say "\tleading decimal point"       })
        \d+                 (?{ say "\trequired fractional part"    })
    )                       (?{ say "\tsuccess"                     })
}x;

Here’s the whole demo:

use 5.010;
use strict;

use utf8;

my $rx0 = qr/[+-]?(?:\d+(?:\.\d*)?|\.\d+)/;

my $rx1 = qr{
    [+-] ?
    (?: \d+
        (?: \. \d* ) ?
      |
        \. \d+
    )
}x;

my $rx2 = qr{
    (?: [+-]                (?{ say "\tleading sign"                })
    ) ?
    (?: \d+                 (?{ say "\tinteger part"                })
        (?: \.              (?{ say "\tinternal decimal point"      })
            \d*             (?{ say "\toptional fractional part"    })
        ) ?
      |
        \.                  (?{ say "\tleading decimal point"       })
        \d+                 (?{ say "\trequired fractional part"    })
    )                       (?{ say "\tsuccess"                     })
}x;

my $string = <<'END_OF_STRING';

    The Earth’s temperature varies between
    -89.2°C and 57.8°C, with a mean of 14°C.

    There are .25 quarts in 1 gallon.

    +10°F is -12.2°C.

END_OF_STRING

while ($string =~ /$rx2/gp) {
    printf "Number: ${^MATCH}\n";
}

which when run produces this:

        leading sign
        integer part
        internal decimal point
        optional fractional part
        success
Number: -89.2
        integer part
        internal decimal point
        optional fractional part
        success
Number: 57.8
        integer part
        success
Number: 14
        leading decimal point
        leading decimal point
        required fractional part
        success
Number: .25
        integer part
        success
Number: 1
        leading decimal point
        leading sign
        integer part
        success
Number: +10
        leading sign
        integer part
        internal decimal point
        optional fractional part
        success
Number: -12.2
        leading decimal point

You may want to arrange a more grammatical regular expression for maintainability. This also helps for when you want to make a recursive descent parser out of it. (Yes, of course you can do that: this is Perl, after all. :)

Look at the last solution in this answer for what I mean by grammatical regexes. I also have larger examples elsewhere here on SO.

But it sounds like you should look at the Regexp::Grammars module by Damian Conway, which was built for just this sort of thing. This question talks about it, and has a link to the module proper.

You might want to check out PIRE - a very fast automata-based regexp engine, tuned to match zillions of lines of text against many regular expressions quickly. It's available in C and has some bindings.

It's really not something that's too hard to put together yourself if you can't find any existing library.

Something like this:

public class RegexNotifier {
   private final Map<Pattern, List<RegexListener>> listeners = new HashMap<Pattern, List<RegexListener>>();

   public synchronized void register(Pattern pattern, RegexListener listener) {
      List<RegexListener> list = listeners.get(pattern);
      if (list == null) {
         list = new ArrayList<RegexListener>();
         listeners.put(pattern, list);
      }
      list.add(listener);
   }

   public void process(String input) {
      for (Entry<Pattern, List<RegexListener>> entry : listeners.entrySet()) {
         if (entry.getKey().matcher(input).matches()) {
            for (RegexListener listener : entry.getValue()) {
               listener.stringMatched(input, entry.getKey());
            }
         }
      }
   }
}

interface RegexListener {
   public void stringMatched(String matched, Pattern pattern);
}

The only shortcoming I see with this is that Pattern doesn't implement hashCode() and equals(), meaning it will be less than optimal if equal patterns using different instances are used. ~~But that usually doesn't happen because the factory method Pattern.compile() is good about caching patterns.~~