开发者

REGEX (PCRE) matching only if zero or once

开发者 https://www.devze.com 2023-01-24 09:51 出处:网络
I have the following problem. Let\'s take the input (wikitext) ======hello((my first program)) world======

I have the following problem.

Let's take the input (wikitext)

======hello((my first program)) world======

I want to match "hello", "my first program" and " world" (notice the space).

But for the input:

======hello(my first program)) world======

I want to match "hello(my first program" and " world".

In other words, I want to match any letters, spaces and additionally any single symbols (no double or more).

This should be done with the unicode character properties like \p{L}, \p{S} or \p{Z}, as documented here.

Any ideas?

Addendum 1

The regex has just to stop before any double symbol or punctuation in unicode terms, that is, before any \p{S}{2,} or \p{P}{2,}.

I'm not trying to parse the whole开发者_如何学JAVA wikitext with this, read my question carefully. The regex I'm looking for IS for the lexer I'm working on, and making it match such inputs will simplify my parser incredibly.

Addendum 2

The pattern must work with preg_match(). I can imagine how I'd have to split it first. Perhaps it would use some lookahead, I don't know, I've tried everything that I could imagine.

Using only preg_match() is a requirement set in stone by the current implementation of the lexer. It must be that way, because that's the natural way of how lexers work: they match sequences in the input stream.


return preg_split('/([\pS\pP])\\1+/', $theString);

Result: http://www.ideone.com/YcbIf

(You need to get rid of the empty strings manually.)


Edit: as a preg_match regex:

'/(?:^|([\pS\pP])\\1+)((?:[^\pS\pP]|([\pS\pP])(?!\\3))*)/'

take the 2nd capture group when it is matched. Example: http://www.ideone.com/ErTVA

But you could just consume ([\pS\pP])\\1+ and discard, or if doesn't match, consume (?:[^\pS\pP]|([\pS\pP])(?!\\3))* and record, since your lexer is going to use more than 1 regex anyway?


Regular expressions are notoriously overused and ill-suited for parsing languages like this. You can get away with it for a little while, but eventually you will find something that breaks your parser, requiring tweak after tweak and a huge library of unit tests to ensure compliance.

You should seriously consider writing a proper lexer and parser instead.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号