开发者

Implementing "*?" (lazy "*") regexp pattern in combinatorial GLR parsers

开发者 https://www.devze.com 2023-01-28 09:02 出处:网络
I have implemented combinatorial GLR parsers. Among them there are: char(·) parser which consumes specified character or range of characters.

I have implemented combinatorial GLR parsers. Among them there are:

  • char(·) parser which consumes specified character or range of characters.
  • many(·) combinator which repeats specified parser from zero to infinite times.

Example: "char('a').many()" will match a string with any number of "a"-s.

But many(·) combinator is greedy, so, for example, char('{') >> char('{') >> char('a'..'z').many() >> char('}') >> char('}') (where ">>" is sequential chaining of parsers) will successfully consume the whole "{{foo}}some{{bar}}" string.

I want to implement the lazy version of many(·) which, being used in previous example, will consume "{{foo}}" only. How can I do that?

Edit:

May be I confused ya all. In my program a parser is a function (or "functor" in terms of C++) which accepts a "step" and returns forest of "steps". A "step" may be of OK type (that means that parser has consumed part of input successfully) and FAIL type (that means the parser has encountered error). There are more types of steps but they are auxiliary.

Parser = f(Step) -> Collection of TreeNodes of Steps.

So when I parse input, I:

  • Compose simple predefined Parser functions to get complex Parser function representing required grammar.

  • Fo开发者_高级运维rm initial Step from the input.

  • Give the initial Step to the complex Parser function.

  • Filter TreeNodes with Steps, leaving only OK ones (or with minimum FAIL-s if there were errors in input).

  • Gather information from Steps which were left.


I have implemented and have been using GLR parsers for 15 years as language front ends for a program transformation system.

I don't know what a "combinatorial GLR parser" is, and I'm unfamiliar with your notation so I'm not quite sure how to interpret it. I assume this is some kind of curried function notation? I'm imagining your combinator rules are equivalent to definining a grammer in terms of terminal characters, where "char('a').many" corresponds to grammar rules:

 char = "a" ;
 char = char "a" ;

GLR parsers, indeed, produce all possible parses. The key insight to GLR parsing is its psuedo-parallel processing of all possible parses. If your "combinators" can propose multiple parses (that is, they produce grammar rules sort of equivalent to the above), and you indeed have them connected to a GLR parser, they will all get tried, and only those sequences of productions that tile the text will survive (meaning all valid parsess, e.g., ambiguous parses) will survive.

If you have indeed implemented a GLR parser, this collection of all possible parses should have been extremely clear to you. The fact that it is not hints what you have implemented is not a GLR parser.

Error recovery with a GLR parser is possible, just as with any other parsing technology. What we do is keep the set of live parses before the point of the error; when an error is found, we try (in psuedo-parallel, the GLR parsing machinery makes this easy if it it bent properly) all the following: a) deleting the offending token, b) inserting all tokens that essentially are FOLLOW(x) where x is live parse. In essence, delete the token, or insert one expected by a live parse. We then turn the GLR parser loose again. Only the valid parses (e.g., repairs) will survive. If the current token cannot be processed, the parser processing the stream with the token deleted survives. In the worst case, the GLR parser error recovery ends up throwing away all tokens to EOF. A serious downside to this is the GLR parser's running time grows pretty radically while parsing errors; if there are many in one place, the error recovery time can go through the roof.


Won't a GLR parser produce all possible parses of the input? Then resolving the ambiguity is a matter of picking the parse you prefer. To do that, I suppose the elements of the parse forest need to be labeled according to what kind of combinator produced them, eager or lazy. (You can't resolve the ambiguity incrementally before you've seen all the input, in general.)

(This answer based on my dim memory and vague possible misunderstanding of GLR parsing. Hopefully someone expert will come by.)


Consider the regular expression <.*?> and the input <a>bc<d>ef. This should find <a>, and no other matches, right?

Now consider the regular expression <.*?>e with the same input. This should find <a>bc<d>e, right?

This poses a dilemma. For the user's sake, we want the behavior of the combinator >> to be understood in terms of its two operands. Yet there is no way to produce the second parser's behavior in terms of what the first one finds.

One answer is for each parser to produce a sequence of all parses, ordered by preference, rather than the unordered set of all parsers. Greedy matching would return matches sorted longest to shortest; non-greedy, shortest to longest.


Non-greedy functionality is nothing more than a disambiguation mechanism. If you truly have a generalized parser (which does not require disambiguation to produce its results), then "non-greedy" is meaningless; the same results will be returned whether or not an operator is "non-greedy".

Non-greedy disambiguation behavior could be applied to the complete set of results provided by a generalized parser. Working left-to-right, filter the ambiguous sub-groups corresponding to a non-greedy operator to use the shortest match which still led to a successful parse of the remaining input.

0

精彩评论

暂无评论...
验证码 换一张
取 消