开发者

ANTLR Parser Question

开发者 https://www.devze.com 2022-12-20 23:49 出处:网络
I\'m trying to parse a number of text records where elements in a record are separated by a \'+\' char, and where the entire record is terminated by a \'#\' char. For example E1+E2+E3+E4+E5+E6#

I'm trying to parse a number of text records where elements in a record are separated by a '+' char, and where the entire record is terminated by a '#' char. For example E1+E2+E3+E4+E5+E6#

Individual elements can be required or optional. If an element is optional, its value is simply missing. For example, if E2 were missing, the input string would be: E1++E3+E4+E5+E6#.

When dealing with empty trailing elements, however, the separator char ('+') may be missing as well. If, for example, the last 3 elements were missing, the string could be: E1+E2+E3#, but it could also be: E1+E2+E3+++#

I have tried the following rule in Antlr:

'R1' 'E1 + E2 + E3' '+'? 'E4'? '+'? 'E5'? '+'? 'E6'? '#

but Antlr complains that it's ambiguous which of course is correct (every token following E3 could be E4, E5 or E6). The input syntax is fixed (it's from a legacy mainframe system), so I was wondering if anybody has a solution to this problem ?

An alternat开发者_运维知识库ive would be to specify all the different permutations in the rule, but that would be a major task.

Best regards and thanks,

Michael


That task sounds like excessive overkill for ANTLR, any reason you're just not splitting the string into an array using the '+' as a separator?

If it's coming from a mainframe, it most likely was intended to be processed in a trivial way.

e.g.,
C++ : http://www.cplusplus.com/reference/clibrary/cstring/strtok/
PHP : http://us3.php.net/manual/en/function.explode.php
Java: http://java.sun.com/javase/6/docs/api/java/lang/String.html#split%28java.lang.String%29
C# : http://msdn.microsoft.com/en-us/library/system.string.split%28VS.71%29.aspx

Just a thought.


If this is ambiguous, it's likely because your Es all have the same format (a more complicated case would be that your Es all just start with the same k characters where k is your lookahead, but I'm going to assume that's not the case. If it is, this will still work; it will just require an extra step.)

So it looks like you can have up to 6 Es and up to 5 +s. We'll say a "segment" is an optional E followed by a + - you can have 5 segments, and an optional trailing E.

This grammar can be represented roughly like this (imperfect ANTLR syntax since I'm not very familiar with it):

r : (e_opt? PLUS){1,5} e_opt? END
e_opt : E  // whatever your E is
PLUS : '+'
END : '#'

If ANTLR doesn't support anything like {1,5} then this is the same as:

(e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) ((e_opt? PLUS) (e_opt? PLUS)?)?)?)?

which is not that clean, so maybe there is a nicer way to do it.

0

精彩评论

暂无评论...
验证码 换一张
取 消