I'm currently writing my own parser for a fictional Assembly language. The instructions are very similar to any normal assembly instruction:
[INSTRUCTION] [OP]*
where op can be 0-3 operands. I want to be able to use an expression that matches this. This is being written in C++ with boost::regex. I myself am a regexp noobie, trying to understand the boost documentation of what each symbol does.
开发者_如何学PythonNow, I already have an expression that can match 0-3 operands like so:
Sample Instructions:
MOVI 8 10
ADDI 8 8 10
NOP
BNEZI -1
Expression: ^([a-z]+)( ([-,0-9]+))*
However, I can't create a suitable expression that handles the same instructions when comma-delimited:
Sample Instructions:
MOVI 8, 10
ADDI 8, 8, 10
This is really tripping me up. I tried rewriting my expression like so:
^([a-z]+)( ([-,0-9]+))*(, ([-,0-9]+))*
This looks to be extremely green, poor regexp. It also isn't working correctly. I was thinking of using a recursive expression, but I looked at the documentation and I might as well scribble "overkill" on my forehead.
I realize I could just format the line to take out all the commas, but I would rather like to be able to write and understand a regexp expression first, then do it the easy way. Any help would be appreciated.
A string like:
ADDI 8, 8, 10
can be matched by a regex like this:
[a-zA-Z]+[ \t]+-?[0-9]+([ \t]*,[ \t]*-?[0-9]+)*
A (short) explanation:
[a-zA-Z]+ # match an instruction
[ \t]+ # match one or more spaces or tabs
-?[0-9]+ # match an integer value with an option minus sign in front of it
( # open group 1
[ \t]* # match zero or more spaces or tabs
, # match a comma
[ \t]* # match zero or more spaces or tabs
-?[0-9]+ # match an integer value with an option minus sign in front of it
)* # close group 1, and repeat it zero or more times
Having said all that, I must agree with dmckee's comment: a proper parser is the way to go, even if this is just a fictional language you're parsing.
try
^([a-z]+)( ([-,0-9]+)((,|\s)[-,0-9]+)*)*
There is a bison grammar as part of HLA assembler source, which you can get here
http://webster.cs.ucr.edu/AsmTools/HLA/frozen.html
I suggest using a proper grammar :)
If your serious about doing this, you have to account for every single line, all boundry conditions, errors, etc. Its not enough just to match a certian form, while not knowing anything about other forms.
If using regex, a cleaner way is to reserve a fixed number of buffers that can be analyzed. This is fairly complex for a rudimentary first level syntax checker.
This is a start:
/
^
(?!\s*$)
\s*
(?|
([a-zA-Z]+)
\s*
((?<=\s)
-?\d+|) (?!,\s*$) (?:,\s*|\s*$)
(-?\d+|) (?!,\s*$) (?:,\s*|\s*$)
(-?\d+|) \s*$
()
|
()()()()(.+)
)
$
/x;
A perl test case, using newline delimeted lines:
use strict;
use warnings;
my $rx = qr/
^
(?!\s*$)
\s*
(?|
([a-zA-Z]+)
\s*
((?<=\s)
-?\d+|) (?!,\s*$) (?:,\s*|\s*$)
(-?\d+|) (?!,\s*$) (?:,\s*|\s*$)
(-?\d+|) \s*$
()
|
()()()()(.+)
)
$/x;
my $cnt = 0; # line counter
while ( my $line = <DATA> )
{
++$cnt;
if ( $line =~ /$rx/ )
{
if (length $5) {
print "\nSyntax error ? (line $cnt) '$5'\n";
}
else {
print "\nInstruction: '$1'\n";
print " op1 = '$2'\n";
print " op2 = '$3'\n";
print " op3 = '$4'\n";
}
}
}
__DATA__
MOVI 8 10
ADDI 8 8 10
NOP
BNEZI -1
InstA 0
InstB 1,
InstC 2,3, 4
InstD 5,6, 7, 8
MOVI 7, 8
ADDI 9, 10, 11
ADDI 12, 13 14
Output:
Syntax error ? (line 2) 'MOVI 8 10'
Syntax error ? (line 3) 'ADDI 8 8 10'
Instruction: 'NOP'
op1 = ''
op2 = ''
op3 = ''
Instruction: 'BNEZI'
op1 = '-1'
op2 = ''
op3 = ''
Instruction: 'InstA'
op1 = '0'
op2 = ''
op3 = ''
Syntax error ? (line 8) 'InstB 1,'
Instruction: 'InstC'
op1 = '2'
op2 = '3'
op3 = '4'
Syntax error ? (line 10) 'InstD 5,6, 7, 8'
Instruction: 'MOVI'
op1 = '7'
op2 = '8'
op3 = ''
Instruction: 'ADDI'
op1 = '9'
op2 = '10'
op3 = '11'
Syntax error ? (line 14) 'ADDI 12, 13 14'
Alternative, all the data is in a string (same regex).
while ( $str =~ /$rx/mg ) { }
you ca try this :
^[\t ]*(?:([.A-Za-z0-9_] )[:])?(?:[\t ]*([A-Za-z]{2,4})(?:[\t ] (\[([A-Za-z0-9_] (([- ])[0-9] )?)\]|\". ?\"|\'. ?\'|[.A-Za-z0-9_] )(?:[\t ]*[,][\t ]*(\[([A-Za-z0-9_] (([- ])[0-9] )?)\]|\". ?\"|\'. ?\'|[.A-Za-z0-9_] ))?)?)?
can match label, OP, params
精彩评论