开发者

Is this the job of the lexer?

开发者 https://www.devze.com 2023-03-13 16:24 出处:网络
Let\'s say I was lexing a ruby method definition: def print_greeting(greeting = \"hi\") end Is it the lexer\'s job to maintain state and emit relevant tokens, or should it be relati开发者_如何学运维

Let's say I was lexing a ruby method definition:

def print_greeting(greeting = "hi")  
end

Is it the lexer's job to maintain state and emit relevant tokens, or should it be relati开发者_如何学运维vely dumb? Notice in the above example the greeting param has a default value of "hi". In a different context, greeting = "hi" is variable assignment which sets greeting to "hi". Should the lexer emit generic tokens such as IDENTIFIER EQUALS STRING, or should it be context-aware and emit something like PARAM_NAME EQUALS STRING?


I tend to make the lexer as stupid as I possibly can and would thus have it emit the IDENTIFIER EQUALS STRING tokens. At lexical analysis time there is (most of the time..) no information available about what the tokens should represent. Having grammar rules like this in the lexer only polutes it with (very) complex syntax rules. And that's the part of the parser.


I think that lexer should be "dumb" and in your case should return something like this: DEF IDENTIFIER OPEN_PARENTHESIS IDENTIFIER EQUALS STRING CLOSE_PARENTHESIS END. Parser should do validation - why split responsibilities.


Distinction between lexical analysis and parsing is an arbitrary one. In many cases you wouldn't want a separate step at all. That said, since the performance is usually the most important issue (otherwise parsing would be mostly trivial task) then you need to decide, and probably measure, whether additional processing during lexical analysis is justified or not. There is no general answer.


Don't work with ruby, but do work with compiler & programming language design.

Both approches work, but in real life, using generic identifiers for variables, parameters and reserved words, is more easier ("dumb lexer" or "dumb scanner").

Later, you can "cast" those generic identifiers into other tokens. Sometimes in your parser.

Sometimes, lexer / scanners have a code section, not the parser , that allow to do several "semantic" operations, incduing casting a generic identifier into a keyword, variable, type identifier, whatever. Your lexer rules detects an generic identifier token, but, returns another token to the parser.

Another similar, common case, is when you have an expression or language that uses "+" and "-" for binary operator and for unary sign operator.

0

精彩评论

暂无评论...
验证码 换一张
取 消