How do lexical analyzers handle comment and escape sequences?_问答_开发者

How do lexical analyzers handle comment and escape sequences?

开发者 https://www.devze.com 2023-02-15 15:29 出处：网络

Comment and escape sequence (such as string literal) are very exceptional from regular symbolic representation.

It's hard to understand for me how does regular lexical analyzers tokenize 开发者_Python百科them. How do lexical analyzers like lex, flex, or etc.. handle this kind of symbols? Is there a generic method? Or just case by case for each language?

I think this - case by case for each language - is true.
If comment starter exists in a string literal, lexer has to ignore it. Similarly, in C, if escaped double quote \" exists in a string literal, lexer has to ignore it.
For this purpose, flex has start condition. This enables contextual analysis.
For instance, there is an example for C comment analysis(between /* and */) in flex texinfo manual:

<INITIAL>"/*"   BEGIN(IN_COMMENT);
<IN_COMMENT>{
"*/"            BEGIN(INITIAL);
[^*\n]+         /* eat comment in chunks */
"*"             /* eat the lone star */
\n              yylineno++;
}

Start condition also enables string literal analysis. There is an example of how to match C-style quoted strings using start conditions in the item Start Conditions, and there is also FAQ item titled How do I expand backslash-escape sequences in C-style quoted strings? in flex texinfo manual.
Probably this will answer directly your question about string literal.

Comment and escape sequence (such as string literal) are very exceptional from regular symbolic representation.

I’m not sure what you mean but this statement is certainly wrong. Both comments (unless they may be nested) and strings with escape sequence admit a simple regular language description.

For example, an escape sequence allowing \\, \", \n and \r can be described by the following regular grammar (with start symbol E):

E -> \ S
S -> \
S -> "
S -> n
S -> r
…

And a string is just a repetition of zero or more unescaped symbols or escape sequences (i.e. a Kleene closure over two regular languages, which is itself regular).

I can't say anything for lex, but in my lexer for my language (using C++ style // comments) I have already split the input by lines (seeing as it's a Python-inspired language), I have a regex that matches the // and then any number of any characters.