Haskell: Parsing escape characters in single quotes_问答_开发者

Haskell: Parsing escape characters in single quotes

开发者 https://www.devze.com 2022-12-20 09:21 出处：网络

I\'m currently making a scanner for a basic compiler I\'m writing in Haskell. One of the requirements is that any character enclosed in single quotes (\') is translated into a character literal token

I'm currently making a scanner for a basic compiler I'm writing in Haskell. One of the requirements is that any character enclosed in single quotes (') is translated into a character literal token (type T_Char), and this includes escape sequences such as '\n' and '\t'. I've defined this part of the scanner function which works okay for most cases:

scanner ('\'':cs)       |   (length cs) == 0            =   error "Illegal character!"
                         |  head cs == '\\'             =   mkEscape (head (drop 1 cs)) : scanner (drop 3 cs)
                         |  head (drop 1 cs) == '\''    =   T_Char (head cs) : scanner (drop 2 cs)


                         where
                            mkEscape        :: Char -> Token
                            mkEscape 'n'    = T_Char '\n'
                            mkEscape 'r'    = T_Char '\r'
                            mkEscape 't'    = T_Char '\t'
                            mkEscape '\\'   = T_Char '\\'
                            mkEscape '\''   = T_Char '\''

However, this comes up when I run it in GHCi:

Main> scanner "abc '\\' def"
[T_Id "abc", T_Char '\'', T_Id "d开发者_JS百科ef"]

It can recognise everything else but gets escaped backslashes confused with escaped single quotes. Is this something to do with character encodings?

I don't think there's anything wrong with the parser regarding your problem. To Haskell, the string will be read as

abc '\' def

because Haskell also has string escapes. So when it reaches the first quotation mark, cs contains the char sequence \' def. Obviously head cs is a backslash, so it will run mkEscape.

The argument given is head (drop 1 cs), which is ', thus mkEscape will return T_Char '\'', which is what you saw.

Perhaps you should call