开发者

Big problem with regular expression in Lex (lexical analyzer)

开发者 https://www.devze.com 2022-12-25 00:38 出处:网络
I have some content like this: author = \"Marjan Mernikand Viljem Zumer\", title = \"Implementation of multiple attribute grammar inheritance in the tool LISA\",

I have some content like this:

    author = "Marjan Mernik  and Viljem Zumer",
    title = "Implementation of multiple attribute grammar inheritance in the tool LISA",
    year = 1999

    author = "Manfred Broy and Martin Wirsing",
    title = "Generalized
             Heterogeneous Algebras and
             Partial Interpretations",
    year = 1983

    author = "Ikuo Nakata and Masataka Sassa",
    title = "L-Attributed LL(1)-Grammars are
             LR-Attributed",
    journal = "Information Processing Letters"

And I need to catch everything between double quotes for title. My first try was this:

^(" "|\t)+"title"" "*=" "*"\"".+"\","

Which catches the first example, but not the other two. The other have multiple lines and that's the problem. I though about changing to something with \n somewhere to allow multiple lines, like this:

^(" "|\t)+"title"" "*=" "*"\""(.|\n)+"\","

But this doesn't help, instead, it catches everything.

Than I though, "what I want is between double quotes, what if I catch everything until I find another " followed by ,? This way I could know if I was at the end of the title or not, no matter the number of lines, like this:

^(" "|\t)+"title"" "*=" "*"\""[开发者_高级运维^"\""]+","

But this has another problem... The example above doesn't have it, but the double quote symbol (") can be in between the title declaration. For instance:

title = "aaaaaaa \"X bbbbbb",

And yes, it will always be preceded by a backslash (\).

Any suggestions to fix this regexp?


The classical regex to match strings in double quotes is:

\"([^\"]|\\.)*\"

In your case, you'll want something like this:

"title"\ *=\ *\"([^\"]|\\.)*\"

PS: IMHO, you're putting too many quotes in your regexes, it's hard to read.


You could use start conditions to simplify each separate pattern, for example:

%x title
%%
"title"\ *=\ *\"  { /* mark title start */
  BEGIN(title);
  fputs("found title = <|", yyout);
}

<title>[^"\\]* { /* process title part, use ([^\"]|\\.)* to grab all at once */
  ECHO;
}

<title>\\. { /* process escapes inside title */
  char c = *(yytext + 1);
  fputc(c, yyout); /* double escaped characters */
  fputc(c, yyout);
}

<title>\" { /* mark end of title */
  fputs("|>", yyout);
  BEGIN(0); /* continue as usual */
}

To make an executable:

$ flex parse_ini.y
$ gcc -o parse_ini lex.yy.c -lfl

Run it:

$ ./parse_ini < input.txt 

Where input.txt is:

author = "Marjan\" Mernik  and Viljem Zumer",
title = "Imp\"lementation of multiple...",
year = 1999

Output:

author = "Marjan\" Mernik  and Viljem Zumer",
found title = <|Imp""lementation of multiple...|>,
year = 1999

It replaced '"' around the title by '<|' and '|>'. Also'\"'` is replaced by '""' inside title.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号