开发者

Writing a tokenizer, where to begin?

开发者 https://www.devze.com 2023-03-03 06:21 出处:网络
I\'m trying 开发者_StackOverflow社区to write a tokenizer for CSS in C++, but I have no idea how to write a tokenizer. I know that it should be greedy, reading as much input as possible, for each token

I'm trying 开发者_StackOverflow社区to write a tokenizer for CSS in C++, but I have no idea how to write a tokenizer. I know that it should be greedy, reading as much input as possible, for each token, and in theory I know how I could put that in code.

I have looked at Boost.Tokenizer, and it seems nice, but it doesn't help me whatsoever. It sure is a nice wrapper for a tokenizer, but the problem lies in writing the token splitter, the TokenizerFunction in Boost terms.

I have no idea how to write this tokenizer, are there any "neat" ways of doing it, like something that closely resembles the syntax itself?

Please note, I'm not looking for a parser! My application doesn't need to be able to understand CSS, just read a CSS file to a general internal tokenized format, process some things and output again.


Writing a "correct" lexer and/or parser is more difficult than you might think. And it can get ugly when you start dealing with weird corner cases.

My best suggestion is to invest some time in learning a proper lexer/parser system. CSS should be a fairly easy language to implement, and then you will have acquired an amazingly powerful tool you can use for all sorts of future projects.

I'm an Old Fart® and I use lex/yacc (or things that use the same syntax) for this type of project. I first learned to use them back in the early 80's and it has returned the effort to learn them many, many times over.

BTW, if you have anything approaching a BNF of the language, lex/yacc can be laughably easy to work with.


Boost.Spirit.Qi would be my first choice.

Spirit.Qi is designed to be a practical parsing tool. The ability to generate a fully-working parser from a formal EBNF specification inlined in C++ significantly reduces development time. Programmers typically approach parsing using ad hoc hacks with primitive tools such as scanf. Even regular-expression libraries (such as boost regex) or scanners (such as Boost tokenizer) do not scale well when we need to write more elaborate parsers. Attempting to write even a moderately-complex parser using these tools leads to code that is hard to understand and maintain.

The Qi tutorials even finish by implementing a parser for an XMLish language; writing a grammar for CSS should be considerably easier.

0

精彩评论

暂无评论...
验证码 换一张
取 消