Information Sources on Token Parsing Patterns_问答_开发者

Information Sources on Token Parsing Patterns

开发者 https://www.devze.com 2023-02-11 06:48 出处：网络

To make a long 开发者_Python百科story short, it looks as if I am going to be responsible for rewriting a text parsing engine where I work.

So, much like you imagine: A block of text comes in, there are custom tags in this text, some simple one-off replaces, some blocks with content, some nesting, etc. Some tags have argument/value pairs, etc.

While I have been coding for years, and would say I'm a mid-level regex user; I am the first to admit that hardcore text parsing is not my forte. And this needs to be fast, so optimization is a concern.

I am looking for information sources on patterns and commentary for this kind of parsing. I'm willing to read over anything that any of you offer. I need to educate myself before I even begin contemplating how to tackle this.

Thanks so much, in advance.

If this gets a little more complex than what you can do with a simple state machine that one person can easily understand i would suggest using a tool to generate tokenizers: flex/jflex/etc.

You can also create a hand crafted top down parser if speed is a very big concern or you can use a parser generator (ANTLR for example and the like). A hand craft parser is usually faster but has the potential to create some nasty corner cases :). You will need a good set of test cases for it.

I do recommend that you start from here: Parsing on wikipedia. Look at recursive descent parsing (it easier to write by hand and comprehensible if your language is not really complex).

Well, first off, regular expressions cannot be used to parse nested structures. You'll have to write a parser. There are plenty of tools available to help you out, from the venerable yacc to antlr to many more. Check out the wikipedia page.

Use Perl 6 Rules. They are grammar folded into the language. Fairly powerfull. Not called regular expressions since Perl 5.10, even though it looks like regular expressions. Now its an integral part of the language, code and regex's are undistinguishable.

http://tripatlas.com/Perl_6_rules
http://www.programmersheaven.com/2/Perl6-FAQ-Regex

You can also use Marpa parser, which will give you the benefits of general practical BNF parsing — an example, another example.

Absolutely do not attempt to use regexes for this. Use a parser. If the text is xml there will be lots of parsers available in your favourite language. If it's not xml, then you will have to write your own custom parser.