Most efficient method to parse small, specific arguments_问答_开发者

Most efficient method to parse small, specific arguments

开发者 https://www.devze.com 2022-12-13 19:28 出处：网络

I have a command开发者_如何学Python line application that needs to support arguments of the following brand:

all: return everything
search: return the first match to search
all*search: return everything matching search
X*search: return the first X matches to search
search#Y: return the Yth match to search

Where search can be either a single keyword or a space separated list of keywords, delimited by single quotes. Keywords are a sequence of one or more letters and digits - nothing else.

A few examples might be:

2*foo
bar#8
all*'foo bar'

This sounds just complex enough that flex/bison come to mind - but the application can expect to have to parse strings like this very frequently, and I feel like (because there's no counting involved) a fully-fledged parser would incur entirely too much overhead.

What would you recommend? A long series of string ops? A few beefy subpattern-capturing regular expressions? Is there actually a plausible argument for a "real" parser?

It might be useful to note that the syntax for this pseudo-grammar is not subject to change, so if the code turns out less-than-wonderfully-maintainable, I won't cry. This is all in C++, if that makes a difference.

Thanks!

I wouldn't reccomend a full lex/yacc parser just for this. What you described can fit a simple regular expression:

 ((all|[0-9]+)\*)?('[A-Za-z0-9\t ]*'|[A-Za-z0-9]+)(#[0-9]+)?

If you have a regex engine that support captures, it's easy to extract the single pieces of information you need. (Most probably in captures 1,3 and 4).

If I understood what you mean, you will probably want to check that capture 1 and capture 4 are not non-empty at the same time.

If you need to further split the search terms, you could do it in a subsequent step, parsing capture 3.

Even without regex, I would hand write a function. It would be simpler than dealing with lex/yacc and I guess you could put together something that is even more efficient than a regular expression.

The answer mostly depends on a balance between how much coding you want to do and how much libraries you want to depend on - if your application can depend on other libraries, you can use any of the many regular expression libraries - e.g. POSIX regex which comes with all Linux/Unix flavors.

If you just want those specific syntaxes, I would use the string tokenizer (strtok) - split on '*' and split on '#' - then handle each case.

In this case the strtok approach would be much better since the number of commands to be parsed are few.