开发者

Lua pattern matching for extracting hard coded strings in code base

开发者 https://www.devze.com 2023-03-17 11:05 出处:网络
I\'m working with a C++ code base. Right now I\'m using a C++ code calling lua script to look through the entire code base and hopefully return a list of all of the strings which are used in the progr

I'm working with a C++ code base. Right now I'm using a C++ code calling lua script to look through the entire code base and hopefully return a list of all of the strings which are used in the program.

The strings in question are always preceded by a JUCE macro called TRANS. Here are some examples which should extract a string

TRANS("Normal")
TRANS ( "With spaces" )
TRANS("")
TRANS("multiple"" ""quotations")
TRANS(")")
TRANS("spans \
multiple \
lines")

And I'm sure you can imagine some other possible string varients that could occur in a large code base. I'm making an automatic tool to generate JUCE translation formatted files to automate the process as much as possible

I've gotten this far, as it stands, for pattern matching in order to find these strings. I've converted the source code into a lua string

path = ..开发者_JAVA百科.

--Open file and read source into string
file = io.open(path, "r")
str = file:read("*all")

and called

for word in string.gmatch(string, 'TRANS%s*%b()') do print(word) end

which finds a pattern that starts with TRANS, has balanced parenthesis. This will get me the full Macro, including the brackets but from there I figured it would be pretty easy to split off the fat I don't need and just keep the actual string value.

However this doesn't work for strings which cause a parenthesis imbalance. e.gTRANS(")") will return TRANS("), instead of TRANS("(")

I revised my pattern to

for word in string.gmatch(string, 'TRANS%s*(%s*%b""%s*') do print(word) end

where, the pattern should start with a TRANS, then 0 or many spaces. Then it should have a ( character followed by zero or more spaces. Now that we are inside the brackets, we should have a balanced number of "" marks, followed by another 0 or many spaces, and finally ended by a ) . Unfortunately, this does not return a single value when used. But... I think even IF it worked as I expected it to... There can be a \" inside, which causes the bracket imbalance.

Any advice on extracting these strings? Should I continue to try and find a pattern matching sequence? or should I try a direct algorithm... Do you know why my second pattern returned no strings? Any other advice! I'm not looking to cover 100% of all possibilities, but being close to 100% would be awesome. Thanks! :D


I love Lua patterns as much as anyone, but you're bringing a knife to a gun fight. This is one of those problems where you really don't want to code the solution as regular expressions. To deal correctly with doublequote marks and backslash escapes, you want a real parser, and LPEG will manage your needs nicely.


In the second case, you forgot to escape parentheses. Try

for word in string.gmatch(str, 'TRANS%s*%(%s*(%b"")%s*%)') do print(word) end
0

精彩评论

暂无评论...
验证码 换一张
取 消