I need to extract an entire javascript function from a script file. I know the name of the function, but I don't know what the contents of the function may be. This function may be embedded within any number of closures.
I need to have two output values:
- The entire body of the named function that I'm finding in the input script.
- The full input script with the found named function removed.
So, assume I'm looking for the findMe
function in this input script:
function() {
function something(x,y) {
if (x == true) {
console.log ("Som开发者_StackOverflow中文版ething says X is true");
// The regex should not find this:
console.log ("function findMe(z) { var a; }");
}
}
function findMe(z) {
if (z == true) {
console.log ("Something says Z is true");
}
}
findMe(true);
something(false,"hello");
}();
From this, I need the following two result values:
The extracted
findMe
scriptfunction findMe(z) { if (z == true) { console.log ("Something says Z is true"); } }
The input script with the
findMe
function removedfunction() { function something(x,y) { if (x == true) { console.log ("Something says X is true"); // The regex should not find this: console.log ("function findMe(z) { var a; }"); } } findMe(true); something(false,"hello"); }();
The problems I'm dealing with:
The body of the script to find could have any valid javascript code within it. The code or regex to find this script must be able to ignore values in strings, multiple nested block levels, and so forth.
If the function definition to find is specified inside of a string, it should be ignored.
Any advice on how to accomplish something like this?
Update:
It looks like regex is not the right way to do this. I'm open to pointers to parsers that could help me accomplish this. I'm looking at Jison, but would love to hear about anything else.
A regex can't do this. What you need is a tool that parses JavaScript in a compiler-accurate way, builds up a structure representing the shape of the JavaScript code, enables you to find the function you want and print it out, and enables you to remove the function definition from that structure and regenerate the remaining javascript text.
Our DMS Software Reengineering Toolkit can do this, using its JavaScript front end. DMS provides general parsing, abstract syntax tree building/navigating/manipulation, and prettyprinting of (valid!) source text from a modified AST. The JavaScript front end provides DMS with compiler-accurate definition of JavaScript. You can point DMS/JavaScript at a JavaScript file (or even various kinds of dynamic HTML with embedded script tags containing JavaScript), have it produce the AST. A DMS pattern can be used to find your function:
pattern find_my_function(r:type,a: arguments, b:body): declaration
" \r my_function_name(\a) { \b } ";
DMS can search the AST for a matching tree with the specified structure; because this is an AST match and not a string match, line breaks, whitespace, comments and other trivial differences won't fool it. [What you didn't say is what to if you have more than one function in different scopes: which one do you want?]
Having found the match, you can ask DMS to print just that matched code which acts as your extraction step. You can also ask DMS to remove the function using a rewrite rule:
rule remove_my_function((r:type,a: arguments, b:body): declaration->declaration
" \r my_function_name(\a) { \b } " -> ";";
and then prettyprint the resulting AST. DMS will preserve all the comments properly.
What this does not do, is check that removing the function doesn't break your code. After all, it may be in a scope where it directly accesses variables defined locally in the scope. Removing it to another scope now means it can't reference its variables.
To detect this problem, you not only need a parser, but you need a symbol table with maps identifiers in the code to definitions and uses. The removal rule then has to add a semantic condition to check for this. DMS provides the machinery to build such a symbol table from the AST using an attribute grammar.
To fix this problem, when removing the function, it may be necessary to modify the function to accept additional arguments replacing the local variables it accesses, and modify the call sites to pass in what amounts to references to the local variables. This can be implemented with a modest sized set of DMS rewrite rules, that check the symbol tables.
So removing such a function can be a lot more complex than just moving the code.
If the script is included in your page (something you weren't clear about) and the function is publicly accessible, then you can just get the source to the function with:
functionXX.toString();
https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/Function/toString
Other ideas:
1) Look at the open source code that does either JS minification or JS pretty indent. In both cases, those pieces of code have to "understand" the JS language in order to do their work in a fault tolerant way. I doubt it's going to be pure regex as the language is just a bit more complicated than that.
2) If you control the source at the server and are wanted to modify a particular function in it, then just insert some new JS that replaces that function at runtime with your own function. That way, you let the JS compiler identify the function for you and you just replace it with your own version.
3) For regex, here's what I've done which is not foolproof, but worked for me for some build tools I use:
I run multiple passes (using regex in python):
- Remove all comments delineated with /* and */.
- Remove all quoted strings
- Now, all that's left is non-string, non-comment javascript so you should be able to regex directly on your function declaration
- If you need the function source with strings and comments back in, you'll have to reconstitute that from the original, now that you know the begin end of the function
Here are the regexes I use (expressed in python's multi-line format):
reStr = r"""
( # capture the non-comment portion
"(?:\\.|[^"\\])*" # capture double quoted strings
|
'(?:\\.|[^'\\])*' # capture single quoted strings
|
(?:[^/\n"']|/[^/*\n"'])+ # any code besides newlines or string literals
|
\n # newline
)
|
(/\* (?:[^*]|\*[^/])* \*/) # /* comment */
|
(?://(.*)$) # // single line comment
$"""
reMultiStart = r""" # start of a multiline comment that doesn't terminate on this line
(
/\* # /*
(
[^\*] # any character that is not a *
| # or
\*[^/] # * followed by something that is not a /
)* # any number of these
)
$"""
reMultiEnd = r""" # end of a multiline comment that didn't start on this line
(
^ # start of the line
(
[^\*] # any character that is not a *
| # or
\*+[^/] # * followed by something that is not a /
)* # any number of these
\*/ # followed by a */
)
"""
regExSingleKeep = re.compile("// /") # lines that have single lines comments that start with "// /" are single line comments we should keep
regExMain = re.compile(reStr, re.VERBOSE)
regExMultiStart = re.compile(reMultiStart, re.VERBOSE)
regExMultiEnd = re.compile(reMultiEnd, re.VERBOSE)
This all sounds messy to me. You might be better off explaining what problem you're really trying to solve so folks can help find a more elegant solution to the real problem.
I built a solution in C# using plain old string methods (no regex) and it works for me with nested functions as well. The underlying principle is in counting braces and checking for unbalanced closing braces. Caveat: This won't work for cases where braces are part of a comment but you can easily enhance this solution by first stripping out comments from the code before parsing function boundaries.
I first added this extension method to extract all indices of matches in a string (Source: More efficient way to get all indexes of a character in a string)
/// <summary>
/// Source: https://stackoverflow.com/questions/12765819/more-efficient-way-to-get-all-indexes-of-a-character-in-a-string
/// </summary>
public static List<int> AllIndexesOf(this string str, string value)
{
if (String.IsNullOrEmpty(value))
throw new ArgumentException("the string to find may not be empty", "value");
List<int> indexes = new List<int>();
for (int index = 0; ; index += value.Length)
{
index = str.IndexOf(value, index);
if (index == -1)
return indexes;
indexes.Add(index);
}
}
I defined this struct for easy referencing of function boundaries:
private struct FuncLimits
{
public int StartIndex;
public int EndIndex;
}
Here's the main function where I parse the boundaries:
public void Parse(string file)
{
List<FuncLimits> funcLimits = new List<FuncLimits>();
List<int> allFuncIndices = file.AllIndexesOf("function ");
List<int> allOpeningBraceIndices = file.AllIndexesOf("{");
List<int> allClosingBraceIndices = file.AllIndexesOf("}");
for (int i = 0; i < allFuncIndices.Count; i++)
{
int thisIndex = allFuncIndices[i];
bool functionBoundaryFound = false;
int testFuncIndex = i;
int lastIndex = file.Length - 1;
while (!functionBoundaryFound)
{
//find the next function index or last position if this is the last function definition
int nextIndex = (testFuncIndex < (allFuncIndices.Count - 1)) ? allFuncIndices[testFuncIndex + 1] : lastIndex;
var q1 = from c in allOpeningBraceIndices where c > thisIndex && c <= nextIndex select c;
var qTemp = q1.Skip<int>(1); //skip the first element as it is the opening brace for this function
var q2 = from c in allClosingBraceIndices where c > thisIndex && c <= nextIndex select c;
int q1Count = qTemp.Count<int>();
int q2Count = q2.Count<int>();
if (q1Count == q2Count && nextIndex < lastIndex)
functionBoundaryFound = false; //next function is a nested function, move on to the one after this
else if (q2Count > q1Count)
{
//we found the function boundary... just need to find the closest unbalanced closing brace
FuncLimits funcLim = new FuncLimits();
funcLim.StartIndex = q1.ElementAt<int>(0);
funcLim.EndIndex = q2.ElementAt<int>(q1Count);
funcLimits.Add(funcLim);
functionBoundaryFound = true;
}
testFuncIndex++;
}
}
}
I am almost afraid that regex cannot do this job. I think it is the same as trying to parse XML or HTML with regex, a topic that has already caused various religious debates on this forum.
EDIT: Please correct me if this is NOT the same as trying to parse XML.
I guess you would have to use and construct a String-Tokenizer for this job.
function tokenizer(str){
var stack = array(); // stack of opening-tokens
var last = ""; // last opening-token
// token pairs: subblocks, strings, regex
var matches = {
"}":"{",
"'":"'",
'"':'"',
"/":"/"
};
// start with function declaration
var needle = str.match(/function[ ]+findme\([^\)]*\)[^\{]*\{/);
// move everything before needle to result
var result += str.slice(0,str.indexOf(needle));
// everithing after needle goes to the stream that will be parsed
var stream = str.slice(str.indexOf(needle)+needle.length);
// init stack
stack.push("{");
last = "{";
// while still in this function
while(stack.length > 0){
// determine next token
needle = stream.match(/(?:\{|\}|"|'|\/|\\)/);
if(needle == "\\"){
// if this is an escape character => remove escaped character
stream = stream.slice(stream.indexOf(needle)+2);
continue;
}else if(last == matches[needle]){
// if this ends something pop stack and set last
stack.pop();
last = stack[stack.length-1];
}else if(last == "{"){
// if we are not inside a string (last either " or ' or /)
// push needle to stack
stack.push(needle);
last = needle;
}
// cut away including token
stream = stream.slice(stream.indexOf(needle)+1);
}
return result + stream;
}
oh, I forgot tokens for comments... but i guess you got an idea now of how it works...
精彩评论