Matching brackets in a string_问答_开发者_运维开发者技术经验分享

What is the most efficient or 开发者_开发技巧elegant method for matching brackets in a string such as:

"f @ g[h[[i[[j[2], k[[1, m[[1, n[2]]]]]]]]]] // z"

for the purpose of identifying and replacing [[ Part ]] brackets with the single character forms?

I want to get:

Matching brackets in a string

With everything else intact, such as the prefix @ and postfix // forms intact

An explanation of Mathematica syntax for those unfamiliar:

Functions use single square brackets for arguments: func[1, 2, 3]

Part indexing is done with double square brackets: list[[6]] or with single-character Unicode double brackets: list〚6〛

My intent is to identify the matching [[ ]] form in a string of ASCII text, and replace it with the Unicode characters 〚〛

Ok, here is another answer, a bit shorter:

Clear[replaceDoubleBrackets];
replaceDoubleBrackets[str_String, openSym_String, closeSym_String] := 
Module[{n = 0},
  Apply[StringJoin, 
   Characters[str] /. {"[" :> {"[", ++n}, 
     "]" :> {"]", n--}} //. {left___, {"[", m_}, {"[", mp1_}, 
      middle___, {"]", mp1_}, {"]", m_}, right___} /; 
       mp1 == m + 1 :> {left, openSym, middle, 
        closeSym, right} /. {br : "[" | "]", _Integer} :> br]]

Example:

In[100]:= replaceDoubleBrackets["f[g[h[[i[[j[2], k[[1, m[[1, n[2]]]]]]]]]]]", "(", ")"]

Out[100]= "f[g[h(i(j[2], k(1, m(1, n[2]))))]]"

EDIT

You can also use Mathematica built-in facilities, if you want to replace double brackets specifically with the symbols you indicated:

Clear[replaceDoubleBracketsAlt];
replaceDoubleBracketsAlt[str_String] :=
  StringJoin @@ Cases[ToBoxes@ToExpression[str, InputForm, HoldForm],
     _String, Infinity]

In[117]:= replaceDoubleBracketsAlt["f[g[h[[i[[j[2], k[[1, m[[1, n[2]]]]]]]]]]]"]

Out[117]= f[g[h[[i[[j[2],k[[1,m[[1,n[2]]]]]]]]]]]

The result would not show here properly, but it is a Unicode string with the symbols you requested.

When I wrote my first solution, I hadn't noticed that you just wanted to replace the [[ with 〚 in a string, and not an expression. You can always use HoldForm or Defer as

Matching brackets in a string

but I think you already knew that, and you want the expression as a string, just like the input (ToString@ on the above doesn't work)

As all the answers so far focus on string manipulations, I'll take a numeric approach instead of wrestling with strings, which is more natural to me. The character code for [ is 91 and ] is 93. So doing the following

Matching brackets in a string

gives the locations of the brackets as a 0/1 vector. I've negated the closing brackets, just to aid the thought process and for use later on.

NOTE: I have only checked for divisibility by 91 and 93, as I certainly don't expect you to be entering any of the following characters, but if, for some reason you choose to, you can easily AND the result above with a boolean list of equality with 91 or 93.

Matching brackets in a string

From this, the positions of the first of Part's double bracket pair can be found as

Matching brackets in a string

The fact that in mma, expressions do not start with [ and that more than two [ cannot appear consecutively as [[[... has been implicitly assumed in the above calculation.

Now the closing pair is trickier to implement, but simple to understand. The idea is the following:

For each non-zero position in closeBracket, say i, go to the corresponding position in openBracket and find the first non-zero position to the left of it (say j).
Set doubleCloseBrackets[[i-1]]=closeBracket[[i]]+openBracket[[j]]+doubleOpenBrackets[[j]].
You can see that doubleCloseBrackets is the counterpart of doubleOpenBrackets and is non-zero at the position of the first of Part's ]] pair.

Matching brackets in a string

So now we have a set of Boolean positions for the first open bracket. We simply have to replace the corresponding element in charCode with the equivalent of 〚 and similarly, with the Boolean positions for the first close bracket, we replace the corresponding element in charCode with the equivalent of 〛.

Matching brackets in a string

Finally, by deleting the element next to the ones that were changed, you can get your modified string with [[]] replaced by 〚〛

Matching brackets in a string

NOTE 2:

A lot of my MATLAB habits have crept in the above code, and is not entirely idiomatic in Mathematica. However, I think the logic is correct, and it works. I'll leave it to you to optimize it (me thinks you can do away with Do[]) and make it a module, as it would take me a lot longer to do it.

Code as text

Clear["Global`*"]
str = "f[g[h[[i[[j[2], k[[1, m[[1, n[2]]]]]]]]]]]";
charCode = ToCharacterCode@str;
openBracket = Boole@Divisible[charCode, First@ToCharacterCode["["]];
closeBracket = -Boole@
    Divisible[charCode, First@ToCharacterCode["]"]];
doubleOpenBracket = 
  Append[Differences@Accumulate[openBracket], 0] openBracket;
posClose = Flatten@Drop[Position[closeBracket, Except@0, {1}], 1];

doubleCloseBracket = ConstantArray[0, Dimensions@doubleOpenBracket];
openBracketDupe = openBracket + doubleOpenBracket;
Do[
  tmp = Last@
    Flatten@Position[openBracketDupe[[1 ;; i]], Except@0, {1}];
  doubleCloseBracket[[i - 1]] = 
   closeBracket[[i]] + openBracketDupe[[tmp]];
  openBracketDupe[[tmp]] = 0;,
  {i, posClose}];

changeOpen = 
  Cases[Range[First@Dimensions@charCode]  doubleOpenBracket, Except@0];
changeClosed = 
  Cases[Range[First@Dimensions@charCode]  doubleCloseBracket, 
   Except@0];
charCode[[changeOpen]] = ToCharacterCode["\[LeftDoubleBracket]"];
charCode[[changeClosed]] = ToCharacterCode["\[RightDoubleBracket]"];
FromCharacterCode@
 Delete[Flatten@charCode, 
  List /@ (Riffle[changeOpen, changeClosed] + 1)]

Here is my attempt. The pasted ASCII code is pretty unreadable due to the presence of special characters so I first provide a picture of how it looks in MMA.

Basically what it does is this: Opening brackets are always uniquely identifiable as single or double. The problem lies in the closing brackets. Opening brackets always have the pattern string-of-characters-containing-no-brackets + [ or [[. It is impossible to have either a [ following a [[ or vice versa without other characters in-between (at least, not in error-free code).

So, we use this as a hook and start looking for certain pairs of matching brackets, namely the ones that don't have any other brackets in-between. Since we know the type, either "[... ]" or "[[...]]", we can replace the latter ones with the double-bracket symbols and the former one with unused characters (I use smileys). This is done so they won't play a role anymore in the next iteration of the pattern matching process.

We repeat until all brackets are processed and finally the smileys are converted to single brackets again.

You see, the explanation takes mores characters than the code does ;-).

Matching brackets in a string

Ascii:

s = "f @ g[hh[[i[[jj[2], k[[1, m[[1, n[2]]]]]]]]]] // z";

myRep[s_String] :=
 StringReplace[s,
  {
   Longest[y : Except["[" | "]"] ..] ~~ "[" ~~ 
     Longest[x : Except["[" | "]"] ..] ~~ "]" :> 
    y <> "\[HappySmiley]" <> x <> "\[SadSmiley]",
   Longest[y : Except["[" | "]"] ..] ~~ "[" ~~ Whitespace ... ~~ "[" ~~
      Longest[x : Except["[" | "]"] ..] ~~ "]" ~~ Whitespace ... ~~ 
     "]" :> y <> "\[LeftDoubleBracket]" <> x <> "\[RightDoubleBracket]"
   }
  ]

StringReplace[FixedPoint[myRep, s], {"\[HappySmiley]" -> "[","\[SadSmiley]" -> "]"}]

Oh, and the Whitespace part is because in Mathematica double brackets need not be next to each other. a[ [1] ] is just as legal as is a[[1]].

You need a stack to do this right; there's no way to do it correctly using regular expressions.

You need to recognize [[ as well as the depth of those brackets, and match them with a ]] which has the same depth. (Stacks do this very nicely. As long as they don't overflow :P)

Without using some sort of a counter, this is not possible. Without having some maximum depth defined, it's not possible to represent this with a finite state automata, so it's not possible to do this with a regular expression.

Note: here's an example of a string that would not be parsed correctly by a regular expression:

[1+[[2+3]*4]] = 21

This would be turned into

[1 + 2 + 3] * 4 = 24

Here is some java-like pseudocode:

public String minimizeBrackets(String input){
    Stack s = new Stack();
    boolean prevWasPopped = false;
    for(char c : input){
        if(c=='['){
            s.push(i);
            prevWasPopped = false;
        }
        else if(c==']'){
            //if the previous step was to pop a '[', then we have two in a row, so delete an open/close pair
            if(prevWasPopped){
                input.setChar(i, " ");
                input.setChar(s.pop(), " ");
            }
            else s.pop();
            prevWasPopped = true;
        }
        else prevWasPopped = false;
    }
    input = input.stripSpaces();
    return input;
}

Note that I cheated a bit by simply turning them into spaces, then removing spaces afterwards... this will NOT do what I advertised, it will destroy all spaces in the original string as well. You could simply log all of the locations instead of changing them to spaces, and then copy over the original string without the logged locations.

Also note that I didn't check the state of the stack at the end. It is assumed to be empty, because every [ character in the input string is assumed to have its unique ] character, and vice versa. If the stack throws a "you tried to pop me when i'm empty" exception at any point, or is not empty at the end of the run, you know that your string was not well formed.

Other answers have made it moot, I think, but here's a more Mathematica-idiomatic version of yoda's first solution. For a long enough string, parts of it may be a bit more efficient, besides.

str = "f @ g[h[[i[[j[2], k[[1, m[[1, n[2]]]]]]]]]] // z";
charCode = ToCharacterCode@str;
openBracket = Boole@Thread[charCode == 91];
closeBracket = -Boole@Thread[charCode == 93];
doubleOpenBracket = openBracket RotateLeft@openBracket;
posClose = Flatten@Position[closeBracket, -1, {1}];
doubleCloseBracket = 0*openBracket;
openBracketDupe = openBracket + doubleOpenBracket;
Do[
 tmp = Last@DeleteCases[Range@i*Sign@openBracketDupe[[1 ;; i]], 0];
 doubleCloseBracket[[i - 1]] = 
  closeBracket[[i]] + openBracketDupe[[tmp]];
 openBracketDupe[[tmp]] = 0, {i, posClose}]
counter = Range@Length@charCode;
changeOpen = DeleteCases[doubleOpenBracket*counter, 0];
changeClosed = DeleteCases[doubleCloseBracket*counter, 0];
charCode[[changeOpen]] = First@ToCharacterCode["\[LeftDoubleBracket]"];
charCode[[changeClosed]] = 
  First@ToCharacterCode["\[RightDoubleBracket]"];
FromCharacterCode@Delete[charCode, List /@ Flatten@{1 + changeOpen, 1 + changeClosed}]

This way of setting "tmp" may be LESS efficient, but I think it's interesting.

I can offer a heavy approach (not too elegant). Below is my implementation of the bare-bones Mathematica parser (it will only work for strings containing Fullform of the code, with the possible exception for double brackets - which I will use here), based on rather general functionality of breadth-first parser that I developed mostly to implement an HTML parser:

ClearAll[listSplit, reconstructIntervals, groupElements, 
groupPositions, processPosList, groupElementsNested];

listSplit[x_List, lengthlist_List, headlist_List] := 
  MapThread[#1 @@ Take[x, #2] &, {headlist, 
    Transpose[{Most[#] + 1, Rest[#]} &[
      FoldList[Plus, 0, lengthlist]]]}];

reconstructIntervals[listlen_Integer, ints_List] := 
  Module[{missed, startint, lastint},
    startint  = If[ints[[1, 1]] == 1, {}, {1, ints[[1, 1]] - 1}];
    lastint = 
       If[ints[[-1, -1]] == listlen, {}, {ints[[-1, -1]] + 1, listlen}];
    missed = 
      Map[If[#[[2, 1]] - #[[1, 2]] > 1, {#[[1, 2]] + 1, #[[2, 1]] - 1}, {}] &, 
      Partition[ints, 2, 1]];
    missed = Join[missed, {lastint}];
    Prepend[Flatten[Transpose[{ints, missed}], 1], startint]];

groupElements[lst_List, poslist_List, headlist_List] /; 
 And[OrderedQ[Flatten[Sort[poslist]]], Length[headlist] == Length[poslist]] := 
  Module[{totalheadlist, allints, llist},
    totalheadlist = 
     Append[Flatten[Transpose[{Array[Sequence &, {Length[headlist]}], headlist}], 1], Sequence];
  allints = reconstructIntervals[Length[lst], poslist];
  llist = Map[If[# === {}, 0, 1 - Subtract @@ #] &, allints];
  listSplit[lst, llist, totalheadlist]];

  (* To work on general heads, we need this *)

groupElements[h_[x__], poslist_List, headlist_List] := 
   h[Sequence @@ groupElements[{x}, poslist, headlist]];

(* If we have a single head *)
groupElements[expr_, poslist_List, head_] := 
    groupElements[expr, poslist, Table[head, {Length[poslist]}]];


groupPositions[plist_List] :=
     Reap[Sow[Last[#], {Most[#]}] & /@ plist, _, List][[2]];


processPosList[{openlist_List, closelist_List}] :=
   Module[{opengroup, closegroup, poslist},
    {opengroup, closegroup} = groupPositions /@ {openlist, closelist} ;
    poslist =  Transpose[Transpose[Sort[#]] & /@ {opengroup, closegroup}];
    If[UnsameQ @@ poslist[[1]],
       Return[(Print["Unmatched lists!", {openlist, closelist}]; {})],
       poslist = Transpose[{poslist[[1, 1]], Transpose /@ Transpose[poslist[[2]]]}]
    ]
];

groupElementsNested[nested_, {openposlist_List, closeposlist_List}, head_] /; Head[head] =!= List := 
 Fold[
  Function[{x, y}, 
    MapAt[groupElements[#, y[[2]], head] &, x, {y[[1]]}]], 
  nested, 
  Sort[processPosList[{openposlist, closeposlist}], 
   Length[#2[[1]]] < Length[#1[[1]]] &]];

ClearAll[parse, parsedToCode, tokenize, Bracket ];

(* "tokenize" our string *)
tokenize[code_String] := 
 Module[{n = 0, tokenrules},
   tokenrules = {"[" :> {"Open", ++n}, "]" :> {"Close", n--}, 
       Whitespace | "" ~~ "," ~~ Whitespace | ""};
   DeleteCases[StringSplit[code, tokenrules], "", Infinity]];

(* parses the "tokenized" string in the breadth-first manner starting 
   with the outermost brackets, using Fold and  groupElementsNested*)

parse[preparsed_] := 
  Module[{maxdepth = Max[Cases[preparsed, _Integer, Infinity]], 
   popenlist, parsed, bracketPositions},
   bracketPositions[expr_, brdepth_Integer] := {Position[expr, {"Open", brdepth}], 
   Position[expr, {"Close", brdepth}]};  
   parsed = Fold[groupElementsNested[#1, bracketPositions[#1, #2], Bracket] &,
               preparsed, Range[maxdepth]];
   parsed =  DeleteCases[parsed, {"Open" | "Close", _}, Infinity];
   parsed = parsed //. h_[x___, y_, Bracket[z___], t___] :> h[x, y[z], t]];

 (* convert our parsed expression into a code that Mathematica can execute *)
 parsedToCode[parsed_] :=
 Module[{myHold},
   SetAttributes[myHold, HoldAll];   
   Hold[Evaluate[
     MapAll[# //. x_String :> ToExpression[x, InputForm, myHold] &, parsed] /.
      HoldPattern[Sequence[x__][y__]] :> x[y]]] //. myHold[x___] :> x

 ];

(note the use of MapAll in the last function). Now, here is how you can use it :)

In[27]:= parsed = parse[tokenize["f[g[h[[i[[j[2], k[[1, m[[1, n[2]]]]]]]]]]]"]]

Out[27]= {"f"["g"["h"[Bracket[
 "i"[Bracket["j"["2"], 
   "k"[Bracket["1", "m"[Bracket["1", "n"["2"]]]]]]]]]]]}

In[28]:= parsed //. a_[Bracket[b__]] :> "Part"[a, b]


Out[28]= {"f"["g"["Part"["h", 
"Part"["i", "j"["2"], 
 "Part"["k", "1", "Part"["m", "1", "n"["2"]]]]]]]}

Now you can use parseToCode:

In[35]:= parsedToCode[parsed//.a_[Bracket[b__]]:>"Part"[a,b]]//FullForm

Out[35]//FullForm= Hold[List[f[g[Part[h,Part[i,j[2],Part[k,1,Part[m,1,n[2]]]]]]]]]

EDIT

Here is an addition needed to make only the character-replacement, as requested:

Clear[stringify, part, parsedToString];
stringify[x_String] := x;
stringify[part[open_, x___, close_]] := 
   part[open, Sequence @@ Riffle[Map[stringify, {x}], ","], close];
stringify[f_String[x___]] := {f, "[",Sequence @@ Riffle[Map[stringify, {x}], ","], "]"};

parsedToString[parsed_] := 
 StringJoin @@ Flatten[Apply[stringify, 
  parsed //. Bracket[x__] :> part["yourOpenChar", x, "yourCloseChar"]] //. 
    part[x__] :> x];

Here is how we can use it:

In[70]:= parsedToString[parsed]

Out[70]= "f[g[h[yourOpenChari[yourOpenCharj[2],k[yourOpenChar1,m[\
  yourOpenChar1,n[2]yourCloseChar]yourCloseChar]yourCloseChar]\
   yourCloseChar]]]"

Edit

tl;dr version:

I'm on track for inadvertently solving the base problem, but regular expressions can't count brackets so use a stack implementation.

Longer version:

My esteemed colleagues are correct, the best way to approach this problem is a stack implementation. Regular expressions may be able to change [[ and ]] into [ and ] respectively if the same number of [[ exist within the string as the number of ]], however if the whole point of the exercise is to use the text within matching [] then regex isn't the way to go. Regular expressions cannot count brackets, nesting logic is just too complex for a simple regex to account for. So in a nutshell I believe that regular expressions can be used to address the basic requirement, which was to change matching [[]] into matching [], however you should really be using a stack because it allows easier manipulation of the resultant string.

And sorry, I completely missed the mathematica tag! I'll leave my answer in here though just in case someone gets excited and jumps the gun like I did.

End Edit

A regular expression utilising reluctant quantifiers should be able to progressively determine where [[ and ]] tokens are in a String, and ensure that matches are only made if the number of [[ equals the number of ]].

The required regex would be along the lines of [[{1}?(?!]])*?]]{1}?, which in plain English is:

[[{1}?, progress one character at a time from the start of the string until one instance of [[ is encountered
(?!]])*? if any characters exist that don't match ]], progress through them one at a time
]]{1}? match the closing bracket

To change the double-square-brackets into single-square-brackets, identify groups within the regex by adding brackets around the first and third particles:

([[{1}?)(?!]])*?(]]{1}?)

This allows you to select the [[ and ]] tokens, and then replace them with [ or ].

Edited (there was an error there)

Is this too naïve?

doubleB[x_String] :=
  StringReplace[
   ToString@StandardForm@
     ToExpression["Hold[" <> x <> "]"], 
  {"Hold[" -> "", RegularExpression["\]\)$"] -> "\)"}];

doubleB["f[g[h[[i[[j[2], k[[1, m[[1, n[2]]]]]]]]]]]"]
ToExpression@doubleB["f[g[h[[i[[j[2], k[[1, m[[1, n[2]]]]]]]]]]]"]

Matching brackets in a string

Just trying to exploit Mma's own parser ...

Here's another one with pattern matching, probably similar to what Sjoerd C. de Vries does, but this one operates on a nested-list structure that is created first, procedurally:

FirstStringPosition[s_String, pat_] :=
    Module[{f = StringPosition[s, pat, 1]},
      If[Length@f > 0, First@First@f, Infinity]
    ];
FirstStringPosition[s_String, ""] = Infinity;

$TokenizeNestedBracePairsBraces = {"[" -> "]", "{" -> "}", "(" -> ")"(*,
  "<"\[Rule]">"*)};
(*nest substrings based on parentheses {([*) (* TODO consider something like http://stackoverflow.com/a/5784082/524504, though non procedural potentially slower*)
TokenizeNestedBracePairs[x_String, closeparen_String] :=
    Module[{opString, cpString, op, cp, result = {}, innerResult,
      rest = x},

      While[rest != "",

        op = FirstStringPosition[rest,
          Keys@$TokenizeNestedBracePairsBraces];
        cp = FirstStringPosition[rest, closeparen];

        Assert[op > 0 && cp > 0];

        Which[
        (*has opening parenthesis*)
          op < cp

          ,(*find next block of [] *)
          result~AppendTo~StringTake[rest, op - 1];
          opString = StringTake[rest, {op}];
          cpString = opString /. $TokenizeNestedBracePairsBraces;
          rest = StringTake[rest, {op + 1, -1}];

          {innerResult, rest} = TokenizeNestedBracePairs[rest, cpString];
          rest = StringDrop[rest, 1];

          result~AppendTo~{opString, innerResult, cpString};

          , cp < Infinity
          ,(*found searched closing parenthesis and no further opening one \
earlier*)
          result~AppendTo~StringTake[rest, cp - 1];
          rest = StringTake[rest, {cp, -1}];
          Return@{result, rest}

          , True
          ,(*done*)
          Return@{result~Append~rest, ""}
        ]
      ]
    ];
(* TODO might want to get rid of empty strings "", { generated here:
TokenizeNestedBracePairs@"f @ g[h[[i[[j[2], k[[1, m[[1, n[2]]]]]]]]]] \
// z"
*)

TokenizeNestedBracePairs[s_String] :=
    First@TokenizeNestedBracePairs[s, ""]

and with these definitions then

StringJoin @@ 
 Flatten[TokenizeNestedBracePairs@
    "f @ g[h[[i[[j[2], k[[1, m[[1, n[2]]]]]]]]]] // z" //. {"[", {"", \
{"[", Longest[x___], "]"}, ""}, "]"} :> {"\[LeftDoubleBracket]", {x}, 
     "\[RightDoubleBracket]"}]

gives

Matching brackets in a string