How do I make this REGEX ignore = in a tag's attribute?_问答_开发者

Alan Moore was very helpful in solving my earlier problem, but I didn't realise until just now that the REGEX he wrote for pulling out all of a tag's attributes will break prematurely if there's an equal sign in a URL. I've spent a good w开发者_如何学编程hile on this, trying different modifications with lookaheads and behinds, to no avail.

I need this regex to break on: space + word + = , but it's breaking even if there's no space, only a letter and an =.

This is mainly only an issue when I'm formatting a tag that has an onclick event with Javascript, such as opening a window or calling a a script (form action).

Regex:

#(\s+[^\s=]+)\s*=\s*([^\s=]+(?>\s+[^\s=]+)*(?!\s*=))#i

Code to check on:

 onClick=window.open('http%3A%2F%2Fwww.stackoverflow.com%2Ffakeindex.php%3Fsomevariable%3Dsomevalue','popup','scrollbars=yes,resizable=yes,width=716,height=540,left=0,top=0,ScreenX=0,ScreenY=0'); class=someclass

What it does:

The above will break on the letter prior to the =, so in this case that the URL is encoded, it breaks on "s" in "scrollbars=yes".

I can encode the URL to get around the =, but the rest of the javascript makes it still a problem since there are variables and values which require the =. If the REGEX forced it to allow = and only break on spaces followed by a word followed by that = like is should be doing, then I should be able to have the javascript work in the custom tags since the entire onClick string would be captured as the value.

Disclaimer:

As others have already stated/emphasized, using regex with HTML is fraught with potential gotcha's. Doing so with a mix of two intermingled markup languages, like you have here, is even more perilous. There are lots of ways for this solution (and any like it) to fail.

That said...

Answering this question requires an understanding of your preceding question (PHP PREG_REPLACE Returning wrong result depending on order checked). Note that I added an answer to that question as well with a solution consisting of minimal change to the original code. What follows is another answer with a somewhat improved solution. (Both of these answers fix both specific problems.)

Some random comments on your original code:

The expression: [^\s]+ can be shortened to: \S+
With the foreach statement, the order of processing is not guaranteed. (And the order is important here - although this is probably not an issue since the array is declared all at once so should have the correct order.)
You are using ([^\[]+) to capture the attribute value. I think you meant to use ([^\]]+) (but even that is not the best expression).
Using ([^\[]+) (or ([^\]]+)) to capture the attribute value does not allow for square brackets to appear within the value.
The regexes are not written in free spacing mode and contain no comments.
Having unquoted attribute values with multiple words introduces quite a bit of potential ambiguity. What if you wanted to have a title attribute like this: title="CSS class is specified: class=myclass"? You should really be delimiting these attribute values.

A (somewhat) better solution:

Assumptions:

All Ltags will be well formed.
Ltags are never nested.
Ltag attributes are separated by a "SPACE+WORD+=" sequence.
Other [specialtags] may appear anywhere inside an Ltag except within the "SPACE+WORD+=" attribute separator sequences.
All Ltag attribute values never contain: "SPACE+WORD+=" sequence. This includes multi-word titles and Javascript snippets inside an onClick.

I assume you know precisely what will be occurring within the Ltag attributes and that they will conform to the above requirements.

Here is a somewhat improved version of replaceLTags(), which uses a callback function to parse and wrap each attribute value with double quotes. The complex regexes are fully commented.

// Convert all Ltags to HTML links.
function replaceLTags($str){
    // Case 1: No URL specified in Ltag open tag: "[l]URL[/l]"
    $re1 = '%\[l\](.*?)\[/l\]%i';
    $str = preg_replace($re1, '<a href="$1">$1</a>', $str);

    // Case 2: URL specified in Ltag open tag: "[l=URL attr=val]linktext[/l]"
    $re2 = '%
        # Match special Ltag construct: [l=url att=value]linktext[/l]
        \[l=                 # Literal start-of-open-Ltag sequence.
        (\S+)                # $1: link URL.
        (                    # $2: Any/all optional attributes.
          [^[\]]*            # {normal*} = Zero or more non-[]
          (?:                # "Unroll-the-loop" (See: MRE3)
            \[[^[\]]*\]      # {special} = matching [square brackets]
            [^[\]]*          # More {normal*} = Zero or more non-[]
          )*                 # End {(special normal*)*} construct.
        )                    # End $2: Optional attributes.
        \]                   # Literal end-of-open-Ltag sequence.
        (.*?)                # $3: Ltag link text contents.
        \[/l\]               # Literal close-Ltag sequence.
        %six';
    return preg_replace_callback($re2, '_replaceLTags_cb', $str);
}
// Callback function wraps values in quotes and converts to HTML.
function _replaceLTags_cb($matches) {
    // Wrap each attribute value in double quotes.
    $matches[2] = preg_replace('/
        # Match one Ltag attribute name=value pair.
        (\s+\w+=)        # $1: Space, attrib name, equals sign.
        (                # $2: Attribute value.
          (?:            # One or more non-start-of-next-attrib
            (?!\s+\w+=)  # If this char is not start of next attrib,
            .            # then match next char of attribute value.
          )+             # Step through value one char at a time.
        )                # End $2: Attribute value.
        /sx', '$1"$2"', $matches[2]);
    // Put humpty back together again.
    return '<a href="'. $matches[1] .'"'.
        $matches[2] .'>'. $matches[3] .'</a>';
}

The main function regex, $re2, matches an Ltag element, but does not attempt to parse individual open tag attributes - it globs (and captures into group $2) all the attributes into one substring. This substring containing all the attributes is then parsed by the regex in the callback function, which uses the desired "SPACE+WORD+=" expression as a separator between name=value pairs.

Note that this function can be passed a string containing multiple Ltags and all will be processed in one go. It will also correctly handle IPv6 literal URL addresses such as: http://[::1:2:3:4:5:6:7] (which contain square brackets).

If you insist on going down this road, I would recommend using a delimiter for the attribute values. I know you said that you can't use the double quote for some reason, but you could use a special character such as '\1' (ASCII 001), then replace that with double quotes in the callback function. This would dramatically cut down on the list of possible ways for this to fail.

If you can guarantee that the pattern will never occur inside an attribute value, you could split the string on this regex:

\s+(?=\w+=)

That actually simplifies the problem quite a bit. The code below assumes the URL (which may contain custom [fill] tags) ends at the first whitespace (if present) or at the closing bracket of the [l] tag. Everything after the first whitespace is assumed to be a series of whitespace-separated name=value pairs, where the name always matches ^\w+$ and the value never contains a match for \s+\w+=. Values may also contain [fill] tags.

function replaceLTags($originalString)
{
  return preg_replace_callback(
    '#\[l=((?>[^\s\[\]]++|\[\w+\])+)(?:\s+((?>[^\[\]]++|\[\w+\])+))?\](.*?)\[/l\]#',
    replaceWithinTags, $originalString);
}

function replaceWithinTags($groups)
{
  $result = "<a href=\"$groups[1]\"";
  $attrs = preg_split('~\s+(?=\w+=)~', $groups[2]);
  foreach ($attrs as $a)
  {
    $result .= preg_replace('#\s*(\w+)=(.*)#', ' $1="$2"', $a);
  }
  $result .= ">$groups[3]</a>";
  return $result;
}

demo

I'm also assuming there are no double-quotes in the attribute values. If there are, the replacement will still work but the resulting HTML will be invalid. If you can't guarantee the absence of double-quotes, you may have to URL-encode them or something before doing these replacements.