PHP PREG_REPLACE Returning wrong result depending on order checked_问答_开发者

I have stumbled upon a really odd bug with PHP's preg_replace function and some regex patterns. What I'm trying to开发者_开发技巧 do is replace custom tags delimited by brackets and convert them to HTML. The regex has to account for custom "fill" tags that will stay with the outputted HTML so that it can be replaced on-the-fly when the page loads (replacing with a site-name for instance).

Each regex pattern will work by itself, but for some reason, some of them will exit the function early if preceded by one of the other patterns is checked first. When I stumbled upon this, I used preg_match and a foreach loop to check the patterns before moving on and would return the result if found - so hypothetically it would seem fresh to each pattern.

This didn't work either.

Check Code:

function replaceLTags($originalString){
    $patterns = array(
                '#^\[l\]([^\s]+)\[/l\]$#i' => '<a href="$1">$1</a>',
                '#^\[l=([^\s]+)]([^\[]+)\[/l\]$#i'=> '<a href="$1">$2</a>',
                '#^\[l=([^\s]+) title=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" title="$2">$3</a>',
                '#^\[l=([^\s]+) rel=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" rel="$2">$3</a>',
                '#^\[l=([^\s]+) onClick=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" onClick="$2">$3</a>',
                '#^\[l=([^\s]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" style="$2">$3</a>',
                '#^\[l=([^\s]+) onClick=([^\[]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" onClick="$2" style="$3">$4</a>',
                '#^\[l=([^\s]+) class=([^\[]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" class="$2" style="$3">$4</a>',
                '#^\[l=([^\s]+) class=([^\[]+) rel=([^\[]+)] target=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" class="$2" rel="$3" target="$4">$5</a>'
            );

    foreach ($patterns as $pattern => $replace){
        if (preg_match($pattern, $originalString)){
            return preg_replace($pattern, $replace, $originalString);
        }
    }
}

$string = '[l=[site_url]/site-category/ class=hello rel=nofollow target=_blank]Hello there[/l]';

echo $alteredString = $format->replaceLTags($string);

The above "String" would come out as:

<a href="[site_url">/site-category/ class=hello rel=nofollow target=_blank]Hello there</a>

When it should come out as:

<a href="[site_url]/site-category/" class="hello" rel="nofollow" target="_blank">Hello there</a>

But if moved that pattern further up in the list to be checked sooner, it'd format correctly.

I'm stumped, because it seems like the string is being overwritten somehow every time it's checked even though that makes no sense.

Seems to me you're doing a lot more work than you need to. Instead of using a separate regex/replacement for each possible list of attributes, why not use preg_replace_callback to process the attributes in a separate step? For example:

function replaceLTags($originalString){
  return preg_replace_callback('#\[l=((?>[^\s\[\]]+|\[site_url\])+)([^\]]*)\](.*?)\[/l\]#',
                               replaceWithinTags, $originalString);
}

function replaceWithinTags($groups){
  return '<a href="' . $groups[1] . '"' . 
         preg_replace('#(\s+\w+)=(\S+)#', '$1="$2"', $groups[2]) .
         '>' . $groups[3] . '</a>';
}

See a complete demo here (updated; see comments).

Here's an updated version of the code based on new information that was provided in the comments:

function replaceLTags($originalString){
  return preg_replace_callback('#\[l=((?>[^\s\[\]]+|\[\w+\])+)([^\]]*)\](.*?)\[/l\]#',
                               replaceWithinTags, $originalString);
}

function replaceWithinTags($groups){
  return '<a href="' . $groups[1] . '"' . 
         preg_replace(
             '#(\s+[^\s=]+)\s*=\s*([^\s=]+(?>\s+[^\s=]+)*(?!\s*=))#',
             '$1="$2"', $groups[2]) .
         '>' . $groups[3] . '</a>';
}

demo

In the first regex I changed [site_url] to \[\w+\] so it can match any custom fill tag.

Here's a breakdown of the second regex:

(\s+[^\s=]+)   # the attribute name and its leading whitespace
\s*=\s*
(
  [^\s=]+   # the first word of the attribute value
  (?>\s+[^\s=]+)*  # the second and subsequent words, if any
  (?!\s*=)  # prevents the group above from consuming tag names
)

The trickiest part is matching multi-word attribute values. (?>\s+[^\s=]+)* will always consume the next tag name if there is one, but the lookahead forces it to backtrack. Normally it would only back off one character at a time, but the atomic group effectively forces it to backtrack by whole words or not at all.

You messed up the regular expressions. If you print the string on each iteration as:

foreach ($patterns as $pattern => $replace){
    echo "String: $originalString\n";
    if (preg_match($pattern, $originalString)){
        return preg_replace($pattern, $replace, $originalString);
    }
}

you will see that the string is not modified. From my run, I noticed that the second regular expression matches. I placed a third param to the preg_match call and printed the matches. Here is what I got:

Array (
    [0] => [l=[site_url]/site-category/ class=hello rel=nofollow target=_blank]Hello there[/l]
    [1] => [site_url
    [2] => /site-category/ class=hello rel=nofollow target=_blank]Hello there )

The cause of your immediate problem at hand is twofold:

First, there is a typo in the applicable regex (the last one in the array). It has an extraneous literal right square bracket before the: " target=". In other words, this:

'#^\[l=([^\s]+) class=([^\[]+) rel=([^\[]+)] target=([^\[]+)]([^\[]+)\[/l\]$#i'

Should read:

'#^\[l=([^\s]+) class=([^\[]+) rel=([^\[]+) target=([^\[]+)]([^\[]+)\[/l\]$#i'

Second, there are two regexes in the array which both match the same string, and unfortunately the more specific of the two (the regex above which is the one we want), comes second. The other more general regex that matches is the second one in the array:

'#^\[l=([^\s]+)]([^\[]+)\[/l\]$#i'

Placing the more general regex last and removing the extraneous square bracket solves the problem. Here is your original code fixed with the above two changes applied:

function replaceLTags($originalString){
    $patterns = array(
                '#^\[l\]([^\s]+)\[/l\]$#i' => '<a href="$1">$1</a>',
                '#^\[l=([^\s]+) title=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" title="$2">$3</a>',
                '#^\[l=([^\s]+) rel=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" rel="$2">$3</a>',
                '#^\[l=([^\s]+) onClick=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" onClick="$2">$3</a>',
                '#^\[l=([^\s]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" style="$2">$3</a>',
                '#^\[l=([^\s]+) onClick=([^\[]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" onClick="$2" style="$3">$4</a>',
                '#^\[l=([^\s]+) class=([^\[]+) style=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" class="$2" style="$3">$4</a>',
                '#^\[l=([^\s]+) class=([^\[]+) rel=([^\[]+) target=([^\[]+)]([^\[]+)\[/l\]$#i' => '<a href="$1" class="$2" rel="$3" target="$4">$5</a>',
                '#^\[l=([^\s]+)]([^\[]+)\[/l\]$#i'=> '<a href="$1">$2</a>'
            );

    foreach ($patterns as $pattern => $replace){
        if (preg_match($pattern, $originalString)){
            return preg_replace($pattern, $replace, $originalString);
        }
    }
}

$string = '[l=[site_url]/site-category/ class=hello rel=nofollow target=_blank]Hello there[/l]';

echo $alteredString = $format->replaceLTags($string);

Note that this only fixes the immediate specific error described in your question and does not address some more fundamental problems with what you are attempting to accomplish. I have presented a somewhat better solution as an answer to your follow-up question: How do I make this REGEX ignore = in a tag's attribute?.

But as others have mentioned, mixing two different markup languages together and processing with regex is asking for trouble.

Here is some general purpose code you can use to have less expressions, you could always remove any tags that are not allowed from the final string.

<?php

function replaceLTags($originalString) {
    if (preg_match('#^\[l\]([^\s]+)\[/l\]$#i', $originalString)) {
        // match a link with no description or tags
        return preg_replace('#^\[l\]([^\s]+)\[/l\]$#i', '<a href="$1">$1</a>', $originalString);
    } else if (preg_match('#^\[l=([^\s]+)\s*([^\]]*)\](.*?)\[/l\]#i', $originalString, $matches)) {
        // match a link with title and/or tags
        $attribs = $matches[2];
        $attrStr = '';
        if (preg_match_all('#([^=]+)=([^\s\]]+)#i', $attribs, $attribMatches) > 0) {
            $attrStr = ' ';
            for ($i = 0; $i < sizeof($attribMatches[0]); ++$i) {
                $attrStr .= $attribMatches[1][$i] . '="' . $attribMatches[2][$i] . '" ';
            }
            $attrStr = rtrim($attrStr);
        }

        return '<a href="' . $matches[1] . '"' . $attrStr . '>' . $matches[3] . '</a>';
    } else {
        return $originalString;
    }
}

$strings = array(
    '[l]http://www.stackoverflow.com[/l]',
    '[l=[site_url]/site-category/ class=hello rel=nofollow target=_blank]Hello there[/l]',
    '[l=[site_url]/page.php?q=123]Link[/l]',
    '[l=http://www.stackoverflow.com/careers/ target=_blank class=default]Stack overflow[/l]'
);

foreach($strings as $string) {
    $altered = replaceLTags($string);
    echo "{$altered}<br />\n";
}