开发者

Regex for multiple email address replacements

开发者 https://www.devze.com 2023-04-08 17:04 出处:网络
OK so here is my situation. I have a site that is run by WordPress. I need to ensure email obfuscation and as such have installed a plugin called \'Graceful Email Obfuscation\'. This works great a开发

OK so here is my situation. I have a site that is run by WordPress. I need to ensure email obfuscation and as such have installed a plugin called 'Graceful Email Obfuscation'. This works great a开发者_开发问答lready. The catch is that I want a catchall in case someone does not follow the rules it specifies for entering email addresses (ie [email] test@example.com [/email]).

The following regex works great at grabbing all the emails BUT I don't want it to touch the ones that are correctly written as [email]test@example.com[/email]. What do I need to add?

// Match any a href="mailto: AND make it optional
$monster_regex = '`(\<a([^>]+)href\=\"mailto\:)?';  

// Match any email address
$monster_regex .= '([^0-9:\\r\\n][A-Z0-9_]+([.][A-Z0-9_]+)*[@][A-Z0-9_]+([.][A-Z0-9_]+)*[.][A-Z]{2,4})'; 

// Now include all its attributes AND make it optional
$monster_regex .= '(\"*\>)?';

// Match any information enclosed in the <a> tag AND make it optional
$monster_regex .= '(.*)?'; 

// Match the closing </a> tag AND make it optional
$monster_regex .= '(\<\/a\>)?`'; 

$monster_regex .= 'im'; // Set the modifiers

preg_match_all($monster_regex, $content, $matches, PREG_SET_ORDER);

My inputs for testing are this:

<a href = "test@example.com">Tester</a>
test@example.com
<a href = "test@hotmail.com">Hotmail Test</a>
[email]test@example.com]

The output I am getting is this:

(
    [0] => Array
        (
            [0] => <a href="mailto:test@example.com">Tester</a>

            [1] => <a href="mailto:
            [2] =>  
            [3] => test@example.com
            [4] => 
            [5] => 
            [6] => ">
            [7] => Tester</a>

        )

    [1] => Array
        (
            [0] => test@example.com

            [1] => 
            [2] => 
            [3] => test@example.com
            [4] => 
            [5] => 
            [6] => 
            [7] => 

        )

    [2] => Array
        (
            [0] => <a href="mailto:test@hotmail.com">Hotmail Test</a>

            [1] => <a href="mailto:
            [2] =>  
            [3] => test@hotmail.com
            [4] => 
            [5] => 
            [6] => ">
            [7] => Hotmail Test</a>

        )

    [3] => Array
        (
            [0] => [email]test@example.com[/email]

            [1] => 
            [2] => 
            [3] => [email]test@example.com
            [4] => 
            [5] => 
            [6] => 
            [7] => [/email]

        )
)

Thanks in advance.


So you want to match anything that looks like an email address unless it's already enclosed in [email]...[/email] tags? Try this:

'%(?>\b[A-Z0-9_]+(?:\.[A-Z0-9_]+)*@[A-Z0-9_]+(?:\.[A-Z0-9_]+)*\.[A-Z]{2,4}\b)(?!\s*\[/email\])%i'

NB: This answer only addresses the problem of how to match something that's not contained some larger structure. I don't intend to get into a debate over how (or whether) to match email addresses with regexes. I simply extracted the core regex from the question, bracketed it with word boundaries (\b) and wrapped it in an atomic group ((?>...)).

Once a potential match is found, the negative lookahead asserts that the address isn't followed by a closing [/email] tag. Assuming the tags are correctly paired, that means the address already properly tagged. And if they aren't correctly paired, it's the plugin's job to catch it.


While I'm here, I'd like to offer some comments on your regex:

  • The range expression A-z appeared in some of your character classes. Probably just a typo, but some people use that as an idiom for matching uppercase or lowercase letters. That's an error because it also matches several punctuation characters whose code points happen to lie between the two letter ranges. (I fixed that when I edited the question.)

  • The characters <, >, :, ", @, = and / have no special meaning in regexes and don't need to be escaped. It doesn't hurt anything, but regexes hard enough to read already; why throw in a bunch of backslashes and square brackets you don't need?

  • The question mark in (.*)? belongs inside the parens: (.*?). That way it will reluctantly match everything before the next </a>. If there's nothing to match, it will match nothing. Making it optional is not only redundant, it could lead to serious performance penalties.

0

精彩评论

暂无评论...
验证码 换一张
取 消