开发者

REGEX to find URL with subdomain in complex URL's

开发者 https://www.devze.com 2023-03-31 11:23 出处:网络
Apologies if this has been answered somewhere before, but like everything, google gives a billion results, all leading to the wrong answer.

Apologies if this has been answered somewhere before, but like everything, google gives a billion results, all leading to the wrong answer.

I have a URL/Email Parser linking url's and emails addresses on my website (PHP). Everything was fine until I gained some international customers with complex domain names (.com.au etc)

This is the function I currently have...

    FUNCTION linkScan($string1) {

    $pattern1 = "/(?<![\/\d\w])(http:\/\/)?([\w\d\-]+)((\.([\w\d\-])+){2,})([\/\?\w\d\.\-_&=+%]*)?/i";
    $pattern2 = "/([\w\d\.\-\_]+)@([\w\d\.\_\-]+)/mi";

    $replace1 = "<a href=\"http://$2$3$6\" target=\"_blank\">$0</a>";
    $replace2 = "<a href=\"mailto:$0\">$0</a>";

    $string2 = PREG_REPLACE($pattern1,$replace1,$string1);
    $string3 = PREG_REPLACE($pattern2,$replace2,$string2);

    $string3 = convertSmartQuotes($string3);


     RETURN $string3;
}

It works fine until it finds an email address someone@somewhere.com.au

Becuase it looks for the URL's first, it finds to somewhere.com.au portion and makes it a link, then when the email scan happend it is ignored because of the HTML tags now embedded in it.

What I want to do if force the use of a subdomain in the URL's (whether that be a www or otherwise), and not care if there is http:// in front of it. But because the regex seems to only care if there are 3 portions (subdomain, domain, .com), the regexp is mistakenly thinking that the .com in a .com.au is actually the domain portion.

It should find...

subdomain.domain.com

subdomain.domain.com.au

It should not find...

domain.com

domain.com.au (which it is currently fi开发者_运维知识库nding)

If there is anyone that can help we with the regular expression, that would be fantastic. Thanks


You need a list if all top-level domains and their structure. The Mozilla project has such a list; it is several hundred lines, so incorporating it into a regex may be cumbersome, although certainly not impossible. https://wiki.mozilla.org/TLD_List update: superseded by http://publicsuffix.org/

Anyway, quite likely you are Doing It Wrong. What are you trying to accomplish?


Regex has a nice list of expressions and also includes a nice tester to make sure your expression works.

0

精彩评论

暂无评论...
验证码 换一张
取 消