开发者

PHP Regex for extracting subdomains of arbitrary domains

开发者 https://www.devze.com 2023-01-27 05:27 出处:网络
I want to extract the subdomain and domain part for domains witharbitrary top level exte开发者_StackOverflow社区nsions.

I want to extract the subdomain and domain part for domains with arbitrary top level exte开发者_StackOverflow社区nsions.

Thus:

sub1.domain1.com --> Extract subdomain=sub1, domain=domain1.com

sub2.domain2.co.in --> Extract subdomain=sub2, domain=domain2.co.in

sub3.domain3.co.uk --> Extract subdomain=sub3, domain=domain3.co.uk

sub4.domain4.us --> Extract subdomain=sub4, domain=domain4.us

mydomain.com --> Extract subdomain="", domain=mydomain.com

mydomain.co.in --> Extract subdomain="", domain=mydomain.co.in

I am bit confused about how to handle TLDs like co.in/co.uk etc. I could do this using brute force way by measuring if the last 5 characters have a DOT (.) in them, but thinking if there is a regex way to do this.


NOTE 1: As TToni pointed out, there can be ambiguities. However, I will put some constraints:

1) The "Domain name" part (without the extension) --> will be at-least 4 characters.

2) The TLD extension part (.com, co.in, .us, etc) will either have a single DOT or if it has two DOTS, then the penultimate part (sub TLD) will have atmost 3 characters.

I have a feeling that these constraints will make the problem unambigious and solvable using regex.

(Also, assume "www." has been stripped out already).


NOTE 2:

Example of above constraints

sub.dom.in --> domain="sub.dom.in"

sub.dom1.in --> domain="dom1.in", subdomain="sub"

This may sound buggy, but the reason is - I want this for my internal purposes, and all my domains have atleast 4 characters in them, AND, all extensions have either single DOT or the penultimate part is at-max 3 characters.


NOTE 3: I have a feeling I might make mistakes by using regex for this. Hence thinking of doing the string search way.

regards,

JP


Not sure you need regexes. Split the domain name on '.' then apply some heuristics on the result depending on the rightmost bit - e..g if last is "com" then domain is last+second last, subdomain is the rest.

Or keep a list of "top-level" (quotes becasue it's a different meaning from the normal top level)domains, iterate over the list matching the right end of the domain name against each. If a match, remove the top level bit and return the rest as subdomain - this could be put in a regex but with a loss of clarity. The list would look something like

".edu", ".gov", ".mil", ".com", ".co.uk", ".gov.uk", ".nhs.uk", [...]

The regex would look something like

 \.(edu|gov|mil|com|co\.uk|gov\.uk|nhs\.uk|[...])$


You can use this: (\b\w+\b(?:\.\b\w+\b)*?){0,1}?\.?(\b\w+\b(?:\.\b\w{1,3}\b)?\.\b\w{1,3}\b)
It doesn't look very beautiful, but the idea behind it is simple. It will catch subdomain in the first group and domain in the second. Also it will split things like "sub1.sub2.sub3.domain2.co.in" into "sub1.sub2.sub3" and "domain2.co.in"


I got the "top-level" domain names,it might be ugly but it works.

$fix = array('com', 'edu', 'gov', 'int', 'mil', 'net', 'org', 'biz', 'info', 'pro', 'name', 'museum', 'coop', 'aero', 'x    xx', 'idv', 'al', 'dz', 'af', 'ar', 'ae', 'aw', 'om', 'az', 'eg', 'et', 'ie', 'ee', 'ad', 'ao', 'ai', 'ag', 'at', 'au',     'mo', 'bb', 'pg', 'bs', 'pk', 'py', 'ps', 'bh', 'pa', 'br', 'by', 'bm', 'bg', 'mp', 'bj', 'be', 'is', 'pr', 'ba', 'pl',     'bo', 'bz', 'bw', 'bt', 'bf', 'bi', 'bv', 'kp', 'gq', 'dk', 'de', 'tl', 'tp', 'tg', 'dm', 'do', 'ru', 'ec', 'er', 'fr',     'fo', 'pf', 'gf', 'tf', 'va', 'ph', 'fj', 'fi', 'cv', 'fk', 'gm', 'cg', 'cd', 'co', 'cr', 'gg', 'gd', 'gl', 'ge', 'cu',     'gp', 'gu', 'gy', 'kz', 'ht', 'kr', 'nl', 'an', 'hm', 'hn', 'ki', 'dj', 'kg', 'gn', 'gw', 'ca', 'gh', 'ga', 'kh', 'cz',     'zw', 'cm', 'qa', 'ky', 'km', 'ci', 'kw', 'cc', 'hr', 'ke', 'ck', 'lv', 'ls', 'la', 'lb', 'lt', 'lr', 'ly', 'li', 're',     'lu', 'rw', 'ro', 'mg', 'im', 'mv', 'mt', 'mw', 'my', 'ml', 'mk', 'mh', 'mq', 'yt', 'mu', 'mr', 'us', 'um', 'as', 'vi',     'mn', 'ms', 'bd', 'pe', 'fm', 'mm', 'md', 'ma', 'mc', 'mz', 'mx', 'nr', 'np', 'ni', 'ne', 'ng', 'nu', 'no', 'nf', 'na',     'za', 'aq', 'gs', 'eu', 'pw', 'pn', 'pt', 'jp', 'se', 'ch', 'sv', 'ws', 'yu', 'sl', 'sn', 'cy', 'sc', 'sa', 'cx', 'st',     'sh', 'kn', 'lc', 'sm', 'pm', 'vc', 'lk', 'sk', 'si', 'sj', 'sz', 'sd', 'sr', 'sb', 'so', 'tj', 'tw', 'th', 'tz', 'to',     'tc', 'tt', 'tn', 'tv', 'tr', 'tm', 'tk', 'wf', 'vu', 'gt', 've', 'bn', 'ug', 'ua', 'uy', 'uz', 'es', 'eh', 'gr', 'hk',     'sg', 'nc', 'nz', 'hu', 'sy', 'jm', 'am', 'ac', 'ye', 'iq', 'ir', 'il', 'it', 'in', 'id', 'uk', 'vg', 'io', 'jo', 'vn',     'zm', 'je', 'td', 'gi', 'cl', 'cf', 'cn', 'ac', 'ad', 'ae', 'af', 'ag', 'ai', 'al', 'am', 'an', 'ao', 'aq', 'ar', 'as',     'at', 'au', 'aw', 'az', 'ba', 'bb', 'bd', 'be', 'bf', 'bg', 'bh', 'bi', 'bj', 'bm', 'bn', 'bo', 'br', 'bs', 'bt', 'bv',     'bw', 'by', 'bz', 'ca', 'cc', 'cd', 'cf', 'cg', 'ch', 'ci', 'ck', 'cl', 'cm', 'cn', 'co', 'cr', 'cu', 'cv', 'cx', 'cy',     'cz', 'de', 'dj', 'dk', 'dm', 'do', 'dz', 'ec', 'ee', 'eg', 'eh', 'er', 'es', 'et', 'eu', 'fi', 'fj', 'fk', 'fm', 'fo',     'fr', 'ga', 'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gl', 'gm', 'gn', 'gp', 'gq', 'gr', 'gs', 'gt', 'gu', 'gw', 'gy', 'hk',     'hm', 'hn', 'hr', 'ht', 'hu', 'id', 'ie', 'il', 'im', 'in', 'io', 'iq', 'ir', 'is', 'it', 'je', 'jm', 'jo', 'jp', 'ke',     'kg', 'kh', 'ki', 'km', 'kn', 'kp', 'kr', 'kw', 'ky', 'kz', 'la', 'lb', 'lc', 'li', 'lk', 'lr', 'ls', 'lt', 'lu', 'lv',     'ly', 'ma', 'mc', 'md', 'mg', 'mh', 'mk', 'ml', 'mm', 'mn', 'mo', 'mp', 'mq', 'mr', 'ms', 'mt', 'mu', 'mv', 'mw', 'mx',     'my', 'mz', 'na', 'nc', 'ne', 'nf', 'ng', 'ni', 'nl', 'no', 'np', 'nr', 'nu', 'nz', 'om', 'pa', 'pe', 'pf', 'pg', 'ph',     'pk', 'pl', 'pm', 'pn', 'pr', 'ps', 'pt', 'pw', 'py', 'qa', 're', 'ro', 'ru', 'rw', 'sa', 'sb', 'sc', 'sd', 'se', 'sg',     'sh', 'si', 'sj', 'sk', 'sl', 'sm', 'sn', 'so', 'sr', 'st', 'sv', 'sy', 'sz', 'tc', 'td', 'tf', 'tg', 'th', 'tj', 'tk',     'tl', 'tm', 'tn', 'to', 'tp', 'tr', 'tt', 'tv', 'tw', 'tz', 'ua', 'ug', 'uk', 'um', 'us', 'uy', 'uz', 'va', 'vc', 've',     'vg', 'vi', 'vn', 'vu', 'wf', 'ws', 'ye', 'yt', 'yu', 'yr', 'za', 'zm', 'zw');

function get_domain($url){
   global $fix;
   $host =  parse_url($url,PHP_URL_HOST);
   $list = explode('.',$host);
   $res = array();
   $i = count($list) - 1;
   while($i >= 0){ 
      if(!in_array($list[$i],$fix)){
         $res[] = $list[$i];
         break;
      }   
    $res[] = $list[$i];
    $i--;
     }   
    return implode('.',array_reverse($res));
}


You can use regex and any internal function, but you'll never have correct result on complex domain zones (.co.uk, .a.bg, .fuso.aichi.jp, etc.).

You need use library that uses Public Suffix List for correct extraction. I recomend TLDExtract.

Here is a sample code:

$extract = new LayerShifter\TLDExtract\Extract();

$result = $extract->parse('mydomain.co.in');
$result->getSubdomain(); // will return null
$result->getHostname(); // will return 'mydomain'
$result->getSuffix(); // will return 'co.in'
$result->getFullHost(); // will return 'mydomain.co.in'
$result->getRegistrableDomain(); // will return 'mydomain.co.in'
0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号