How to retrieve the `scheme://domain` part of an URL without including subdomains?_问答_开发者

I am using Ruby on Rails 3.0.10 and I would like to retrieve the scheme://domain part of an URL without including the subdomain part. That is, if I have the following URL

http://www.sub_domain.domain.com

I would like to retrieve

http://www.domain.com

How can I do that (should I use a regex?)?

UPDATE

@mu is too short rightly said in his\her comment (that made me think...):

You misunderstand. www.ac.uk is meaningless, the base domain for Oxford is ox.ac.uk; the ac.uk part means "academic UK" and is, semantically, one component. A few other countries have similar naming schemes.

So, the update question is:

How can I iterate over a开发者_如何学运维n URL (for example http://www.maths.ox.ac.uk/) as made in the following steps so to delete progressively subdomain parts until the last?

http://www.maths.ox.ac.uk/ # Step 0 (start)
http://www.ox.ac.uk/       # Step 1
http://www.ac.uk/          # Step 2 (end)

This is a total hack, and I have no idea how it could be useful in the generic sense, but here you go.

ruby-1.8.7-p352 >   uri = URI.parse("http://www.foo.domain.com/")
 => #<URI::HTTP:0x105011840 URL:http://www.foo.domain.com/> 
ruby-1.8.7-p352 > uri.scheme + "://" + uri.host.split(/\./)[-2..-1].join(".")
 => "http://domain.com"

If you know that the URL ends in .com and follows the format you specified, you could try a regular expression like this:

\.[\w\-]+\.com

to parse out the domain and the following .com. Prefix that with http://www and you should be all set.

There is no "general case" solution for this. Some URLs use a suffix with one dot (.com or .edu), while some use multiple dots (.co.jp, etc). You won't be able to solve this with something as simple as a regex.

What you may be able to do is to make a list of possible URL suffixes and construct a regex for each. If it matches your input string, use a variation of the above:

base_regex = '\.[\w\-]+'
list_of_suffixes.each {|s|
    thisregex = Regexp.new(base_regex + s)
    match = thisregex.match(url)
    next if match == nil
    return 'http://www.' + match[0]
}

Note: code is off the top of my head and for illustration purposes only (it probably won't run exactly as-is, but you get the point)

The right way to deal with this is to use URI:

# Parse and remove all the stuff you don't want.
u = URI.parse('http://www.sub-domain.domain.com/pancakes')
u.userinfo = nil
u.path     = ''
u.fragment = nil
# You might want to check u.scheme as well

host = u.host

And now you have to figure out what you want to do with host. You could start at the last component and work your way backwards until you get a domain name that resolves to something using Net::DNS:

require 'net/dns/resolver'
components = host.split('.')
basename   = (1 .. components.length).
             map  { |i| components.last(i + 1).join('.') }.
             find { |n| Resolver(n).answer.length > 0    }

# basename is now nil or something with a DNS A record
if(basename.nil?)
    # complain and bail out
end
u.host = basename
# Your trimmed URL is in u.to_s

You have to check that the domain names resolve to something useful or you won't know if you have a valid one. You could try to track down all the various naming rules all over the world instead but there's no point.

This still won't guarantee you that you have a useful URL, you'd have to check to see if the name you end up with responds to an HTTP request to be sure.

To answer your original question:

should I use a regex?

Absolutely not. URLs are a lot more complicated than most people think so you should use a real URL parser such as URI. Furthermore, domain names are also more complicated than most people think so you have to resort to DNS lookups to get anything sensible.