开发者

Can I use a regular expression to extract the domain from a URL?

开发者 https://www.devze.com 2023-01-07 23:23 出处:网络
Suppose I want to turn this : http://en.wikiped开发者_JAVA技巧ia.org/wiki/Anarchy into this : en.wikipedia.org

Suppose I want to turn this :

http://en.wikiped开发者_JAVA技巧ia.org/wiki/Anarchy

into this :

en.wikipedia.org

or even better, this :

wikipedia.org

Is this even possible in regex?


Why use a regex when Ruby has a library for it? The URI library:

ruby-1.9.1-p378 > require 'uri'
 => true 
ruby-1.9.1-p378 > uri = URI.parse("http://en.wikipedia.org/wiki/Anarchy")
 => #<URI::HTTP:0x000001010a2270 URL:http://en.wikipedia.org/wiki/Anarchy> 
ruby-1.9.1-p378 > uri.host
 => "en.wikipedia.org" 
ruby-1.9.1-p378 > uri.host.split('.')
 => ["en", "wikipedia", "org"] 

Splitting the host is one way to separate the domains, but I'm not aware of a reliable way to get the base domain -- you can't just count, in the event of a URL like "http://somedomain.otherdomain.school.ac.uk" vs "www.google.com".


/http:\/\/([^\/]*).*/ will produce en.wikipedia.org from the string you provided.

/http:\/\/.{0,3}\.([^\/]*).*/ will produce wikipedia.org.


yes

Now I know you haven't asked for how, and you haven't specified a language, but I'll answer anyway... (note, this works for all language subsites, not just en.wikipedia...)

perl:

$url =~ s,http://[a-z]{2}\.(wikipedia\.org)/.*,$1,;

ruby:

url = url.sub(/http:\/\/[a-z]{2}\.(wikipedia\.org)\/.*/, '\1')

php: $url = preg_replace('|http://[a-z]{2}.(wikipedia.org)/.*|, '$1', $url);

Of course, for this particular example, you don't even need a regex, just this will do:

url = 'wikipedia.org'

but I jest...

you probably want to handle any URL and pull out the domain part, and it should also work for domains in different countries, eg: foo.co.uk.

In which case, I'd use Mark Rushakoff's solution to get the hostname and then a regex to pull out the domain:

domain = host.sub(/^.*\.([^.]+\.[^.]+(\.[a-z]{2})?)$/, '\1')

Hope this helps

Also, if you want to learn more, I have a regex tute online: http://tech.bluesmoon.info/2006/04/beginning-regular-expressions.html


Sure all you would have to do is search on http://(.*)/wiki/Anarchy

In Perl (Sorry I don't know Ruby, but I expect it's similar)

$string_to_search =~ s/http:////(.)//. should give you wikipedia.org to get rid of the en, you can simply search on http:////en(.)//......

That should do it.

Update: In case you're not familiar with Regex, I would recommend picking up a Regex book, this one really rocks and I like it: REGEX BOOK,Mastering Regular Expressions, I saw it on half.com the other day for 14.99 used, but to clarify what i suggested above, is to look for the string http://en, then for anything until you find a / this is all captured in $1 (in perl, not sure if it's the same in ruby), a simple print $1 will print the string.

Update: #2 sorry the star in the regex is not showing up for some reason, so where you see the . in the () and after the // just imagine a *, oh and I forgot for the en part add a /. at the end that way you don't end up with .wikipedia.org

0

精彩评论

暂无评论...
验证码 换一张
取 消