开发者

Regex: Getting content from URL

开发者 https://www.devze.com 2022-12-27 16:30 出处:网络
I want to get \"the-game\" using regex from URLs like http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/another-one/another-one/

I want to get "the-game" using regex from URLs like

  • http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/another-one/another-one/
  • http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/another-one/
  • http://www.somesite.com.domain.webdev.domain.com/en/the开发者_高级运维-game/another-one/


What parts of the URL could vary and what parts are constant? The following regex will always match whatever is in the slashes following "/en/" - the-game in your example.

(?<=/en/).*?(?=/)

This one will match the contents of the 2nd set of slashes of any URL containing "webdev", assuming the first set of slashes contains a 2 or 3 character language code.

(?<=.*?webdev.*?/.{2,3}/).*?(?=/)

Hopefully you can tweak these examples to accomplish what you're looking for.


var myregexp = /^(?:[^\/]*\/){4}([^\/]+)/;
var match = myregexp.exec(subject);
if (match != null) {
    result = match[1];
} else {
    result = "";
}

matches whatever lies between the fourth and fifth slash and stores the result in the variable result.


You probably should use some kind of url parsing library rather than resorting to using regex.

In python:

from urlparse import urlparse
url = urlparse('http://www.somesite.com.domain.webdev.domain.com/en/the-game/another-one/another-one/another-one/')
print url.path

Which would yield:

/en/the-game/another-one/another-one/another-one/

From there, you can do simple things like stripping /en/ from the beginning of the path. Otherwise, you're bound to do something wrong with a regular expression. Don't reinvent the wheel!

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号