A Way to Group URLs_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-02-19 11:20 出处：网络

I have a list of URLs, each associated with a set of numbers. For example: http://example.com/ - 0 http://example.com/login/ - 1

I have a list of URLs, each associated with a set of numbers. For example:

http://example.com/ - 0
http://example.com/login/ - 1
http://example.com/login/verify/ - 2
http://example.com/user123/home/ - 3
http://example.com/user254/home/ - 3
http://example.com/user123/edit/ - 4

I want some method to 'compress' this, maybe using regexp -- the catch is that for all URLs not in the list I can assume they map to whatever number I want.

So an out开发者_开发知识库put like this -- any URL is checked against each expression in this order, and given a number according to the first match.

http://example.com/login/verify* - 2
http://example.com/login/* - 1
http://example.com/*/home/ - 3
http://example.com/*/edit - 4
http://example.com/* - 0

Note: There are multiple possible outputs like this that are acceptable. Also, I considered something like a tree, where each node contains an expression like one of the above, and the leaves at the end are the actual URLs to check against.

Another Note, I said mapped to numbers for simplicity's sake. Actually, they are mapped to a set of numbers, where the set has to match. Just in-case that helps someone come up with a solution (though I doubt it).

It looks like it will be easiest to use a different regex for each URL match, they would probably look something like this.

http://example\.com/login/verify
http://example\.com/login
http://example\.com/[^/]+/home
http://example\.com/[^/]+/edit
http://example\.com

Try to match the URL to each of these in order, then when it matches look up the number (or set) that corresponds to that match.

Alternatively you could use a single regex with capturing groups to determine which URL was actually matched, for example:

http://example\.com(?:(/login/verify)|(/login)|(/[^/]+/home)|(/[^/]+/edit))?

Here is a Rubular that shows how you could use the previous regex: http://www.rubular.com/r/tklqMs8U1Z

edit: Here is a Python function that does what I think you're looking for.

import re

def url_match(url):
    base = "http://example.com"
    endings = [("/login/verify", 2), ("/login", 1), ("/*/home", 3), ("/*/edit", 4), ("", 0)]
    re_endings = ["(%s)" % re.escape(x[0]).replace(r"\*", "[^/]+") for x in endings]

    pattern = re.compile("%s(?:%s)" % (re.escape(base), "|".join(re_endings)))
    match = pattern.match(url)

    if match is None:
        return None

    index = [i for i, x in enumerate(match.groups()) if x is not None]
    return endings[index[0]][1]

url_match("http://example.com")              # 0
url_match("http://example.com/login")        # 1
url_match("http://example.com/login/verify") # 2
url_match("http://example.com/user123/home") # 3
url_match("http://example.com/user123/edit") # 4
url_match("http://sample.com")               # None

What you are asking for is clustering of the URL based on the webpath. You can check out K-means clustering of text document . It does explain this in details.

A Way to Group URLs

精彩评论

关注公众号

热门标签

图文推荐

A Way to Group URLs

更多 问答 相关资讯：

精彩评论

关注公众号

热门标签

图文推荐

更多问答相关资讯：