Given a list of URLs known to be somewhat "RESTful", what would be a decent algorithm for grouping them so that URLs mappi开发者_如何学Cng to the same "controller/action/view" are likely to be grouped together?
For example, given the following list:
http://www.example.com/foo
http://www.example.com/foo/1
http://www.example.com/foo/2
http://www.example.com/foo/3
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/foo/3/edit
It would group them as follows:
http://www.example.com/foo
http://www.example.com/foo/1
http://www.example.com/foo/2
http://www.example.com/foo/3
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/foo/3/edit
Nothing is known about the order or structure of the URLs ahead of time. In my example, it would be somewhat easy since the IDs are obviously numeric. Ideally, I'd like an algorithm that does a good job even if IDs are non-numeric (as in http://www.example.com/products/rocket
and http://www.example.com/products/ufo
).
It's really just an effort to say, "Given these URLs, I've grouped them by removing what I think it he 'variable' ID part of the URL."
Aliza has the right idea, you want to look for the 'articulation points' (in REST, basically where a parameter is being passed). Looking only for a single point of change gets tricky
Example
http://www.example.com/foo/1/new
http://www.example.com/foo/1/edit
http://www.example.com/foo/2/edit
http://www.example.com/bar/1/new
These can be grouped several equally good ways since we have no idea of the URL semantics. This really boils down to the question of this - is this piece of the URL part of the REST descriptor or a parameter. If we know what all the descriptors are, the rest are parameters and we are done.
Give a sufficiently large dataset, we'd want to look at the statistics of all URLs at each depth. e.g., /x/y/z/t/. We would count the number of occurrences in each slot and generate a large joint probability distribution table.
We can now look at the distribution of symbols. A high count in a slot means it's likely a parameter. We would start from the bottom, look for conditional probability events, ie., What is the probability of x being foo, then what is the probability y being something given x, etc. etc. I'd have to think more to determine a systematic way to extracting these, but it seems like a promisign start
split each url to an array of strings with the delimiter being '/'
e.g. http://www.example.com/foo/1/edit
will give the array [http:,www.example.com,foo,1,edit]
if two arrays (urls) share the same value in all indecies except for one, they will be in the same group.
e.g. http://www.example.com/foo/1/edit
= [http:,www.example.com,foo,1,edit]
and
http://www.example.com/foo/2/edit
= [http:,www.example.com,foo,2,edit]
. The arrays match in all indices except for #3 which is 1 in the first array and 2 in the second array. Therefore, the urls belong to the same group.
It is easy to see that urls like http://www.example.com/foo/3
and http://www.example.com/foo/1/edit
will not belong to the same group according to this algorithm.
精彩评论