I can't seem to find a question on SO about my particular problem, so forgive me if this has been asked before!
Anyway, I'm writing a script to loop through a set of URL's and give me a list of unique urls with un开发者_如何学Goique parameters.
The trouble I'm having is actually comparing the parameters to eliminate multiple duplicates. It's a bit hard to explain, so some examples are probably in order:
Say I have a list of URL's like this
- hxxp://www.somesite.com/page.php?id=3&title=derp
- hxxp://www.somesite.com/page.php?id=4&title=blah
- hxxp://www.somesite.com/page.php?id=3&c=32&title=thing
- hxxp://www.somesite.com/page.php?b=33&id=3
I have it parsing each URL into a list of lists, so eventually I have a list like this:
sort = [['id', 'title'], ['id', 'c', 'title'], ['b', 'id']]
I nee to figure out a way to give me just 2 lists in my list at that point:
new = [['id', 'c', 'title'], ['b', 'id']]
As of right now I've got a bit to sort it out a little, I know I'm close and I've been slamming my head against this for a couple days now :(. Any ideas?
Thanks in advance! :)
EDIT: Sorry for not being clear! This script is aimed at finding unique entry points for web applications post-spidering. Basically if a URL has 3 unique entry points
['id', 'c', 'title']
I'd prefer that to the same link with 2 unique entry points, such as:
['id', 'title']
So I need my new list of lists to eliminate the one with 2 and prefer the one with 3 ONLY if the smaller variables are in the larger set. If it's still unclear let me know, and thank you for the quick responses! :)
I'll assume that subsets are considered "duplicates" (non-commutatively, of course)...
Start by converting each query into a set and ordering them all from largest to smallest. Then add each query to a new list if it isn't a subset of an already-added query. Since any set is a subset of itself, this logic covers exact duplicates:
a = []
for q in sorted((set(q) for q in sort), key=len, reverse=True):
if not any(q.issubset(Q) for Q in a):
a.append(q)
a = [list(q) for q in a] # Back to lists, if you want
精彩评论