python statistical analysis_问答_开发者_运维开发者技术经验分享

Given 15 players - 2 Goalkeepers, 5 defenders, 5 midfielders and 3 strikers, and the fact that each has a value and a score, I want to calculate the highest scoring team for the money I have. Each team must consist of 1 GK then a formation e.g. 4:4:2, 4:3:3 etc. I started with sample data such as this

player role points cost

I then did the following to evaluate all combinations

read each line into a list (for each role) then use itertools in a nested run to get all combinations

if line[1] == "G": G.append(line[0])
if line[1] == "D": D.append(line[0])
if line[1] == "M": M.append(line[0])
if line[1] == "S": S.append(line[0])

for gk in itertools.combinations(G,1):
    for de in itertools.combinations(D,4):
        for mi in itertools.combinations(M,4):
            for st in itertools.combinations(S,2):
                开发者_如何学Goteams[str(count)]= " ".join(gk)+" "+" ".join(de)+" "+" ".join(mi)+" "+" ".join(st)
                count +=1

Having got the teams, I calculate their points value, and the team cost. If it's lower than the threshold, I print it.

But if I now make this 20 goalkeepers, 150 defenders, 150 midfielders and 100 strikers, I understandably get out of memory.

What could I do to perform this analysis? Is it a generator rather than a recursive function that I need?

Many thanks

You might be able to solve this problem with recursion. The following shows the basic outline, but ignores details like a team being composed of a certain number of certain types of players.

players=[{'name':'A','score':5,'cost':10},
         {'name':'B','score':10,'cost':3},
         {'name':'C','score':6,'cost':8}]

def player_cost(player):
    return player['cost']
def player_score(player):
    return player['score']
def total_score(players):
    return sum(player['score'] for player in players)

def finance_team_recurse(budget, available_players):
    affordable_players=[]
    for player in available_players:
        if player_cost(player)<=budget:
            # Since we've ordered available players, the first player appended
            # will be the one with the highest score.
            affordable_players.append(player)
    result=[]
    if affordable_players:
        candidate_player=affordable_players[0]
        other_players=affordable_players[1:]
        # if you include candidate_player on your team
        team_with_candidate=finance_team_recurse(budget-player_cost(candidate_player),
                                                 other_players)
        team_with_candidate.append(candidate_player)
        score_of_team_with_candidate=total_score(team_with_candidate)
        if score_of_team_with_candidate>total_score(other_players):
            result=team_with_candidate
        else:
            # if you exclude candidate_player from your team
            team_without_candidate=finance_team_recurse(budget, other_players)
            score_of_team_without_candidate=total_score(team_without_candidate)
            if score_of_team_with_candidate>score_of_team_without_candidate:
                result=team_with_candidate
            else:
                result=team_without_candidate
    return result

def finance_team(budget, available_players):
    tmp=available_players[:]
    # Sort so player with highest score is first. (Greedy algorithm?)
    tmp.sort(key=player_score, reverse=True)
    return finance_team_recurse(budget,tmp)

print(finance_team(20,players))
# [{'score': 6, 'cost': 8, 'name': 'C'}, {'score': 10, 'cost': 3, 'name': 'B'}]

20 choose 1 = 20 combinations
150 choose 4 = 20260275 combinations
100 choose 2 = 4950 combinations

So there are a total of 20*20260275*20260275*4950 = 40637395564486875000L items in the teams dict. That takes a lot of memory.

for gk in itertools.combinations(G,1):
    for de in itertools.combinations(D,4):
        for mi in itertools.combinations(M,4):
            for st in itertools.combinations(S,2):    
                #Don't collect the results into a dict.
                #That's what's killing you (memory-wise).
                #Just compute the cost and
                #Just print the result here.

PS. 40637395564486875000L is on the order of 10**19. Assuming your program can process 10**6 combinations per second, it will take about 1.3 millions years for the program to complete...

Functions and generators help a lot:

def make_teams(G, D, M, S):
    """ returns all possible teams """
    for gk in itertools.combinations(G,1):
        for de in itertools.combinations(D,4):
            for mi in itertools.combinations(M,4):
                for st in itertools.combinations(S,2):
                    yield gk, de, mi, st

def get_cost( team ):
    return sum( member.cost for member in team )

def good_teams( min_score=0):
    for team in make_teams(G, D, M, S):
        if get_cost( team ) > min_score:
            yield team

for team in good_teams(min_score=100):
    print team

It still generates all possible combinations, so you'll probably run out of time now, instead of memory.

What you're doing seems like a variation of the knapsack problem - you can do better than try all possible combination, but not much better.

One way to get a good solution fast is to sort the players by their score per money. You should get the top scoring teams first, but there are no guarantees that you get the best possible solution. Wikipedia calls this the "Greedy approximation algorithm".

def score_per_cost( player ):
    return player.score / player.cost

def sorted_combinations(seq, n):
    return itertools.combinations(
        sorted(seq, key=score_per_cost, reverse=True),n)

def make_teams(G, D, M, S):
    """ returns all possible teams """
    for gk in sorted_combinations(G,1):
        for de in sorted_combinations(D,4):
            for mi in sorted_combinations(M,4):
                for st in sorted_combinations(S,2):
                    yield gk, de, mi, st

def get_cost( team ):
    return sum( member.cost for member in team )

def top_teams(n):
    return itertools.islice(make_teams(G, D, M, S),n)

for team in top_teams(100):
    print team

I'll leave adding the "cost per team < threshold" requirement to the reader (hint: it's one line in make_teams :p).