FireLeast FireLeast - 6 months ago 21
Python Question

Python Pandas self join for merge cartesian product to produce all combinations and sum

I am brand new to Python, seems like it has a lot of flexibility and is faster than traditional RDBMS systems.

Working on a very simple process to create random fantasy teams. I come from an RDBMS background (Oracle SQL) and that does not seem to be optimal for this data processing.

I made a dataframe using pandas read from csv file and now have a simple dataframe with two columns -- Player, Salary:

` Name Salary
0 Jason Day 11700
1 Dustin Johnson 11600
2 Rory McIlroy 11400
3 Jordan Spieth 11100
4 Henrik Stenson 10500
5 Phil Mickelson 10200
6 Justin Rose 9800
7 Adam Scott 9600
8 Sergio Garcia 9400
9 Rickie Fowler 9200`


What I am trying to do via python (pandas) is produce all combinations of 6 players which salary is between a certain amount 45000 -- 50000.

In looking up python options, I found the itertools combination interesting, but it would result a massive list of combinations without filtering the sum of salary.

In traditional SQL, I would do a massive merge cartesian join w/ SUM, but then I get the players in different spots..

Such as A, B, C then, C, B, A..

My traditional SQL which doesn't work well enough is something like this:

` SELECT distinct
ONE.name AS "1",
TWO.name AS "2",
THREE.name AS "3",
FOUR.name AS "4",
FIVE.name AS "5",
SIX.name AS "6",
sum(one.salary + two.salary + three.salary + four.salary + five.salary + six.salary) as salary
FROM
nl.pgachamp2 ONE,nl.pgachamp2 TWO,nl.pgachamp2 THREE, nl.pgachamp2 FOUR,nl.pgachamp2 FIVE,nl.pgachamp2 SIX
where ONE.name != TWO.name
and ONE.name != THREE.name
and one.name != four.name
and one.name != five.name
and TWO.name != THREE.name
and TWO.name != four.name
and two.name != five.name
and TWO.name != six.name
and THREE.name != four.name
and THREE.name != five.name
and three.name != six.name
and five.name != six.name
and four.name != six.name
and four.name != five.name
and one.name != six.name
group by ONE.name, TWO.name, THREE.name, FOUR.name, FIVE.name, SIX.name`


Is there a way to do this in Pandas/Python?

Any documentation that can be pointed to would be great!

Answer

Here's a non-Pandas solution using a simple algorithm. It generates combinations recursively from a list of players sorted by salary. This lets it skip generating combinations that exceed the salary cap.

As piRSquared mentions, there are no teams of 6 that fall within the salary limits stated in the question, so I chose limits to generate a small number of teams.

#!/usr/bin/env python3

''' Limited combinations

    Generate combinations of players whose combined salaries fall within given limits

    See http://stackoverflow.com/q/38636460/4014959

    Written by PM 2Ring 2016.07.28
'''

data = '''\
0              Jason Day   11700
1         Dustin Johnson   11600
2           Rory McIlroy   11400
3          Jordan Spieth   11100
4         Henrik Stenson   10500
5         Phil Mickelson   10200
6            Justin Rose    9800
7             Adam Scott    9600
8          Sergio Garcia    9400
9          Rickie Fowler    9200
'''

data = [s.split() for s in data.splitlines()]
all_players = [(' '.join(u[1:-1]), int(u[-1])) for u in data]
all_players.sort(key=lambda t: t[1])
for i, row in enumerate(all_players):
    print(i, row)
print('- '*40)

def choose_teams(free, num, team=(), value=0):
    num -= 1
    for i, p in enumerate(free):
        salary = all_players[p][1]
        newvalue = value + salary
        if newvalue <= hi:
            newteam = team + (p,)
            if num == 0:
                if newvalue >= lo:
                    yield newteam, newvalue
            else:
                yield from choose_teams(free[i+1:], num, newteam, newvalue)
        else:
            break

#Salary limits
lo, hi = 55000, 60500

#Indices of players that can be chosen for a team 
free = tuple(range(len(all_players)))

for i, (t, s) in enumerate(choose_teams(free, 6), 1):
    team = [all_players[p] for p in t]
    names, sals = zip(*team)
    assert sum(sals) == s
    print(i, t, names, s)

output

0 ('Rickie Fowler', 9200)
1 ('Sergio Garcia', 9400)
2 ('Adam Scott', 9600)
3 ('Justin Rose', 9800)
4 ('Phil Mickelson', 10200)
5 ('Henrik Stenson', 10500)
6 ('Jordan Spieth', 11100)
7 ('Rory McIlroy', 11400)
8 ('Dustin Johnson', 11600)
9 ('Jason Day', 11700)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
1 (0, 1, 2, 3, 4, 5) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson') 58700
2 (0, 1, 2, 3, 4, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Jordan Spieth') 59300
3 (0, 1, 2, 3, 4, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Rory McIlroy') 59600
4 (0, 1, 2, 3, 4, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Dustin Johnson') 59800
5 (0, 1, 2, 3, 4, 9) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Jason Day') 59900
6 (0, 1, 2, 3, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Jordan Spieth') 59600
7 (0, 1, 2, 3, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Rory McIlroy') 59900
8 (0, 1, 2, 3, 5, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Dustin Johnson') 60100
9 (0, 1, 2, 3, 5, 9) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Henrik Stenson', 'Jason Day') 60200
10 (0, 1, 2, 3, 6, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Justin Rose', 'Jordan Spieth', 'Rory McIlroy') 60500
11 (0, 1, 2, 4, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60000
12 (0, 1, 2, 4, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Rory McIlroy') 60300
13 (0, 1, 2, 4, 5, 8) ('Rickie Fowler', 'Sergio Garcia', 'Adam Scott', 'Phil Mickelson', 'Henrik Stenson', 'Dustin Johnson') 60500
14 (0, 1, 3, 4, 5, 6) ('Rickie Fowler', 'Sergio Garcia', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60200
15 (0, 1, 3, 4, 5, 7) ('Rickie Fowler', 'Sergio Garcia', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Rory McIlroy') 60500
16 (0, 2, 3, 4, 5, 6) ('Rickie Fowler', 'Adam Scott', 'Justin Rose', 'Phil Mickelson', 'Henrik Stenson', 'Jordan Spieth') 60400

If you're using an older version of Python that doesn't support the yield from syntax you can replace

yield from choose_teams(free[i+1:], num, newteam, newvalue)

with

for t, v in choose_teams(free[i+1:], num, newteam, newvalue):
    yield t, v