Flow Nuwen Flow Nuwen - 1 month ago 8
Python Question

Collecting Summary Statistics on Dataframe built by randomly sampling other dataframes

My goal is to build a dataframe by randomly sampling from other dataframes, collecting summary statistics on the new dataframe, and then append those statistics to a list. Ideally, I can iterate through this process n number of times (e.g. bootstrap).

dfposlist = [OFdf, Firstdf, Seconddf, Thirddf, CFdf, RFdf, Cdf, SSdf]

OFdf.head()
playerID OPW POS salary
87 bondsba01 62.061290 OF 8541667
785 ramirma02 35.785630 OF 13050000
966 walkela01 30.644305 OF 6050000
859 sheffga01 29.090699 OF 9916667
357 gilesbr02 28.160054 OF 7666666


All the dataframes in the list have the same headers. What I'm trying to do looks something like this:

teamdist = []
for df in dfposlist:
frames = [df.sample(n=1)]
team = pd.concat(frames)

teamopw = team['OPW'].sum()
teamsal = team['salary'].sum()
teamplayers = team['playerID'].tolist()

teamdic = {'Salary':teamsal, 'OPW':teamopw, 'Players':teamplayers}
teamdist.append(teamdic)


The output I'm looking for is something like this:

teamdist = [{'Salary':4900000, 'OPW':78.452, 'Players':[bondsba01, etc, etc]}]


But for some reason all the sum actions like
teamopw = team['OPW'].sum()
do not work how I'd like, and just returns the elements in
team['OPW']


print(teamopw)
0.17118131814601256
38.10700006434629
1.5699939126695253
32.9068837019903
16.990760776263674
18.22428871113601
13.447706356730897


Any advice on how to get this working? Thanks!

Edit: Working solution as follows. Not sure if it is the most pythonic way, but it works.

teamdist = []
team = pd.concat([df.sample(n=1) for df in dfposlist])

teamopw = team[['OPW']].values.sum()
teamsal = team[['salary']].values.sum()
teamplayers = team['playerID'].tolist()

teamdic = {'Salary':teamsal, 'OPW':teamopw, 'Players':teamplayers}
teamdist.append(teamdic)

Answer

Here (with random data):

import pandas as pd
import numpy as np

dfposlist = dict(zip(range(10),
                     [pd.DataFrame(np.random.randn(10, 5),
                                   columns=list('abcde'))
                     for i in range(10)]))
for df in dfposlist.values():
    df['f'] = list('qrstuvwxyz')

teamdist = []
team = pd.concat([df.sample(n=1) for df in dfposlist.values()])
print(team.info())

teamdic = team[['a', 'c', 'e']].sum().to_dict()
teamdic['f'] = team['f'].tolist()
teamdist.append(teamdic)
print(teamdist)

# Output:
## team.info():
<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 1 to 6
Data columns (total 6 columns):
a    10 non-null float64
b    10 non-null float64
c    10 non-null float64
d    10 non-null float64
e    10 non-null float64
f    10 non-null object
dtypes: float64(5), object(1)
memory usage: 560.0+ bytes
None

## teamdist:
[{'a': -3.5380097363724601,
  'c': 2.0951152809401776,
  'e': 3.1439230427971863,
  'f': ['r', 'w', 'z', 'v', 'x', 'q', 't', 'q', 'v', 'w']}]