Rebin Rebin - 5 months ago 31
Python Question

Python: Pandas DataFrame for tuples

Is this a correct way of creating DataFrame for tuples? (assume that the tuples are created inside code fragment)

import pandas as pd
import numpy as np
import random

row = ['a','b','c']
col = ['A','B','C','D']

# use numpy for creating a ZEROS matrix
st = np.zeros((len(row),len(col)))
df2 = pd.DataFrame(st, index=row, columns=col)

# CONVERT each cell to an OBJECT for inserting tuples
for c in col:
df2[c] = df2[c].astype(object)

print df2

for i in row:
for j in col:
df2.set_value(i, j, (i+j, np.round(random.uniform(0, 1), 4)))

print df2

As you can see I first created a
in numpy and then made each cell an OBJECT type in Pandas so I can insert tuples. Is this correct way to do or there is a better solution to ADD/RETRIVE tuples to matrices?

Results are fine:

a 0 0 0 0
b 0 0 0 0
c 0 0 0 0

a (aA, 0.7134) (aB, 0.006) (aC, 0.1948) (aD, 0.2158)
b (bA, 0.2937) (bB, 0.8083) (bC, 0.3597) (bD, 0.324)
c (cA, 0.9534) (cB, 0.9666) (cC, 0.7489) (cD, 0.8599)


First, to answer your literal question: You can construct DataFrames from a list of lists. The values in the list of lists can themselves be tuples:

import numpy as np
import pandas as pd

row = ['a','b','c']
col = ['A','B','C','D']

data = [[(i+j, round(np.random.uniform(0, 1), 4)) for j in col] for i in row]
df = pd.DataFrame(data, index=row, columns=col)


              A             B             C             D
a  (aA, 0.8967)  (aB, 0.7302)  (aC, 0.7833)  (aD, 0.7417)
b  (bA, 0.4621)  (bB, 0.6426)  (bC, 0.2249)  (bD, 0.7085)
c  (cA, 0.7471)  (cB, 0.6251)    (cC, 0.58)  (cD, 0.2426)

Having said that, beware that storing tuples in DataFrames dooms you to Python-speed loops. To take advantage of fast Pandas/NumPy routines, you need to use native NumPy dtypes such as np.float64 (whereas, in contrast, tuples require "object" dtype).

So perhaps a better solution for your purpose is to use two separate DataFrames, one for the strings and one for the numbers:

import numpy as np
import pandas as pd


prevstate = pd.DataFrame([[i+j for j in col] for i in row], index=row, columns=col)
prob = pd.DataFrame(np.random.uniform(0, 1, size=(len(row), len(col))).round(4), 
                    index=row, columns=col)
#     A   B   C   D
# a  aA  aB  aC  aD
# b  bA  bB  bC  bD
# c  cA  cB  cC  cD

#         A       B       C       D
# a  0.8967  0.7302  0.7833  0.7417
# b  0.4621  0.6426  0.2249  0.7085
# c  0.7471  0.6251  0.5800  0.2426

To loop through the columns, find the row with maximum probability and retrieve the corresponding prevstate, you could use .idxmax and .loc:

for col in prob.columns:
    idx = (prob[col].idxmax())
    print('{}: {}'.format(prevstate.loc[idx, col], prob.loc[idx, col]))


aA: 0.8967
aB: 0.7302
aC: 0.7833
aD: 0.7417