Rebin - 1 year ago 126

Python Question

Is this a correct way of creating DataFrame for tuples? (assume that the tuples are created inside code fragment)

`import pandas as pd`

import numpy as np

import random

row = ['a','b','c']

col = ['A','B','C','D']

# use numpy for creating a ZEROS matrix

st = np.zeros((len(row),len(col)))

df2 = pd.DataFrame(st, index=row, columns=col)

# CONVERT each cell to an OBJECT for inserting tuples

for c in col:

df2[c] = df2[c].astype(object)

print df2

for i in row:

for j in col:

df2.set_value(i, j, (i+j, np.round(random.uniform(0, 1), 4)))

print df2

As you can see I first created a

`zeros(3,4)`

Results are fine:

`A B C D`

a 0 0 0 0

b 0 0 0 0

c 0 0 0 0

A B C D

a (aA, 0.7134) (aB, 0.006) (aC, 0.1948) (aD, 0.2158)

b (bA, 0.2937) (bB, 0.8083) (bC, 0.3597) (bD, 0.324)

c (cA, 0.9534) (cB, 0.9666) (cC, 0.7489) (cD, 0.8599)

Answer Source

First, to answer your literal question: You can construct DataFrames from a list of lists. The values in the list of lists can themselves be tuples:

```
import numpy as np
import pandas as pd
np.random.seed(2016)
row = ['a','b','c']
col = ['A','B','C','D']
data = [[(i+j, round(np.random.uniform(0, 1), 4)) for j in col] for i in row]
df = pd.DataFrame(data, index=row, columns=col)
print(df)
```

yields

```
A B C D
a (aA, 0.8967) (aB, 0.7302) (aC, 0.7833) (aD, 0.7417)
b (bA, 0.4621) (bB, 0.6426) (bC, 0.2249) (bD, 0.7085)
c (cA, 0.7471) (cB, 0.6251) (cC, 0.58) (cD, 0.2426)
```

Having said that, beware that storing tuples in DataFrames dooms you to Python-speed loops. To take advantage of fast Pandas/NumPy routines, you need to use native NumPy dtypes such as `np.float64`

(whereas, in contrast, tuples require "object" dtype).

So perhaps a better solution for your purpose is to use two separate DataFrames, one for the strings and one for the numbers:

```
import numpy as np
import pandas as pd
np.random.seed(2016)
row=['a','b','c']
col=['A','B','C','D']
prevstate = pd.DataFrame([[i+j for j in col] for i in row], index=row, columns=col)
prob = pd.DataFrame(np.random.uniform(0, 1, size=(len(row), len(col))).round(4),
index=row, columns=col)
print(prevstate)
# A B C D
# a aA aB aC aD
# b bA bB bC bD
# c cA cB cC cD
print(prob)
# A B C D
# a 0.8967 0.7302 0.7833 0.7417
# b 0.4621 0.6426 0.2249 0.7085
# c 0.7471 0.6251 0.5800 0.2426
```

To loop through the columns, find the row with maximum probability and retrieve the corresponding `prevstate`

, you could use `.idxmax`

and `.loc`

:

```
for col in prob.columns:
idx = (prob[col].idxmax())
print('{}: {}'.format(prevstate.loc[idx, col], prob.loc[idx, col]))
```

yields

```
aA: 0.8967
aB: 0.7302
aC: 0.7833
aD: 0.7417
```