everial everial - 2 months ago 28
Python Question

Setting pandas.DataFrame string dtype (not file based)

I'm having trouble with using

pandas.DataFrame
's constructor and using the
dtype
argument. I'd like to preserve string values, but the following snippets always convert to a numeric type and then yield
NaN
s.

from __future__ import unicode_literals
from __future__ import print_function


import numpy as np
import pandas as pd


def main():
columns = ['great', 'good', 'average', 'bad', 'horrible']
# minimal example, dates are coming (as strings) from some
# non-file source.
example_data = {
'alice': ['', '', '', '2016-05-24', ''],
'bob': ['', '2015-01-02', '', '', '2012-09-15'],
'eve': ['2011-12-31', '', '1998-08-13', '', ''],
}

# first pass, yields dataframe full of NaNs
df = pd.DataFrame(data=example_data, index=example_data.keys(),
columns=columns, dtype=str) #or string, 'str', 'string', 'object'
print(df.dtypes)
print(df)
print()

# based on https://github.com/pydata/pandas/blob/master/pandas/core/frame.py
# and https://github.com/pydata/pandas/blob/37f95cef85834207db0930e863341efb285e38a2/pandas/types/common.py
# we're ultimately feeding dtype to numpy's dtype, so let's just use that:
# (using np.dtype('S10') and converting to str doesn't work either)
df = pd.DataFrame(data=example_data, index=example_data.keys(),
columns=columns, dtype=np.dtype('U'))
print(df.dtypes)
print(df) # still full of NaNs... =(



if __name__ == '__main__':
main()


What value(s) of
dtypes
will preserve strings in the data frame?

for reference:


$ python --version

2.7.12

$ pip2 list | grep pandas

pandas (0.18.1)

$ pip2 list | grep numpy

numpy (1.11.1)

Answer

For the particular case in the OP, you can use the DataFrame.from_dict() constructor (see also the Alternate Constructors section of the DataFrame documentation) .

from __future__ import unicode_literals
from __future__ import print_function

import pandas as pd

columns = ['great', 'good', 'average', 'bad', 'horrible']
example_data = {
    'alice': ['', '', '', '2016-05-24', ''],
    'bob': ['', '2015-01-02', '', '', '2012-09-15'],
    'eve': ['2011-12-31', '', '1998-08-13', '', ''],
}
df = pd.DataFrame.from_dict(example_data, orient='index')
df.columns = columns

print(df.dtypes)
# great       object
# good        object
# average     object
# bad         object
# horrible    object
# dtype: object

print(df)
#             great        good     average         bad    horrible
# bob                2015-01-02                          2012-09-15
# eve    2011-12-31              1998-08-13                        
# alice                                      2016-05-24     

You can even specify dtype=str in DataFrame.from_dict() — though it is not necessary in this example.

EDIT: The DataFrame constructor interprets a dictionary as a collection of columns:

print(pd.DataFrame(example_data))

#         alice         bob         eve
# 0                          2011-12-31
# 1              2015-01-02            
# 2                          1998-08-13
# 3  2016-05-24                        
# 4              2012-09-15            

(I'm dropping the data=, since data is the first argument in the function's signature anyway). Your code confuses rows and columns:

print(pd.DataFrame(example_data, index=example_data.keys(), columns=columns))

#       great good average  bad horrible
# alice   NaN  NaN     NaN  NaN      NaN
# bob     NaN  NaN     NaN  NaN      NaN
# eve     NaN  NaN     NaN  NaN      NaN   

(though I'm not exactly sure how it ends up giving you a DataFrame of NaNs). It would be correct to do

print(pd.DataFrame(example_data, columns=example_data.keys(), index=columns))

#                alice         bob         eve
# great                             2011-12-31
# good                  2015-01-02            
# average                           1998-08-13
# bad       2016-05-24                        
# horrible              2012-09-15   

Specifying the column names is actually unnecessary — they are already parsed from the dictionary:

print(pd.DataFrame(example_data, index=columns))

#                alice         bob         eve
# great                             2011-12-31
# good                  2015-01-02            
# average                           1998-08-13
# bad       2016-05-24                        
# horrible              2012-09-15                     

What you want is actually the transpose of this — so you can also take said transpose!

print(pd.DataFrame(data=example_data, index=columns).T)

#             great        good     average         bad    horrible
# alice                                      2016-05-24            
# bob                2015-01-02                          2012-09-15
# eve    2011-12-31              1998-08-13