Aaron Aaron - 3 months ago 19
Python Question

Unable to merge multiIndexed pandas dataframes

I believe I am ultimately looking for a way to change the dtype of data frame indices. Please allow me to explain:

Each df is multi-indexed on (the same) four levels. One level consists of mixed labels of integers, integer and letters (like D8), and just letters.

However, for df1, the integers within the index labels are surrounded by quotation marks, while for df2, the same integer lables are free of any quotes; i.e.,

df1.index.levels[1]
Index(['Z5', '02', '1C', '26', '2G', '2S', '30', '46', '48', '5M', 'CSA', etc...'], dtype='object', name='BMDIV')

df2.index.levels[1]
Index([ 26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y', '8F',
'8J', 'AN', 'AS', 'C3', 'CA', etc.
dtype='object', name='BMDIV')


When I try to merge these tables

df_merge = pd.merge(df1, df2, how='left', left_index=True, right_index=True)


I get:


TypeError: type object argument after * must be a sequence, not map


Is there a way to change, for example, the type of label in df2 so that the numbers are in quotes and therefore presumably match the corresponding labels in df1?

Answer

One way to change the level values is to build a new MultiIndex and re-assign it to df.index:

import pandas as pd

df = pd.DataFrame(
    {'index':[ 26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y', 
               '8F', '8J', 'AN', 'AS', 'C3', 'CA'], 
     'foo':1, 'bar':2})
df = df.set_index(['index', 'foo'])
level_values = [df.index.get_level_values(i) for i in range(index.nlevels)]
level_values[0] = level_values[0].astype(str)
df.index = pd.MultiIndex.from_arrays(level_values)

which makes the level values strings:

In [53]: df.index.levels[0]
Out[56]: 
Index(['1C', '26', '30', '46', '48', '5M', '72', '7D', '7Y', '8F', '8J', 'AN',
       'AS', 'C3', 'CA'],
      dtype='object', name='index')

Alternatively, you could avoid the somewhat low-level messiness by using reset_index and set_value:

import pandas as pd

df = pd.DataFrame(
    {'index':[ 26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y', 
               '8F', '8J', 'AN', 'AS', 'C3', 'CA'], 
     'foo':1, 'bar':2})
df = df.set_index(['index', 'foo'])

df = df.reset_index('index')
df['index'] = df['index'].astype(str)
df = df.set_index('index', append=True)
df = df.swaplevel(0, 1, axis=0)

which again produces string-valued index level values:

In [67]: df.index.levels[0]
Out[67]: 
Index(['1C', '26', '30', '46', '48', '5M', '72', '7D', '7Y', '8F', '8J', 'AN',
       'AS', 'C3', 'CA'],
      dtype='object', name='index')

Of these two options, using_MultiIndex is faster:

N = 1000
def make_df(N):
    df = pd.DataFrame(
        {'index': np.random.choice(np.array(
            [26, 30, 46, 48, 72, '1C', '5M', '7D', '7Y', 
             '8F', '8J', 'AN', 'AS', 'C3', 'CA'], dtype='O'), size=N), 
         'foo':1, 'bar':2})
    df = df.set_index(['index', 'foo'])
    return df

def using_MultiIndex(df):
    level_values = [df.index.get_level_values(i) for i in range(index.nlevels)]
    level_values[0] = level_values[0].astype(str)
    df.index = pd.MultiIndex.from_arrays(level_values)
    return df

def using_reset_index(df):
    df = df.reset_index('index')
    df['index'] = df['index'].astype(str)
    df = df.set_index('index', append=True)
    df = df.swaplevel(0, 1, axis=0)
    return df


In [81]: %%timeit df = make_df(1000)
   ....: using_MultiIndex(df)
   ....: 
1000 loops, best of 3: 693 µs per loop

In [82]: %%timeit df = make_df(1000)
   ....: using_reset_index(df)
   ....: 
100 loops, best of 3: 2.09 ms per loop
Comments