Lucidnonsense Lucidnonsense - 6 months ago 67
Python Question

Overloading in python - pandas

I'm building a database type object which, when an index is not found, uses an api to retrieve the information, save it to the object/file and return it.

I'd like to do this by overloading the

.loc[x, y]
method of the pandas DataFrame but I can't work out how to do this!

At the moment I have:

import pandas as pd

class DataBase(pd.DataFrame):
"""DataBase Object which can be updated by external api"""
def __init__(self, path, api=None):
super(DataBase, self).__init__(pd.read_hdf('store.h5','df'))
self.api = api

I may want to change the
function to include a
argument so I can read only what I need to.

I can't think of a way to overload the
method properly!

Also, hdf5 is just one method. I'd like to retain the ability to use any other storage methods like sql, or even csv if necessary


loc is a property that creates returns a name called _loc if its not None else it creates a pandas.core.indexing._LocIndexer on demand. Indexers, by default have access to the DataFrame that created them, so you can modify the DataFrame on a key miss.

You can override the behaviour of DataFrame.loc by subclassing DataFrame and _LocIndexer as thus.

class MyLocIndexer(_LocIndexer):    
    def __getitem__(self, key):
            return super().__getitem__(key)
        except KeyError:
            item = db.fetch_item(key)
            self[key] = item
            return item
            # `return self[key]' is better as it also works when accessing a 
            # whole axis

class MyDataFrame(DataFrame):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self._loc = MyLocIndexer(self, "loc")

The above is written in python3, so you will have to fix the super statements if you are using python2.