Sangeeth Saravanaraj Sangeeth Saravanaraj - 3 months ago 8x
Python Question

What is the best data structure for storing a set of four (or more) values?

Say I have the following

and its corresponding
which represents a

name = 'abc'
age = 23
weight = 60
height = 174

Please note that the
could be of different
, reference-to-any-other-object, etc).

There will be many
(at least >100,000). Each and every
will be
when all these four
(actually its
) are put together. In other words, there exists no
with all 4
are the same.

I am trying to find an efficient data structure in
which will allow me to (store and) retrieve
based on any one of these
time complexity.

For example:

def retrieve(name=None,age=None,weight=None,height=None)
if name is not None and age is None and weight is None and height is None:
/* get all records with the given name */
if name is None and age is not None and weight is None and height is None:
/* get all records with the given age */
return records

The way the
should be called is as follows:


The above should return
[{name:'abc', age:23, wight:50, height=175}, {name:'abc', age:28, wight:55, height=170}, etc]


The above should return
[{name:'abc', age:23, wight:50, height=175}, {name:'def', age:23, wight:65, height=180}, etc]

And, I may need to add one or two more
to this record in future. For example, say,
sex = 'm'
. So, the
function must be scalable.

So in short: Is there a data structure in
which will allow
storing a record
number of
(name, age, sex, weigh, height, etc) and
retrieving records
based on any (one) of the
(or ideally
constant - O(1)
look-up time) complexity?


There isn't a single data structure built into Python that does everything you want, but it's fairly easy to use a combination of the ones it has to achieve your goals and do so fairly efficiently.

For example, say your input was the following data in a comma-separated-value file called employees.csv with field names defined as shown by the first line:

Bob Barker,25,175,6ft 2in
Ted Kingston,28,163,5ft 10in
Mary Manson,27,140,5ft 6in
Sue Sommers,27,132,5ft 8in
Alice Toklas,24,124,5ft 6in

The following is working code which illustrates how to read and store this data into a list of records, and automatically create separate look-up tables for finding records associated with the values of contained in the fields each of these record.

The records are instances of a class created by namedtuple which is a very memory efficient because each one lacks a __dict__ attribute that class instances normally contain. Using them will make it possible to access the fields of each by name using dot syntax, like record.fieldname.

The look-up tables are defaultdict(list) instances, which provide dictionary-like O(1) look-up times on average, and also allow multiple values to be associated with each one. So the look-up key is the value of the field value being sought and the data associated with it will be a list of the integer indices of the Person records stored in the employees list with that value — so they'll all be relatively small.

Note that the code for the class is completely data-driven in that it doesn't contain any hardcoded field names which instead are all taken from the first row of csv data input file when it's read in. When an instance is used, any actual retrieve() method calls must, of course, contain valid field name keyword arguments.


Modified to not create a lookup table for every unique value of every field when the data file is first read. Now the retrieve() method creates them only as needed (and saves/caches the result for future use).

from collections import defaultdict, namedtuple
import csv

class DataBase(object):
    def __init__(self, csv_filename, recordname):
        # read data from csv format file into a list of namedtuples
        with open(csv_filename, 'rb') as inputfile:
            csv_reader = csv.reader(inputfile, delimiter=',')
            self.fields =  # read header row
            self.Record = namedtuple(recordname, self.fields)
            self.records = [self.Record(*row) for row in csv_reader]
            self.valid_fieldnames = set(self.fields)
        # create an empty table of lookup tables for each field name that maps
        # each unique field value to a list of record-list indices of the ones
        # that contain it
        self.lookup_tables = defaultdict(lambda: defaultdict(list))

    def retrieve(self, **kwargs):
        """ Fetch a list of records with a field name with the value supplied
            as a keyword arg (or return None if there aren't any). """
        if len(kwargs) != 1: raise ValueError(
            'Exactly one fieldname/keyword argument required for function '
            '(%s specified)' % ', '.join([repr(k) for k in kwargs.keys()]))
        field, value = kwargs.items()[0]  # get only keyword arg and value
        if field not in self.valid_fieldnames:
            raise ValueError('keyword arg "%s" isn\'t a valid field name' % field)
        if field not in self.lookup_tables:  # must create field look up table
            for index, record in enumerate(self.records):
                value = getattr(record, field)
        matches = [self.records[index]
                    for index in self.lookup_tables[field].get(value, [])]
        return matches if matches else None

if __name__ == '__main__':
    empdb = DataBase('employees.csv', 'Person')
    print "retrieve(name='Ted Kingston'):", empdb.retrieve(name='Ted Kingston')
    print "retrieve(age='27'):", empdb.retrieve(age='27')
    print "retrieve(weight='150'):", empdb.retrieve(weight='150')
        print "retrieve(hight='5ft 6in'):", empdb.retrieve(hight='5ft 6in')
    except ValueError as e:
        print "ValueError('{}') raised as expected".format(e)
        raise type('NoExceptionError', (Exception,), {})(
            'No exception raised from "retrieve(hight=\'5ft\')" call.')


retrieve(name='Ted Kingston'): [Person(name='Ted Kingston', age='28', weight='163', height='5ft 10in')]
retrieve(age='27'): [Person(name='Mary Manson', age='27', weight='140', height='5ft 6in'),
                     Person(name='Sue Sommers', age='27', weight='132', height='5ft 8in')]
retrieve(weight='150'): None
retrieve(hight='5ft 6in'): ValueError('keyword arg "hight" is an invalid fieldname')
                           raised as expected