Abdurrezzak EFE Abdurrezzak EFE - 3 months ago 10
Python Question

How to create a specific type of object for type .hdf5?

My question is about creating an object type or document for .hdf5 files. The object will have three attributes, an id, a user_id and a boolean array of size 64. I have to create them about 10000000 (Ten millions) many.

Imagine mongodb, I have to use them like that. I have to make queries for some particular user_id'ed objects as well as for all of them.

Any suggestion and help is appreciated.

Answer

I would go ahead and use a dictionary for this case. I feel dictionaries do scale up well. Since the query would be on user_id, I would make it the key.

The structure would be like

{ 
    'user_id-xyz': {
        'id':'id-1212',
        'boolarray':[True,False,..],

    },
    'user_id-abc':{
        ...
    }
}

In order to achieve this, I might go for a numpy custom datatype.

element = np.dtype([('id', 'i16'), ('boolarray', 'b',(64,1))])
f = h5py.File('foo.hdf5','w')
dset = f.create_dataset("blocky", (1000000,), dtype='V79') # 64(bools)+15(for id)
grp = f.create_group("user_id-xyz")
# create subgroups for each id.
subdataset = grp.create_dataset('ele',(1,),dtype=element) 

# test of membership.
'user_id-xyz' in f
# retrieval
f.get('user_id-xyz')
# all keys.
f.keys()

Overall, I hope this helps you.

Comments