Max Power Max Power - 1 month ago 12
Python Question

Python Read HDF5 Rows Where Index Not in List

I read a random subset of rows from an HDF table with the following Python/Pandas code:

hdf_store = pd.HDFStore('path_to_data.h5')
total_rows = hdf_store.get_storer('hdf_table_name').nrows

num_rows = int(total_rows * .25)
row_indices = np.random.randint(0,rows_indices,size=num_rows)

my_df = pd.read_hdf(hdf_store, 'hdf_table_name', where=pd.Index(row_indices))


Later in the program, I would like to pull the rest of the rows of data from the HDF5 table. But the following threw errors:

rest_of_rows = pd.read_hdf(hdf_store, 'hdf_table_name',
where=pd.Index(not in (row_indices)))

rest_of_rows = pd.read_hdf(hdf_store, 'hdf_table_name',
where=not pd.Index(row_indices))


Is there a way to pull HDF rows by records not in a list of indices?

Because the table is larger than my RAM, I would like to avoid pulling all the rows from HDF upfront (even in chunks) and then splitting it to hold both tables at once. I could map the indices to another column, and subset on rows not in the mapped value of that column. But that would presumably be a lot slower than querying on the index directly.

Answer

You can use Index.difference method.

Demo:

# randomly select 25% of index elements (without duplicates `replace=False`)
sample_idx = np.random.choice(np.arange(total_rows), total_rows//4, replace=False)

# select remaining index elements
rest_idx = pd.Index(np.arange(total_rows)).difference(sample_idx)

# get rest rows by index
rest = store.select('hdf_table_name', where=rest_idx)

PS optionally you may want to select rest of rows in chunks...

Comments