I have tried to puzzle out an answer to this question for many months while learning pandas. I use SAS for my day-to-day work and it is great for it's out-of-core support. However, SAS is horrible as a piece of software for numerous other reasons.
One day I hope to replace my use of SAS with python and pandas, but I currently lack an out-of-core workflow for large datasets. I'm not talking about "big data" that requires a distributed network, but rather files too large to fit in memory but small enough to fit on a hard-drive.
My first thought is to use
if var1 > 2 then newvar = 'A' elif var2 = 4 then newvar = 'B'
I routinely use tens of gigabytes of data in just this fashion e.g. I have tables on disk that I read via queries, create data and append back.
Details which will affect how you store your data, like:
Give as much detail as you can; and I can help you develop a structure.
Ensure you have pandas at least
Since pytables is optimized to operate on row-wise (which is what you query on), we will create a table for each group of fields. This way it's easy to select a small group of fields (which will work with a big table, but it's more efficient to do it this way... I think I may be able to fix this limitation in the future... this is more intuitive anyhow):
(The following is pseudocode.)
import numpy as np import pandas as pd # create a store store = pd.HDFStore('mystore.h5') # this is the key to your storage: # this maps your fields to a specific group, and defines # what you want to have as data_columns. # you might want to create a nice class wrapping this # (as you will want to have this map and its inversion) group_map = dict( A = dict(fields = ['field_1','field_2',.....], dc = ['field_1',....,'field_5']), B = dict(fields = ['field_10',...... ], dc = ['field_10']), ..... REPORTING_ONLY = dict(fields = ['field_1000','field_1001',...], dc = ), ) group_map_inverted = dict() for g, v in group_map.items(): group_map_inverted.update(dict([ (f,g) for f in v['fields'] ]))
Reading in the files and creating the storage (essentially doing what
for f in files: # read in the file, additional options hmay be necessary here # the chunksize is not strictly necessary, you may be able to slurp each # file into memory in which case just eliminate this part of the loop # (you can also change chunksize if necessary) for chunk in pd.read_table(f, chunksize=50000): # we are going to append to each table by group # we are not going to create indexes at this time # but we *ARE* going to create (some) data_columns # figure out the field groupings for g, v in group_map.items(): # create the frame for this group frame = chunk.reindex(columns = v['fields'], copy = False) # append it store.append(g, frame, index=False, data_columns = v['dc'])
Now you have all of the tables in the file (actually you could store them in separate files if you wish, you would prob have to add the filename to the group_map, but probably this isn't necessary).
This is how you get columns and create new ones:
frame = store.select(group_that_I_want) # you can optionally specify: # columns = a list of the columns IN THAT GROUP (if you wanted to # select only say 3 out of the 20 columns in this sub-table) # and a where clause if you want a subset of the rows # do calculations on this frame new_frame = cool_function_on_frame(frame) # to 'add columns', create a new group (you probably want to # limit the columns in this new_group to be only NEW ones # (e.g. so you don't overlap from the other tables) # add this info to the group_map store.append(new_group, new_frame.reindex(columns = new_columns_created, copy = False), data_columns = new_columns_created)
When you are ready for post_processing:
# This may be a bit tricky; and depends what you are actually doing. # I may need to modify this function to be a bit more general: report_data = store.select_as_multiple([groups_1,groups_2,.....], where =['field_1>0', 'field_1000=foo'], selector = group_1)
About data_columns, you don't actually need to define ANY data_columns; they allow you to sub-select rows based on the column. E.g. something like:
store.select(group, where = ['field_1000=foo', 'field_1001>0'])
They may be most interesting to you in the final report generation stage (essentially a data column is segregated from other columns, which might impact efficiency somewhat if you define a lot).
You also might want to:
Let me know when you have questions!