I need to manipulate a large amount of numerical/textual data, say total of 10 billion entries which can be theoretically organized as 1000 of 10000*1000 tables.
Most calculations need to be performed on a small subset of data each time (specific rows or columns), such that I don't need all the data at once.
Therefore, I am intersted to store the data in some kind of database so I can easily search the database, retrieve multiple rows/columns matching defined criteria, make some calculations and update the database.The database should be accessible with both Python and Matlab, where I use Python mainly for creating raw data and putting it into database and Matlab for the data processing.
The whole project runs on Windows 7. What is the best and mainly the simplest database I can use for this purpose? I have no prior experience with databases at all.
I would recommend SQLite. The default Python installation already has bindings for it.
To use install the appropriate SQLite Windows installer.
To create the database you can do something like (from the sqlite3 documentation):
import sqlite3 conn = sqlite3.connect('example.db') c = conn.cursor() # Create table c.execute('''CREATE TABLE stocks (date text, trans text, symbol text, qty real, price real)''') # Insert a row of data c.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)") # Save (commit) the changes conn.commit() # We can also close the cursor if we are done with it c.close()
And to import into Matlab you can use mksqlite.
For more information you might want to checkout: http://labrosa.ee.columbia.edu/millionsong/pages/sqlite-interfaces-python-and-matlab