Is it possible to set conditions (filters) for the DataFrame columns before reading a csv or tsv files, If I am already aware of the column names and types? If yes, how?
For Example: Consider there are two numerical columns (col1 and col2) in a very big file. I do not want to load whole file in the memory and select only those rows where col1 greater than col2. Therefore, first, I want to set the condition on the dataframe that it should read only those rows from the csv file where col1 is greater than col2. I hope my explanation make sense.
You can use blaze for this which is a handy tool to have alongside
Let's assume an input file of:
a,b 1,2 3,4 5,3 3,6 6,1
We then open the file and query the data - note that the query isn't executed until you attempt to materialise/access it:
import blaze import pandas as pd csv_data = blaze.Data('input.csv') query = csv_data[csv_data['a'] > csv_data['b']] df = pd.DataFrame.from_records(query, columns=query.fields)
That then gives
a b 0 5 3 1 6 1