Manu Manu - 1 month ago 10x
Python Question

Applying Conditions on Pandas DataFrame Columns before reading csv or tsv files

Is it possible to set conditions (filters) for the DataFrame columns before reading a csv or tsv files, If I am already aware of the column names and types? If yes, how?

For Example: Consider there are two numerical columns (col1 and col2) in a very big file. I do not want to load whole file in the memory and select only those rows where col1 greater than col2. Therefore, first, I want to set the condition on the dataframe that it should read only those rows from the csv file where col1 is greater than col2. I hope my explanation make sense.



You can use blaze for this which is a handy tool to have alongside pandas.

Let's assume an input file of:


We then open the file and query the data - note that the query isn't executed until you attempt to materialise/access it:

import blaze
import pandas as pd

csv_data = blaze.Data('input.csv')
query = csv_data[csv_data['a'] > csv_data['b']]
df = pd.DataFrame.from_records(query, columns=query.fields)

That then gives df as:

   a  b
0  5  3
1  6  1