Sevyns Sevyns - 7 months ago 35
Python Question

What is the appropriate method to replace a dataframe with a subset using pandas.dataframe.query method()?

This question is very similar to one I asked here:

Python Pandas SettingWithCopyWarning copies vs new objects

I'd like to understand how I can exclude records within a given dataframe (IE operate on the dataframe and not a view of it) while also having the option of applying additional operations on the results.

I'm struggling with understanding how Python is managing reference vs value assignment when operating on Pandas DataFrame objects. I'm working with a dataset that is in a Pandas Dataframe and I'd like to reduce the set down based on certain attribute values. I'd also like to apply additional operations on the results of this operation. The preferred method I'd like to use is the .query() method. Here is a simple example:

mydf = pd.DataFrame({'col1':['A','B','C'],
mydf = mydf.query('col1 == \'A\'')

This will conceptually accomplish what I'm looking for; a reduction in the dataset I'm working with based on a query against it. The question I have is this:

"Is this the correct application of the query function or should I be doing something else if I have additional operations to perform on 'mydf'"?

I've read through this documentation but still don't understand what pitfalls to watch out for...


I think this is a right approach if you don't need the data that was filtered out (reduced). You can also chain your "additional operations" (which is pretty efficient) like this:

 mydf = mydf.query('col1 == "A"').func1(...).func2(...).func3(...)

Here is a link to the documentation with lots of examples of how to use the query() method