vpk vpk - 2 months ago 13
Python Question

Pandas: Filter dataframe for values that are too frequent or too rare

On a pandas dataframe, I know I can groupby on one or more columns and then filter values that occur more/less than a given number.

But I want to do this on every column on the dataframe. I want to remove values that are too infrequent (let's say that occur less than 5% of times) or too frequent. As an example, consider a dataframe with following columns:

city of origin, city of destination, distance, type of transport (air/car/foot), time of day, price-interval
.

import pandas as pd
import string
import numpy as np
vals = [(c, np.random.choice(list(string.lowercase), 100, replace=True)) for c in
'city of origin', 'city of destination', 'distance, type of transport (air/car/foot)', 'time of day, price-interval']
df = pd.DataFrame(dict(vals))
>> df.head()
city of destination city of origin distance, type of transport (air/car/foot) time of day, price-interval
0 f p a n
1 k b a f
2 q s n j
3 h c g u
4 w d m h


If this is a big dataframe, it makes sense to remove rows that have spurious items, for example, if
time of day = night
occurs only 3% of the time, or if
foot
mode of transport is rare, and so on.

I want to remove all such values from all columns (or a list of columns). One idea I have is to do a
value_counts
on every column,
transform
and add one column for each value_counts; then filter based on whether they are above or below a threshold. But I think there must be a better way to achieve this?

Answer

This procedure will go through each column of the DataFrame and eliminate rows where the given category is less than a given threshold percentage, shrinking the DataFrame on each loop.

This answer is similar to that provided by @Ami Tavory, but with a few subtle differences:

  • It normalizes the value counts so you can just use a percentile threshold.
  • It calculates counts just once per column instead of twice. This results in faster execution.

Code:

threshold = 0.03
for col in df:
    counts = df[col].value_counts(normalize=True)
    df = df.loc[df[col].isin(counts[counts > threshold].index), :]

Code timing:

df2 = pd.DataFrame(np.random.choice(list(string.lowercase), [1e6, 4], replace=True), 
                   columns=list('ABCD'))

%%timeit df=df2.copy()
threshold = 0.03
for col in df:
    counts = df[col].value_counts(normalize=True)
    df = df.loc[df[col].isin(counts[counts > threshold].index), :]

1 loops, best of 3: 485 ms per loop

%%timeit df=df2.copy()
m = 0.03 * len(df)
for c in df:
    df = df[df[c].isin(df[c].value_counts()[df[c].value_counts() > m].index)]

1 loops, best of 3: 688 ms per loop
Comments