I didn't manage to find an answer for this, so
basically how do you perform a SQL query on a dataset to first group the rows based on a few columns and then select/filter only the groups with more rows than a specified size.
Heres is an example of what I am trying to achieve with a pandas dataframe:
df.groupby([cols_to_group]).filter(lambda x: len(x) > minimum_group_size)
I think a solution can be this:
SELECT * FROM ( SELECT * , COUNT(*) OVER (PARTITION BY cols_to_group) as cnt -- cnt is length of each group FROM yourTable) t WHERE t.cnt > minimum_group_size;