vollkorn vollkorn - 3 months ago 15
Python Question

Count occurrences of certain values in dask.dataframe

I have a dataframe like this:

df.head()
day time resource_record
0 27 00:00:00 AAAA
1 27 00:00:00 A
2 27 00:00:00 AAAA
3 27 00:00:01 A
4 27 00:00:02 A


and want to find out how many occurrences of certain
resource_records
exist.

My first try was using the Series returned by
value_counts()
, which seems great, but does not allow me to exclude some labels afterwards, because there is no
drop()
implemented in
dask.Series
.

So I tried just to not print the undesired labels:

for row in df.resource_record.value_counts().iteritems():
if row[0] in ['AAAA']:
continue
print('\t{0}\t{1}'.format(row[1], row[0]))


Which works fine, but what if I ever want to further work on this data and really want it 'cleaned'. So I searched the docs a bit more and found
mask()
, but this feels a bit clumsy as well:

records = df.resource_record.mask(df.resource_record.map(lambda x: x in ['AAAA'])).value_counts()


I looked for a method which would allow me to just count individual values, but
count()
does count all values that are not NaN.

Then I found
str.contains()
, but I don't know how to handle the undocumented Scalar type I get returned with this code:

print(df.resource_record.str.contains('A').sum())


Output:

dd.Scalar<series-..., dtype=int64>


But even after looking at Scalar's code in
dask/dataframe/core.py
I didn't find a way of getting its value.

How would you efficiently count the occurrences of a certain set of values in your dataframe?

Answer

In most cases pandas syntax will work as well with dask, with the necessary addition of .compute() (or dask.compute) to actually perform the action. Until the compute, you are merely constructing the graph which defined the action.

I believe the simplest solution to your question is this:

df[df.resource_record!='AAAA'].resource_record.value_counts().compute()

Where the expression in the selector square brackets could be some mapping or function.