vollkorn vollkorn - 1 year ago 56
Python Question

Count occurrences of certain values in dask.dataframe

I have a dataframe like this:

day time resource_record
0 27 00:00:00 AAAA
1 27 00:00:00 A
2 27 00:00:00 AAAA
3 27 00:00:01 A
4 27 00:00:02 A

and want to find out how many occurrences of certain

My first try was using the Series returned by
, which seems great, but does not allow me to exclude some labels afterwards, because there is no
implemented in

So I tried just to not print the undesired labels:

for row in df.resource_record.value_counts().iteritems():
if row[0] in ['AAAA']:
print('\t{0}\t{1}'.format(row[1], row[0]))

Which works fine, but what if I ever want to further work on this data and really want it 'cleaned'. So I searched the docs a bit more and found
, but this feels a bit clumsy as well:

records = df.resource_record.mask(df.resource_record.map(lambda x: x in ['AAAA'])).value_counts()

I looked for a method which would allow me to just count individual values, but
does count all values that are not NaN.

Then I found
, but I don't know how to handle the undocumented Scalar type I get returned with this code:



dd.Scalar<series-..., dtype=int64>

But even after looking at Scalar's code in
I didn't find a way of getting its value.

How would you efficiently count the occurrences of a certain set of values in your dataframe?

Answer Source

In most cases pandas syntax will work as well with dask, with the necessary addition of .compute() (or dask.compute) to actually perform the action. Until the compute, you are merely constructing the graph which defined the action.

I believe the simplest solution to your question is this:


Where the expression in the selector square brackets could be some mapping or function.