Noobie Noobie - 5 months ago 25
Python Question

Dask: very low CPU usage and multiple threads? is this expected?

I am using

dask
as in how to parallelize many (fuzzy) string comparisons using apply in Pandas?

Basically I do some computations (without writing anything to disk) that invoke
Pandas
and
Fuzzywuzzy
(that may not be releasing the GIL apparently, if that helps) and I run something like:

dmaster = dd.from_pandas(master, npartitions=4)
dmaster = dmaster.assign(my_value=dmaster.original.apply(lambda x: helper(x, slave), name='my_value'))
dmaster.compute(get=dask.multiprocessing.get)


However, a variant of the code has been running for 10 hours now, and is not over yet. I notice in windows task manager that


  • RAM utilization
    is pretty low, corresponding to the size of my data

  • CPU usage
    bounces from 0% to up to 5% every 2/3 seconds or so

  • I have about
    20 Python processes
    whose size is 100MB, and one Python process that likely contains the data that is 30GB in size (I have a 128 GB machine with a 8 core CPU)



Question is: is that behavior expected? Am I obviously terribly wrong in setting some
dask
options here?

Of course, I understand the specifics depends on what exactly I am doing, but maybe the patterns above can already tell that something is horribly wrong?

Many thanks!!

Answer

Of course, I understand the specifics depends on what exactly I am doing, but maybe the patterns above can already tell that something is horribly wrong?

This is pretty spot on. Identifying performance issues is tricky, especially when parallel computing comes into play. Here are some things that come to mind.

  1. The multiprocessing scheduler has to move data between different processes between every time. The serialization/deserialization cycle could be quite expensive. Using the distributed scheduler would handle this better
  2. Your function helper could be doing something oddly
  3. Generally using apply, even in Pandas, is best to be avoided.

Generally a good way to pin down these problems is to create a minimal, complete, verifiable example to share that others can reproduce and play with easily. Often in when creating such an example you find the solution to your problem anyway. But if this doesn't happen at least you can then pass the buck on to the library maintainer. Until such an example is created most lhttps://pypi.python.org/pypi/dask.mesos/0.2.1ibrary maintainers don't bother to spend their time, there is almost always too many details specific to the problem at hand to warrant free service.