wl2776 wl2776 - 3 months ago 24
Python Question

How to find why a task fails in dask distributed?

I am developing a distributed computing system using

. Tasks that I submit to it with the
function sometimes fail, while others seeming identical, run successfully.

Does the framework provide any means to diagnose problems?

By failing I mean increasing counter of failed tasks in the Bokeh web UI, provided by the scheduler. Counter of finished tasks increases too.

Function that is run by the
. It communicates to a database, retrieves some rows from its table, performs calculations and updates values.

I've got more than 40000 tasks in map, so it is a bit tedious to study logs.


If a task fails then any attempt to retrieve the result will raise the same error that occurred on the worker

In [1]: from distributed import Client

In [2]: c = Client()

In [3]: def div(x, y):
   ...:     return x / y

In [4]: future = c.submit(div, 1, 0)

In [5]: future.result()
<ipython-input-3-398a43a7781e> in div()
      1 def div(x, y):
----> 2     return x / y

ZeroDivisionError: division by zero

However, other things can go wrong. For example you might not have the same software on your workers as on your client or your network might not let connections go through, or any of the other things that happen in real-world networks. To help diagnose these there are a few options:

  1. You can use the web interface to track the progress of your tasks and workers
  2. You can start IPython kernels in the scheduler or workers to inspect them directly