leekaiinthesky leekaiinthesky - 6 months ago 20
Python Question

How do I diagnose an egregious difference in Python script (pandas aggregate) runtime between two machines?

I have the same Python script running on two machines, on the same input and giving the same output. It takes about 7 minutes to run on my laptop and 56 minutes to run on a server that I uploaded it to.

Obviously, the environment is different, but the difference in speed (8x!) was a little too egregious for me to just write it off as different environments. Both machines have plenty of memory, and the server has more cores than my laptop, but also more stuff running on it (plus I don't think the cores matter for this operation). It's normal for me to run stuff on both my laptop and the server. I've never had a 2x difference in speed before, not to mention a 10x one.

The script is pretty sample, most of the work is a

pandas.DataFrame.groupby().aggregate(lambda)
, but on a large (~400 MB) table.

How can I diagnose what's causing the speed difference?




Below is what I tried so far, but I know very little about this, so it may or may not be interesting.

cProfile

I tried running
python -m cProfile
in both environments to see if any particular function call was taking a long time, but it seemed to be across the board:

Laptop:

ncalls tottime percall cumtime percall filename:lineno(function)
93 56.976 0.613 372.694 4.007 {method 'get_result' of 'pandas.lib.SeriesGrouper' objects}
16338970 46.629 0.000 250.104 0.000 aggregate_sv_calls_by_sv_id.py:42(<lambda>)
18442616 33.808 0.000 56.950 0.000 {sorted}
18442645 25.395 0.000 172.113 0.000 series.py:1033(__iter__)
78056747/78056745 15.405 0.000 15.420 0.000 {isinstance}
18446903 14.235 0.000 34.129 0.000 dtypes.py:68(is_dtype)
18443264 13.515 0.000 21.058 0.000 internals.py:3806(dtype)
18442666 13.447 0.000 30.854 0.000 common.py:2192(is_datetime_or_timedelta_dtype)
18449428 13.250 0.000 13.250 0.000 {hasattr}
18442793 13.001 0.000 19.134 0.000 internals.py:3833(internal_values)
...


Server:

ncalls tottime percall cumtime percall filename:lineno(function)
312253983 175.564 0.000 275.203 0.000 {isinstance}
93 153.294 1.648 3323.806 35.740 groupby.py:1885(_aggregate_series_pure_python)
16338970 151.336 0.000 749.431 0.000 aggregate_sv_calls_by_sv_id.py:42(<lambda>)
18443129 146.950 0.000 607.143 0.000 internals.py:2482(make_block)
18442874 136.363 0.000 529.415 0.000 series.py:120(__init__)
18443241 125.725 0.000 248.772 0.000 generic.py:2674(__setattr__)
18442596 106.353 0.000 1336.629 0.000 internals.py:3868(get_slice)
18442617 106.296 0.000 172.363 0.000 {sorted}
18442596 89.203 0.000 2105.284 0.000 series.py:689(_get_values)
18443048 84.777 0.000 91.065 0.000 base.py:506(_get_attributes_dict)
...


Since it seemed to be across the board rather than any one function I could isolate, I tried to see if I could reduce the problem to a minimum, complete, verifiable example... This was the best I got:

%timeit

Laptop:

In [5]: df = pd.DataFrame(np.random.randint(10, size=(10000, 50)))

In [6]: %timeit df.groupby(0).sum()
100 loops, best of 3: 5.54 ms per loop

In [7]: %timeit df.groupby(0).agg(lambda x: sum(x))
1 loops, best of 3: 124 ms per loop

In [8]: %timeit df.groupby(0, 1).agg(lambda x: sum(x))
1 loops, best of 3: 155 ms per loop


Server:

In [5]: df = pd.DataFrame(np.random.randint(10, size=(10000, 50)))

In [6]: %timeit df.groupby(0).sum()
100 loops, best of 3: 6.08 ms per loop

In [7]: %timeit df.groupby(0).agg(lambda x: sum(x))
1 loops, best of 3: 215 ms per loop

In [8]: %timeit df.groupby(0, 1).agg(lambda x: sum(x))
1 loops, best of 3: 327 ms per loop


So this doesn't quite capture the 8x speed difference. It only shows about a 2x speed difference.

But this size of difference (2x) I feel like I am used to chalking up to some kind of environmental factors that maybe I won't be able to figure out, like other load on the server or something like that, or different processors. If I'm willing to write off a 2x difference as different machines, does that mean that I also have to be willing to accept an 8x difference as just different machines? Or is there more interesting stuff to dig into here?




Conclusion:

My
pandas
version was 0.17.1 on my laptop and 0.18.1 on the server. I compared the same code running on my laptop between 0.17.1 and 0.18.1, and 0.17.1 ran about 4x as fast. So ~4x can be attributed to that, and ~2x can be attributed to platform differences between the laptop and server.

Answer

I suspect you're looking at two, maybe even three different causes.

By comparing the two profiles, I see:

  • the same routine (sorted), part of the standard distribution, executed around the same number of times, shows a 3X time difference. This routine is CPU and memory bound, has no other needs, so this may indicate a platform difference (faster CPU, slower memory, massive page faulting and possibly swap thrashing). The same can be seen in calls to aggregate_sv_calls_by_sv_id.py.

  • said routine executes one time less on the server, which indicates a difference between the run of the same algorithm. This might be the same as the next element, or indicate some sort of indeterminacy.

  • what appears to be the same routine, and is executed the same nontrivial number of times (93) on laptop and server, is called {method 'get_result' of 'pandas.lib.SeriesGrouper' objects} here, and groupby.py:1885(_aggregate_series_pure_python) there. This means there's a difference in the software base.

All in all, I'd go out on a limb and say that your software version is, per se, around 2.5 times faster than the server's. Then, the server is running the script 3 times slower. And 2.5 * 3 gives around that 8X factor that you observe, if this is reproducible. Otherwise, either the causes are really three and not two, and the above indeterminacy plays a part, or the 3X slowdown on server is due to aleatory circumstances (e.g. system load, very likely).