George Sovetov George Sovetov - 3 months ago 13x
Python Question

Histogram in Python in functional way in O(n)

Not sure how to formulate question properly, please correct me if I'm wrong.

Say I want to make histogram in functional way in pure Python.

def add_value(bins, x):
i = int(x // 10)
return (*bins[:i], bins[i] + 1, bins[i + 1:])

def histogram(data):
return reduce(data, add_value, (0,) * 10)

It's O(n^2), right? Is it possible somehow to optimize it to O(n)?

I'm tempted to change tuple to list:

def add_value(bins, x):
bins[int(x // 10)] += 1
return bins

def histogram(data):
return reduce(data, add_value, [0] * 10)

But I don't feel confident about mixing mutable lists and functional style that imply immutability. Are there any dangerous pitfall connected to this? I haven't found any.

I'm asking about merely technical problems, not something ideological or opinion-based.

bsa bsa

Let me start by saying I'm still learning my way around functional programming myself, so I'm no expert.

To answer your first question, yes, I agree that your functional implementation is O(n^2), as you are rebuilding the entire tuple at each step.

For your second question, here is a functional implementation of histogram in Python which I believe to be O(n):

from itertools import groupby

def histogram(data):
    return dict(map(lambda x: (x[0], sum(1 for i in x[1])), groupby(data, lambda x: x // 10)))

It doesn't look much like your original program, and the histogram is returned as a dictionary instead of a list. If you want an immutable result, call tuple instead of dict. Or just leave it as a map result.

You might think it's not very readable, and I would tend to agree. As much as I love Python, it's not the best language for functional programming. Not terrible, but not the best. Idiomatic Python tends to discourage the use of map, reduce etc. A Pythonic histogram implementation would initialize an empty list or hash, and iterate over the data in a plain old for loop. But I'm digressing.

To make the implementation a little clearer, take a look at this equivalent implementation in Ruby. I picked Ruby because it's pretty close to Python, but slightly more biased towards functional programming. Functional code in Ruby tends to be more readable than in Python.

def histogram(data)
    data.group_by { |i| i / 10 }.map { |b, values| [b, values.length] }.to_h

As for "dangerous pitfalls" and "technical problems": in the case of single threaded execution, there is no problem using mutable data structures in functional programs.

But, imagine that your histogram calculation was to run on a large data set, using multiple CPU cores or even a cluster of CPUs. The input data is broken up into chunks, and your add_value function is executed concurrently in multiple threads of execution.

If all the threads operate on a shared mutable list, two threads can modify an element in your bins list at the same time, like this:

  1. Thread A reads a value from data. Let's say, 33. 33 // 10 == 3
  2. Thread B reads a value from data. Let's say, 36. 36 // 10 == 3
  3. Thread A reads bins[3] == 100.
  4. Thread B also reads bins[3] == 100.
  5. Thread A calculates 100 + 1, and writes hist[3] = 101.
  6. Thread B calculates 100 + 1, and writes hist[3] = 101. Uh oh.

To prevent this, you could create or use an existing data structure with built in locking mechanisms. These have a performance cost though, which may become a bottleneck as you use more threads.

If you used functional code and immutable data structures, you wouldn't have to manage concurrent access to shared data. Instead, each thread independently calculates a partial histogram. Then, these partial histograms would be merged by reduction into a single result. Merging histograms would be done by simply "adding them together".

I hope that makes things clearer and answers your questions.