tlg tlg - 13 days ago 9
Python Question

Speed to iterate multiple times over a generator compared to a list

I expected that in the case of multiple loops list iteration will be much faster than using a generator, and my code suggests this is false.

My understanding is (by operation I mean any expression defining an element):




  • a list requires n operations to be initialized

  • but then every loop over the list is just grabbing an element from memory

  • thus, m loops over a list require only n operations

  • a generator does not require any operations to be initialized

  • however, looping over generator runs operations in fly

  • thus, one loop over a generator requires n operations

  • but m loops over a generator require n x m operations




And I checked my expectations using the following code:

from timeit import timeit

def pow2_list(n):
"""Return a list with powers of 2"""

results = []

for i in range(n):
results.append(2**i)

return results

def pow2_gen(n):
"""Generator of powers of 2"""

for i in range(n):
yield 2**i

def loop(iterator, n=1000):
"""Loop n times over iterable object"""

for _ in range(n):
for _ in iterator:
pass

l = pow2_list(1000) # point to a list
g = pow2_gen(1000) # point to a generator


time_list = \
timeit("loop(l)", setup="from __main__ import loop, l", number=10)

time_gen = \
timeit("loop(g)", setup="from __main__ import loop, g", number=10)

print("Loops over list took: ", time_list)
print("Loops over generator took: ", time_gen)


And the results surprised me...

Loops over list took: 0.20484769299946493
Loops over generator took: 0.0019217690005461918


Somehow using generators appears much faster than lists, even when looping over 1000 times. And in this case we are talking about two orders of magnitude! Why?

EDIT:

Thanks for answers. Now I see my mistake. I wrongly assumed that generator starts from beginning on a new loop, like range:

>>> x = range(10)
>>> sum(x)
45
>>> sum(x)
45


But this was naive (range is not a generator...).

Regarding possible duplicate comment: my problem concerned multiple loops over generator, which is not explained in the other thread.

Answer

Your generator is actually only looping once. Once created with pow2_gen, g stores a generator; the very first time through loop, this generator is consumed, and emits StopIteration. The other times through loop, g.next() just continues to throw StopIteration, so, in effect g represents an empty sequence.

To make the comparison more fair, you would need to re-create the generator every time you loop.

A further difficulty with the way you've approached this is that you're calling append to build your list, which is likely the very slowest way to construct a list. More often, lists are built with list comprehensions.

The following code lets us pick apart the timing a bit more carefully. create_list and create_gen create lists and generators, respectively, using list comprehension and generator expressions. time_loop is like your loop method, while time_apply is a version of loop that re-creates the iterable each time through the loop.

def create_list(n=1000):
    return [2**i for i in range(n)]

def create_gen(n=1000):
    return (2**i for i in range(n))

def time_loop(iterator, n=1000):
    for t in range(n):
        for v in iterator:
            pass

def time_apply(create_fn, fn_arg, n=1000):
    for t in range(n):
        iterator = create_fn(fn_arg)
        time_loop(iterator, 1)

print('time_loop(create_list): %.3f' % timeit("time_loop(create_list(1000))",
                                              setup="from __main__ import *",
                                              number=10))

print('time_loop(create_gen): %.3f' % timeit("time_loop(create_gen(1000))",
                                             setup="from __main__ import *",
                                             number=10))

print('time_apply(create_list): %.3f' % timeit("time_apply(create_list, 1000)",
                                               setup="from __main__ import *",
                                               number=10))

print('time_apply(create_gen): %.3f' % timeit("time_apply(create_gen, 1000)",
                                              setup="from __main__ import *",
                                              number=10))

Results on my box suggest that building a list (time_apply(create_list)) is similar in time to (or maybe even faster than) building a generator (time_apply(create_gen)).

time_loop(create_list): 0.244
time_loop(create_gen): 0.028
time_apply(create_list): 21.190
time_apply(create_gen): 21.555

You can see the same effect you've documented in your question, which is that time_loop(create_gen) is an order of magnitude faster than time_loop(create_list). Again, this is because the generator created is only being iterated over once, rather than the many loops over the list.

As you hypothesise, building a list once and iterating over it many times (time_loop(create_list)) is faster than iterating over a generator many times (time_apply(create_gen)) in this particular scenario.

The trade-off between list and generator is going to be strongly dependent on how big the iterator you're creating is. With 1000 items, I would expect lists to be pretty fast. With 100,000 items, things might look different.

print('create big list: %.3f' % timeit("l = create_list(100000)",
                                       setup="from __main__ import *",
                                       number=10))

print('create big gen: %.3f' % timeit("g = create_gen(100000)",
                                      setup="from __main__ import *",
                                      number=10))

Here I'm getting:

create big list: 209.748
create big gen: 0.023

Python uses up between 700 and 800 MB of memory building the big list; the generator uses almost nothing at all. Memory allocation and garbage cleanup are computationally expensive in Python, and predictably make your code slow; generators are a very simple way to avoid gobbling up your machine's RAM, and can make a big difference to runtime.