EbraHim - 2 years ago 64
Python Question

# Why creating a generator using `()` need a lot of memory?

Let assume that I want to print square of all the numbers smaller than

`20000000`
in a file for example.

I have three options to have these numbers I guess:

First option is creating a list of squares:

``````import time
import psutil
import gc

gc.collect()

mem_before = psutil.virtual_memory()[3]
time1 = time.time()
x = [i**2 for i in range(20000000)]
time2 = time.time()
mem_after =  psutil.virtual_memory()[3]

print "Used Mem = ", (mem_after - mem_before)/(1024**2) #Byte to Megabyte convert.
print "Calculation time = ", time2 - time1
``````

It is really slow and time consuming:

``````Used Mem =  1270 #Megabytes
Calculation time =  33.9309999943 #Seconds
``````

Second options is creating a generator using
`()`

``````import time
import psutil
import gc

gc.collect()

mem_before = psutil.virtual_memory()[3]
time1 = time.time()
x = (i**2 for i in range(20000000))
time2 = time.time()
mem_after =  psutil.virtual_memory()[3]

print "Used Mem = ", (mem_after - mem_before)/(1024**2)
print "Calculation time = ", time2 - time1
``````

It is much more faster than option1, but still uses a lot of memory:

``````Used Mem =  611
Calculation time =  0.278000116348
``````

Third option and the most efficient way is defining a function that returns generator:

``````mem_before = psutil.virtual_memory()[3]
time1 = time.time()
def f(n):
i = -1
while(i<n):
i += 1
yield i**2
x = f(20000000)
time2 = time.time()
mem_after =  psutil.virtual_memory()[3]

print "Used Mem = ", (mem_after - mem_before)/(1024**2)
print "Calculation time = ", time2 - time1
``````

Its consumption:

``````Used Mem =  0
Calculation time =  0.0
``````

The question is:

1. What's the difference between first and second solution? If using
`()`
we create a generator, so why it need a lot of memory?

2. Is there any built-in function equivalent to my third option?

1. As others have pointed out in the comments, `range` creates a `list` in Python 2. Hence, it is not the generator per se that uses up the memory, but the `range` that the generator uses:

``````x = (i**2 for i in range(20000000))
# builds a 2*10**7 element list, not for the squares , but for the bases

>>> sys.getsizeof(range(100))
872
>>> sys.getsizeof(xrange(100))
40
>>> sys.getsizeof(range(1000))
8720
>>> sys.getsizeof(xrange(1000))
40
>>> sys.getsizeof(range(20000000))
160000072
>>> sys.getsizeof(xrange(20000000))
40
``````

This also explains why your second version (the generator expression) uses around half the memory of the first version (the list comprehension) as the first one builds two lists (for the bases and the squares) while the second only builds one list for the bases.

2. `xrange(20000000)` thus, greatly improves memory usage as it returns a lazy iterator. This is essentially the built-in memory efficient way to iterate over a range of numbers that mirrors your third version (with the added flexibility of `start`, `stop` and `step`):

``````x = (i**2 for i in xrange(20000000))
``````

In Python 3, `range` is essentially what `xrange` used to be in Python 2. However, the Python 3 `range` object has some nice features that Python 2's `xrange` doesn't have, like `O(1)` slicing, contains, etc.

Some references:

Python2 xrange docs

Python3 range docs

Should you always favor xrange() over range()?

This excellent answer by Martijn Pieters

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download