stackErr stackErr - 1 month ago 7
Python Question

Python map on a group

There is an array of strings in this format:

r = [["Name - Version - Author - Message"],[..],..]


I run the following code on this data structure:

from itertools import groupby

m = map(lambda x: x.split(" - ", 3), r)
group = groupby(m, lambda y: y[0]) # group by name


This works fine (output is grouped by name) but then running the following:

dot = re.compile('\.')
# "Names" have to be changed from first.last to first-last
output = map(lambda (n,g): (dot.sub('-', n), g), group)


This does the regex substitution correctly but it loses all the grouped data. Why? What's happening here? How do I fix this?

When I run this loop:

for name,grp in output:
print list(grp)


It outputs an empty array for each
grp

Answer

Py2's map is eager, while the groups produced by groupby are lazy. As soon as you iterate to the next group from groupby, the previous group becomes invalid (the underlying iterator is advanced past it).

You have two options:

  1. Use a lazy map function
  2. Convert the group to a realized sequence as you iterate so it isn't lost

The former is easy:

from future_builtins import map

which gets you the Py3 version of map (which returns a lazy generator, not an eagerly filled sequence).

As is the latter, you just wrap g in the list (or if you prefer, tuple) constructor:

output = map(lambda (n,g): (dot.sub('-', n), list(g)), group)

Side-note: If you're using map with lambdas, it's going to be slower than just using the more Pythonic list comprehensions or generator expressions. If you need a lambda to use map (or filter), don't use map/filter, the listcomp or genexpr equivalent will always be faster.

So if you really need the speed, use C built-ins for the mapping functions when available, otherwise, use list comps or genexpr.

Example:

from future_builtins import map  # Generator based avoids temporary lists
from operator import itemgetter, methodcaller

m = map(methodcaller('split', " - ", 3), r)
# Or with genexpr:
m = (x.split(" - ", 3) for x in r)

group = groupby(m, itemgetter(0)) # group by name

# No good way to do this without lambda, so use listcomp or genexpr
output = [(dot.sub('-', n), list(g)) for n, g in group]