cᴏʟᴅsᴘᴇᴇᴅ cᴏʟᴅsᴘᴇᴇᴅ - 1 month ago 16
Python Question

Apply a function pairwise on a pandas series

I've a pandas series whose elements constitute frozensets:

data = {0: frozenset({'apple', 'banana'}),
1: frozenset({'apple', 'orange'}),
2: frozenset({'banana'}),
3: frozenset({'kumquat', 'orange'}),
4: frozenset({'orange'}),
5: frozenset({'orange', 'pear'}),
6: frozenset({'orange', 'pear'}),
7: frozenset({'apple', 'banana', 'pear'}),
8: frozenset({'banana', 'persimmon'}),
9: frozenset({'apple'}),
10: frozenset({'banana'}),
11: frozenset({'apple'})}

tokens = pd.Series(data); tokens

0 (apple, banana)
1 (orange, apple)
2 (banana)
3 (orange, kumquat)
4 (orange)
5 (orange, pear)
6 (orange, pear)
7 (apple, banana, pear)
8 (persimmon, banana)
9 (apple)
10 (banana)
11 (apple)
Name: Tokens, dtype: object


I want to apply a function pairwise. For example,
tokens.diff
gives me the set difference between consecutive rows:

0 NaN
1 (orange)
2 (banana)
3 (orange, kumquat)
4 ()
5 (pear)
6 ()
7 (apple, banana)
8 (persimmon)
9 (apple)
10 (banana)
11 (apple)
Name: Tokens, dtype: object


I'd like the same thing, but instead of set difference, I want a set union on consecutive rows. So, I'd ideally like:

0 NaN
1 (orange, apple, banana)
2 (banana, orange, apply)
3 (orange, kumquat, banana)
4 (orange, kumquat)
...


How can I achieve this with Pandas? I know I can do this with
zip
and a list comp, but hoping there's a better way.

Answer Source

Couple of ways

Option 1] list comprehension

In [3631]: pd.Series([x[0].union(x[1])
                      for x in zip(tokens, tokens.shift(-1).fillna(''))],
                     index=tokens.index)
Out[3631]:
0              (orange, banana, apple)
1              (orange, apple, banana)
2            (orange, kumquat, banana)
3                    (orange, kumquat)
4                       (orange, pear)
5                       (orange, pear)
6        (orange, pear, banana, apple)
7     (persimmon, pear, banana, apple)
8           (apple, persimmon, banana)
9                      (apple, banana)
10                     (banana, apple)
11                             (apple)
dtype: object

Option 2] map

In [3632]: pd.Series(map(lambda x: x[0].union(x[1]), 
                         zip(tokens, tokens.shift(-1).fillna(''))),
                     index=tokens.index)
Out[3632]:
0              (orange, banana, apple)
1              (orange, apple, banana)
2            (orange, kumquat, banana)
3                    (orange, kumquat)
4                       (orange, pear)
5                       (orange, pear)
6        (orange, pear, banana, apple)
7     (persimmon, pear, banana, apple)
8           (apple, persimmon, banana)
9                      (apple, banana)
10                     (banana, apple)
11                             (apple)
dtype: object

Option 3] Using concat and apply

In [3633]: pd.concat([tokens, tokens.shift(-1).fillna('')],
                     axis=1).apply(lambda x: x[0].union(x[1]), axis=1)
Out[3633]:
0              (orange, banana, apple)
1              (orange, apple, banana)
2            (orange, kumquat, banana)
3                    (orange, kumquat)
4                       (orange, pear)
5                       (orange, pear)
6        (orange, pear, banana, apple)
7     (persimmon, pear, banana, apple)
8           (apple, persimmon, banana)
9                      (apple, banana)
10                     (banana, apple)
11                             (apple)
dtype: object

Timings

In [3647]: tokens.shape
Out[3647]: (60000L,)

In [3648]: %timeit pd.Series([x[0].union(x[1]) for x in zip(tokens, tokens.shift(-1).fillna(''))], index=tokens.index)
10 loops, best of 3: 35 ms per loop

In [3649]: %timeit pd.Series(map(lambda x: x[0].union(x[1]), zip(tokens, tokens.shift(-1).fillna(''))), index=tokens.index)
10 loops, best of 3: 40.9 ms per loop

In [3650]: %timeit pd.concat([tokens, tokens.shift(-1).fillna('')], axis=1).apply(lambda x: x[0].union(x[1]), axis=1)
1 loop, best of 3: 2.2 s per loop

Unrelated and for sake of a number on diff

In [3653]: %timeit tokens.diff()
10 loops, best of 3: 10.8 ms per loop