eleanora eleanora - 2 months ago 16
Python Question

How to count rows efficiently with one pass over the dataframe

I have a dataframe made of strings like this:

ID_0 ID_1
g k
a h
c i
j e
d i
i h
b b
d d
i a
d h


For each pair of strings I can count how many rows have either string in them as follows.

import pandas as pd
import itertools

df = pd.read_csv("test.csv", header=None, prefix="ID_", usecols = [0,1])

alphabet_1 = set(df['ID_0'])
alphabet_2 = set(df['ID_1'])
# This just makes a set of all the strings in the dataframe.
alphabet = alphabet_1 | alphabet_2
#This iterates over all pairs and counts how many rows have either in either column
for (x,y) in itertools.combinations(alphabet, 2):
print x, y, len(df.loc[df['ID_0'].isin([x,y]) | df['ID_1'].isin([x,y])])


This gives:

a c 3
a b 3
a e 3
a d 5
a g 3
a i 5
a h 4
a k 3
a j 3
c b 2
c e 2
c d 4
[...]


The problem is that my dataframe is very large and the alphabet is of size 200 and this method does an independent traversal over the whole dataframe for each pair of letters.

Is it possible to get the same output by doing a single pass over the dataframe somehow?




Timings

I created some data with:

import numpy as np
import pandas as pd
from string import ascii_lowercase
n = 10**4
data = np.random.choice(list(ascii_lowercase), size=(n,2))
df = pd.DataFrame(data, columns=['ID_0', 'ID_1'])

#Testing Parfait's answer
def f(row):
ser = len(df[(df['ID_0'] == row['ID_0']) | (df['ID_1'] == row['ID_0'])|
(df['ID_0'] == row['ID_1']) | (df['ID_1'] == row['ID_1'])])
return(ser)

%timeit df.apply(f, axis=1)
1 loops, best of 3: 37.8 s per loop



I would like to be able to do this for n = 10**8. Can this be sped up?

Answer

You can get past the row level subiteration by using some clever combinatorics/set theory to do the counting:

# Count of individual characters and pairs.
char_count = df['ID_0'].append(df.loc[df['ID_0'] != df['ID_1'], 'ID_1']).value_counts().to_dict()
pair_count = df.groupby(['ID_0', 'ID_1']).size().to_dict()

# Get the counts.
df['count'] = [char_count[x]  if x == y else char_count[x] + char_count[y] - (pair_count[x,y] + pair_count.get((y,x),0)) for x,y in df[['ID_0', 'ID_1']].values]

The resulting output:

  ID_0 ID_1  count
0    g    k      1
1    a    h      4
2    c    i      4
3    j    e      1
4    d    i      6
5    i    h      6
6    b    b      1
7    d    d      3
8    i    a      5
9    d    h      5

I've compared the output of my method to the row level iteration method on a dataset with 5000 rows and all of the counts match.

Why does this work? It essentially just relies on the formula for counting the union of two sets: set_union_equation

The cardinality of a given element is just the char_count. When the elements are distinct, the cardinality of the intersection is just the count of pairs of the elements in any order. Note that when the two elements are identical, the formula reduces to just the char_count.

Timings

Using the timing setup in the question, and the following function for my answer:

def root(df):
    char_count = df['ID_0'].append(df.loc[df['ID_0'] != df['ID_1'], 'ID_1']).value_counts().to_dict()
    pair_count = df.groupby(['ID_0', 'ID_1']).size().to_dict()
    df['count'] = [char_count[x]  if x == y else char_count[x] + char_count[y] - (pair_count[x,y] + pair_count.get((y,x),0)) for x,y in df[['ID_0', 'ID_1']].values]
    return df

I get the following timings for n=10**4:

%timeit root(df.copy())
10 loops, best of 3: 25 ms per loop

%timeit df.apply(f, axis=1)
1 loop, best of 3: 49.4 s per loop

I get the following timing for n=10**6:

%timeit root(df.copy())
10 loops best of 3: 2.22 s per loop

It appears that my solution scales approximately linearly.

Comments