JasperDangs7 JasperDangs7 - 1 month ago 11
Python Question

Update Pandas Cells based on Column Values and Other Columns

I am looking to update many columns based on the values in one column; this is easy with a loop but takes far too long for my application when there are many columns and many rows. What is the most elegant way to get the desired counts for each letter?

Desired Output:

Things count_A count_B count_C count_D
['A','B','C'] 1 1 1 0
['A','A','A'] 3 0 0 0
['B','A'] 1 1 0 0
['D','D'] 0 0 0 2

Answer

The most elegant is definitely the CountVectorizer from sklearn.

I'll show you how it works first, then I'll do everything in one line, so you can see how elegant it is.

First, we'll do it step by step:

let's create some data

raw = ['ABC', 'AAA', 'BA', 'DD']

things = [list(s) for s in raw]

Then read in some packages and initialize count vectorizer

from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

cv = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)

Next we generate a matrix of counts

matrix = cv.fit_transform(things)

names = ["count_"+n for n in cv.get_feature_names()]

And save as a data frame

df = pd.DataFrame(data=matrix.toarray(), columns=names, index=raw)

Generating a data frame like this:

    count_A count_B count_C count_D
ABC 1   1   1   0
AAA 3   0   0   0
BA  1   1   0   0
DD  0   0   0   2

Elegant version:

Everything above in one line

df = pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)

Timing:

You mentioned that you're working with a rather large dataset, so I used the %%timeit function to give a time estimate.

Previous response by @piRSquared (which otherwise looks very good!)

pd.concat([s, s.apply(lambda x: pd.Series(x).value_counts()).fillna(0)], axis=1)

100 loops, best of 3: 3.27 ms per loop

My answer:

pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)

1000 loops, best of 3: 1.08 ms per loop

According to my testing, CountVectorizer is about 3x faster.

Comments