I am looking to update many columns based on the values in one column; this is easy with a loop but takes far too long for my application when there are many columns and many rows. What is the most elegant way to get the desired counts for each letter?
Things count_A count_B count_C count_D
['A','B','C'] 1 1 1 0
['A','A','A'] 3 0 0 0
['B','A'] 1 1 0 0
['D','D'] 0 0 0 2
The most elegant is definitely the CountVectorizer from sklearn.
I'll show you how it works first, then I'll do everything in one line, so you can see how elegant it is.
let's create some data
raw = ['ABC', 'AAA', 'BA', 'DD'] things = [list(s) for s in raw]
Then read in some packages and initialize count vectorizer
from sklearn.feature_extraction.text import CountVectorizer import pandas as pd cv = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False)
Next we generate a matrix of counts
matrix = cv.fit_transform(things) names = ["count_"+n for n in cv.get_feature_names()]
And save as a data frame
df = pd.DataFrame(data=matrix.toarray(), columns=names, index=raw)
Generating a data frame like this:
count_A count_B count_C count_D ABC 1 1 1 0 AAA 3 0 0 0 BA 1 1 0 0 DD 0 0 0 2
Everything above in one line
df = pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)
You mentioned that you're working with a rather large dataset, so I used the %%timeit function to give a time estimate.
Previous response by @piRSquared (which otherwise looks very good!)
pd.concat([s, s.apply(lambda x: pd.Series(x).value_counts()).fillna(0)], axis=1)
100 loops, best of 3: 3.27 ms per loop
pd.DataFrame(data=cv.fit_transform(things).toarray(), columns=["count_"+n for n in cv.get_feature_names()], index=raw)
1000 loops, best of 3: 1.08 ms per loop
According to my testing, CountVectorizer is about 3x faster.