CF84 CF84 - 11 days ago 7
Python Question

Pandas: assign categorical column based on string values in another column

Say I have a dataframe like this:

A B
1 05010001 17
2 05020001 5
3 05020002 11
4 05020003 2
5 05030001 86
6 07030001 84
7 07030002 10
8 08010001 16


I want to add a third column,
C
, which assigns
1
to all rows which
A
value begins with
05
,
2
to all values beginning with
07
and so forth.

The result would be:

A B C
1 05010001 17 1
2 05020001 5 1
3 05020002 11 1
4 05020003 2 1
5 05030001 86 1
6 07030001 84 2
7 07030002 10 2
8 08010001 16 3


What is the most efficient way of doing this? My actual dataframe has 3,000,000 rows.

Answer
firstTwo = df.sort_values('A').A.str[:2]            # sort the data frame by column A and 
                                                    # take the first two digits of column A
df['C'] = (firstTwo != firstTwo.shift()).cumsum()   # create ids based on the first two 
                                                    # digits with cumsum
df

enter image description here

Comments