vino88 vino88 - 2 months ago 6
Python Question

How to create a new column in Python Dataframe by referencing two other columns?

I have a dataframe that looks something like this:

df = pd.DataFrame({'Name':['a','a','a','a','b','b','b'], 'Year':[1999,1999,1999,2000,1999,2000,2000], 'Name_id':[1,1,1,1,2,2,2]})

Name Name_id Year
0 a 1 1999
1 a 1 1999
2 a 1 1999
3 a 1 2000
4 b 2 1999
5 b 2 2000
6 b 2 2000


What I'd like to have is a new column 'yr_name_id' that increases for each unique Name_id-Year combination and then begins anew with each new Name_id.

Name Name_id Year yr_name_id
0 a 1 1999 1
1 a 1 1999 1
2 a 1 1999 1
3 a 1 2000 2
4 b 2 1999 1
5 b 2 2000 2
6 b 2 2000 2


I've tried a variety of things and looked here, here and at a few posts on groupby and enumerate.

At first I tried creating a unique dictionary after combining Name_id and Year and then using map to assign values, but when I try to combine Name_id and Year as strings via:

df['yr_name_id'] = str(df['Name_id']) + str(df['Year'])


The new column has a non-unique syntax of
0 0 1\n1 1\n2 1\n3 1\n4 2\n5 2...
which I don't really understand.

A more promising approach that I think I just need help with the lambda is by using groupby

df['yr_name_id'] = df.groupby(['Name_id', 'Year'])['Name_id'].transform(lambda x: )#unsure from this point


I am very unfamiliar with lambda's so any guidance on how I might do this would be greatly appreciated.

Answer

IIUC you can do it this way:

In [99]: df['yr_name_id'] = pd.Categorical(pd.factorize(df['Name_id'].astype(str) + '-' + df['Year'].astype(str))[0] + 1)

In [100]: df
Out[100]:
  Name  Name_id  Year yr_name_id
0    a        1  1999          1
1    a        1  1999          1
2    a        1  1999          1
3    a        1  2000          2
4    b        2  1999          3
5    b        2  2000          4
6    b        2  2000          4

In [101]: df.dtypes
Out[101]:
Name            object
Name_id          int64
Year             int64
yr_name_id    category
dtype: object

But looking at your desired DF, it looks like you want to categorize just a Year column, not a combination of Name_id + Year

In [102]: df['yr_name_id'] = pd.Categorical(pd.factorize(df.Year)[0] + 1)

In [103]: df
Out[103]:
  Name  Name_id  Year yr_name_id
0    a        1  1999          1
1    a        1  1999          1
2    a        1  1999          1
3    a        1  2000          2
4    b        2  1999          1
5    b        2  2000          2
6    b        2  2000          2

In [104]: df.dtypes
Out[104]:
Name            object
Name_id          int64
Year             int64
yr_name_id    category
dtype: object