john doe john doe - 24 days ago 6
Python Question

How to create a unique id column given an ordered numerical series?

I am working with large dataframe column that have the following structure:

In:

df1 = pd.DataFrame({'A': [0,0,0,1,2,0,1,0,1,2,3,4,5,6,7]})


Out:

A
0 0
1 0
2 0
3 1
4 2
5 0
6 1
7 0
8 1
9 2
10 3
11 4
12 5
13 6
14 7


As you can see,
A
is an ordered sequence from 0 to n. That represents an order in my data. For example:

A
2 0
3 1
4 2
5 0


Let's take a chunk of
df1
(from index
2
to
5
), in column
A
the
0
represents the beginning and the
2
represents the end of the sequence. On the other hand, if there is no consecutive number (e.g.
0
) it represent that sequence ends. Thus, my question is how to generate efficiently a new column (e.g.
id
) conformed by a unique key or number based in the numerical order of
A
:

A id
0 0 -> begin and ends 1

1 0 -> begin and ends 2

2 0 -> begin 3
3 1 -> continue 3
4 2 -> ends 3

5 0 -> begin 4
6 1 -> ends 4

7 0 -> begin 5
8 1 -> continue 5
9 2 -> continue 5
10 3 -> continue 5
11 4 -> continue 5
12 5 -> continue 5
13 6 -> continue 5
14 7 -> ends 5


I added a diagram in order to be more clear.

Answer

I think you can use:

print ((df1.A.diff() < 1).cumsum() + 1)
0     1
1     2
2     3
3     3
4     3
5     4
6     4
7     5
8     5
9     5
10    5
11    5
12    5
13    5
14    5
Name: A, dtype: int32

More general with dealing with NaN after diff:

dif = df1.A.diff()
dif.iloc[0] = df1.loc[0,'A']
print ((dif < 1).cumsum())
0     1
1     2
2     3
3     3
4     3
5     4
6     4
7     5
8     5
9     5
10    5
11    5
12    5
13    5
14    5
Name: A, dtype: int32

Explanation:

First find differences by diff:

dif = df1.A.diff()
print (dif)
0     NaN
1     0.0
2     0.0
3     1.0
4     1.0
5    -2.0
6     1.0
7    -1.0
8     1.0
9     1.0
10    1.0
11    1.0
12    1.0
13    1.0
14    1.0
Name: A, dtype: float64

Then set first value (NaN) to original value:

dif.iloc[0] = df1.loc[0,'A']

Get mask:

print (dif < 1)
0      True
1      True
2      True
3     False
4     False
5      True
6     False
7      True
8     False
9     False
10    False
11    False
12    False
13    False
14    False
Name: A, dtype: bool

Last use cumsum with boolean mask:

print ((dif < 1).cumsum())
0     1
1     2
2     3
3     3
4     3
5     4
6     4
7     5
8     5
9     5
10    5
11    5
12    5
13    5
14    5
Name: A, dtype: int32