john doe - 1 year ago 51
Python Question

# How to create a unique id column given an ordered numerical series?

I am working with large dataframe column that have the following structure:

In:

``````df1 = pd.DataFrame({'A': [0,0,0,1,2,0,1,0,1,2,3,4,5,6,7]})
``````

Out:

``````    A
0   0
1   0
2   0
3   1
4   2
5   0
6   1
7   0
8   1
9   2
10  3
11  4
12  5
13  6
14  7
``````

As you can see,
`A`
is an ordered sequence from 0 to n. That represents an order in my data. For example:

``````    A
2   0
3   1
4   2
5   0
``````

Let's take a chunk of
`df1`
(from index
`2`
to
`5`
), in column
`A`
the
`0`
represents the beginning and the
`2`
represents the end of the sequence. On the other hand, if there is no consecutive number (e.g.
`0`
) it represent that sequence ends. Thus, my question is how to generate efficiently a new column (e.g.
`id`
) conformed by a unique key or number based in the numerical order of
`A`
:

``````      A                             id
0     0 ->  begin and ends          1

1     0 ->  begin and ends          2

2     0 ->  begin                   3
3     1 ->  continue                3
4     2 ->  ends                    3

5     0 ->  begin                   4
6     1 ->  ends                    4

7     0 ->  begin                   5
8     1 ->  continue                5
9     2 ->  continue                5
10    3 ->  continue                5
11    4 ->  continue                5
12    5 ->  continue                5
13    6 ->  continue                5
14    7 ->  ends                    5
``````

I added a diagram in order to be more clear.

I think you can use:

``````print ((df1.A.diff() < 1).cumsum() + 1)
0     1
1     2
2     3
3     3
4     3
5     4
6     4
7     5
8     5
9     5
10    5
11    5
12    5
13    5
14    5
Name: A, dtype: int32
``````

More general with dealing with `NaN` after `diff`:

``````dif = df1.A.diff()
dif.iloc[0] = df1.loc[0,'A']
print ((dif < 1).cumsum())
0     1
1     2
2     3
3     3
4     3
5     4
6     4
7     5
8     5
9     5
10    5
11    5
12    5
13    5
14    5
Name: A, dtype: int32
``````

Explanation:

First find differences by `diff`:

``````dif = df1.A.diff()
print (dif)
0     NaN
1     0.0
2     0.0
3     1.0
4     1.0
5    -2.0
6     1.0
7    -1.0
8     1.0
9     1.0
10    1.0
11    1.0
12    1.0
13    1.0
14    1.0
Name: A, dtype: float64
``````

Then set first value (`NaN`) to original value:

``````dif.iloc[0] = df1.loc[0,'A']
``````

``````print (dif < 1)
0      True
1      True
2      True
3     False
4     False
5      True
6     False
7      True
8     False
9     False
10    False
11    False
12    False
13    False
14    False
Name: A, dtype: bool
``````

Last use `cumsum` with boolean mask:

``````print ((dif < 1).cumsum())
0     1
1     2
2     3
3     3
4     3
5     4
6     4
7     5
8     5
9     5
10    5
11    5
12    5
13    5
14    5
Name: A, dtype: int32
``````
Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download