Thanos Thanos - 7 months ago 20
Python Question

GroupBy - How to extract seconds from DateTime with diff()

I have the following dataframe:

In [372]: df_2
Out[372]:
A ID3 DATETIME
0 B-028 b76cd912ff 2014-10-08 13:43:27
1 B-054 4a57ed0b02 2014-10-08 14:26:19
2 B-076 1a682034f8 2014-10-08 14:29:01
3 B-023 b76cd912ff 2014-10-08 18:39:34
4 B-023 f88g8d7sds 2014-10-08 18:40:18
5 B-033 b76cd912ff 2014-10-08 18:44:30
6 B-032 b76cd912ff 2014-10-08 18:46:00
7 B-037 b76cd912ff 2014-10-08 18:52:15
8 B-046 db959faf02 2014-10-08 18:59:59
9 B-053 b76cd912ff 2014-10-08 19:17:48
10 B-065 b76cd912ff 2014-10-08 19:21:38


And I want to find the difference between different entries - grouped by
'ID3'
.

I am trying to use
transform()
on a
GroupBy
like this:

In [379]: df_2['diff'] = df_2.sort_values(by='DATETIME').groupby('ID3')['DATETIME'].transform(lambda x: x.diff()); df_2['diff']
Out[379]:
0 NaT
1 NaT
2 NaT
3 1970-01-01 04:56:07
4 NaT
5 1970-01-01 00:04:56
6 1970-01-01 00:01:30
7 1970-01-01 00:06:15
8 NaT
9 1970-01-01 00:25:33
10 1970-01-01 00:03:50
Name: diff, dtype: datetime64[ns]


I have also tried with
x.diff().astype(int)
for
lambda
, with the exact same result.

Datatype of both
'DATETIME'
and
'diff'
is:
datetime64[ns]


What I want to achieve is have
diff
represented in seconds and not some time in relation to Epoch time.

I have figured out that I can convert
df_2['diff']
to
TimeDelta
and the extract seconds in one chained call after this point like this:

In [405]: df_2['diff'] = pd.to_timedelta(df_2['diff']).map(lambda x: x.total_seconds()); df_2['diff']
Out[407]:
0 NaN
1 NaN
2 NaN
3 17767.0
4 NaN
5 296.0
6 90.0
7 375.0
8 NaN
9 1533.0
10 230.0
Name: diff, dtype: float64


Is there a way to achieve this (seconds as values for
df_2['diff']
) in one step in the
transform
instead of having to take a couple of steps in the process?

To make it clear, I have tried making conversion to
TimeDelta
in
transform
without any success.

Thanks for the help!

Answer

UPDATE: transform() from class NDFrameGroupBy(GroupBy) doesn't seem to do downcasting and works as expected:

In [220]: (df_2[['ID3','DATETIME']]
   .....:      .sort_values(by='DATETIME')
   .....:      .groupby('ID3')
   .....:      .transform(lambda x: x.diff().dt.total_seconds())
   .....: )
Out[220]:
    DATETIME
0        NaN
1        NaN
2        NaN
3    17767.0
4        NaN
5      296.0
6       90.0
7      375.0
8        NaN
9     1533.0
10     230.0

the transform() from class SeriesGroupBy(GroupBy) tries to do the following:

result = _possibly_downcast_to_dtype(result, dtype)

which could (i'm not sure) cause your problem

OLD answer:

try this:

In [168]: df_2.sort_values(by='DATETIME').groupby('ID3')['DATETIME'].diff().dt.total_seconds()
Out[168]:
0         NaN
1         NaN
2         NaN
3     17767.0
4         NaN
5       296.0
6        90.0
7       375.0
8         NaN
9      1533.0
10      230.0
dtype: float64