Chris - 7 months ago 44
Python Question

# Take max of non missing date

I have a Pandas DataFrame with nan and date values (mix of float and datetime):

``````import datetime
import numpy as np
import pandas as pd

df = pd.DataFrame({"A": [np.nan,
datetime.date(2016, 1, 2),
np.nan],
"B": [datetime.date(2016, 1, 1),
np.nan,
datetime.date(2016, 1, 3)]
})
``````

I want to take the maximum date across each row. However, if I take
`df.max(axis=1)`
I get this:

``````Out[77]:
0   NaN
1   NaN
2   NaN
dtype: float64
``````

How can I return the maximum non-null date in each row?

You need convert both columns `to_datetime` first, because mixed values - `dates` and `nan`. Then `nan` are converted to `NaT`

``````df.A = pd.to_datetime(df.A)
df.B = pd.to_datetime(df.B)
print (df)
A          B
0        NaT 2016-01-01
1 2016-01-02        NaT
2        NaT 2016-01-03

print (df.max())

A   2016-01-02
B   2016-01-03
dtype: datetime64[ns]

print (df.max(axis=1))
0   2016-01-01
1   2016-01-02
2   2016-01-03
dtype: datetime64[ns]
``````

More dynamic solution with `concat` and `list comprehension`:

``````df[['A','B','C']] = pd.concat([pd.to_datetime(df[col]) for col in df[['A','B','C']]], axis=1)
print (df)
A          B
0        NaT 2016-01-01
1 2016-01-02        NaT
2        NaT 2016-01-03
``````

Or use `apply`:

``````df[['A','B','C']] = df[['A','B','C']].apply(pd.to_datetime)
``````

Timings:

``````In [28]: %timeit (c(df2))
100 loops, best of 3: 4.55 ms per loop

In [29]: %timeit (b(df1))
100 loops, best of 3: 12.8 ms per loop

In [30]: %timeit (a(df))
100 loops, best of 3: 12.8 ms per loop
``````

Code for timings:

``````df = pd.DataFrame({"A": [np.nan,
datetime.date(2016, 1, 2),
np.nan],
"B": [datetime.date(2016, 1, 1),
np.nan,
datetime.date(2016, 1, 3)],
"C": [datetime.date(2016, 1, 1),
np.nan,
datetime.date(2016, 1, 3)]
})

print (df)
#[300000 rows x 3 columns]
df = pd.concat([df]*100000).reset_index(drop=True)
df1 = df.copy()
df2 = df.copy()

def a(df):
df[['A','B','C']] = pd.concat([pd.to_datetime(df[col]) for col in df[['A','B','C']]], axis=1)
return df

def b(df):
df[['A','B','C']] = df[['A','B','C']].apply(pd.to_datetime)
return df

def c(df):
df.A = pd.to_datetime(df.A)
df.B = pd.to_datetime(df.B)
df.C = pd.to_datetime(df.C)
return df
``````
Source (Stackoverflow)