Pedro Braz Pedro Braz - 7 months ago 11
Python Question

Fastest waty to create DataFrame from last available data

I had no success looking for answers for this question in the forum since it is hard to put it in keywords. Any keywords suggestions are appreciated so that I cane make this question more accessible so that others can benefit from it.

The closest question I found doesn't really answer mine.

My problem is the following:

I have one DataFrame that I called

ref
, and a dates list called
pub
.
ref
has dates for indexes but those dates are different (there will be a few matching values) from the dates in
pub
. I want to create a new DataFrame that contains all the dates from
pub
but fill it with the "last available data" from
ref
.

Thus, say
ref
is:

Dat col1 col2
2015-01-01 5 4
2015-01-02 6 7
2015-01-05 8 9


And
pub


2015-01-01
2015-01-04
2015-01-06


I'd like to create a DataFrame like:

Dat col1 col2
2015-01-01 5 4
2015-01-04 6 7
2015-01-06 8 9


For this matter performance is an issue. So i'm looking for the fastest / a fast way of doing that.

Thanks in advance.

Answer

You can do an outer merge, set the new index to Dat, sort it, forward fill, and then reindex based on the dates in pub.

dates = ['2015-01-01', '2015-01-04', '2015-01-06']
pub = pd.DataFrame([dt.datetime.strptime(ts, '%Y-%m-%d').date() for ts in dates], 
                   columns=['Dat'])

>>> (ref
     .merge(pub, on='Dat', how='outer')
     .set_index('Dat')
     .sort_index()
     .ffill()
     .reindex(pub.Dat))
            col1  col2
Dat                   
2015-01-01     5     4
2015-01-04     6     7
2015-01-06     8     9