Runner Bean Runner Bean - 3 months ago 9
Python Question

Python pandas make new column from data in existing column and from another dataframe

I have a DataFrame called 'mydata', and if I do

len(mydata.loc['2015-9-2'])


It counts the number of rows in mydata that have that date, and returns a number like

1067


I have another DataFrame called 'yourdata' which looks something like

timestamp
51 2015-06-22
52 2015-06-23
53 2015-06-24
54 2015-06-25
43 2015-07-13


Now I want use each date in yourdata so instead of typing in each date

len(mydata.loc['2015-9-2'])


I can iterate through 'yourdata' using them like

len(mydata.loc[yourdata['timestamp']])


and produce a new DataFrame with the results or just add a new column to yourdata with the result for each date, but I'm lost as how to do this?

The following does not work

yourdata['result'] = len(mydata.loc[yourdata['timestamp']])


neither does this

yourdata['result'] = len(mydata.loc[yourdata.iloc[:,-3]])


this does work

yourdata['result'] = len(mydata.loc['2015-9-2'])


buts that no good as I want to use the date in each row not some fixed date.

Edit: first few rows of mydata

timestamp BPM
0 2015-08-30 16:48:00 65
1 2015-08-30 16:48:10 65
2 2015-08-30 16:48:15 66
3 2015-08-30 16:48:20 67
4 2015-08-30 16:48:30 70

Answer
import numpy as np
import pandas as pd

mydata = pd.DataFrame({'timestamp': ['2015-06-22 16:48:00']*3 +
                                    ['2015-06-23 16:48:00']*2 +
                                    ['2015-06-24 16:48:00'] +
                                    ['2015-06-25 16:48:00']*4 +
                                    ['2015-07-13 16:48:00',
                                     '2015-08-13 16:48:00'],
                       'BPM': [65]*8 + [70]*4})
mydata['timestamp'] = pd.to_datetime(mydata['timestamp'])
print(mydata)

#     BPM           timestamp
# 0    65 2015-06-22 16:48:00
# 1    65 2015-06-22 16:48:00
# 2    65 2015-06-22 16:48:00
# 3    65 2015-06-23 16:48:00
# 4    65 2015-06-23 16:48:00
# 5    65 2015-06-24 16:48:00
# 6    65 2015-06-25 16:48:00
# 7    65 2015-06-25 16:48:00
# 8    70 2015-06-25 16:48:00
# 9    70 2015-06-25 16:48:00
# 10   70 2015-07-13 16:48:00
# 11   70 2015-08-13 16:48:00

yourdata = pd.Series(['2015-06-22', '2015-06-23', '2015-06-24',
                      '2015-06-25', '2015-07-13'], name='timestamp')
yourdata = pd.to_datetime(yourdata).to_frame()
print(yourdata)

# 0   2015-06-22
# 1   2015-06-23
# 2   2015-06-24
# 3   2015-06-25
# 4   2015-07-13

result = (mydata.set_index('timestamp').resample('D')
                .size().loc[yourdata['timestamp']]
                .reset_index())
result.columns = ['timestamp', 'result']
print(result)

#    timestamp  result
# 0 2015-06-22       3
# 1 2015-06-23       2
# 2 2015-06-24       1
# 3 2015-06-25       4
# 4 2015-07-13       1