adele adele - 6 months ago 33
Python Question

Histogram from Pandas data and specify groupby time

How do I make a histogram and find 90% and 95% values from the following output of bars:

bars = ticks.Volume.resample('1s', how=sum)
bars = bars.dropna()
bars

Timestamp
2015-12-27 23:00:25 1.0
2015-12-27 23:01:11 10.0
2015-12-27 23:02:03 1.0
2015-12-27 23:02:14 2.0
2015-12-27 23:07:27 1.0
2015-12-27 23:14:58 2.0
2015-12-27 23:17:45 1.0
2015-12-27 23:21:38 1.0
2015-12-27 23:37:29 2.0
2015-12-27 23:37:32 1.0
2015-12-27 23:47:35 2.0
2015-12-27 23:47:38 12.0
2015-12-28 00:18:48 1.0
2015-12-28 00:26:19 1.0
2015-12-28 00:42:52 4.0
2015-12-28 01:25:52 1.0
2015-12-28 01:38:52 4.0
2015-12-28 02:03:47 4.0
2015-12-28 02:04:25 4.0
2015-12-28 02:39:15 3.0
2015-12-28 02:54:11 5.0
2015-12-28 03:07:43 1.0
2015-12-28 03:20:04 1.0
2015-12-28 03:30:00 6.0
2015-12-28 03:42:16 1.0
2015-12-28 04:11:03 6.0
2015-12-28 05:13:37 1.0
2015-12-28 05:15:20 1.0
2015-12-28 05:45:51 2.0
2015-12-28 05:48:14 29.0


Also how do I restrict this to only 09:30 - 16:15? Should I use groupby? If yes, can you please show how to do this?

thanks

Answer

These tasks can easily be done using Numpy's histogram and percentile methods.

But first we filter by time of day by first converting the index to datetime objects. In the following example I have changed the target times to include observations in your example dataframe.

import numpy as np
import pandas as pd

#EDIT: added code to rename a column
##
# Rename column
##
bars.columns  # check the original column names
>>>Index([u'Unnamed: 1'], dtype='object')

# rename the 'Unnamed: 1' column
bars.rename(columns={'Unnamed: 1': 'Value'}, inplace=True)

bars.columns
>>>Index([u'Value'], dtype='object')

##
# Filter by time of day
##

# Convert to a datetime. 
# WARNING this is operation is very expensive. For very large dataframes, it is much faster
# to use keep the indices as text and use a different filtering function.
bars.index = bars.index.to_datetime()

# Changed the target times to include values in the sample df
start = (2, 30)
end = (5, 15)

# Filter to only keeps times of day that fall within the desired times
idx = pd.Series(bars.index).apply(lambda x: x.replace(hour=start[0], minute=start[1]) < x < x.replace(hour=end[0], minute=end[1])).values

bars_filtered = bars[idx]

bars_filtered 
                     Value
2015-12-28 02:03:47    4.0
2015-12-28 02:04:25    4.0
2015-12-28 02:39:15    3.0
2015-12-28 02:54:11    5.0
2015-12-28 03:07:43    1.0
2015-12-28 03:20:04    1.0
2015-12-28 03:30:00    6.0
2015-12-28 03:42:16    1.0
2015-12-28 04:11:03    6.0

Calculating the histograms and percentiles is very easy.

##
# Histograms and Percentiles
##

# Get the histograms
num_bins = 10
hist, edges = np.histogram(bars.Value, bins=num_bins)

hist
array([20,  7,  0,  2,  0,  0,  0,  0,  0,  1])

# Edges defining the histogram bins
edges
array([  1. ,   3.8,   6.6,   9.4,  12.2,  15. ,  17.8,  20.6,  23.4,
    26.2,  29. ])

# Calculate the percentiles
p_90 = np.percentile(bars_filtered.Value, q=90)
p_95 = np.percentile(bars_filtered.Value, q=95)

p_90
6.0

p_95
6.0