kiril kiril - 1 month ago 8
Python Question

Query dataframe using index value between index entries

I'm reading temperature entries stored in a file.
Each entry is generated when the temperature value changes, so it is not stored in regular intervals.

An example of the data could be as follows:

timestamp | temperature
-----------+------------
1477400000 | 31
1477400001 | 31.5
1477400003 | 32
1477400010 | 31.5
1477400200 | 32
1477400201 | 32.5


I would need a fast way to get the temperature at any timestamp, even if it is not in the index. For instance, temperature at
1477400002
would be
31.5
, but
1477400002
is not in the index.

For sake of easier reproducibility, the same dataframe may be generated as follows:

df = pd.DataFrame(data={'temperature': [31, 31.5, 32, 31.5, 32, 32.5]},
index=[1477400000, 1477400001, 1477400003, 1477400010, 1477400200, 1477400201])

Answer

Assuming that the index is sorted, you can use np.searchsorted to return the ordinal position and use iloc to index into the df:

In [84]:
df.iloc[max(0, np.searchsorted(df.index, 1477400002 ) -1)]

Out[84]:
temperature    31.5
Name: 1477400001, dtype: float64

Here I subtract 1 from the result of np.searchsorted to return the lower bound, additionally to protect against the situation where it returns the first entry I also calc the max between 0 and the returned value so if you tried to find 1477400000 then this will still return the first entry