Arne - 1 year ago 65

Python Question

I have a dataframe with multiple columns and a few 1000 rows with text data. One column contains floats that represent time in ascending order (0, 0.45, 0.87, 1.10 etc). From this I want to build a new dataframe that contains only all the rows where these time values are closest to the integers x = 0,1,2,3......etc

Here on Stackoverflow I found an answer to a very similar question, answer posted by DSM. The code is essentially this, modified (hopefully) to give -the- closest number to x, df is my data frame.

`df.loc[(df.ElapsedTime-x).abs().argsort()[:1]]`

This seems to essentially do what I need for one x value but I can't figure out how to iterate this over the -entire- data frame to extract -all- rows where the column value is closest to x = 0,1,2,3....in ascending order. This code gives me a data frame, there must be a way to loop this and append the resulting data frames to get the desired result?

I have tried this:

`L=[]`

for x in np.arange(len(df)):

L.append(df.loc[(df.ElapsedTime-x).abs().argsort()[:1]])

L

L, in principle has the right rows but it is a messy list and it takes a long time to execute because for loops are not a great way to iterate over a data frame. I'd prefer to get a data frame as the result.

I feel I am missing something trivial.

Not sure how to post the desired dataframe.

Lets say the timevalues are (taken from my dataframe):

`0.00,0.03,0.58,1.59,1.71,1.96,2.21,2.33,2.46,2.58,2.7,2.83,2.95,3.07`

The values grabbed for 0,1,2,3 would be 0, .58, 1.96, 2.95

@beroe: if the numbers are 0.8, 1.1, 1.4, 2.8, in this case 1.1 should be grabbed for 1 and 1.4 should be grabbed for 2. If as an example the numbers are 0.5 1.5 2.5. While I think it is unlikely this will happen in my data I think it would be fine to grab 1.5 as 1 and 2.5 as 2. In this application I don't think it is that critical, although I am not sure how I would implement this.

Please let me know if anyone needs any additional info.

Answer Source

Don't know how fast this would be, but you could round the times to get "integer" candidates, take the absolute value of the difference to give yourself a way to find the closest, the sort by difference, and then `groupby`

the integer time to return just the rows that are close to integers:

```
# setting up my fake data
df=pd.DataFrame()
df['ElapsedTime']=pd.Series([0.5, 0.8, 1.1, 1.4, 1.8, 2.2, 3.1])
# To use your own data set, set df = Z, and start here...
df['bintime'] = df.ElapsedTime.round()
df['d'] = abs(df.ElapsedTime - df.bintime)
dfindex = df.sort('d').groupby('bintime').first()
```

For the fake time series defined above, the contents of `dfindex`

is:

```
ElapsedTime d
bintime
0 0.5 0.5
1 1.1 0.1
2 1.8 0.2
3 3.1 0.1
```