Leosar - 8 months ago 49
R Question

# From which row a data.frame variable have a constant value

I would like to calculate the mean of a variable on a data.frame in R from the row which another variable start to have a constant value. I usually use dplyr for this database kind of task but I dont figure out how to do this, here is an example:

``````s<-"no Spc PSize
2                0           6493
2                0           9281
2               12          26183
2               12          36180
2               12          37806
2               12          37765
3               12          36015
3               12          26661
3                0          14031
3                0           5564
3                1          17701
3                1          20808
3                1          31511
3                1          44746
3                1          50534
3                1          54858
3                1          58160
3                1          60326"

mean(d[1:10,3])
sd(d[1:10,3])
``````

From the row 11 the variable spc have a constant value, so this is the place I want to split the data.frame

``````mean(d[11:18,3])
sd(d[11:18,3])
``````

I can calculate it by hand, but that is not the idea...

You can do it by adding a column that checks whether or not the entry matches the value above, then use `cumsum` to find the places where the count changes. I `group_by`'d that, and calculated the summaries you wanted -- I also added an output of which rows were included to demonstrate where it was grabbing from.

``````d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc[1])
, whichGroup = cumsum(isDiff)) %>%
group_by(whichGroup, Spc) %>%
summarise(mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
``````

Gives:

``````  whichGroup   Spc    mean        sd whichRows
<int> <int>   <dbl>     <dbl>     <chr>
1          0     0  7887.0  1971.414       1:2
2          1    12 33435.0  5486.794       3:8
3          2     0  9797.5  5987.073      9:10
4          3     1 42330.5 16866.591     11:18
``````

If you only want the last group, which I can't tell from your post if you do or not, you could instead use `filter`, like this:

``````d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc[1])
, whichGroup = cumsum(isDiff)) %>%
filter(whichGroup == max(whichGroup)) %>%
summarise(Spc = Spc[1]
, mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
``````

Which gives:

``````  Spc    mean       sd whichRows
1   1 42330.5 16866.59     11:18
``````

Based on the comment, you appear to want the last group vs. the rest, you can get that with this approach:

``````d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc[1])
, whichGroup = cumsum(isDiff)) %>%
group_by(isLast = whichGroup == max(whichGroup)) %>%
summarise(mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
``````

which gives:

``````  isLast    mean       sd whichRows
<lgl>   <dbl>    <dbl>     <chr>
1  FALSE 23597.9 13521.32      1:10
2   TRUE 42330.5 16866.59     11:18
``````