Leosar Leosar - 1 month ago 13
R Question

From which row a data.frame variable have a constant value

I would like to calculate the mean of a variable on a data.frame in R from the row which another variable start to have a constant value. I usually use dplyr for this database kind of task but I dont figure out how to do this, here is an example:

s<-"no Spc PSize
2 0 6493
2 0 9281
2 12 26183
2 12 36180
2 12 37806
2 12 37765
3 12 36015
3 12 26661
3 0 14031
3 0 5564
3 1 17701
3 1 20808
3 1 31511
3 1 44746
3 1 50534
3 1 54858
3 1 58160
3 1 60326"

d<-read.delim(textConnection(s),sep="",header=T)

mean(d[1:10,3])
sd(d[1:10,3])


From the row 11 the variable spc have a constant value, so this is the place I want to split the data.frame

mean(d[11:18,3])
sd(d[11:18,3])


I can calculate it by hand, but that is not the idea...

Answer

You can do it by adding a column that checks whether or not the entry matches the value above, then use cumsum to find the places where the count changes. I group_by'd that, and calculated the summaries you wanted -- I also added an output of which rows were included to demonstrate where it was grabbing from.

d %>%
  mutate(
    row = 1:n()
    , isDiff = Spc != lag(Spc, default = Spc[1])
    , whichGroup = cumsum(isDiff)) %>%
  group_by(whichGroup, Spc) %>%
  summarise(mean = mean(PSize)
            , sd = sd(PSize)
            , whichRows = paste(range(row), collapse = ":"))

Gives:

  whichGroup   Spc    mean        sd whichRows
       <int> <int>   <dbl>     <dbl>     <chr>
1          0     0  7887.0  1971.414       1:2
2          1    12 33435.0  5486.794       3:8
3          2     0  9797.5  5987.073      9:10
4          3     1 42330.5 16866.591     11:18

If you only want the last group, which I can't tell from your post if you do or not, you could instead use filter, like this:

d %>%
  mutate(
    row = 1:n()
    , isDiff = Spc != lag(Spc, default = Spc[1])
    , whichGroup = cumsum(isDiff)) %>%
  filter(whichGroup == max(whichGroup)) %>%
  summarise(Spc = Spc[1]
            , mean = mean(PSize)
            , sd = sd(PSize)
            , whichRows = paste(range(row), collapse = ":"))

Which gives:

  Spc    mean       sd whichRows
1   1 42330.5 16866.59     11:18

Based on the comment, you appear to want the last group vs. the rest, you can get that with this approach:

d %>%
  mutate(
    row = 1:n()
    , isDiff = Spc != lag(Spc, default = Spc[1])
    , whichGroup = cumsum(isDiff)) %>%
  group_by(isLast = whichGroup == max(whichGroup)) %>%
  summarise(mean = mean(PSize)
            , sd = sd(PSize)
            , whichRows = paste(range(row), collapse = ":"))

which gives:

  isLast    mean       sd whichRows
   <lgl>   <dbl>    <dbl>     <chr>
1  FALSE 23597.9 13521.32      1:10
2   TRUE 42330.5 16866.59     11:18
Comments