Leosar - 11 months ago 57

R Question

I would like to calculate the mean of a variable on a data.frame in R from the row which another variable start to have a constant value. I usually use dplyr for this database kind of task but I dont figure out how to do this, here is an example:

`s<-"no Spc PSize`

2 0 6493

2 0 9281

2 12 26183

2 12 36180

2 12 37806

2 12 37765

3 12 36015

3 12 26661

3 0 14031

3 0 5564

3 1 17701

3 1 20808

3 1 31511

3 1 44746

3 1 50534

3 1 54858

3 1 58160

3 1 60326"

d<-read.delim(textConnection(s),sep="",header=T)

mean(d[1:10,3])

sd(d[1:10,3])

From the row 11 the variable spc have a constant value, so this is the place I want to split the data.frame

`mean(d[11:18,3])`

sd(d[11:18,3])

I can calculate it by hand, but that is not the idea...

Answer Source

You can do it by adding a column that checks whether or not the entry matches the value above, then use `cumsum`

to find the places where the count changes. I `group_by`

'd that, and calculated the summaries you wanted -- I also added an output of which rows were included to demonstrate where it was grabbing from.

```
d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc[1])
, whichGroup = cumsum(isDiff)) %>%
group_by(whichGroup, Spc) %>%
summarise(mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
```

Gives:

```
whichGroup Spc mean sd whichRows
<int> <int> <dbl> <dbl> <chr>
1 0 0 7887.0 1971.414 1:2
2 1 12 33435.0 5486.794 3:8
3 2 0 9797.5 5987.073 9:10
4 3 1 42330.5 16866.591 11:18
```

If you only want the last group, which I can't tell from your post if you do or not, you could instead use `filter`

, like this:

```
d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc[1])
, whichGroup = cumsum(isDiff)) %>%
filter(whichGroup == max(whichGroup)) %>%
summarise(Spc = Spc[1]
, mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
```

Which gives:

```
Spc mean sd whichRows
1 1 42330.5 16866.59 11:18
```

Based on the comment, you appear to want the last group vs. the rest, you can get that with this approach:

```
d %>%
mutate(
row = 1:n()
, isDiff = Spc != lag(Spc, default = Spc[1])
, whichGroup = cumsum(isDiff)) %>%
group_by(isLast = whichGroup == max(whichGroup)) %>%
summarise(mean = mean(PSize)
, sd = sd(PSize)
, whichRows = paste(range(row), collapse = ":"))
```

which gives:

```
isLast mean sd whichRows
<lgl> <dbl> <dbl> <chr>
1 FALSE 23597.9 13521.32 1:10
2 TRUE 42330.5 16866.59 11:18
```