Cebs - 1 year ago 78

R Question

I need to perform some operations in a data frame and since are a little bit particular I don't have a clue how to perform them. Here is some data:

`x<-seq(1:250)`

pos<-seq(1000,1249,1)

pval<-c(rep(0.25,40),rep(0.0001,10),rep(0.14,100),rep(0.0005,20),rep(0.58,10),rep(0.00001,20),rep(0.85,50))

len<-rep(0.1,250)

nsnp<-rep(33.7,250)

data<-data.frame(cbind(x,pos,pval,len,nsnp))

Well, so my problem is that I need to create a data frame from this one, but I need to combine the consecutive rows according to data$pval. That is to say, sorting by data$x I need to join all the consecutive elements that have a data$pval <= than 0.05. And perform:

- Mean of data$pos between the first and last consecutive element with data$pval <= than 0.05
- Sum all the consecutive data$len with data$pval <= than 0.05
- Sum all the consecutive data$nsnp with data$pval <= than 0.05

Since at our data frame (data) there are 3 regions with consecutive data$x numbers, the final data base should looks like this:

`pos len nsnp`

[1,] 1044.5 1 337

[2,] 1159.5 2 674

[3,] 1189.5 2 674

This numbers can be obtained like this:

`data2<-subset(data,data$pval<=0.05)`

mean(data2$pos[data2$pos>=1040 & data2$pos<=1049])

sum(data2$len[data2$pos>=1040 & data2$pos<=1049])

sum(data2$nsnp[data2$pos>=1040 & data2$pos<=1049])

mean(data2$pos[data2$pos>=1150 & data2$pos<=1169])

sum(data2$len[data2$pos>=1150 & data2$pos<=1169])

sum(data2$nsnp[data2$pos>=1150 & data2$pos<=1169])

mean(data2$pos[data2$pos>=1180 & data2$pos<=1199])

sum(data2$len[data2$pos>=1180 & data2$pos<=1199])

sum(data2$nsnp[data2$pos>=1180 & data2$pos<=1199])

I hope now my problem is understood. My problem is that I could not find how to select the consecutive rows according to data$x. These consecutive rows in my example are: pos 1040-1049, pos 1150-1169 and pos 1180-1199.

Answer Source

It seems that this can be done by grouping by `pval`

, so using `dplyr`

,

```
library(dplyr)
data2 %>%
group_by(pval) %>%
summarise(pos = mean(pos), len = sum(len), nsnp = sum(nsnp))
# A tibble: 3 × 4
# pval pos len nsnp
# <dbl> <dbl> <dbl> <dbl>
#1 1e-05 1189.5 2 674
#2 1e-04 1044.5 1 337
#3 5e-04 1159.5 2 674
```

However, if that's not the case then we can group by consecutive `pos`

values as follows,

```
library(dplyr)
data2 %>%
group_by(new = cumsum(c(1, diff(pos) != 1))) %>%
summarise(pos = mean(pos), len = sum(len), nsnp = sum(nsnp))
# A tibble: 3 × 4
# new pos len nsnp
# <dbl> <dbl> <dbl> <dbl>
#1 1 1044.5 1 337
#2 2 1159.5 2 674
#3 3 1189.5 2 674
```