Cebs - 2 months ago 16
R Question

# Selecting consecutive rows in a data frame... r

I need to perform some operations in a data frame and since are a little bit particular I don't have a clue how to perform them. Here is some data:

``````x<-seq(1:250)
pos<-seq(1000,1249,1)
pval<-c(rep(0.25,40),rep(0.0001,10),rep(0.14,100),rep(0.0005,20),rep(0.58,10),rep(0.00001,20),rep(0.85,50))
len<-rep(0.1,250)
nsnp<-rep(33.7,250)
data<-data.frame(cbind(x,pos,pval,len,nsnp))
``````

Well, so my problem is that I need to create a data frame from this one, but I need to combine the consecutive rows according to data\$pval. That is to say, sorting by data\$x I need to join all the consecutive elements that have a data\$pval <= than 0.05. And perform:

1. Mean of data\$pos between the first and last consecutive element with data\$pval <= than 0.05

2. Sum all the consecutive data\$len with data\$pval <= than 0.05

3. Sum all the consecutive data\$nsnp with data\$pval <= than 0.05

Since at our data frame (data) there are 3 regions with consecutive data\$x numbers, the final data base should looks like this:

``````       pos len nsnp
[1,] 1044.5   1  337
[2,] 1159.5   2  674
[3,] 1189.5   2  674
``````

This numbers can be obtained like this:

``````data2<-subset(data,data\$pval<=0.05)
mean(data2\$pos[data2\$pos>=1040 & data2\$pos<=1049])
sum(data2\$len[data2\$pos>=1040 & data2\$pos<=1049])
sum(data2\$nsnp[data2\$pos>=1040 & data2\$pos<=1049])
mean(data2\$pos[data2\$pos>=1150 & data2\$pos<=1169])
sum(data2\$len[data2\$pos>=1150 & data2\$pos<=1169])
sum(data2\$nsnp[data2\$pos>=1150 & data2\$pos<=1169])
mean(data2\$pos[data2\$pos>=1180 & data2\$pos<=1199])
sum(data2\$len[data2\$pos>=1180 & data2\$pos<=1199])
sum(data2\$nsnp[data2\$pos>=1180 & data2\$pos<=1199])
``````

I hope now my problem is understood. My problem is that I could not find how to select the consecutive rows according to data\$x. These consecutive rows in my example are: pos 1040-1049, pos 1150-1169 and pos 1180-1199.

It seems that this can be done by grouping by `pval`, so using `dplyr`,

``````library(dplyr)
data2 %>%
group_by(pval) %>%
summarise(pos = mean(pos), len = sum(len), nsnp = sum(nsnp))
# A tibble: 3 × 4
#   pval    pos   len  nsnp
#  <dbl>  <dbl> <dbl> <dbl>
#1 1e-05 1189.5     2   674
#2 1e-04 1044.5     1   337
#3 5e-04 1159.5     2   674
``````

However, if that's not the case then we can group by consecutive `pos` values as follows,

``````library(dplyr)
data2 %>%
group_by(new = cumsum(c(1, diff(pos) != 1))) %>%
summarise(pos = mean(pos), len = sum(len), nsnp = sum(nsnp))
# A tibble: 3 × 4
#    new    pos   len  nsnp
#  <dbl>  <dbl> <dbl> <dbl>
#1     1 1044.5     1   337
#2     2 1159.5     2   674
#3     3 1189.5     2   674
``````
Source (Stackoverflow)