Cebs Cebs - 2 months ago 16
R Question

Selecting consecutive rows in a data frame... r

I need to perform some operations in a data frame and since are a little bit particular I don't have a clue how to perform them. Here is some data:

x<-seq(1:250)
pos<-seq(1000,1249,1)
pval<-c(rep(0.25,40),rep(0.0001,10),rep(0.14,100),rep(0.0005,20),rep(0.58,10),rep(0.00001,20),rep(0.85,50))
len<-rep(0.1,250)
nsnp<-rep(33.7,250)
data<-data.frame(cbind(x,pos,pval,len,nsnp))


Well, so my problem is that I need to create a data frame from this one, but I need to combine the consecutive rows according to data$pval. That is to say, sorting by data$x I need to join all the consecutive elements that have a data$pval <= than 0.05. And perform:


  1. Mean of data$pos between the first and last consecutive element with data$pval <= than 0.05

  2. Sum all the consecutive data$len with data$pval <= than 0.05

  3. Sum all the consecutive data$nsnp with data$pval <= than 0.05



Since at our data frame (data) there are 3 regions with consecutive data$x numbers, the final data base should looks like this:

pos len nsnp
[1,] 1044.5 1 337
[2,] 1159.5 2 674
[3,] 1189.5 2 674


This numbers can be obtained like this:

data2<-subset(data,data$pval<=0.05)
mean(data2$pos[data2$pos>=1040 & data2$pos<=1049])
sum(data2$len[data2$pos>=1040 & data2$pos<=1049])
sum(data2$nsnp[data2$pos>=1040 & data2$pos<=1049])
mean(data2$pos[data2$pos>=1150 & data2$pos<=1169])
sum(data2$len[data2$pos>=1150 & data2$pos<=1169])
sum(data2$nsnp[data2$pos>=1150 & data2$pos<=1169])
mean(data2$pos[data2$pos>=1180 & data2$pos<=1199])
sum(data2$len[data2$pos>=1180 & data2$pos<=1199])
sum(data2$nsnp[data2$pos>=1180 & data2$pos<=1199])


I hope now my problem is understood. My problem is that I could not find how to select the consecutive rows according to data$x. These consecutive rows in my example are: pos 1040-1049, pos 1150-1169 and pos 1180-1199.

Answer

It seems that this can be done by grouping by pval, so using dplyr,

library(dplyr)
data2 %>% 
  group_by(pval) %>% 
  summarise(pos = mean(pos), len = sum(len), nsnp = sum(nsnp))
# A tibble: 3 × 4
#   pval    pos   len  nsnp
#  <dbl>  <dbl> <dbl> <dbl>
#1 1e-05 1189.5     2   674
#2 1e-04 1044.5     1   337
#3 5e-04 1159.5     2   674

However, if that's not the case then we can group by consecutive pos values as follows,

library(dplyr)
data2 %>% 
  group_by(new = cumsum(c(1, diff(pos) != 1))) %>% 
  summarise(pos = mean(pos), len = sum(len), nsnp = sum(nsnp))
# A tibble: 3 × 4
#    new    pos   len  nsnp
#  <dbl>  <dbl> <dbl> <dbl>
#1     1 1044.5     1   337
#2     2 1159.5     2   674
#3     3 1189.5     2   674
Comments