Jonathan Dunne Jonathan Dunne - 3 months ago 10
R Question

How can i show dataframe rows based on computed indices R

I am working on a problem related a dataframe and retrieving specific rows based on indices from matched criteria

# Create dataframe

position <- c("START" , "MIDDLE", "END" ,"START" , "MIDDLE",
"MIDDLE", "MIDDLE", "MIDDLE" ,"MIDDLE" ,"MIDDLE",
"MIDDLE", "MIDDLE", "MIDDLE" ,"END", "START" ,
"START" , "START" , "MIDDLE", "MIDDLE", "END",
"START" , "START", "MIDDLE", "MIDDLE", "MIDDLE",
"END" ,"START", "MIDDLE", "MIDDLE", "MIDDLE",
"END", "START" , "MIDDLE", "MIDDLE", "MIDDLE",
"MIDDLE" ,"MIDDLE" ,"MIDDLE", "MIDDLE" ,"MIDDLE" ,
"MIDDLE" ,"MIDDLE", "MIDDLE", "MIDDLE", "MIDDLE",
"MIDDLE" ,"MIDDLE", "MIDDLE" ,"MIDDLE" ,"MIDDLE" ,
"MIDDLE", "MIDDLE", "MIDDLE", "END")

text <-c("First line", "Middle Line", "Last Line", "First line","Middle Line",
"Middle Line", "Middle Line", "Middle Line", "Middle Line", "Middle Line",
"Middle Line", "Middle Line", "Middle Line", "Last Line", "First line",
"First line", "First line", "Middle Line", "Middle Line", "Last Line",
"First line", "First line", "Middle Line", "Middle Line", "Middle Line",
"Last Line", "First line", "Middle Line", "Middle Line", "Middle Line",
"Last Line", "First line", "Middle Line", "Middle Line", "Middle Line",
"Middle Line", "Middle Line", "Middle Line", "Middle Line", "Middle Line",
"Middle Line", "Middle Line", "Middle Line", "Middle Line", "Middle Line",
"Middle Line", "Middle Line", "Middle Line", "Middle Line", "Middle Line",
"Middle Line", "Middle Line", "Middle Line", "Last Line")


Which essential shows lines like the following:

> head(a_df)
position text
1 START First line
2 MIDDLE Middle Line
3 END Last Line


Basically I want to be able to show subsets of the overall dataframe each subset should contain a start/middle and end line.

Doing some reading online I am trying to generate indices as follows:

# Generate indices
index_start <- with(a_df, grep("START", a_df$position))
index_end <- with(a_df, grep("END", a_df$position))


Which gives required output:

index_start
[1] 1 4 15 16 17 21 22 27 32
> index_end
[1] 3 14 20 26 31 54


I realise the indices are imbalanced (I am remove these imbalances) but I am wondering how i can use the above output to seed the values in the following subset commands:

a_df[c(1:3),]
a_df[c(4:14),]
a_df[c(17:20),]
a_df[c(22:26),]
a_df[c(27:31),]
a_df[c(32:54),]


Thanks in advance
Jonathan

Answer

It is not clear about selecting the elements of 'index_start' in the sequence, but based on the code showed in the OP's post, it seems like we need to get the last element of 'index_start' that is less than element in 'index_end'. In order to get the last element, we create a grouping variable with findInterval and using tapply, get the last element of 'index_start', with tail

Then, we get the sequence between corresponding elements of 'index_start1', 'index_end' and subset the dataset rows based on it with Map to get a list of data.frames.

index_start1 <- unname(tapply(index_start, findInterval(index_start, index_end),
                           FUN = tail, 1))    
index_start1
#[1]  1  4 17 22 27 32

lst <- Map(function(x, y) a_df[x:y,], index_start1, index_end)
lst
#[[1]]
#  position        text
#1    START  First line
#2   MIDDLE Middle Line
#3      END   Last Line

#[[2]]
#   position        text
#4     START  First line
#5    MIDDLE Middle Line
#6    MIDDLE Middle Line
#7    MIDDLE Middle Line
#8    MIDDLE Middle Line
#9    MIDDLE Middle Line
#10   MIDDLE Middle Line
#11   MIDDLE Middle Line
#12   MIDDLE Middle Line
#13   MIDDLE Middle Line
#14      END   Last Line

#[[3]]
#   position        text
#17    START  First line
#18   MIDDLE Middle Line
#19   MIDDLE Middle Line
#20      END   Last Line

#[[4]]
#   position        text
#22    START  First line
#23   MIDDLE Middle Line
#24   MIDDLE Middle Line
#25   MIDDLE Middle Line
#26      END   Last Line

#[[5]]
#   position        text
#27    START  First line
#28   MIDDLE Middle Line
#29   MIDDLE Middle Line
#30   MIDDLE Middle Line
#31      END   Last Line

#[[6]]
#   position        text
#32    START  First line
#33   MIDDLE Middle Line
#34   MIDDLE Middle Line
#35   MIDDLE Middle Line
#36   MIDDLE Middle Line
#37   MIDDLE Middle Line
#38   MIDDLE Middle Line
#39   MIDDLE Middle Line
#40   MIDDLE Middle Line
#41   MIDDLE Middle Line
#42   MIDDLE Middle Line
#43   MIDDLE Middle Line
#44   MIDDLE Middle Line
#45   MIDDLE Middle Line
#46   MIDDLE Middle Line
#47   MIDDLE Middle Line
#48   MIDDLE Middle Line
#49   MIDDLE Middle Line
#50   MIDDLE Middle Line
#51   MIDDLE Middle Line
#52   MIDDLE Middle Line
#53   MIDDLE Middle Line
#54      END   Last Line

NOTE: It is better to keep the 'data.frame's in the list as most of the operations can be done within the list environment.

Comments