Jonathan Dunne Jonathan Dunne - 2 months ago 9
R Question

Remove a data frame row in R with a match over multiple Rows

I have data frame which looks like this:

content ChatPosition
This is a start line START
This is a middle line MIDDLE
This is a middle line MIDDLE
This is the last line END
This is a start line with a subsequent middle or end START
This is another start line without a middle or an end START
This is a start line START
This is a middle line MIDDLE
This is the last line END

content <- c("This is a start line" , "This is a middle line" , "This is a middle line" ,"This is the last line" ,
"This is a start line with a subsequent middle or end" , "This is another start line without a middle or an end" ,
"This is a start line" , "This is a middle line" , "This is the last line")
ChatPosition <- c("START" , "MIDDLE" , "MIDDLE" , "END" , "START" ,"START" , "START" ,"MIDDLE" , "END")
df <- data.frame(content, ChatPosition)


I'd like to delete the rows which contain a start but only if the next line doesn't contain a MIDDLE or END in the ChatPosition column.

content ChatPosition
This is a start line START
This is a middle line MIDDLE
This is a middle line MIDDLE
This is the last line END
This is a start line START
This is a middle line MIDDLE
This is the last line END

nrow(df)
jjj <- 0

for(jjj in 1:nrow(df))
{
# Check of a match of two STARTS over over multiple lines.

if (df$ChatPosition[jjj]=="START" && df$ChatPosition[jjj+1]=="START")

{
print(df$content[jjj])
}

}


I was able to use the above code to print out the two lines i want to delete I am wondering what is the most elegant solution to remove these lines?

Also is a for with nested if the right approach here or is there a library which can do this type of thing much more easily?

Regards
Jonathan

lmo lmo
Answer

This should work for you.

df[!(as.character(df$ChatPosition) == "START" & 
   c(tail(as.character(df$ChatPosition), -1), "END") == "START"), ]

                     content ChatPosition
1       This is a start line        START
2      This is a middle line       MIDDLE
3 This is a      middle line       MIDDLE
4      This is the last line          END
7       This is a start line        START
8      This is a middle line       MIDDLE
9      This is the last line          END

The first argument in [] returns a logical vector that tells R what rows to keep. I use tail(, -1) to get the next observation of df$ChatPosition for comparison. Note that It is necessary to convert df$ChatPosition to character in the second line in order to concatenate "END" in the final position, since df$ChatPosition is a factor variable.