user3091668 user3091668 - 1 month ago 19
R Question

Split a data-frame based in ordered multi factorial column

I would like to split a data-frame in a list of data-frames. The reasoning to split it is that we will have always

father
followed by
mother
which in turn is followed by
offspring
. However, these family members might have more than one row (which are always subsequent. e.g
father
number 1 is in the row 1 and row 2). In my below example I have two families, then I am trying to get a list with two data-frames.

My input:

df <- 'Chr Start End Family
1 187546286 187552094 father
3 108028534 108032021 father
1 4864403 4878685 mother
1 18898657 18904908 mother
2 460238 461771 offspring
3 108028534 108032021 offspring
1 71481449 71532983 father
2 74507242 74511395 father
2 181864092 181864690 mother
1 71481449 71532983 offspring
2 181864092 181864690 offspring
3 160057791 160113642 offspring'

df <- read.table(text=df, header=T)


Thus, my expected output
dfout[[1]]
would look like:

dfout <- 'Chr Start End Family
1 187546286 187552094 father
3 108028534 108032021 father
1 4864403 4878685 mother
1 18898657 18904908 mother
2 460238 461771 offspring
3 108028534 108032021 offspring'

dfout - read.table(text=dfout, header=TRUE)

Answer

To split each family into a separate data frame, you will need an index indicating where one family ends and another begins. For the index, I used "father" as the change-point. But we cannot simply use indx <- df$Family == "father" since there can be multiple 'father' entries in a row. Instead we test where the switch from 'offspring' to 'father' by searching for where it equals 1.

indx <- cumsum(c(1L, diff(df$Family == "father")) == 1L)
split(df, indx)
# $`1`
#   Chr     Start       End    Family
# 1   1 187546286 187552094    father
# 2   3 108028534 108032021    father
# 3   1   4864403   4878685    mother
# 4   1  18898657  18904908    mother
# 5   2    460238    461771 offspring
# 6   3 108028534 108032021 offspring
# 
# $`2`
#    Chr     Start       End    Family
# 7    1  71481449  71532983    father
# 8    2  74507242  74511395    father
# 9    2 181864092 181864690    mother
# 10   1  71481449  71532983 offspring
# 11   2 181864092 181864690 offspring
# 12   3 160057791 160113642 offspring