Kees Boogaard Kees Boogaard - 1 month ago 13
R Question

R Match rows in a data frame based on formula

I have a data frame containing 7 columns and I want to add a column with information about the 'parent-row'. This sounds vague, so I'll clarify with an example. Below you can see a data frame:

` Nclass0 Nclass1 BestSBestI impurity n
[1,] 5 5 4 36.0 0.2500000 10
[2,] 5 2 1 37.0 0.2040816 7
[3,] 4 0 -1 -1.0 0.0000000 4
[4,] 1 2 2 0.5 0.2222222 3
[5,] 1 0 -1 -1.0 0.0000000 1
[6,] 0 2 -1 -1.0 0.0000000 2
[7,] 0 3 -1 -1.0 0.0000000 3`


Using the nclass0 and nclass1, I want to add an 8th column in which matching pairs have the same id. The first row is the parent row (with id=0). The rows match if [rowX,1] + [rowY,1] are equal to the parents row nclass0 and [rowX,2] + [rowY,2] are equal to the parent rows nclass1. RowX and rowY are the child rows and should get id=1.

In this case the parent row [1,] has child rows [2,]&[7,] and these rows should get id=1. After this the second row becomes the parent row with its own child rows [3,] and [4,] with id=2, until all rows with child rows have been assigned an id.

I have made several attempts but failed miserably. Does anyone have a suggestion how this can be done? The desired output for this case would be:

` Nclass0 Nclass1 BestS BestI impurity n id
[1,] 5 5 4 36.0 0.2500000 10 0
[2,] 5 2 1 37.0 0.2040816 7 1
[3,] 4 0 -1 -1.0 0.0000000 4 2
[4,] 1 2 2 0.5 0.2222222 3 2
[5,] 1 0 -1 -1.0 0.0000000 1 4
[6,] 0 2 -1 -1.0 0.0000000 2 4
[7,] 0 3 -1 -1.0 0.0000000 3 1`

Answer

Here's a solution that makes use of a while loop. The loop will run until either every row has an id value, or until it has evaluated all of the rows in the data frame. I'm sure there are some weaknesses, but it's a good start:

Note: I think this could get unbearably slow in a large data frame, so I hope you don't need to do this on anything large (each outer takes about 1 second to complete on a vector of 10,000).

DF <- 
  structure(list(Nclass0 = c(5, 5, 4, 1, 1, 0, 0), 
                 Nclass1 = c(5, 2, 0, 2, 0, 2, 3), 
                 BestS = c(4, 1, -1, 2, -1, -1, -1), 
                 BestI = c(36, 37, -1, 0.5, -1, -1, -1), 
                 impurity = c(0.25, 0.2040816, 0, 0.2222222, 0, 0, 0), 
                 n = c(10, 7, 4, 3, 1, 2, 3)), 
            .Names = c("Nclass0", "Nclass1", "BestS", "BestI", "impurity", "n"), 
            row.names = c(NA, -7L), class = "data.frame")

DF[["id"]] <- c(0, rep(NA, nrow(DF) - 1))

i <- 1
while(sum(is.na(DF[["id"]])) > 0){
  cross0 <- outer(DF[["Nclass0"]], DF[["Nclass0"]], `+`)
  match0 <- cross0 == DF[["Nclass0"]][i] & lower.tri(cross0)

  cross1 <- outer(DF[["Nclass1"]], DF[["Nclass1"]], `+`)
  match1 <- cross1 == DF[["Nclass1"]][i] & lower.tri(cross1)

  rows <- as.vector(which(match0 & match1, arr.ind = TRUE))
  if (length(rows)) DF[["id"]][rows] <- i

  if (i == nrow(DF)) break else i <- i + 1
}

Explanation

To try to clarify your problem, you are looking for pairs where x1 + x2 = x_ref AND y1 + y2 = y_ref.

What this code does is make a matrix of all of the possible pairwise sums of a vector with itself. This is accomplished with outer.

outer(DF[["Nclass0"]], DF[["Nclass0"]], `+`)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]   10   10    9    6    6    5    5
[2,]   10   10    9    6    6    5    5
[3,]    9    9    8    5    5    4    4
[4,]    6    6    5    2    2    1    1
[5,]    6    6    5    2    2    1    1
[6,]    5    5    4    1    1    0    0
[7,]    5    5    4    1    1    0    0

When trying to find the x-match for the first row, we compare this matrix to DF$class0[1] (and take set the upper triangle to false to avoid duplicates).

match0 <- cross0 == DF[["Nclass0"]][1] & lower.tri(cross0)
match0

      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[5,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
[6,]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
[7,]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

We repeat this process for Nclass1

cross1 <- outer(DF[["Nclass1"]], DF[["Nclass1"]], `+`)
match1 <- cross1 == DF[["Nclass1"]][1] & lower.tri(cross1)
match1

      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[3,]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[5,]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[7,] FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE

To find the row indices, we want to find the intersection of these two match matrices--in other words which positions in both matrices are TRUE

as.vector(which(match0 & match1, arr.ind = TRUE))
[1] 7 2

So rows 7 and 2 are related to the first row. We can repeat this operation for each subsequent row until we've assigned an ID for every row.

Turning it into a function

Here's a function that takes a data frame, a column name for the x-match, the column name for the y-match, and a character to name the id variable. I've added some bells and whistles to check the inputs.

assign_id <- function(DF, class0, class1, id_var){
  check <- require(checkmate)
  if (!check) stop ("Install the checkmate package")

  checkmate::assert_character(x = class0,
                              len = 1)
  checkmate::assert_character(x = class1,
                              len = 1)
  checkmate::assert_character(x = id_var,
                              len = 1)

  checkmate::assert_subset(c(class0, class1),
                           choices = names(DF))

  i <- 1

  DF[[id_var]] <- c(0, rep(NA, nrow(DF) - 1))

  while(sum(is.na(DF[[id_var]])) > 0){
    cross0 <- outer(DF[[class0]], DF[[class0]], `+`)
    match0 <- cross0 == DF[[class0]][i] & lower.tri(cross0)

    cross1 <- outer(DF[[class1]], DF[[class1]], `+`)
    match1 <- cross1 == DF[[class1]][i] & lower.tri(cross1)

    rows <- as.vector(which(match0 & match1, arr.ind = TRUE))
    if (length(rows)) DF[[id_var]][rows] <- i

    if (i == nrow(DF)) break else i <- i + 1
  }

  DF
}

assign_id(DF, "Nclass0", "Nclass1", "id")